Llama 2 is a free and open-source large language model that you can run locally on your own machine. It is an improvement to the earlier Llama model.
In this post, you will learn:
- What the llama 2 model is.
- How to install and run the Llama 2 models in Windows.
What is Llama 2 model?
LLama 2 is a free and open-source large language model released by Meta in July 2023. Two model families are released.
- Llama 2: 40% more pertaining data than Llama 1. Doubled context length. Good for sentence completion.
- Llama 2-Chat: Optimized for conversation usage like Chat GPT.
7B, 13B and 70B models are available for both families.
What’s new in Llama 2?
The model architecture is very similar to the Llama model. The improvements in model architecture are
- Increased context length
- Grouped-query attention
The context length is the number of tokens (similar to words or subwords) a language model can consider in the input text when generating a response. The original llama model has a context length of 2,048. Llama 2’s context length is doubled to 4,096.
Grouped-query attention (GQA) is a new optimization to tackle high memory usage due to increased context length and model size. It reduces memory usage by sharing the cached keys and values of the previous tokens. GQA is only used in the 34B and 70B Llama 2 models.
The pretraining of Llama 1 and 2 are similar, except that Llama 2 has a larger pretraining dataset. It is increased to 2.0 trillion tokens, up from 1.0 and 1.4 tokens for the Llama 1 model.
Indeed, the larger pretraining dataset has resulted in higher performance across all metrics evaluated.
Supervised Fine-Tuning (SFT)
Quality is more important than quantity when it comes to fine-tuning the model. The model was fine-tuned with the following techniques.
- SFT Annotations: High-quality prompt and response pairs.
- Reinforcement Learning with Human Feedback (RLHF): Let a human tell which answers he or she likes more. Then teach the model to respond with answers that humans prefer.
- Human preference: The model learns to provide safe and helpful responses based on human feedback.
Running Llama 2 locally
Step 1: Install text-generation-webUI
Follow this installation guide for Windows.
Step 2: Download Llama 2 model
Now you have text-generation webUI running, the next step is to download the Llama 2 model. There are many variants. Which one you need depends on the hardware of your machine.
- Download the models with GPTQ format if you use Windows with Nvidia GPU card.
- Download the models with GGML format if you use CPU on Windows or M1/M2 Mac.
Download the largest model size (7B, 13B, 70B) your machine can possibly run. You will want to download the Chat models if you want to use them in a conversation style like ChatGPT.
Option 1: Windows users with Nvidia GPU
In text-generation-webui, navigate to the Model page. In Download custom model or LoRA section, put in the Huggingface path you find below for the model you want to download below. Refresh the Model list and load the newly downloaded model.
Option 2: Mac users or Windows CPU users
If you use Mac or Windows without a GPU card, you can download the Llama 2 models from the following pages. There are multiple files on the one model page. You DON’T need to download them all. You only need to download ONE
.bin file to run the model. They are different quantizations that aim at reducing the file sizes.
.bin file for a model and put it in text-generation-webui > models folder. Refresh the Model list on the Models page. Select and load the model to start using.