Llama 2 is a free and open-source large language model that you can run locally on your own machine. It is an improvement to the earlier Llama model.
In this post, you will learn:
- What the llama 2 model is.
- How to install and run the Llama 2 models in Windows.
Contents
What is Llama 2 model?
LLama 2 is a free and open-source large language model released by Meta in July 2023. Two model families are released.
- Llama 2: 40% more pertaining data than Llama 1. Doubled context length. Good for sentence completion.
- Llama 2-Chat: Optimized for conversation usage like Chat GPT.
7B, 13B and 70B models are available for both families.

What’s new in Llama 2?
Model architecture
The model architecture is very similar to the Llama model. The improvements in model architecture are
- Increased context length
- Grouped-query attention
The context length is the number of tokens (similar to words or subwords) a language model can consider in the input text when generating a response. The original llama model has a context length of 2,048. Llama 2’s context length is doubled to 4,096.
Grouped-query attention (GQA) is a new optimization to tackle high memory usage due to increased context length and model size. It reduces memory usage by sharing the cached keys and values of the previous tokens. GQA is only used in the 34B and 70B Llama 2 models.
Pre-training
The pretraining of Llama 1 and 2 are similar, except that Llama 2 has a larger pretraining dataset. It is increased to 2.0 trillion tokens, up from 1.0 and 1.4 tokens for the Llama 1 model.
Indeed, the larger pretraining dataset has resulted in higher performance across all metrics evaluated.

Supervised Fine-Tuning (SFT)
Quality is more important than quantity when it comes to fine-tuning the model. The model was fine-tuned with the following techniques.
- SFT Annotations: High-quality prompt and response pairs.
- Reinforcement Learning with Human Feedback (RLHF): Let a human tell which answers he or she likes more. Then teach the model to respond with answers that humans prefer.
- Human preference: The model learns to provide safe and helpful responses based on human feedback.

Running Llama 2 locally
Step 1: Install text-generation-webUI
Follow this installation guide for Windows.
Step 2: Download Llama 2 model
Now you have text-generation webUI running, the next step is to download the Llama 2 model. There are many variants. Which one you need depends on the hardware of your machine.
- Download the models with GPTQ format if you use Windows with Nvidia GPU card.
- Download the models with GGML format if you use CPU on Windows or M1/M2 Mac.
Download the largest model size (7B, 13B, 70B) your machine can possibly run. You will want to download the Chat models if you want to use them in a conversation style like ChatGPT.
Option 1: Windows users with Nvidia GPU
In text-generation-webui, navigate to the Model page. In Download custom model or LoRA section, put in the Huggingface path you find below for the model you want to download below. Refresh the Model list and load the newly downloaded model.
Huggingface path:
localmodels/Llama-2-7B-Chat-GPTQ
Huggingface path:
localmodels/Llama-2-13B-Chat-GPTQ
Huggingface path:
localmodels/Llama-2-70B-Chat-GPTQ
Huggingface path:
localmodels/Llama-2-7B-GPTQ
Huggingface path:
localmodels/Llama-2-13B-GPTQ
Huggingface path:
localmodels/Llama-2-70B-GPTQ
Option 2: Mac users or Windows CPU users
If you use Mac or Windows without a GPU card, you can download the Llama 2 models from the following pages. There are multiple files on the one model page. You DON’T need to download them all. You only need to download ONE .bin
file to run the model. They are different quantizations that aim at reducing the file sizes.
Download ONE .bin
file for a model and put it in text-generation-webui > models folder. Refresh the Model list on the Models page. Select and load the model to start using.
Llama 2 Chat 7B GGML model – Download link
Llama 2 Chat 13B GGML model – Download link
Llama 2 7B model – Download link
Llama 2 13B model – Download link
Hi,
1. When I try to load a model it complains all GPU space are (4GB) used up by Torch so there is no space for uploading the model. How can I resolve this?
2. How can I switch to the CPU base from the GPU base?
Thanks
Additionally I developed an app https://github.com/1b5d/llm-api to run models such as Llama 2 locally and expose it using an API
where are the steps