Skip to content

What’s new in Llama 2 & how to run it locally

llama 2 download and installation guide

Llama 2 is a free and open-source large language model that you can run locally on your own machine. It is an improvement to the earlier Llama model.

In this post, you will learn:

  • What the llama 2 model is.
  • How to install and run the Llama 2 models in Windows.

What is Llama 2 model?

LLama 2 is a free and open-source large language model released by Meta in July 2023. Two model families are released.

  • Llama 2: 40% more pertaining data than Llama 1. Doubled context length. Good for sentence completion.
  • Llama 2-Chat: Optimized for conversation usage like Chat GPT.

7B, 13B and 70B models are available for both families.

What’s new in Llama 2?

Model architecture

The model architecture is very similar to the Llama model. The improvements in model architecture are

The context length is the number of tokens (similar to words or subwords) a language model can consider in the input text when generating a response. The original llama model has a context length of 2,048. Llama 2’s context length is doubled to 4,096.

Grouped-query attention (GQA) is a new optimization to tackle high memory usage due to increased context length and model size. It reduces memory usage by sharing the cached keys and values of the previous tokens. GQA is only used in the 34B and 70B Llama 2 models.


The pretraining of Llama 1 and 2 are similar, except that Llama 2 has a larger pretraining dataset. It is increased to 2.0 trillion tokens, up from 1.0 and 1.4 tokens for the Llama 1 model.

Indeed, the larger pretraining dataset has resulted in higher performance across all metrics evaluated.

Performance comparison between Llama 1 and Llama 2. (Table from the Llama 2 paper)

Supervised Fine-Tuning (SFT)

Quality is more important than quantity when it comes to fine-tuning the model. The model was fine-tuned with the following techniques.

  • SFT Annotations: High-quality prompt and response pairs.
  • Reinforcement Learning with Human Feedback (RLHF): Let a human tell which answers he or she likes more. Then teach the model to respond with answers that humans prefer.
  • Human preference: The model learns to provide safe and helpful responses based on human feedback.
Training of Llama 2 (Image from Llama 2 paper.)

Running Llama 2 locally

Step 1: Install text-generation-webUI

Follow this installation guide for Windows.

Step 2: Download Llama 2 model

Now you have text-generation webUI running, the next step is to download the Llama 2 model. There are many variants. Which one you need depends on the hardware of your machine.

  • Download the models with GPTQ format if you use Windows with Nvidia GPU card.
  • Download the models with GGML format if you use CPU on Windows or M1/M2 Mac.

Download the largest model size (7B, 13B, 70B) your machine can possibly run. You will want to download the Chat models if you want to use them in a conversation style like ChatGPT.

Option 1: Windows users with Nvidia GPU

In text-generation-webui, navigate to the Model page. In Download custom model or LoRA section, put in the Huggingface path you find below for the model you want to download below. Refresh the Model list and load the newly downloaded model.

Llama-2-Chat 7B

Huggingface path:


Llama-2-Chat 13B

Huggingface path:


Llama-2-Chat 70B

Huggingface path:



Huggingface path:



Huggingface path:



Huggingface path:


Option 2: Mac users or Windows CPU users

If you use Mac or Windows without a GPU card, you can download the Llama 2 models from the following pages. There are multiple files on the one model page. You DON’T need to download them all. You only need to download ONE .bin file to run the model. They are different quantizations that aim at reducing the file sizes.

Download ONE .bin file for a model and put it in text-generation-webui > models folder. Refresh the Model list on the Models page. Select and load the model to start using.

Llama 2 Chat 7B GGML modelDownload link

Llama 2 Chat 13B GGML modelDownload link

Llama 2 7B modelDownload link

Llama 2 13B modelDownload link


4 thoughts on “What’s new in Llama 2 & how to run it locally”

  1. Hi,
    1. When I try to load a model it complains all GPU space are (4GB) used up by Torch so there is no space for uploading the model. How can I resolve this?

    2. How can I switch to the CPU base from the GPU base?


Leave a Reply

Your email address will not be published. Required fields are marked *