Skip to content

A brief history of LLaMA models

LLaMA (Large Language Model Meta AI) is a language model released by Meta (Facebook). It is Meta’s answer to OpenAI’s GPT models. The LLaMA base model was released in February 2023. Now we have seen a handful of new fine-tuned LLaMA models released.

It is literally a brief history, but a lot has happened for sure. So let’s do a brief review.

I will cover some developments in models and briefly touch on tools.

  • LLaMA base model
  • Alpaca model
  • Vicuna model
  • Koala model
  • GPT4-x-Alpaca model
  • WizardLM model
  • OpenAssistant model
  • Software to run LLaMA models locally

Below is an overview of the models.

ModelSizeTraining data
LLaMA (base model)7B, 13B, 33B, 65BVarious
Alpaca7B, 13B52k GPT-3 instructions
Vicuna7B, 13B70k ChatGPT conversations
Koala-distill7B, 13B117k cleaned ChatGPT conversations
GPT4-x-Alpaca13B20k GPT4 instructions
WizardML7B70k instructions synthesized with ChatGPT/GPT-3
OpenAssistant LLaMA13B, 30B600k human interactions (OpenAssistant Conversations)
Model comparison

LLaMA base model

LLaMA (Large Language Model Meta AI) is a language model released by Meta (Facebook). It is Meta’s answer to OpenAI’s GPT models.

Like GPT, LLaMA is intended to be a general-purpose foundational model suitable for further fine-tuning.

LLaMA models have the following variants

  • 7B parameters
  • 13B parameters
  • 33B parameters
  • 65B parameters

The larger the number of parameters, the more powerful the model, but it also takes up more resources to run.


Unlike GPT, LLaMA is an open-source model. You can download, study and run them locally. Officially, you will need to use a Google form to request the model weights.

However, the models were leaked on Torrent in March 2023, less than a month after its release.


The objective of LLaMA is to build the best-performing model for a given inference budget, for example, running on an NVIDIA 3090 using less than 10GB VRAM.

Model architecture

LLaMA is a transformer model similar to GPT with the following modifications.

  • Normalize the input of each transformer sub-layer to improve training stability.
  • Use SwiGLU instead of ReLU to improve performance.
  • Use rotary embedding instead of absolute positioning to improve performance.

The table below summarizes the model parameters.

ParametersLayersAttention headsEmbedding dimension
Model parameters

For reference, GPT-3 has 175B parameters. LLaMA models are small.


The pre-training data used in LLaMA are

  • English CommonCrawl (67%): Removed non-English text and duplicated content. Only includes pages used as references in Wikipedia.
  • C4 (15%): A cleaned version of CommonCrawl. The same filters were applied.
  • Github (4.5%): Public GitHub dataset available on Google BigQuery.
  • Wikipedia (4.5%): From June-August 2022 period covering 20 languages.
  • Gutenberg and Books3 (4.5%): Both are book datasets.
  • ArXiv (45%): Scientific data.
  • StackExchange (2%): High-quality Q&As covering science and engineering topics.

The tokenizer is with byte-pair encoding using SentencePiece.

The training data has 1.4T tokens.


They evaluated the models with tasks such as common sense reasoning, reading comprehension, and code generation.

Summary of performance:

  • Larger is better: Larger models perform better in most tasks.
  • More examples in the prompt are better: Give 5 examples to LLaMA 7B model is almost as good as not giving any to a 65B model in Natural Questions tasks.
  • Smaller performant model. LLaMA 13B’s performance is similar to GPT-3, despite 10 times smaller. (13B vs 175B parameters)
  • LLaMA is not very good at quantitative reasoning, especially the smaller 7B and 13B models.
  • LLaMA is not tuned for instruction following like ChatGPT. However, the 65B model can follow basic instructions. We will wait for Alpaca (not for long).

Model size comparison

How much do you gain by using a bigger LLaMA model? The following table summarizes the performance of tasks in different categories. They are calculated based on the scores provided in the research article, assuming linear scales.

AverageCommon sense reasoningNatural QuestionsReading comprehensionTriviaQAQuantitative reasoningCode generationMultitask language understanding
Performance of LLaMA models (normalized to 65B as 100%).

Is it worth using a bigger model? You can expect a ~50% generic improvement when switching from the 7B to the 65B model.

But it also depends on what you use the models for. You will only see a small gain for common sense reasoning and reading comprehension tasks. You will see a big gain for code generation and technical reading tasks.

Summary for LLaMA

The take-home message in this study is small models can perform well if you train them with enough data. This opens up the possibility of running a “local ChatGPT” on a PC.

But the LLaMA base model was not trained to follow instructions. This is saved for later development.

To sum up, LLaMA is designed to be a base model for further fine-tuning. Its advantages are

  • Small size
  • Performant – thanks to extensive training
  • Open source

Alpaca model

Alpaca is a fine-tuned LLaMA model, meaning that the model architecture is the same, but the weights are slightly different. It is aimed at resolving the lack of instruction-following capability of LLaMA models.

It behaves like ChatGPT and can follow conversations and instructions.

The 7B and 13B Alpaca models are available.


It was trained to follow instructions like ChatGPT.

The authors first generate the training data using OpenAI’s GPT-3, then convert them to 52k instruction-following conversational data using the Self-Instruct pipeline.

Training workflow for Alpaca model
Training pipeline of Alpaca (Source: Alpaca model page)

As a result, Alpaca is fine-tuned to respond to conversations like ChatGPT.


A blinded evaluation for instruction-following ability performed by some of the authors ranked the responses of Alpaca 7B and GPT-3 (text-davinci-003 specifically, which is also trained with instructions) roughly equally.

This is a surprising result because Alpaca is 26 times smaller than GPT-3.

Of course, this is just a narrow aspect of performance. It doesn’t mean Alpaca performs equally with GPT-3 in other areas like code generation and scientific knowledge, which were not tested in the study.


Alpaca is a nice first step in fine-tuning the LLaMA model. As we see in the next section, it is outperformed by a similar fine-tuning effort, Vicuna.

Vicuna model

Vicuna is trained by fine-tuning the LLaMA base models on user-shared conversations collected from So it is basically fine-tuned with ChatGPT conversations.

It comes in two sizes: 7B and 13B.


The model was fine-tuned by an academic team from UC Berkeley, CMU, Stanford, and UC San Diego.

It was trained with user-contributed ChatGPT conversations. So you can expect its behavior mimics ChatGPT. Precisely, it is trained with 70,000 ChatGPT conversations users shared on

It only costed $140 to train the 7B model and $300 to train the 13B model.


How good is Vicuna? According to their website, the output quality (as judged by GPT-4…) is about 90% of ChatGPT, making it the best language model you can run locally.

Response quality as judged by GPT-4. (from Vicuna site)

The authors used an interesting method to evaluate the model’s performance: Using GPT-4 as the judge. They asked GPT-4 to generate some challenging questions and let Vicuna and some other best language models answer them.

They then ask GPT-4 to evaluate the quality of the answers in different aspects, such as helpfulness and accuracy.

Here’s the result for comparing LLaMAAlpacaBard, and ChatGPT. In the eyes of GPT-4, Vicuna is almost as good as ChatGPT, beating LLaMA and Alpaca by a large margin.

GPT-4’s judgment. (source: Vicuna model page)


The Vicuna model is considered to be one of the best LLaMA models that you can run locally. But I won’t be surprised if things change in the coming weeks.


Koala is a LLaMA 7B and 13B models fine-tuned with publicly available dialog data by an academic team at UC Berkeley.


The training data includes filtered data from multiple datasets.

They trained two models

  1. Koala-All: Used all datasets
  2. Koala-Distill: Used the first two datasets (i.e., data distilled from ChatGPT)


They evaluated the performance of Koala-All and Koala-Distill by comparing them with Alpaca and ChatGPT. 100 evaluators from Amazon Mechanical Turk judged the responses of these models from the same prompts.

The results are

  • Koala-All is better than Alpaca but worse than ChatGPT.
  • Koala-Distill is slightly better than Koala-All. — This is surprising as Koala-All was fine-tuned with more data.


The take-home message is quality of the data is more important than quantity. Koala-Distll fine-tuned with ChatGPT data alone outperforms Koala-All trained with additional data.

Going forward, finding or generating high-quality data to fine-tune LLaMA models is going to be important.


GPT4-x-Alpaca is a LLaMA 13B model fine-tuned with a collection of GPT4 conversations, GPTeacher. There’s not a lot of information on its training and performance.

Below are some community efforts to evaluate the model


WizardLM is a fine-tuned 7B LLaMA model. It was fine-tuned with a large amount of instruction-following conversations with varying difficulties. The novelty of this model is using an LLM to generate training data automatically.


The WizardLM model was trained with 70k computer-generated instructions with a new method called Evol-Instruct. The method produces instructions with varying levels of difficulty.

Evol-Instruct expands a prompt with these five operations

  • Add constraints
  • Deepening
  • Concretizing
  • Increase reasoning steps
  • Complicate input

These operations were applied sequentially to an initial instruction to make it more complex.

The responses were generated by an LLM.


The authors compared the performance of WizardLM with Alpaca 7B, Vicuna 7B, and ChatGPT. They recruited 10 people to judge the responses of WizardLM and other models in five aspects blindly: Relevance, knowledge, reasoning, calculation, and accuracy.

The authors conclude that:

  • The instructions generated by Evol-Instruct are superior to ShareGPT (used by Vicuna).
  • WizardLM significantly outperforms Alpca and Vicuna.
  • ChatGPT is better overall, but WizardLM excels in high-complexity questions.
WizardLM excels in answering complex instructions. (Source: WizardLM paper)

The community generally agrees that WizardLM is the current state-of-the-art for 7B models.


OpenAssistant is an open-source effort to develop AI chatbots that are freely available to everyone. The training dataset, OpenAssistant Conversations, contains more than 600k interactions in diverse topics for training various models.

They have released the instruction-tuned LLaMA 13B and 30B models, along with other models trained with the same dataset. Not much performance information is available on the web yet.

Software tools

The development on the software engineering side is equally breathtaking. Currently, the two main ways to run LLaMA models on your PC are

  • llama.cpp (for Mac or CPU only)
  • Oobabooga text-generation-webui


llama.cpp is written in C++ from the ground up. The goal is to enable running LLaMA models on Macbooks. It is optimized for Apple Silicon M1/M2.

It supports 4-bit quantization to reduce the resources needed for LLaMA models. Quantizing the models reduces the storage and RAM usage at the expense of a slight quality reduction.

A 7B model originally takes 13GB of disk space and RAM to load. It only takes about 4 GB after 4-bit quantization.

Due to its native Apple Silicon support, llama.cpp is an excellent choice for running LLaMA models on Mac M1/M2.

However, it only supports usage in a text terminal. Technically, you can use text-generation-webui as a GUI for llama.cpp. But, as of writing, it could be a lot slower.

See the installation guide on Mac.


Oobabooga text-generation-webui is a GUI for using LLaMA models. It can be run on Windows, Linux and Mac.

You should go with this GUI if you have a GPU card on Windows or Linux.

Like llama.cpp, it supports 4-bit quantization (but in a different file format) for model size reduction.

See the installation guide on Windows and Mac.

2 thoughts on “A brief history of LLaMA models”

  1. great page for introduction of llama models
    I am also interested in understanding about the feature generation process for llama models
    when we input a query, how is it converted to features and out of 7B parameters how many of them are generally populated?

Leave a Reply

Your email address will not be published. Required fields are marked *