Skip to content

Beginner’s guide to Llama models

llama and a girl, a beginner guide cover image

This guide is for you if you are new to Llama, a free and open-source large language model. You will find some basic information and answers to common questions.

  • What is Llama?
  • What can you do with it?
  • Model variants available.
  • Options to install locally.

What is Llama?

Llama (Large Language Model Meta AI) is a family of large language models (LLM). It is Meta (Facebook)’s answer to ChatGPT.

But the two companies take different paths. ChatGPT is proprietary. You don’t know the code of the model, the training data, and the training method. Llama is an open-source software. The code, training data, and the training code are out there in the public.

Llama is the first major open-source large language model. It gains instant popularity upon release. In addition to being free and open source, it is pretty small and can be run on a personal computer. The 7-billion and 13-billion parameter models are very usable on a good consumer-grade PC.

How does Llama work?

LLama is an AI model designed to predict the next word. You can think of it as a glorified autocomplete. It is trained with text from the internet and other public dataset. Llama 2 is trained with about 2 trillion words.

You might be curious about why the Llama model appears to be intelligent: It provides reasonable answers to tough questions. It’s capable of rewriting your essay and offering the positives and negatives of various topics.

The training text was written by people. Think of them as human thoughts projected on text files. When the model learns to finish a sentence, the model also learns an aspect of being human.

Can the Llama model understand logic? People have different opinions on this. One side says it can’t because it was created to learn correlations. All it does is guess which word is most likely to come next.

But the other side believes it can. Imagine the training text is a mystery story about a murder. It needs to finish the last sentence “The murderer is.” To guess the next word right correctly, it has no choice but to learn logic.

Why use LLama instead of ChatGPT?

ChatGPT is zero setup. A free version is available. Why use LLama? ChatGPT is indeed highly accessible. Here are the reasons to use Llama over ChatGPT.

  • Privacy. You can use Llama locally on your own computer. You don’t need to worry about the questions you asked being stored in a company’s server indefinitely.
  • Confidentiality. You may not be able to use ChatGPT for work-related queries because you are bounded by a non-disclosure agreement. You don’t have an NDA with OpenAI, after all.
  • Customization. There are many locally finetuned models you can choose from. If you don’t like the answers of a model, you can switch to another one.
  • Train your model. Finally, you have an opportunity to train your own model using techniques such as LoRA.

What can you do with Llama models?

You can use Llama models the same ways you use ChatGPT.

  • Chat. Just ask questions about things you want to know.
  • Coding. Ask for a short program to do something in a specific computer language.
  • Outlines. Giving an outline of certain technical topics.
  • Creative writing. Let the model write a story for you.
  • Information extraction. Summarize an essay. Ask specific questions about an essay.
  • Rewrite. Write your paragraph in a different tone and style.
A llama with a model.

What language does Llama support?

Mostly English. The training data is 90% English.  

Other supported languages include German, French, Chinese, Spanish, Dutch, Italian, Japanese, Polish, Portuguese, and others. But don’t count on them.

This means you shouldn’t use Llama for translation tasks.

What computer hardware do I need?

It depends on the model size. The following are the VRAM needed for running on a GPU card with a GPTQ model.

Model8-bit4-bit
7B10 GB6 GB
13B20 GB10 GB
30 GB40 GB20 GB
70 GB80 GB40 GB
GPU VRAM requirement.

And the followings are for GGML models. (for Mac or CPU on Windows or Linux)

Model4-bit quantized
7B4 GB
13B8 GB
30 GB20 GB
70 GB39 GB
RAM requirement.
Looking for Llamas.

What are those 8-bit and 4-bit models?

Large language models are… large. The amount of memory a computer quickly becomes a bottleneck for using the model.

A parameter of an AI model is typically encoded in 16-bit numbers, which equals 2 bytes. In other words, loading a 13B Llama model takes 26GB, which is impractical for most people.

Quantization is a method to reduce the models’ size while preserving quality. The benefit to you is the smaller size in your hard drive and requires less RAM to run.

An 8-bit quantized model takes 8 bits or 1 byte of memory for each parameter. A 4-bit quantized model takes 4 bits or half a byte for each parameter. A 4-bit quantized 13B Llama model only takes 6.5 GB of RAM to load.

Of course, there’s no free lunch. You may see a slight degradation in quality when using the 8-bit and the 4-bit models.

What are the different versions of Llama?

Official models

There are two versions of the official models released by Meta — Llama 1 and Llama 2.

Llama 1

Llama 1 came out in February 2023. This release caused a big excitement because it was the first important LLM that was open-source. It was a big surprise back then, but now it seems like it was a long time ago. Llama 1 has spurred many efforts to fine-tune and optimize the model to run it locally. It was initially thought to be impossible to run a LLM locally. It was solved in a short period of time by hobbyists.

Llama 2

Although holding great promise, Llama 1 was released with a license that does not allow commercial use. This has limited the adoption of the Llama 1 model.

LLama 2 came out in July 2023. There are some incremental improvements in training and model architecture. The most significant change is the license term. Llama 2 is now free for commercial use. It is widely expected that this will spark a new round of development like what happened with Stable Diffuison.

Fine-tuned models

Unlike ChatGPT, you can make your own Llama model if you are unhappy with its response. You do that by teaching it with additional data. This is called fine-tuning.

Here are some popular fine-tuned models.

WizardLM

WizardLM

WizardLM is a family of models fine-tuned with many instruction-following conversations. The novelty of this model is using an LLM to generate training data automatically.

Download links

ModelBase modelDownload links
WizardLM 7B uncensoredLlama 1GPTQ, ggml
WizardLM 13B V1.1LLama 1GPTQ, ggml
WizardLM 30B V1.0LLama 1GPTQggml
WizardLM models

Vicuna

Vicuna is fine-tuned with ChatGPT conversations.

ModelBase modelDownload links
Vicuna 7B v1.3Llama 1 GPTQggml
Vicuna 13B v1.3LLama 1GPTQggml
Vicuna 30B v1.3LLama 1GPTQggml
Vicuna models

How to compare the performance of models?

There are so many models to choose from. How do you know which is the best, whatever that means? How to compare the Llama models with ChatGPT?

LMSYS hosts a leadership board to compare the performance of LLMs, including proprietary ones like ChatGPT. They measure 3 metrics:

  • Chatbot Arena: The answers of two LLMs are presented to users blindly and let users pick the better one. A ranking score is then calculated for each LLM.
  • MT-bench: Use GPT-4 to judge the answers LLM (This metric favors GPT models.).
  • Massive Multitask Language Understanding (MMLU): Test the LLM in 57 tasks, including elementary mathematics, US history, computer science, law, and more.

How much memory does the Llama model have?

The Llama model is stateless. It doesn’t remember your last input. In chat applications, it remembers the previous conversation because it is included as part of the input.

The amount of information the Llama model can process at a time is determined by the context length. It is the maximum length of the input. It is 2,048 tokens (About 3 pages of words) for Llama 1 and 4,096 tokens (About 6 pages of words) for Llama 2.

What are GTPQ and GGML model formats?

GPTQ is a quantization method (paper) that quantizes an LLM to reduce its size. It’s fast and introduces minimal degradation to performance.

GGML is another quantization method(Github) that focuses on speeding up with Apple Silicon M1/M2 and Intel CPU.

Which model format should I use?

If you have an Nvidia GPU card, the GPTQ format gives you the best performance.

If you use Mac, Windows without GPU, or Linux without GPU, use the GGML format.

How to install Llama models?

See the installation guide for Windows and the installation guide for Mac.

What is the software to use Llama?

Text-generation-webui is a graphical user interface for using Llama model. It is powerful and easy to use. I recommend this software for general users.

If you prefer a text-only experience and is comfortable with using Terminals, llama.cpp is a good choice.

Can I use Llama commercially?

No for Llama 1.

Yes for Llama 2.

3 thoughts on “Beginner’s guide to Llama models”

Leave a Reply

Your email address will not be published. Required fields are marked *