Large language models (LLM), such as Llama and GPT, are typically trained on broad datasets. They are not good at special domain knowledge. Instead of training a new model, using Retrieval-Augmented Generation (RAG) is a quick and cheap way to achieve a similar purpose.
You can use RAG to provide new data to LLM without retraining them. With RAG, you can pull data from sources like document repositories, databases, or APIs.
In this article, we will study RAG using a Google Colab notebook. You will learn how to
- Benefits of using RAG
- Convert the document to embeddings.
- Store the embeddings in a vector database.
- Use the vector database with a language model (RAG)
Contents
What is RAG?
RAG, or Retrieval-augmented Generation, is an AI framework to improve the responses of Large Language Models (LLMs). RAG helps LLMs give better answers by using both their own knowledge and external information sources.
You typically use a generic LLM, such as Llama and GPT, with RAG.
In the retrieval phase, an algorithm finds relevant information based on the user’s question. The source can be public information like the internet or a set of private documents.
In the generation phase, the LLM uses both the retrieved information and its own knowledge to form an answer. It can also provide source links for transparency.
Benefits of using RAG
The benefits of using Retrieval Augmented Generation (RAG) are
- More factual responses: LLM’s responses are based on the provided information. The model is less likely to “hallucinate” incorrect or misleading information.
- Consistency: More likely to get the same answer from the same question.
- Cost: Building an RAG pipeline is less expensive than fine-tuning. You only need to update the database instead of training a new model when updating the information.
- Currency: Ensure the LLM’s responses are based on up-to-date data.
- Accessible source: Users can have access to the source for cross-checking. The LLM acts as a helper while referencing the source as the truth.
When to use Retrieval-Augmented Generation?
RAG is typically used for knowledge-intensive NLP (Natural language processing) tasks. It is a good way to incorporate new information into LLMs.
For example, the Llama 2 model is trained with data that has no mention of itself, the Llama 2 model. So, if you ask the model a question about Llama 2,
what is so special about llama 2?
The LLama 2 model says
Llama 2 is a unique and special animal for several reasons. Here are some of the most notable features that make it stand out…
Which is not quite what you meant. But with RAG and documents of Llama 2 publications, it says
Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) developed and released by GenAI, Meta. The models are optimized for dialogue use cases and outperform open-source chat models on most benchmarks tested…
That’s pretty right on! So RAG is a good way to incorporate new or proprietary information to LLMs.
Retrieval-augmented generation vs fine-tuning
RAG and fine-tuning are the two most popular ways to incorporate new incorporation in LLMs. Both require additional data but they are used differently.
Fine-tuning performs additional training with the new data. You will get a new LLM model that captures your data. You will then use the new model to replace it with the old one.
In contrast, RAG does not require changing the model. It incorporates the new data in the prompt.
Cost
RAG is cheaper than fine-tuning. Fine-tuning requires training an LLM, which is typically large.
Performance
Both can achieve good performance. In general, fine-tuning requires more work to get there.
Hallucination
Both have the chance to hallucinate (giving inaccurate information). RAG has more control on hallucination by providing more accurate context information.
Transparency
RAG is a more transparent approach. You keep the LLM fixed and unmodified. The ability to respond with new information is controlled by the quality of retrieval and how well the prompt is constructed.
Comparing the fine-tuning, RAG is more transparent and is easier to debug.
How does RAG work?

You can use an RAG workflow to let an LLM answer questions based on documents the model has not seen before.
In this RAG workflow, documents are first broken down into chunks of sentences. They are then transformed into embeddings (a bunch of numbers) using a sentence transformer model. Those embeddings are then stored in a vector database with indexing for fast search and retrieval.
The RAG pipeline is implemented using the LangChain RetrievalQA. It uses the similarity search to search the question against the database. The matching sentences and the question are used as the input to the Llama 2 Chat LLM.
That’s why the LLM can answer questions based on the document: RAG uses vector search to find relevant sentences and includes them in the prompt!
Note that there are many ways to implement RAG. Using vector database is just one of the options.
Before you start
We will use this notebook from Pinecone to build a RAG pipeline with Llama 2.
Don’t run the notebook yet. You should have a free Pinecone account and the approval for using the Llama 2 model ready. Or else you will be stuck in the middle of the notebook.
Getting approval for using Llama 2
Step 1: Fill in the Llama 2 access request form
Fill in the Llama access request form. You will need the Llama 2 & Llama Chat model but it doesn’t hurt to get others in one go. You will have to use the email address associated with your HuggingFace account.
Typically, you will receive the approval email within an hour.
Step 2: Request access to the Llama 2 model
Visit the Llama 2 13B Chat model page. You should see another request form for downloading the model…

Submit the request and wait. You should receive approval within an hour or so.

Now, you are ready to use the notebook.
Using RAG with Llama 2
Open the notebook for RAG with Llama 2. There are three strings to be replaced in the notebook in order to run.
PINECONE_API_KEY
: Pinecone API KeyPINECONE_ENV
: The environment associated with the API keyHF_AUTH_TOKEN
: Hugging Face authorization token
They are placed throughout the notebook. So you can change them when you encounter them.
Pinecone API Key
PINECONE_API_KEY
Go to Pinecone and create a free account if you don’t have one.
After signing in, click API Keys on the right panel. It should show your API keys. You can use the default one or create a new one.
The PINECONE_API_KEY
can be copied using the copy button.

Pinecone environment
PINECONE_ENV

Copy the Pinecone environment PINECONE_ENV under the Environment header.
HF_AUTH_TOKEN
Go to the Access token page. Create a new one or use an old one.

Retrieval-Augmented Generation workflow
The notebook has pretty good notes explaining what it is trying to do. I will describe what each part does to give you a high-level picture with some background information.
Step 1. Initializing the Hugging Face Embedding Pipeline
A sentence transformer maps a sentence to an embedding space. The specific sentence transformer model used is sentence-transformers/all-MiniLM-L6-v2
, which maps any sentence to a 384-number vector. It maps sentences with similar meanings to a nearby space.
These embeddings will be useful later for matching potential sentences with the question using a vector-based semantic search by comparing how similar they are in meaning. (as opposed to vocabularies)
Step 2. Building the Vector Index
Let’s store the embeddings in a database. It is not ideal to use a typical database like MySQL because searching these embeddings for similarity can be time-consuming. We typically store embeddings in a vector database with indexing specially designed for searching for similar vectors.
This section uses a Pinecone vector database but many others, like Weaviate and Chroma, would also work. The database is managed remotely in a Pinecone server.
Step 3. Initializing the Hugging Face Pipeline
Now it’s time to load the Llama 2 Chat model.
LLM is big. It is necessary to reduce its memory footprint by reducing it to 4 bits for each parameter using quantization. The bitsandbytes library does just that. With quantization, the whole 13B llama 2 model is loaded to the GPU’s VRAM.
We also need to load the tokenizer the Llama 2 uses so that we can translate our input text into tokens (unique IDs for each word) .
Putting them together, we have a run-of-the-mill Llama 2 chat pipeline. You ask a question. It translates the question to tokens, which are then fed to the Llama 2 model. It’s output tokens are then translated to a human-readable response.
Step 4: Initializing a RetrievalQA Chain
We now have a vector database for the external knowledge and the LLM. We will use LangChain to put them together.
These codes set up things using their interface. We will use the RetrievalQA
function to tie them up. This is how RetrievalQA
works
- You ask a question.
- RetrievalQA translates the question to vector embeddings.
- It searches for similar embeddings in the Pinecone vector database (i.e., searches for similar meanings)
- The original text of the embeddings is added to the prompt as context.
- Add your question to the prompt.
- Feed the prompt to the LLM.
- Get the response from the LLM.
This completes the RAG pipeline!
Thank you for your great series
Thanks for putting in the effort to write such detailed articles.