Skip to content

Context length in LLMs: All you need to know

LLM context length cover image

The context length of an LLM is crucial for its use. In this post, we’ll discuss the basics of context length, recent developments, and how to adjust it in text-generation-webui.

What is context length?

Context length is the number of tokens a language model can process at once. It is the maximum length of the input sequence. It’s like the memory or attention span of the model. It is a predefined number in a transformer-based model such as ChatGPT and Llama.

A Token is the model’s way of representing a word with a series of numbers. 100 words is about 130 tokens. If a model does not recognize a word, it breaks the word into multiple tokens.

Why is context length important?

Context length is important because

  • Complexity of input. It limits the number of words a language model can process.
    • A model can only summarize an article no longer than the context length.
    • Long-term planning tasks require long input sequences.
    • Longer, more complex input generates richer output content.
  • Memory and coherence. In chat apps, the context length dictates how much of the previous conversation is remembered. Language models like ChatGPT or Llama 2-Chat do not remember anything. They are stateless. They know the past conversation because they were included in your current input before feeding it to the model.

Context length of GPT and Llama

Below are the context lengths of the GPT and Llama models.

ModelContext lengthNumber of English pages*
GPT 3.54,0966
GPT 48,19212
GPT 4-32k32,76849
Llama 12,0483
Llama 24,0966
Context length comparison. (*Assuming 500 words per page.)

Read more:

The clock is ticking to improve context length.

How to set context length?

The max_seq_len is the context length. In Oobabooga text-generation-webui, navigate to the Model page, use the following settings

  • Model loader: Exllama or Exllama_HF
  • max_seq_len and compress_pos_emb
max_seq_lencompress_pos_emb
2,0481
4,0962
8,1924
Context length settings for llama 1 models.
max_seq_lencompress_pos_emb
4,0961
8,1922
16,3844
Context length settings for llama 2 models.

Setting context length in text-generation-webui
Context length setting in text-generation-webui.

The native context length for Llama 1 and 2 are 2,024 and 4,096 tokens. You should NOT use a different context length unless the model is fine-tuned for an extended context length.

How is context length implemented?

The context length is simply the maximum length of the input sequence. A transformer layer doesn’t care about the length of the input sequence. All input lengths produce the same number of output vectors.

Challenges of increasing context length

Increasing context length is not as simple as feeding the model with longer sequences. Although it won’t break, the result is not good.

So Llama 2 now has a context length of 4,096. Can it reach 32k like the proprietary GPT4-32k model? The answer is yes, but it must rely on a recent discovery. Before getting to it, we will go through a few ideas to increase the context length. Then we will know what to do with Llama.

Training with long sequences

In machine learning, training more often helps solve problems. However, this method isn’t very effective for increasing context length. Research shows that finetuning a pretrained model with long sequences only provides a slight improvement in context length.

So fine-tuning with longer sequences is not the answer.

Meditating on increasing context length.

Better positional encoding

Transformer-based language models process all input tokens at once. How does it know the order of the words? The answer is positional encoding, the transformer’s way of telling the model which word goes first.

There are two widely used positional encoding methods:

  • Sinuosdial encoding (used in GPT): A wave-like function based on the token’s position is added to the input token embedding.
  • Rotary encoding (used in Llama): It similarly uses a wave-like function. But instead of adding to the input, it is multiplied with the keys and queries in every layer.

A transformer layer doesn’t care about the length of the input sequence (context length). All input lengths produce the same number of output vectors. But research showed the performance of these two methods dropped if the context length exceeded the training value.

ALiBi (Attention Linear Bias) is a new positional encoding method designed to increase the content length without training with longer sequences. The model encodes position as a linear term adding to the attention score (query times key). The bias term is linear with the position difference between two tokens.

ALiBi's implementation of positional encoding.
ALiBi encodes position information by adding a linear bias term. (Figure from ALiBi paper.)

This simple encoding works well for extending the context length. Context length can be extended by more than 10 times without much performance degradation!

ALiBi's perpexity performance
Perplexity (a performance measure, the lower the better) holds up well for ALiBi.

Positional interpolation

ALiBi is good, but you can only use it with a new model trained from scratch. There’s no way to use it on pretrained models that do not use ALiBi to start with, like GPT3 or Llama.

Is there a way to extend context length without training a new model? Yes, the Meta team found that stretching the positional encoding to repeat over the new context length works well. They finetuned the model a bit to make it work. But hey, that’s way cheaper than retraining a new model.

Interpolate the position encoding to extend the context length from 2048 (top) to 4096 (bottom). (Figure from the Interpolation paper.)

To top it off, a community developer kaiokendev made the same discovery when fine-tuning a Llama model. Thanks to Llama’s open-source development model, researchers can no longer afford to take their time to publish!

Towards a 32k Llama model

How to get to a Llama model with 32k context length? Two options

  1. Switch to ALiBi positional encoding and train a new Llama 3 model from scratch. This is going to be expensive.
  2. Finetune the Llama 2 model with positional interpolation. This is cheaper but with slight performance degradation.

A LoRA should be possible for option 2.

Further readings

Leave a Reply

Your email address will not be published. Required fields are marked *