The context length of an LLM is crucial for its use. In this post, we’ll discuss the basics of context length, recent developments, and how to adjust it in text-generation-webui.
Contents
What is context length?
Context length is the number of tokens a language model can process at once. It is the maximum length of the input sequence. It’s like the memory or attention span of the model. It is a predefined number in a transformer-based model such as ChatGPT and Llama.
A Token is the model’s way of representing a word with a series of numbers. 100 words is about 130 tokens. If a model does not recognize a word, it breaks the word into multiple tokens.
Why is context length important?
Context length is important because
- Complexity of input. It limits the number of words a language model can process.
- A model can only summarize an article no longer than the context length.
- Long-term planning tasks require long input sequences.
- Longer, more complex input generates richer output content.
- Memory and coherence. In chat apps, the context length dictates how much of the previous conversation is remembered. Language models like ChatGPT or Llama 2-Chat do not remember anything. They are stateless. They know the past conversation because they were included in your current input before feeding it to the model.
Context length of GPT and Llama
Below are the context lengths of the GPT and Llama models.
Model | Context length | Number of English pages* |
---|---|---|
GPT 3.5 | 4,096 | 6 |
GPT 4 | 8,192 | 12 |
GPT 4-32k | 32,768 | 49 |
Llama 1 | 2,048 | 3 |
Llama 2 | 4,096 | 6 |
Read more:
- What is the difference between the GPT-4 models? (OpenAI)
- Llama 2 – Resource Overview (Meta AI)

How to set context length?
The max_seq_len
is the context length. In Oobabooga text-generation-webui, navigate to the Model page, use the following settings
- Model loader: Exllama or Exllama_HF
- max_seq_len and compress_pos_emb
max_seq_len | compress_pos_emb |
---|---|
2,048 | 1 |
4,096 | 2 |
8,192 | 4 |
max_seq_len | compress_pos_emb |
---|---|
4,096 | 1 |
8,192 | 2 |
16,384 | 4 |

The native context length for Llama 1 and 2 are 2,024 and 4,096 tokens. You should NOT use a different context length unless the model is fine-tuned for an extended context length.
How is context length implemented?
The context length is simply the maximum length of the input sequence. A transformer layer doesn’t care about the length of the input sequence. All input lengths produce the same number of output vectors.
Challenges of increasing context length
Increasing context length is not as simple as feeding the model with longer sequences. Although it won’t break, the result is not good.
So Llama 2 now has a context length of 4,096. Can it reach 32k like the proprietary GPT4-32k model? The answer is yes, but it must rely on a recent discovery. Before getting to it, we will go through a few ideas to increase the context length. Then we will know what to do with Llama.
Training with long sequences
In machine learning, training more often helps solve problems. However, this method isn’t very effective for increasing context length. Research shows that finetuning a pretrained model with long sequences only provides a slight improvement in context length.
So fine-tuning with longer sequences is not the answer.

Better positional encoding
Transformer-based language models process all input tokens at once. How does it know the order of the words? The answer is positional encoding, the transformer’s way of telling the model which word goes first.
There are two widely used positional encoding methods:
- Sinuosdial encoding (used in GPT): A wave-like function based on the token’s position is added to the input token embedding.
- Rotary encoding (used in Llama): It similarly uses a wave-like function. But instead of adding to the input, it is multiplied with the keys and queries in every layer.
A transformer layer doesn’t care about the length of the input sequence (context length). All input lengths produce the same number of output vectors. But research showed the performance of these two methods dropped if the context length exceeded the training value.
ALiBi (Attention Linear Bias) is a new positional encoding method designed to increase the content length without training with longer sequences. The model encodes position as a linear term adding to the attention score (query times key). The bias term is linear with the position difference between two tokens.

This simple encoding works well for extending the context length. Context length can be extended by more than 10 times without much performance degradation!

Positional interpolation
ALiBi is good, but you can only use it with a new model trained from scratch. There’s no way to use it on pretrained models that do not use ALiBi to start with, like GPT3 or Llama.
Is there a way to extend context length without training a new model? Yes, the Meta team found that stretching the positional encoding to repeat over the new context length works well. They finetuned the model a bit to make it work. But hey, that’s way cheaper than retraining a new model.

To top it off, a community developer kaiokendev made the same discovery when fine-tuning a Llama model. Thanks to Llama’s open-source development model, researchers can no longer afford to take their time to publish!
Towards a 32k Llama model
How to get to a Llama model with 32k context length? Two options
- Switch to ALiBi positional encoding and train a new Llama 3 model from scratch. This is going to be expensive.
- Finetune the Llama 2 model with positional interpolation. This is cheaper but with slight performance degradation.
A LoRA should be possible for option 2.
Further readings
- Extending Context Window of Large Language Models via Positional Interpolation (2023) – The positional interpolation paper for extending the context length of a pretrained model.
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (2021) – The ALiBi positional encoding extends context length.
- RoFormer: Enhanced Transformer with Rotary Position Embedding (2021) – The rotary positional encoding used in Llama models.