What is a context window for Large Language Models?

Three abstract AI chatbot bubbles against a light blue background.

A context window refers to the amount of information a large language model (LLM) can process in a single prompt. Context windows are like a human’s short-term memory. Like us, LLMs can only “look” at so much information simultaneously. So, in Q&A format applications like Anthropic’s Claude, OpenAI’s ChatGPT, or Google’s Gemini, information loaded into a context window is used together with the model’s vastly larger pretrained structure to deliver the best responses to user prompts.

Get to know and directly engage with senior McKinsey experts on context windows

Lareina Yee is a senior partner in McKinsey’s Bay Area office, where Roger Roberts is a partner and Michael Chui is a senior knowledge fellow; Mara Pometti is a consultant in the London office; and Stephen Xu is senior director of product management in McKinsey’s Gen AI Lab in Toronto.

When ChatGPT first went public in late 2022, consumers were surprised and delighted by generative AI’s (gen AI’s) new abilities to write poems and summarize meeting notes. Developers were grappling with a different challenge. At the time, GPT-3 (the LLM on which ChatGPT was initially built) could only process 2,048 tokens at a time (about 1,500 words). That’s not enough for many enterprises, which frequently need to process large documents, from insurance plans to vendor contracts and technical manuals. Since that time, we have seen context windows grow. In mid-2023, Anthropic announced LLMs with context windows that could process 100,000 tokens. Today, Google’s Gemini model can process two million tokens at once, or roughly over 3,000 pages of text. (For more, see sidebar, “What is a token?”)

What is a token?

Although large language models (LLMs) primarily interact with text today—accepting text inputs and providing text outputs—they also make calculations using 1s and 0s. Each LLM converts text into numbers using a process called tokenization. This is where inputted text is broken down into words, characters, or even subwords. Each token is then mapped to a predefined token ID. Although each LLM has its own tokenization process, generally one token amounts to 0.75 words. Some LLM providers charge by token, while others charge by character.

For example, the phrase “bits and bytes” translates to three tokens—11777, 323, and 5943—using GPT-4’s tokenization process. Try it out with your own sample text on OpenAI’s tokenizer demonstration.

Learn more about QuantumBlack, AI by McKinsey.

How do short context windows work?

Before these breakthroughs in context windows, teams working with gen AI had to get creative with prompt engineering to make the most of their 1,500 words. The goal was to carefully select the right prompt to maximize output quality and performance. This involved concision on two levels: first, when crafting the system prompt (or instructions to the system on how to act) and second, when providing business context (such as relevant snippets of business knowledge).

Given these constraints, gen AI practitioners developed new techniques to improve output. First, retrieval-augmented generation (RAG) allowed engineers to find the most relevant snippets of information across a repository of documents. Second, fine-tuning enabled practitioners to change the model and steer its behavior with examples. Finally, prompt orchestration made it possible to chain prompts together, each tackling a different part of the problem.

What are the advantages of longer context windows?

Models with long context windows have many advantages, including the following:

Ingesting vast amounts of fresh data. New use cases might include debugging the current version of a large codebase or extracting insight from all reviews of a global company’s products in the past week.
Supporting increased developer productivity. In simple terms, when more data can be taken into context, less work must be done outside the model to improve output. Open-source models with long-context capabilities, such as Google’s Gemma or Meta’s Llama, are now making this more accessible.
Making use of multimodal data. For example, an insurance claims agent might upload video, audio, images, and text to an application built on a long-context-window model to aid in drafting a report quickly.

Learn more about QuantumBlack, AI by McKinsey.

What has been the process for extending context windows?

Engineers have made impressive progress in expanding context windows; some speculate that context windows of near-infinite length may someday be possible. While this work involves unique challenges, it has also led to valuable innovations:

Enhanced model performance over longer contexts. As context windows grew in length, researchers noticed that the models tended to focus on the beginning or the end of a prompt. The latest models seem to have overcome this tendency, demonstrating improved abilities to retain the start-to-finish coherence needed in lengthy inputs. Google DeepMind researchers published a study in April 2024 demonstrating these improved capabilities.
Expanded training data sets. New, long-context data sets help models learn to process more extensive texts and other forms of content, enhancing their ability to work with lengthier, more complex inputs.
Scaled hardware capabilities. Architecture and hardware are continually being optimized, allowing for faster processing of extended contexts with low latency, even for vast inputs. These developments ensure responsive performance, allowing models to serve users with speed and accuracy across diverse applications.

As researchers continue to rapidly expand context windows via novel model structures, more efficient long-context data sets, and innovative training techniques, the field of AI is pushing the boundaries of what can be achieved.

Circular, white maze filled with white semicircles.

Looking for direct answers to other complex questions?

Explore the full McKinsey Explainers series

What challenges are associated with longer context windows?

Model builders are working to address several drawbacks to longer context windows. These include loss of explainability (due to decreased visibility into how a model arrived at an output), slower response times given the increased number of computations, and higher costs per query given typical token-based pricing and generally higher volumes passed to the model. These challenges must be assessed against the value opportunities.

Given the fast-paced nature of AI innovation, these limitations may well be overcome in the near future. Teams should continue to test and monitor the latest technologies for their tasks and use cases to make informed decisions.

Learn more about QuantumBlack, AI by McKinsey.

What can organizations do today to take advantage of these capabilities?

Long context windows accelerate the already blistering pace of gen AI development, enabling models to process immense and diverse data sources—from expansive text collections to hours of multimedia. With these extended windows, organizations can prototype and iterate quickly on solutions grounded in deep, nuanced domains and deliver more relevant output.

Developing architectures, aside from the chat-based interfaces we use today, will be key to taking full advantage of these capabilities. These may potentially mean less work outside the model to create great results, which in turn could accelerate innovation and enable the scaffolding needed to move beyond this initial set of gen AI applications. They may also open new avenues of creativity for designers who shape experiences for employees and customers. Increasingly, the constraint is imagination, not computation.

However, like any other shiny new technology, longer context windows are no silver bullet. They’re just one more tool teams can deploy—on top of existing, well-functioning operations, data processes, and strategic planning—to build and scale better AI-powered products and services for users.

Learn more about QuantumBlack, AI by McKinsey. And check out context window-related job opportunities if you’re interested in working at McKinsey.