Context Windows and Token Limits Explained

Every AI model has a context window — the maximum amount of text it can "see" at one time. Understanding how context windows work is essential for using models effectively, especially for long documents, extended conversations, or complex codebases. This guide explains tokenization, context limits, and practical strategies for working with them.

What Is a Token?

Models don't process text character by character or word by word — they process tokens. A token is a chunk of text that the model treats as a unit. Tokens are defined by the model's tokenizer and roughly correspond to:

~4 characters of English text per token on average
About 100 tokens ≈ 75 words
Common words are single tokens ("the", "is", "code")
Long or unusual words may be split into multiple tokens
Punctuation, spaces, and newlines count as tokens
Code is tokenized differently than prose — some code is very token-efficient, some isn't

Token Counting Examples

"Hello, world!" → ~4 tokens
"The quick brown fox jumps over the lazy dog" → ~9 tokens
"Supercalifragilisticexpialidocious" → ~8 tokens

A typical business email is 200-400 tokens. A book chapter is 2,000-5,000 tokens. A large codebase could be millions of tokens.

What Is a Context Window?

The context window is the maximum number of tokens the model can process in a single inference call — combining your input (prompt, conversation history, attached documents) and the model's output.

Input + Output ≤ Context Window

If your input is 100,000 tokens and your context window is 128,000 tokens, the model can generate a maximum of 28,000 tokens in response.

Context Window Comparison (2025-2026)

Model	Context Window	Typical Cost
Claude Opus 4	200,000 tokens	High
Claude Sonnet 4	200,000 tokens	Medium
Gemini 1.5 Pro	2,000,000 tokens	Medium
Gemini 2.5 Pro	1,000,000 tokens	High
GPT-4o	128,000 tokens	Medium
GPT-4o mini	128,000 tokens	Low
o3	200,000 tokens	Very High

Larger context windows enable: processing entire books, analyzing full codebases, maintaining very long conversations, and multi-document analysis.

How Models Use Context

Attention and the "Lost in the Middle" Problem

Models use an attention mechanism to connect information across the context. Research shows that model performance on recall tasks degrades for information that appears in the middle of very long contexts — models pay more attention to the beginning and end.

Practical implication: Put critical instructions and key information at the start and end of your prompt for long-context tasks:

[State task clearly — what you need and what to focus on]

[Long document / codebase / conversation]

[Restate the specific question or task so it's close to the generation point]

Conversation History

In multi-turn conversations, all previous messages are included in the context for each new message. As conversations grow:

Earlier messages get farther from the generation point
The context window fills up
Eventually, the oldest messages are truncated

Most chat interfaces handle this automatically (truncating from the oldest), but you lose early context silently.

Practical Strategies for Token Management

1. Start Fresh for New Tasks

Don't continue a long conversation just because you're working with the same model. Start a new session for new tasks to avoid:

Irrelevant context from previous topics
Context window exhaustion
Old information conflicting with new

2. Summarize Long Conversations

When a conversation is getting long:

"Summarize our conversation so far into a brief context summary I can use
to start a fresh session. Include: key decisions made, current approach,
and remaining open questions."

Start a new session with that summary as your opening context.

3. Be Selective About What You Include

For code tasks:

Include function signatures without implementations for adjacent files
Include only the specific test file that's failing
Include only the relevant migration, not the entire schema history

For document analysis:

Paste only the relevant sections
Or explicitly tell the model which section to focus on

4. Chunking for Very Long Documents

For documents too long even for large context windows:

I'm going to send you a long document in sections.
After each section, just say "Received section N" and note
3 key points. After all sections, I'll ask for analysis.

Section 1 of 5:
[content]

5. Use the Right Model for the Task

Long document analysis: Gemini 1.5 Pro (2M tokens) or Claude (200K)
Short, fast tasks: GPT-4o mini or Gemini Flash
Extended coding sessions: Models with large context that handle code well

Token Estimation for Planning

Rough estimates for planning:

Content	Approximate Tokens
1 page of prose	~500 tokens
1,000 words	~750 tokens
A short story (5,000 words)	~3,750 tokens
A novel chapter (10,000 words)	~7,500 tokens
100 lines of Python	~200-400 tokens
1,000 lines of Rust	~2,000-4,000 tokens
A PDF page with tables	~300-600 tokens

Most models have context window usage displayed in their interfaces. Monitor it when working with large inputs.

Output Tokens vs. Input Tokens

Input tokens (your prompt) and output tokens (the model's response) are usually priced differently:

Input tokens are typically cheaper
Output tokens cost more

For long generation tasks:

Estimate how many output tokens you need
Don't set max_tokens too low (you'll get cut-off responses)
Don't set it too high if you only need a short response (costs more)

Key Takeaways

Tokens ≈ 4 characters of English; input + output must fit within the context window
Information in the middle of very long contexts is recalled less reliably — put key instructions at start and end
Start fresh sessions for new tasks; don't carry irrelevant conversation history
Summarize long conversations rather than continuing indefinitely
Match model context window size to your actual needs — bigger isn't always better if you're paying for it