Context Windows and Token Limits Explained
Every AI model has a context window — the maximum amount of text it can "see" at one time. Understanding how context windows work is essential for using models effectively, especially for long documents, extended conversations, or complex codebases. This guide explains tokenization, context limits, and practical strategies for working with them.
What Is a Token?
Models don't process text character by character or word by word — they process tokens. A token is a chunk of text that the model treats as a unit. Tokens are defined by the model's tokenizer and roughly correspond to:
- ~4 characters of English text per token on average
- About 100 tokens ≈ 75 words
- Common words are single tokens ("the", "is", "code")
- Long or unusual words may be split into multiple tokens
- Punctuation, spaces, and newlines count as tokens
- Code is tokenized differently than prose — some code is very token-efficient, some isn't
Token Counting Examples
"Hello, world!" → ~4 tokens
"The quick brown fox jumps over the lazy dog" → ~9 tokens
"Supercalifragilisticexpialidocious" → ~8 tokens
A typical business email is 200-400 tokens. A book chapter is 2,000-5,000 tokens. A large codebase could be millions of tokens.
What Is a Context Window?
The context window is the maximum number of tokens the model can process in a single inference call — combining your input (prompt, conversation history, attached documents) and the model's output.
Input + Output ≤ Context Window
If your input is 100,000 tokens and your context window is 128,000 tokens, the model can generate a maximum of 28,000 tokens in response.
Context Window Comparison (2025-2026)
| Model | Context Window | Typical Cost |
|---|---|---|
| Claude Opus 4 | 200,000 tokens | High |
| Claude Sonnet 4 | 200,000 tokens | Medium |
| Gemini 1.5 Pro | 2,000,000 tokens | Medium |
| Gemini 2.5 Pro | 1,000,000 tokens | High |
| GPT-4o | 128,000 tokens | Medium |
| GPT-4o mini | 128,000 tokens | Low |
| o3 | 200,000 tokens | Very High |
Larger context windows enable: processing entire books, analyzing full codebases, maintaining very long conversations, and multi-document analysis.
How Models Use Context
Attention and the "Lost in the Middle" Problem
Models use an attention mechanism to connect information across the context. Research shows that model performance on recall tasks degrades for information that appears in the middle of very long contexts — models pay more attention to the beginning and end.
Practical implication: Put critical instructions and key information at the start and end of your prompt for long-context tasks:
[State task clearly — what you need and what to focus on]
[Long document / codebase / conversation]
[Restate the specific question or task so it's close to the generation point]
Conversation History
In multi-turn conversations, all previous messages are included in the context for each new message. As conversations grow:
- Earlier messages get farther from the generation point
- The context window fills up
- Eventually, the oldest messages are truncated
Most chat interfaces handle this automatically (truncating from the oldest), but you lose early context silently.
Practical Strategies for Token Management
1. Start Fresh for New Tasks
Don't continue a long conversation just because you're working with the same model. Start a new session for new tasks to avoid:
- Irrelevant context from previous topics
- Context window exhaustion
- Old information conflicting with new
2. Summarize Long Conversations
When a conversation is getting long:
"Summarize our conversation so far into a brief context summary I can use
to start a fresh session. Include: key decisions made, current approach,
and remaining open questions."
Start a new session with that summary as your opening context.
3. Be Selective About What You Include
For code tasks:
- Include function signatures without implementations for adjacent files
- Include only the specific test file that's failing
- Include only the relevant migration, not the entire schema history
For document analysis:
- Paste only the relevant sections
- Or explicitly tell the model which section to focus on
4. Chunking for Very Long Documents
For documents too long even for large context windows:
I'm going to send you a long document in sections.
After each section, just say "Received section N" and note
3 key points. After all sections, I'll ask for analysis.
Section 1 of 5:
[content]
5. Use the Right Model for the Task
- Long document analysis: Gemini 1.5 Pro (2M tokens) or Claude (200K)
- Short, fast tasks: GPT-4o mini or Gemini Flash
- Extended coding sessions: Models with large context that handle code well
Token Estimation for Planning
Rough estimates for planning:
| Content | Approximate Tokens |
|---|---|
| 1 page of prose | ~500 tokens |
| 1,000 words | ~750 tokens |
| A short story (5,000 words) | ~3,750 tokens |
| A novel chapter (10,000 words) | ~7,500 tokens |
| 100 lines of Python | ~200-400 tokens |
| 1,000 lines of Rust | ~2,000-4,000 tokens |
| A PDF page with tables | ~300-600 tokens |
Most models have context window usage displayed in their interfaces. Monitor it when working with large inputs.
Output Tokens vs. Input Tokens
Input tokens (your prompt) and output tokens (the model's response) are usually priced differently:
- Input tokens are typically cheaper
- Output tokens cost more
For long generation tasks:
- Estimate how many output tokens you need
- Don't set
max_tokenstoo low (you'll get cut-off responses) - Don't set it too high if you only need a short response (costs more)
Key Takeaways
- Tokens ≈ 4 characters of English; input + output must fit within the context window
- Information in the middle of very long contexts is recalled less reliably — put key instructions at start and end
- Start fresh sessions for new tasks; don't carry irrelevant conversation history
- Summarize long conversations rather than continuing indefinitely
- Match model context window size to your actual needs — bigger isn't always better if you're paying for it