Token Budget Pattern
What It Is
Section titled “What It Is”The token budget pattern enforces hard limits on the number of input and output tokens per LLM request. When a request would exceed its budget, the system automatically trims, summarizes, or truncates the input to fit — rather than sending an oversized request that fails or costs more than expected.
The Problem It Solves
Section titled “The Problem It Solves”Without token budgets, costs grow unpredictably. A single RAG query that retrieves too many chunks, a conversation with long history, or a verbose system prompt can easily create requests with 50,000+ input tokens — costing 10-50x what the same request should cost with proper budgeting.
Common runaway scenarios:
- A retrieval pipeline returns 20 chunks when 5 would suffice, inflating input tokens.
- Conversation history grows unbounded across turns, each turn adding the full history.
- A system prompt combined with a user query and retrieved context exceeds the model’s context window, causing a hard failure.
- No per-request guardrail means one pathological request can consume a significant portion of the daily budget.
How It Works
Section titled “How It Works”flowchart TD
A["Request assembly"] --> B["Count tokens by component"]
B --> C{"Within token budget?"}
C -->|"Yes"| D["Reserve output tokens"]
D --> E["Send prompt to model"]
C -->|"No"| F["Trim variable context"]
F --> F1["Drop lowest-relevance chunks"]
F --> F2["Summarize older conversation turns"]
F --> F3["Truncate as last resort"]
F1 --> G["Recalculate token counts"]
F2 --> G
F3 --> G
G --> C
G --> H{"Still above hard limit?"}
H -->|"Yes"| I["Reject request with clear budget error"]
H -->|"No"| D
- Define per-component budgets: Allocate the total context window across fixed components (system prompt, few-shot examples) and variable components (retrieved context, conversation history, user input).
- Measure token counts: Count tokens for each component before assembling the prompt. Use the model’s tokenizer for accurate counts.
- Apply trimming strategies: When a component exceeds its budget:
- Retrieved context: Drop the lowest-relevance chunks until the budget is met.
- Conversation history: Summarize older turns or drop them from oldest to newest.
- User input: Truncate only as a last resort (rare — user inputs are usually short).
- Reserve output tokens: Always reserve a portion of the context window for the model’s output. A common mistake is filling the entire window with input, leaving no room for a meaningful response.
- Enforce hard limits: If trimming cannot bring the request within budget, reject the request with a clear error rather than sending a degraded prompt.
When to Use It
Section titled “When to Use It”- You process requests with variable-length context (RAG, conversations, document analysis).
- Cost predictability is important for your business model (per-request pricing, free tier limits).
- You have experienced cost spikes from pathological inputs or runaway context accumulation.
- You need to prevent hard failures from exceeding model context windows.
When not to Use It
Section titled “When not to Use It”- All your requests are short, fixed-format prompts with predictable token counts. Budgeting adds complexity to a system that does not need it.
- You are doing long-context tasks where the value comes from processing the full input (document summarization of a specific document, code analysis of an entire codebase). Trimming destroys the value of the task.
- Cost is not a concern and latency optimization (fewer tokens = faster) is not needed. Some research or internal workloads have effectively unlimited budgets.
Trade-offs
Section titled “Trade-offs”- Information loss — Trimming context removes potentially relevant information. The system may produce worse answers because it was forced to drop useful context.
- Implementation complexity — Accurate token counting requires the model’s specific tokenizer. Different models tokenize differently, so budgets are model-specific.
- User experience — In conversation scenarios, summarizing or dropping history changes what the model “remembers.” Users may experience inconsistent context awareness.
- Over-budgeting — Setting budgets too conservatively wastes context window capacity. Setting them too loosely defeats the purpose.
Implementation Example
Section titled “Implementation Example”from dataclasses import dataclass
@dataclassclass TokenBudget: system_prompt: int context: int conversation_history: int user_query: int output_reserve: int
@property def total_input(self) -> int: return ( self.system_prompt + self.context + self.conversation_history + self.user_query )
@property def total(self) -> int: return self.total_input + self.output_reserve
def count_tokens(text: str) -> int: return len(text.split()) * 4 // 3
def trim_context_to_budget( chunks: list[dict], budget: int,) -> list[dict]: trimmed = [] used = 0 for chunk in chunks: chunk_tokens = count_tokens(chunk["content"]) if used + chunk_tokens > budget: break trimmed.append(chunk) used += chunk_tokens return trimmed
def trim_conversation_history( messages: list[dict], budget: int,) -> list[dict]: total = sum(count_tokens(m["content"]) for m in messages)
if total <= budget: return messages
trimmed = list(messages) while trimmed and total > budget: removed = trimmed.pop(0) total -= count_tokens(removed["content"])
return trimmed
def assemble_prompt( system_prompt: str, context_chunks: list[dict], conversation: list[dict], user_query: str, budget: TokenBudget,) -> dict: system_tokens = count_tokens(system_prompt) if system_tokens > budget.system_prompt: raise ValueError( f"System prompt ({system_tokens} tokens) exceeds budget ({budget.system_prompt})" )
query_tokens = count_tokens(user_query) if query_tokens > budget.user_query: raise ValueError( f"User query ({query_tokens} tokens) exceeds budget ({budget.user_query})" )
trimmed_context = trim_context_to_budget(context_chunks, budget.context) trimmed_history = trim_conversation_history(conversation, budget.conversation_history)
context_text = "\n\n".join(c["content"] for c in trimmed_context)
messages = [ {"role": "system", "content": system_prompt}, ]
if context_text: messages.append( {"role": "system", "content": f"Relevant context:\n{context_text}"} )
messages.extend(trimmed_history) messages.append({"role": "user", "content": user_query})
return { "messages": messages, "max_tokens": budget.output_reserve, "metadata": { "context_chunks_used": len(trimmed_context), "context_chunks_available": len(context_chunks), "history_messages_used": len(trimmed_history), "history_messages_available": len(conversation), }, }
STANDARD_BUDGET = TokenBudget( system_prompt=500, context=3000, conversation_history=1000, user_query=500, output_reserve=1000,)For production systems, replace the approximation in count_tokens with the model’s actual tokenizer (tiktoken for OpenAI models, the relevant HuggingFace tokenizer for others).
Tool Landscape
Section titled “Tool Landscape”| Tool | Type | Notes |
|---|---|---|
| tiktoken | Open-source (OpenAI) | Fast BPE tokenizer for accurate token counting with OpenAI models |
| LangChain token counters | Framework utility | Built-in token counting and context trimming for chains |
| LiteLLM | Open-source proxy | Token counting and budget enforcement at the gateway level |
| Semantic Kernel | Framework (Microsoft) | Token management as part of the prompt orchestration layer |
Related Patterns
Section titled “Related Patterns”- Prompt Compression Pattern — Reduce token count of context before it enters the budget. Compression happens before; budgeting enforces the limit.
- Cost Circuit Breaker — Aggregate budget enforcement at the system level, not per-request.
- Tiered Model Strategy — Route to cheaper models to stretch the per-request cost budget.
- Cost Attribution Pattern — Track actual token usage against budgets to identify over/under-allocation.
- Chunking Strategy Pattern — Chunk size directly affects how many chunks fit in the context budget.