Skip to content

Token Budget Pattern

The token budget pattern enforces hard limits on the number of input and output tokens per LLM request. When a request would exceed its budget, the system automatically trims, summarizes, or truncates the input to fit — rather than sending an oversized request that fails or costs more than expected.

Without token budgets, costs grow unpredictably. A single RAG query that retrieves too many chunks, a conversation with long history, or a verbose system prompt can easily create requests with 50,000+ input tokens — costing 10-50x what the same request should cost with proper budgeting.

Common runaway scenarios:

  • A retrieval pipeline returns 20 chunks when 5 would suffice, inflating input tokens.
  • Conversation history grows unbounded across turns, each turn adding the full history.
  • A system prompt combined with a user query and retrieved context exceeds the model’s context window, causing a hard failure.
  • No per-request guardrail means one pathological request can consume a significant portion of the daily budget.
flowchart TD
    A["Request assembly"] --> B["Count tokens by component"]
    B --> C{"Within token budget?"}
    C -->|"Yes"| D["Reserve output tokens"]
    D --> E["Send prompt to model"]
    C -->|"No"| F["Trim variable context"]
    F --> F1["Drop lowest-relevance chunks"]
    F --> F2["Summarize older conversation turns"]
    F --> F3["Truncate as last resort"]
    F1 --> G["Recalculate token counts"]
    F2 --> G
    F3 --> G
    G --> C
    G --> H{"Still above hard limit?"}
    H -->|"Yes"| I["Reject request with clear budget error"]
    H -->|"No"| D
  1. Define per-component budgets: Allocate the total context window across fixed components (system prompt, few-shot examples) and variable components (retrieved context, conversation history, user input).
  2. Measure token counts: Count tokens for each component before assembling the prompt. Use the model’s tokenizer for accurate counts.
  3. Apply trimming strategies: When a component exceeds its budget:
    • Retrieved context: Drop the lowest-relevance chunks until the budget is met.
    • Conversation history: Summarize older turns or drop them from oldest to newest.
    • User input: Truncate only as a last resort (rare — user inputs are usually short).
  4. Reserve output tokens: Always reserve a portion of the context window for the model’s output. A common mistake is filling the entire window with input, leaving no room for a meaningful response.
  5. Enforce hard limits: If trimming cannot bring the request within budget, reject the request with a clear error rather than sending a degraded prompt.
  • You process requests with variable-length context (RAG, conversations, document analysis).
  • Cost predictability is important for your business model (per-request pricing, free tier limits).
  • You have experienced cost spikes from pathological inputs or runaway context accumulation.
  • You need to prevent hard failures from exceeding model context windows.
  • All your requests are short, fixed-format prompts with predictable token counts. Budgeting adds complexity to a system that does not need it.
  • You are doing long-context tasks where the value comes from processing the full input (document summarization of a specific document, code analysis of an entire codebase). Trimming destroys the value of the task.
  • Cost is not a concern and latency optimization (fewer tokens = faster) is not needed. Some research or internal workloads have effectively unlimited budgets.
  1. Information loss — Trimming context removes potentially relevant information. The system may produce worse answers because it was forced to drop useful context.
  2. Implementation complexity — Accurate token counting requires the model’s specific tokenizer. Different models tokenize differently, so budgets are model-specific.
  3. User experience — In conversation scenarios, summarizing or dropping history changes what the model “remembers.” Users may experience inconsistent context awareness.
  4. Over-budgeting — Setting budgets too conservatively wastes context window capacity. Setting them too loosely defeats the purpose.
from dataclasses import dataclass
@dataclass
class TokenBudget:
system_prompt: int
context: int
conversation_history: int
user_query: int
output_reserve: int
@property
def total_input(self) -> int:
return (
self.system_prompt
+ self.context
+ self.conversation_history
+ self.user_query
)
@property
def total(self) -> int:
return self.total_input + self.output_reserve
def count_tokens(text: str) -> int:
return len(text.split()) * 4 // 3
def trim_context_to_budget(
chunks: list[dict],
budget: int,
) -> list[dict]:
trimmed = []
used = 0
for chunk in chunks:
chunk_tokens = count_tokens(chunk["content"])
if used + chunk_tokens > budget:
break
trimmed.append(chunk)
used += chunk_tokens
return trimmed
def trim_conversation_history(
messages: list[dict],
budget: int,
) -> list[dict]:
total = sum(count_tokens(m["content"]) for m in messages)
if total <= budget:
return messages
trimmed = list(messages)
while trimmed and total > budget:
removed = trimmed.pop(0)
total -= count_tokens(removed["content"])
return trimmed
def assemble_prompt(
system_prompt: str,
context_chunks: list[dict],
conversation: list[dict],
user_query: str,
budget: TokenBudget,
) -> dict:
system_tokens = count_tokens(system_prompt)
if system_tokens > budget.system_prompt:
raise ValueError(
f"System prompt ({system_tokens} tokens) exceeds budget ({budget.system_prompt})"
)
query_tokens = count_tokens(user_query)
if query_tokens > budget.user_query:
raise ValueError(
f"User query ({query_tokens} tokens) exceeds budget ({budget.user_query})"
)
trimmed_context = trim_context_to_budget(context_chunks, budget.context)
trimmed_history = trim_conversation_history(conversation, budget.conversation_history)
context_text = "\n\n".join(c["content"] for c in trimmed_context)
messages = [
{"role": "system", "content": system_prompt},
]
if context_text:
messages.append(
{"role": "system", "content": f"Relevant context:\n{context_text}"}
)
messages.extend(trimmed_history)
messages.append({"role": "user", "content": user_query})
return {
"messages": messages,
"max_tokens": budget.output_reserve,
"metadata": {
"context_chunks_used": len(trimmed_context),
"context_chunks_available": len(context_chunks),
"history_messages_used": len(trimmed_history),
"history_messages_available": len(conversation),
},
}
STANDARD_BUDGET = TokenBudget(
system_prompt=500,
context=3000,
conversation_history=1000,
user_query=500,
output_reserve=1000,
)

For production systems, replace the approximation in count_tokens with the model’s actual tokenizer (tiktoken for OpenAI models, the relevant HuggingFace tokenizer for others).

ToolTypeNotes
tiktokenOpen-source (OpenAI)Fast BPE tokenizer for accurate token counting with OpenAI models
LangChain token countersFramework utilityBuilt-in token counting and context trimming for chains
LiteLLMOpen-source proxyToken counting and budget enforcement at the gateway level
Semantic KernelFramework (Microsoft)Token management as part of the prompt orchestration layer