AI Engineering Patterns

What It Is

Cascading context assembly is a multi-tier retrieval and prompting strategy that starts with the cheapest, smallest context and only escalates to richer context when the model's initial response signals low confidence or insufficient information. Instead of stuffing every request with maximum context, it builds context progressively — spending retrieval cost and context window tokens only when needed.

The Problem It Solves

The default RAG architecture retrieves a fixed number of chunks for every query and stuffs them all into the prompt. This "max context by default" approach has compounding costs:

Token waste on easy queries: "What is your refund policy?" could be answered from a single short chunk. Instead, the system retrieves 10 chunks, pays for 4,000 input tokens, and the LLM ignores 9 of them. Across thousands of requests, this waste is substantial.
Latency floor: More context means more input tokens, which means higher time-to-first-token. Even when the answer is trivially available, the system pays the latency cost of processing unnecessary context.
Retrieval noise: Retrieving more chunks increases the chance of including irrelevant or contradictory information. The LLM must filter signal from noise in the context, which can degrade answer quality — more is not always better.
Fixed cost regardless of difficulty: A one-line factual lookup and a complex multi-document synthesis task consume the same retrieval and context budget. There is no cost differentiation by query complexity.

Cascading context assembly aligns cost with difficulty: simple queries get simple (cheap) context, and only complex queries pay for rich context.

How It Works

flowchart TD
    A["Query arrives"] --> B["Tier 1: minimal context (1-2 chunks)"]
    B --> C["LLM generates response + confidence signal"]
    C --> D{"Confidence above threshold?"}
    D -->|"Yes"| E["Return response"]
    D -->|"No"| F["Tier 2: expanded context (5-8 chunks)"]
    F --> G["LLM generates response + confidence signal"]
    G --> H{"Confidence above threshold?"}
    H -->|"Yes"| I["Return response"]
    H -->|"No"| J["Tier 3: full context (max chunks + reranking)"]
    J --> K["LLM generates final response"]
    K --> L["Return response"]

Tier 1 — Minimal context: Retrieve the top 1-2 most relevant chunks. Send to the LLM with an instruction to answer if confident, or respond with a structured "insufficient context" signal if not.
Confidence evaluation: Parse the LLM's response for a confidence signal. This can be an explicit confidence score (asked for in the prompt), a refusal phrase ("I don't have enough information"), or log-probability analysis on the response tokens.
Tier 2 — Expanded context: If Tier 1 signals low confidence, retrieve additional chunks (5-8 total) using broader search or a different retrieval strategy (switching from pure vector to hybrid search). Re-prompt with the richer context.
Tier 3 — Full context: If Tier 2 still signals low confidence, pull maximum context with reranking, multi-source retrieval, and optionally a stronger model. This is the "spare no expense" tier reserved for genuinely difficult queries.
Response: Return the response from the first tier that achieves sufficient confidence. Most queries resolve at Tier 1 or Tier 2.

When to Use It

Your query distribution has a long tail: most queries are simple, but some require deep retrieval. You are paying Tier 3 costs for Tier 1 queries.
Cost reduction is a priority and you have sufficient request volume for the savings to be meaningful.
Latency matters and you want to serve simple queries faster without a blanket latency increase from large contexts.
Your retrieval quality degrades with more chunks (retrieval noise is a measured problem, not a theoretical concern).

When not to Use It

Most queries genuinely require deep context (legal analysis, research synthesis). If Tier 3 fires on 80%+ of queries, cascading adds latency (multiple LLM calls) without saving cost.
The LLM cannot reliably signal low confidence. If the model confidently hallucinates at Tier 1 instead of admitting insufficient context, cascading produces wrong answers faster. Test confidence calibration before adopting this pattern.
Your retrieval is already cheap (small index, local embedding model, low per-token costs). The cost of an extra LLM call to evaluate confidence may exceed the savings from reduced context.
Strict latency SLOs that cannot tolerate the occasional multi-tier cascade. Worst-case latency increases because hard queries now take 2-3 serial LLM calls.

Trade-offs

Confidence calibration — The entire pattern depends on the LLM's ability to say "I don't know" when context is insufficient. Many models default to generating plausible-sounding answers. Prompt engineering and model selection are critical to making this work.
Latency variance — Easy queries are faster, but hard queries are slower (multiple serial LLM calls). P50 improves, P99 may worsen. This variance may be unacceptable for some use cases.
Prompt complexity — Each tier needs a prompt that elicits a usable confidence signal without being so cautious that the model refuses to answer on sufficient context.
Cascading errors — If Tier 1 confidently returns a wrong answer, the cascade stops early with a bad response. The system optimizes for cost at the risk of missing cases where more context would have corrected the answer.

Failure Modes

Over-Confident Early Exit

Trigger: The model reports high confidence at Tier 1 because it has strong priors from training data, even though the domain-specific answer differs from its parametric knowledge. Symptom: The system serves wrong answers cheaply and confidently. Quality metrics look fine in aggregate because most queries are correctly answered at Tier 1, but a class of domain-specific queries consistently fails. Mitigation: Validate confidence calibration on a held-out set of queries where the answer requires domain context (not general knowledge). Force a random sample of Tier 1 exits through the full cascade to measure the miss rate.

Tier Configuration Drift

Trigger: New context sources are added to the system but not wired into the cascade tiers, or tier costs change (embedding model updated, new retrieval index). Symptom: The cascade skips a relevant context source. Cost assumptions are wrong — a "cheap" tier is now expensive, and the cascade does not save money anymore. Mitigation: Document the cascade tier configuration as code. Review tier composition and cost assumptions on a schedule (monthly). Alert when per-tier cost-per-query deviates from expected range.

Serial Latency Stacking

Trigger: A hard query fails confidence checks at every tier, causing sequential LLM calls at Tier 1, Tier 2, and Tier 3. Symptom: Worst-case latency is the sum of all tier latencies. P99 latency spikes to 3x the single-call latency. Users on hard queries experience unacceptable wait times. Mitigation: Set a total cascade timeout. If the budget is exhausted, return the best response so far with a confidence disclaimer rather than continuing to cascade.

Implementation Example

from dataclasses import dataclass
from enum import Enum


class ConfidenceLevel(Enum):
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"
    INSUFFICIENT = "insufficient"


@dataclass
class TierResult:
    response: str
    confidence: ConfidenceLevel
    tier: int
    chunks_used: int
    input_tokens: int


TIER_PROMPT_SUFFIX = (
    "\n\nAfter your answer, on a new line, provide your confidence level "
    "as exactly one of: [HIGH_CONFIDENCE], [MEDIUM_CONFIDENCE], [LOW_CONFIDENCE], "
    "[INSUFFICIENT_CONTEXT]. Use [INSUFFICIENT_CONTEXT] if the provided context "
    "does not contain enough information to answer reliably."
)


def parse_confidence(response: str) -> tuple[str, ConfidenceLevel]:
    confidence_map = {
        "[HIGH_CONFIDENCE]": ConfidenceLevel.HIGH,
        "[MEDIUM_CONFIDENCE]": ConfidenceLevel.MEDIUM,
        "[LOW_CONFIDENCE]": ConfidenceLevel.LOW,
        "[INSUFFICIENT_CONTEXT]": ConfidenceLevel.INSUFFICIENT,
    }
    for tag, level in confidence_map.items():
        if tag in response:
            clean_response = response.replace(tag, "").strip()
            return clean_response, level
    return response, ConfidenceLevel.MEDIUM


class CascadingContextAssembler:
    def __init__(
        self,
        retrieve_fn,
        llm_fn,
        tier_configs: list[dict] | None = None,
        confidence_threshold: ConfidenceLevel = ConfidenceLevel.MEDIUM,
    ):
        self._retrieve = retrieve_fn
        self._llm = llm_fn
        self._threshold = confidence_threshold
        self._tier_configs = tier_configs or [
            {"top_k": 2, "strategy": "vector"},
            {"top_k": 6, "strategy": "hybrid"},
            {"top_k": 12, "strategy": "hybrid_reranked"},
        ]

    def _is_sufficient(self, confidence: ConfidenceLevel) -> bool:
        rank = {
            ConfidenceLevel.HIGH: 3,
            ConfidenceLevel.MEDIUM: 2,
            ConfidenceLevel.LOW: 1,
            ConfidenceLevel.INSUFFICIENT: 0,
        }
        return rank[confidence] >= rank[self._threshold]

    def answer(self, query: str) -> TierResult:
        for tier_idx, config in enumerate(self._tier_configs):
            chunks = self._retrieve(
                query,
                top_k=config["top_k"],
                strategy=config["strategy"],
            )

            context = "\n\n---\n\n".join(chunks)
            prompt = (
                f"Answer the question using only the provided context.\n\n"
                f"Context:\n{context}\n\n"
                f"Question: {query}"
                f"{TIER_PROMPT_SUFFIX}"
            )

            raw_response = self._llm(prompt)
            clean_response, confidence = parse_confidence(raw_response)

            if self._is_sufficient(confidence) or tier_idx == len(self._tier_configs) - 1:
                return TierResult(
                    response=clean_response,
                    confidence=confidence,
                    tier=tier_idx + 1,
                    chunks_used=len(chunks),
                    input_tokens=len(prompt.split()) * 2,
                )

        return TierResult(
            response=clean_response,
            confidence=confidence,
            tier=len(self._tier_configs),
            chunks_used=len(chunks),
            input_tokens=len(prompt.split()) * 2,
        )

Tool Landscape

Tool	Type	Notes
LlamaIndex Response Synthesizers	Framework	Supports compact, refine, and tree-summarize modes as escalation tiers
LangChain LCEL	Framework	Compose multi-step chains with conditional branching based on output
Semantic Kernel (Microsoft)	Framework	Planner can dynamically decide how much context to retrieve
Cohere Rerank	API	Useful as the escalation step between Tier 2 and Tier 3
Custom pipeline	DIY	Necessary for tight control over tier boundaries and confidence parsing

Related Patterns

Token Budget Pattern — Token budgets set the hard ceiling; cascading context decides how much of that budget to use per request.
Model Router Pattern — Can be combined: Tier 1 uses a cheap model with minimal context; Tier 3 uses a powerful model with full context.
Hybrid Search — Tier escalation can switch retrieval strategies (vector-only at Tier 1, hybrid at Tier 2).
Semantic Caching — Cache responses from all tiers. A Tier 1 cache hit avoids even the minimal retrieval cost.

Cascading Context Assembly