Model Router
What It Is
Section titled “What It Is”A model router classifies incoming queries by complexity and routes them to the appropriate model tier. Simple queries go to fast, cheap models. Complex queries go to powerful, expensive models. The router itself can be a lightweight classifier, an LLM judge, or a rule-based system.
The Problem It Solves
Section titled “The Problem It Solves”Most production workloads have a highly skewed complexity distribution. Typically 70-90% of queries are simple enough for a small, fast model, but teams route everything to a single powerful (expensive) model because they lack a mechanism to differentiate. This results in 5-20x higher costs than necessary.
How It Works
Section titled “How It Works”Query → Router (classify complexity) │ ┌───────┼───────┐ │ │ │ Simple Medium Complex │ │ │ GPT-4o Claude Claude mini Sonnet Opus │ │ │ └───────┴───────┘ │ Response- A query arrives at the router.
- The router classifies the query into a complexity tier using one of:
- Rule-based: Query length, keyword presence, conversation turn count.
- Classifier: A lightweight ML model trained on labeled query-complexity pairs.
- LLM judge: A cheap model scores the query complexity before routing (adds latency but is more accurate).
- Based on the tier, the query is sent to the corresponding model.
- Quality monitoring compares outputs across tiers to validate routing accuracy.
- Misrouted queries (simple queries that needed the powerful model) are logged and used to retrain the router.
When to Use It
Section titled “When to Use It”- Your workload has a mix of simple and complex queries and you are paying a single-model price for all of them.
- You have measurable quality criteria that let you verify whether a cheaper model’s output is acceptable.
- Cost reduction of 50%+ would meaningfully change the economics of your product.
- You can tolerate occasional quality misses on misrouted queries (with fallback to a better model).
When NOT to Use It
Section titled “When NOT to Use It”- All your queries are roughly the same complexity (e.g., a specialized code generation tool where every query is complex). The router adds latency with no cost savings.
- You cannot measure output quality automatically. Without quality feedback, you cannot validate that cheaper models are performing adequately.
- Your query volume is low enough that the cost difference is negligible. If you spend less than a few hundred dollars per month on inference, the engineering effort of maintaining a router is not justified.
- Latency is more critical than cost, and the routing decision itself adds unacceptable overhead.
Trade-offs
Section titled “Trade-offs”- Routing accuracy vs. cost — Misrouting a complex query to a cheap model produces bad output. Misrouting a simple query to an expensive model wastes money. Finding the balance requires ongoing calibration.
- Added latency — Any classification step adds latency before the actual inference. Rule-based routers add microseconds; LLM-judge routers add hundreds of milliseconds.
- Maintenance burden — As models evolve and new tiers appear, the router needs updating. Model capabilities change with each release.
- Quality monitoring dependency — The router is only as good as your ability to detect when it routes incorrectly. Requires investment in evaluation and feedback loops.
Implementation Example
Section titled “Implementation Example”from dataclasses import dataclassfrom enum import Enum
class Tier(Enum): SIMPLE = "simple" MEDIUM = "medium" COMPLEX = "complex"
@dataclassclass RouteResult: tier: Tier model: str reason: str
TIER_MODELS = { Tier.SIMPLE: "gpt-4o-mini", Tier.MEDIUM: "claude-3-5-sonnet-20241022", Tier.COMPLEX: "claude-3-5-opus-20260301",}
COMPLEXITY_KEYWORDS = { "analyze", "compare", "synthesize", "evaluate", "multi-step", "trade-off", "architecture", "design",}
def classify_query(query: str, has_context: bool = False) -> RouteResult: words = query.lower().split() word_count = len(words) has_complexity_keywords = bool(set(words) & COMPLEXITY_KEYWORDS)
if word_count < 20 and not has_complexity_keywords and not has_context: tier = Tier.SIMPLE reason = "Short query without complexity indicators" elif word_count > 100 or (has_complexity_keywords and has_context): tier = Tier.COMPLEX reason = "Long query with complexity indicators and context" else: tier = Tier.MEDIUM reason = "Moderate complexity"
return RouteResult( tier=tier, model=TIER_MODELS[tier], reason=reason, )
async def routed_completion(gateway, query: str, messages: list[dict]) -> dict: route = classify_query(query) return await gateway.chat_completion( model=route.model, messages=messages, )For production systems, replace the rule-based classifier with a trained lightweight model (e.g., a fine-tuned BERT classifier or a logistic regression over query features) and add quality-based feedback to continuously improve routing decisions.
Tool Landscape
Section titled “Tool Landscape”| Tool | Type | Notes |
|---|---|---|
| Martian | Managed router | ML-based model router with automatic quality monitoring |
| Unify AI | Managed router | Routes across 100+ models based on quality/cost/latency targets |
| RouteLLM (Berkeley) | Open-source | Research framework for training LLM routers |
| Portkey | Gateway with routing | Conditional routing rules at the gateway level |
| Custom classifier | DIY | Maximum control, requires labeled data and evaluation infrastructure |
Related Patterns
Section titled “Related Patterns”- LLM Gateway Pattern — The gateway executes the routing decision. Router logic often lives inside the gateway.
- Semantic Caching — Check the cache before routing. Cached responses bypass model selection entirely.
- Tiered Model Strategy — The broader cost strategy that model routing implements.
- Cost Attribution Pattern — Measures cost savings from routing to validate the approach.
- Quality Drift Detection — Detects when routing decisions degrade output quality over time.