Inference & Serving

Patterns for controlling how requests reach LLM providers and how responses come back. This pillar covers the infrastructure layer between your application code and model APIs — routing, caching, failover, and cost control at the inference boundary.

Patterns

LLM Gateway Centralized proxy for API key management, rate limiting, logging, and multi-provider routing.

Model Router Route queries to cheap/fast models vs powerful ones based on complexity for 85-99% cost reduction.

Semantic Caching Cache LLM responses by meaning rather than exact text match, serving similar queries from cache for 40-60% cost reduction.