LLM Gateway
What It Is
Section titled “What It Is”An LLM Gateway is a centralized proxy layer that sits between your application code and LLM provider APIs. All inference requests flow through the gateway, which handles authentication, routing, rate limiting, logging, cost tracking, and failover. It is the single control plane for every model interaction in your system.
The Problem It Solves
Section titled “The Problem It Solves”Without a gateway, every service that calls an LLM embeds its own API keys, retry logic, error handling, and logging. This creates several failure modes:
- API keys scattered across services, impossible to rotate quickly.
- No centralized view of spend, latency, or error rates across providers.
- Each service implements its own retry and fallback logic (or none at all).
- No way to enforce rate limits or spend caps across the organization.
- Provider switches require code changes in every calling service.
How It Works
Section titled “How It Works”┌──────────────┐ ┌──────────────┐ ┌──────────────────┐│ Service A │────▶│ │────▶│ OpenAI API │├──────────────┤ │ │ ├──────────────────┤│ Service B │────▶│ LLM Gateway │────▶│ Anthropic API │├──────────────┤ │ │ ├──────────────────┤│ Service C │────▶│ │────▶│ Self-hosted │└──────────────┘ └──────────────┘ └──────────────────┘ │ ┌─────┴─────┐ │ Logging │ │ Metrics │ │ Budgets │ └───────────┘- Application services send inference requests to the gateway using a unified API format.
- The gateway resolves which provider and model to use based on routing rules.
- The gateway injects the correct API key and transforms the request to the provider’s format.
- The response is logged, metered, and returned to the calling service in a normalized format.
- On failure, the gateway applies retry logic and fallback rules before returning an error.
When to Use It
Section titled “When to Use It”- You call LLMs from more than one service or team.
- You use (or plan to use) more than one LLM provider.
- You need centralized cost tracking and spend limits.
- You need consistent logging and tracing across all model calls.
- You want to swap providers or models without changing application code.
When NOT to Use It
Section titled “When NOT to Use It”- You have a single service making occasional LLM calls with one provider. The overhead of running a gateway is not justified until you have multiple callers or need to switch providers.
- Your application is a prototype or proof-of-concept. Direct API calls are fine for validation. Add the gateway when you move to production.
- Latency is so critical that even the small overhead of an extra network hop (~1-5ms) is unacceptable. This is rare but relevant for ultra-low-latency serving paths.
Trade-offs
Section titled “Trade-offs”- Added latency — Every request adds a network hop. Typically 1-5ms, but can matter at the margins for streaming-first applications.
- Single point of failure — The gateway must be highly available. A gateway outage takes down all LLM calls. Requires redundancy planning.
- Operational complexity — Another service to deploy, monitor, and maintain. Not free for small teams.
- Feature lag — New provider features (streaming modes, tool calling changes) require gateway updates before applications can use them.
Implementation Example
Section titled “Implementation Example”import hashlibimport hmacimport timefrom dataclasses import dataclass, field
import httpx
@dataclassclass ProviderConfig: name: str base_url: str api_key: str models: list[str] timeout: float = 30.0 max_retries: int = 2
@dataclassclass GatewayMetrics: requests: int = 0 failures: int = 0 total_tokens: int = 0 total_latency_ms: float = 0.0
class LLMGateway: def __init__(self, providers: list[ProviderConfig]): self._providers = {p.name: p for p in providers} self._model_to_provider: dict[str, str] = {} self._metrics: dict[str, GatewayMetrics] = {} for p in providers: self._metrics[p.name] = GatewayMetrics() for model in p.models: self._model_to_provider[model] = p.name
def _resolve_provider(self, model: str) -> ProviderConfig: provider_name = self._model_to_provider.get(model) if not provider_name: raise ValueError(f"No provider configured for model: {model}") return self._providers[provider_name]
async def chat_completion( self, model: str, messages: list[dict], temperature: float = 0.7, max_tokens: int = 1024, ) -> dict: provider = self._resolve_provider(model) metrics = self._metrics[provider.name] metrics.requests += 1
start = time.monotonic() async with httpx.AsyncClient(timeout=provider.timeout) as client: for attempt in range(provider.max_retries + 1): try: response = await client.post( f"{provider.base_url}/chat/completions", headers={ "Authorization": f"Bearer {provider.api_key}", "Content-Type": "application/json", }, json={ "model": model, "messages": messages, "temperature": temperature, "max_tokens": max_tokens, }, ) response.raise_for_status() result = response.json() elapsed = (time.monotonic() - start) * 1000 metrics.total_latency_ms += elapsed usage = result.get("usage", {}) metrics.total_tokens += usage.get("total_tokens", 0) return result except httpx.HTTPStatusError: if attempt == provider.max_retries: metrics.failures += 1 raise
raise RuntimeError("Unreachable")
def get_metrics(self) -> dict[str, GatewayMetrics]: return dict(self._metrics)Tool Landscape
Section titled “Tool Landscape”| Tool | Type | Notes |
|---|---|---|
| LiteLLM | Open-source proxy | Supports 100+ providers, OpenAI-compatible API |
| Portkey | Managed gateway | Built-in caching, fallback, and observability |
| Helicone | Logging-focused proxy | Strong on analytics and cost tracking |
| Kong AI Gateway | Enterprise gateway | Extends existing Kong infrastructure for LLM routing |
| Custom (nginx/Envoy) | DIY | Maximum control, highest operational burden |
Related Patterns
Section titled “Related Patterns”- Model Router Pattern — Routes to different models by complexity. Often implemented inside the gateway.
- Fallback Chain — Provider failover logic. The gateway is the natural place to implement it.
- Semantic Caching — Cache layer that integrates at the gateway level.
- Cost Attribution Pattern — The gateway is the source of truth for cost data.
- Span-Level Tracing Pattern — The gateway generates the root span for model interactions.