AI Engineering Patterns

What It Is

An LLM Gateway is a centralized proxy layer that sits between your application code and LLM provider APIs. All inference requests flow through the gateway, which handles authentication, routing, rate limiting, logging, cost tracking, and failover. It is the single control plane for every model interaction in your system.

The Problem It Solves

Without a gateway, every service that calls an LLM embeds its own API keys, retry logic, error handling, and logging. This creates several failure modes:

API keys scattered across services, impossible to rotate quickly.
No centralized view of spend, latency, or error rates across providers.
Each service implements its own retry and fallback logic (or none at all).
No way to enforce rate limits or spend caps across the organization.
Provider switches require code changes in every calling service.

How It Works

flowchart LR
    subgraph S["Application Services"]
      A["Service A"]
      B["Service B"]
      C["Service C"]
    end

    subgraph G["LLM Gateway"]
      D["Auth + policy checks"]
      E["Routing + retries + failover"]
      F["Response normalization"]
    end

    subgraph P["Providers"]
      P1["OpenAI API"]
      P2["Anthropic API"]
      P3["Self-hosted model API"]
    end

    A --> D
    B --> D
    C --> D
    D --> E --> F
    F --> P1
    F --> P2
    F --> P3
    F --> M["Logs, metrics, and cost budgets"]

Application services send inference requests to the gateway using a unified API format.
The gateway resolves which provider and model to use based on routing rules.
The gateway injects the correct API key and transforms the request to the provider's format.
The response is logged, metered, and returned to the calling service in a normalized format.
On failure, the gateway applies retry logic and fallback rules before returning an error.

When to Use It

You call LLMs from more than one service or team.
You use (or plan to use) more than one LLM provider.
You need centralized cost tracking and spend limits.
You need consistent logging and tracing across all model calls.
You want to swap providers or models without changing application code.

When not to Use It

You have a single service making occasional LLM calls with one provider. The overhead of running a gateway is not justified until you have multiple callers or need to switch providers.
Your application is a prototype or proof-of-concept. Direct API calls are fine for validation. Add the gateway when you move to production.
Latency is so critical that even the small overhead of an extra network hop (~1-5ms) is unacceptable. This is rare but relevant for ultra-low-latency serving paths.

Trade-offs

Added latency — Every request adds a network hop. Typically 1-5ms, but can matter at the margins for streaming-first applications.
Single point of failure — The gateway must be highly available. A gateway outage takes down all LLM calls. Requires redundancy planning.
Operational complexity — Another service to deploy, monitor, and maintain. Not free for small teams.
Feature lag — New provider features (streaming modes, tool calling changes) require gateway updates before applications can use them.

Failure Modes

Gateway Becomes the Bottleneck

Trigger: Gateway is deployed as a single instance or with insufficient scaling, and traffic exceeds its capacity. Symptom: All LLM calls across all services experience increased latency or timeouts simultaneously, even though the underlying providers are healthy. Mitigation: Deploy the gateway with horizontal auto-scaling. Add health checks that include throughput metrics, not just liveness. Load-test the gateway independently from providers.

Silent Credential Rotation Failure

Trigger: API key rotation on one provider succeeds but the gateway reads cached credentials. Symptom: Requests to a single provider start failing with 401/403 errors. Other providers work fine. Teams blame the provider before checking the gateway. Mitigation: Gateway should validate credentials on rotation and emit a specific alert on authentication failures distinct from rate-limit or model errors.

Logging Amplifies Costs

Trigger: Verbose logging is enabled (full prompts and completions) without sampling or size limits. Symptom: Log storage costs approach or exceed LLM inference costs. Log ingestion pipelines lag, causing alert delays. Mitigation: Log prompt/completion hashes by default; store full payloads only for sampled requests or flagged interactions. Set retention policies.

Implementation Example

import hashlib
import hmac
import time
from dataclasses import dataclass, field

import httpx


@dataclass
class ProviderConfig:
    name: str
    base_url: str
    api_key: str
    models: list[str]
    timeout: float = 30.0
    max_retries: int = 2


@dataclass
class GatewayMetrics:
    requests: int = 0
    failures: int = 0
    total_tokens: int = 0
    total_latency_ms: float = 0.0


class LLMGateway:
    def __init__(self, providers: list[ProviderConfig]):
        self._providers = {p.name: p for p in providers}
        self._model_to_provider: dict[str, str] = {}
        self._metrics: dict[str, GatewayMetrics] = {}
        for p in providers:
            self._metrics[p.name] = GatewayMetrics()
            for model in p.models:
                self._model_to_provider[model] = p.name

    def _resolve_provider(self, model: str) -> ProviderConfig:
        provider_name = self._model_to_provider.get(model)
        if not provider_name:
            raise ValueError(f"No provider configured for model: {model}")
        return self._providers[provider_name]

    async def chat_completion(
        self,
        model: str,
        messages: list[dict],
        temperature: float = 0.7,
        max_tokens: int = 1024,
    ) -> dict:
        provider = self._resolve_provider(model)
        metrics = self._metrics[provider.name]
        metrics.requests += 1

        start = time.monotonic()
        async with httpx.AsyncClient(timeout=provider.timeout) as client:
            for attempt in range(provider.max_retries + 1):
                try:
                    response = await client.post(
                        f"{provider.base_url}/chat/completions",
                        headers={
                            "Authorization": f"Bearer {provider.api_key}",
                            "Content-Type": "application/json",
                        },
                        json={
                            "model": model,
                            "messages": messages,
                            "temperature": temperature,
                            "max_tokens": max_tokens,
                        },
                    )
                    response.raise_for_status()
                    result = response.json()
                    elapsed = (time.monotonic() - start) * 1000
                    metrics.total_latency_ms += elapsed
                    usage = result.get("usage", {})
                    metrics.total_tokens += usage.get("total_tokens", 0)
                    return result
                except httpx.HTTPStatusError:
                    if attempt == provider.max_retries:
                        metrics.failures += 1
                        raise

        raise RuntimeError("Unreachable")

    def get_metrics(self) -> dict[str, GatewayMetrics]:
        return dict(self._metrics)

Tool Landscape

Tool	Type	Notes
LiteLLM	Open-source proxy	Supports 100+ providers, OpenAI-compatible API
Portkey	Managed gateway	Built-in caching, fallback, and observability
Helicone	Logging-focused proxy	Strong on analytics and cost tracking
Kong AI Gateway	Enterprise gateway	Extends existing Kong infrastructure for LLM routing
Custom (nginx/Envoy)	DIY	Maximum control, highest operational burden

Related Patterns

Model Router Pattern — Routes to different models by complexity. Often implemented inside the gateway.
Fallback Chain — Provider failover logic. The gateway is the natural place to implement it.
Semantic Caching — Cache layer that integrates at the gateway level.
Cost Attribution Pattern — The gateway is the source of truth for cost data.
Span-Level Tracing Pattern — The gateway generates the root span for model interactions.