Skip to content

LLM Gateway

An LLM Gateway is a centralized proxy layer that sits between your application code and LLM provider APIs. All inference requests flow through the gateway, which handles authentication, routing, rate limiting, logging, cost tracking, and failover. It is the single control plane for every model interaction in your system.

Without a gateway, every service that calls an LLM embeds its own API keys, retry logic, error handling, and logging. This creates several failure modes:

  • API keys scattered across services, impossible to rotate quickly.
  • No centralized view of spend, latency, or error rates across providers.
  • Each service implements its own retry and fallback logic (or none at all).
  • No way to enforce rate limits or spend caps across the organization.
  • Provider switches require code changes in every calling service.
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Service A │────▶│ │────▶│ OpenAI API │
├──────────────┤ │ │ ├──────────────────┤
│ Service B │────▶│ LLM Gateway │────▶│ Anthropic API │
├──────────────┤ │ │ ├──────────────────┤
│ Service C │────▶│ │────▶│ Self-hosted │
└──────────────┘ └──────────────┘ └──────────────────┘
┌─────┴─────┐
│ Logging │
│ Metrics │
│ Budgets │
└───────────┘
  1. Application services send inference requests to the gateway using a unified API format.
  2. The gateway resolves which provider and model to use based on routing rules.
  3. The gateway injects the correct API key and transforms the request to the provider’s format.
  4. The response is logged, metered, and returned to the calling service in a normalized format.
  5. On failure, the gateway applies retry logic and fallback rules before returning an error.
  • You call LLMs from more than one service or team.
  • You use (or plan to use) more than one LLM provider.
  • You need centralized cost tracking and spend limits.
  • You need consistent logging and tracing across all model calls.
  • You want to swap providers or models without changing application code.
  • You have a single service making occasional LLM calls with one provider. The overhead of running a gateway is not justified until you have multiple callers or need to switch providers.
  • Your application is a prototype or proof-of-concept. Direct API calls are fine for validation. Add the gateway when you move to production.
  • Latency is so critical that even the small overhead of an extra network hop (~1-5ms) is unacceptable. This is rare but relevant for ultra-low-latency serving paths.
  1. Added latency — Every request adds a network hop. Typically 1-5ms, but can matter at the margins for streaming-first applications.
  2. Single point of failure — The gateway must be highly available. A gateway outage takes down all LLM calls. Requires redundancy planning.
  3. Operational complexity — Another service to deploy, monitor, and maintain. Not free for small teams.
  4. Feature lag — New provider features (streaming modes, tool calling changes) require gateway updates before applications can use them.
import hashlib
import hmac
import time
from dataclasses import dataclass, field
import httpx
@dataclass
class ProviderConfig:
name: str
base_url: str
api_key: str
models: list[str]
timeout: float = 30.0
max_retries: int = 2
@dataclass
class GatewayMetrics:
requests: int = 0
failures: int = 0
total_tokens: int = 0
total_latency_ms: float = 0.0
class LLMGateway:
def __init__(self, providers: list[ProviderConfig]):
self._providers = {p.name: p for p in providers}
self._model_to_provider: dict[str, str] = {}
self._metrics: dict[str, GatewayMetrics] = {}
for p in providers:
self._metrics[p.name] = GatewayMetrics()
for model in p.models:
self._model_to_provider[model] = p.name
def _resolve_provider(self, model: str) -> ProviderConfig:
provider_name = self._model_to_provider.get(model)
if not provider_name:
raise ValueError(f"No provider configured for model: {model}")
return self._providers[provider_name]
async def chat_completion(
self,
model: str,
messages: list[dict],
temperature: float = 0.7,
max_tokens: int = 1024,
) -> dict:
provider = self._resolve_provider(model)
metrics = self._metrics[provider.name]
metrics.requests += 1
start = time.monotonic()
async with httpx.AsyncClient(timeout=provider.timeout) as client:
for attempt in range(provider.max_retries + 1):
try:
response = await client.post(
f"{provider.base_url}/chat/completions",
headers={
"Authorization": f"Bearer {provider.api_key}",
"Content-Type": "application/json",
},
json={
"model": model,
"messages": messages,
"temperature": temperature,
"max_tokens": max_tokens,
},
)
response.raise_for_status()
result = response.json()
elapsed = (time.monotonic() - start) * 1000
metrics.total_latency_ms += elapsed
usage = result.get("usage", {})
metrics.total_tokens += usage.get("total_tokens", 0)
return result
except httpx.HTTPStatusError:
if attempt == provider.max_retries:
metrics.failures += 1
raise
raise RuntimeError("Unreachable")
def get_metrics(self) -> dict[str, GatewayMetrics]:
return dict(self._metrics)
ToolTypeNotes
LiteLLMOpen-source proxySupports 100+ providers, OpenAI-compatible API
PortkeyManaged gatewayBuilt-in caching, fallback, and observability
HeliconeLogging-focused proxyStrong on analytics and cost tracking
Kong AI GatewayEnterprise gatewayExtends existing Kong infrastructure for LLM routing
Custom (nginx/Envoy)DIYMaximum control, highest operational burden