Circuit Breaker for LLMs
What It Is
Section titled “What It Is”A circuit breaker for LLMs monitors the failure rate, latency, and quality of LLM provider responses. When degradation exceeds a threshold, the circuit “opens” and stops sending requests to the failing provider, redirecting traffic to a fallback. After a cooldown period, the circuit enters a “half-open” state to probe whether the provider has recovered.
The Problem It Solves
Section titled “The Problem It Solves”LLM providers degrade in ways that traditional health checks miss. A provider can return HTTP 200 while delivering:
- Responses 5x slower than normal.
- Empty or truncated completions.
- Content filter rejections on legitimate queries.
- Repetitive, low-quality outputs due to backend issues.
Without a circuit breaker, your system continues sending requests to a degraded provider, accumulating user-facing failures until someone notices manually. Standard HTTP retry logic makes this worse by adding load to an already struggling provider.
How It Works
Section titled “How It Works” ┌──────────┐ │ CLOSED │ Normal operation, monitoring failures └─────┬────┘ │ failure rate > threshold ▼ ┌──────────┐ │ OPEN │ All requests go to fallback └─────┬────┘ │ cooldown expires ▼ ┌──────────┐ │HALF-OPEN │ Send probe requests to original provider └─────┬────┘ │ ┌───────┴───────┐ │ │ Probes pass Probes fail │ │ ▼ ▼ CLOSED OPEN- Closed state: Requests flow normally. The breaker tracks failures (errors, timeouts, latency threshold violations) in a sliding window.
- Trip condition: When failures in the window exceed the threshold (e.g., 50% failure rate over last 20 requests), the circuit opens.
- Open state: All requests are immediately redirected to the fallback provider. No requests reach the failing provider.
- Cooldown: After a configured cooldown period (e.g., 30-60 seconds), the circuit moves to half-open.
- Half-open state: A limited number of probe requests are sent to the original provider. If they succeed, the circuit closes. If they fail, it re-opens.
For LLMs specifically, the failure definition must include latency breaches and quality degradation, not just HTTP errors.
When to Use It
Section titled “When to Use It”- Your system has at least one fallback provider or model.
- Provider outages or degradation have caused production incidents.
- You need to protect user experience during provider issues.
- Your system makes enough requests to detect degradation statistically (at least tens of requests per minute).
When NOT to Use It
Section titled “When NOT to Use It”- You have only one LLM provider with no fallback. A circuit breaker without a fallback just fails faster — useful for protecting downstream systems from cascading failures, but not for maintaining service.
- Your request volume is too low to detect degradation reliably (fewer than 10 requests per minute). The sliding window will produce noisy signals.
- All providers share the same backend (e.g., two products both calling the same underlying model). Failing over will not help.
Trade-offs
Section titled “Trade-offs”- Flash trips — A short burst of errors can trip the circuit unnecessarily, routing traffic away from a healthy provider. Tuning the window size and threshold is critical.
- Fallback quality — The fallback model is typically weaker or more expensive. Unnecessary trips degrade quality or increase cost.
- Recovery delay — The cooldown period means you continue using the fallback even after the provider recovers. Shorter cooldowns risk re-tripping on transient issues.
- Complexity — Adds state management to your inference path. In distributed systems, circuit state must be shared across instances or each instance manages its own state.
Implementation Example
Section titled “Implementation Example”import timefrom collections import dequefrom dataclasses import dataclass, fieldfrom enum import Enumfrom threading import Lock
class CircuitState(Enum): CLOSED = "closed" OPEN = "open" HALF_OPEN = "half-open"
@dataclassclass CircuitBreaker: failure_threshold: float = 0.5 window_size: int = 20 cooldown_seconds: float = 30.0 half_open_max_probes: int = 3
_state: CircuitState = field(default=CircuitState.CLOSED, init=False) _results: deque = field(default_factory=deque, init=False) _opened_at: float = field(default=0.0, init=False) _half_open_successes: int = field(default=0, init=False) _lock: Lock = field(default_factory=Lock, init=False)
@property def state(self) -> CircuitState: with self._lock: if self._state == CircuitState.OPEN: if time.monotonic() - self._opened_at >= self.cooldown_seconds: self._state = CircuitState.HALF_OPEN self._half_open_successes = 0 return self._state
def should_allow_request(self) -> bool: current = self.state if current == CircuitState.CLOSED: return True if current == CircuitState.HALF_OPEN: return True return False
def record_success(self) -> None: with self._lock: self._results.append(True) if len(self._results) > self.window_size: self._results.popleft()
if self._state == CircuitState.HALF_OPEN: self._half_open_successes += 1 if self._half_open_successes >= self.half_open_max_probes: self._state = CircuitState.CLOSED self._results.clear()
def record_failure(self) -> None: with self._lock: self._results.append(False) if len(self._results) > self.window_size: self._results.popleft()
if self._state == CircuitState.HALF_OPEN: self._state = CircuitState.OPEN self._opened_at = time.monotonic() return
if len(self._results) >= self.window_size: failure_rate = 1 - (sum(self._results) / len(self._results)) if failure_rate >= self.failure_threshold: self._state = CircuitState.OPEN self._opened_at = time.monotonic()
def record_latency_violation(self) -> None: self.record_failure()
async def resilient_completion( breaker: CircuitBreaker, primary_fn, fallback_fn, messages: list[dict], latency_budget_ms: float = 5000.0,) -> dict: if breaker.should_allow_request(): start = time.monotonic() try: result = await primary_fn(messages=messages) elapsed_ms = (time.monotonic() - start) * 1000 if elapsed_ms > latency_budget_ms: breaker.record_latency_violation() else: breaker.record_success() return result except Exception: breaker.record_failure()
return await fallback_fn(messages=messages)Tool Landscape
Section titled “Tool Landscape”| Tool | Type | Notes |
|---|---|---|
| LiteLLM | Open-source proxy | Built-in circuit breaker with configurable thresholds per provider |
| Portkey | Managed gateway | Automatic failover with circuit breaker semantics |
| Resilience4j | Java library | Generic circuit breaker, adaptable for LLM calls |
| Polly (.NET) | .NET library | Circuit breaker policies for .NET applications |
| Custom implementation | DIY | Necessary when latency and quality (not just errors) define failures |
Related Patterns
Section titled “Related Patterns”- Fallback Chain — The circuit breaker decides when to activate the fallback chain.
- Graceful Degradation Pattern — Fallback behavior when the circuit is open.
- Health Check Pattern — Health checks complement circuit breakers with proactive detection.
- LLM Gateway Pattern — The gateway is the natural place to implement circuit breaker logic.
- SLO Pattern for AI — SLOs define the thresholds that trigger circuit breaker trips.