Skip to content

Glossary

Agent — An AI system that can autonomously plan and execute multi-step tasks using tools and external APIs. Distinguished from simple LLM calls by the ability to take actions beyond text generation.

Batch Inference — Processing multiple inference requests together rather than individually. Trades latency for throughput and cost efficiency.

BM25 — A probabilistic keyword-based ranking function used in sparse retrieval. Relies on term frequency and document length normalization rather than learned embeddings.

Canary Deployment — Releasing a new model or prompt version to a small subset of traffic before full rollout. Limits blast radius of quality regressions.

Chain-of-Thought (CoT) — Prompting strategy that asks the model to show intermediate reasoning steps. Improves accuracy on complex tasks at the cost of additional output tokens.

Chunking — Splitting documents into smaller segments for embedding and retrieval. Chunk size and strategy directly affect retrieval quality.

Circuit Breaker — A pattern that monitors failure rates and temporarily stops sending requests to a degraded service. Prevents cascade failures and allows recovery time.

Context Window — The maximum number of tokens a model can process in a single request, including both input and output tokens.

Data Contract — A formal agreement between data producers and consumers specifying schema, quality thresholds, and SLAs. Prevents upstream changes from silently breaking downstream systems.

Dense Retrieval — Using learned vector embeddings to find semantically similar documents. Captures meaning rather than exact keyword matches.

Distribution Shift — When the statistical properties of input data in production differ from what was expected during development or training.

Embedding — A fixed-size vector representation of text (or other data) in a continuous vector space. Similar meanings map to nearby vectors.

Eval Dataset — A curated set of input-output pairs used to measure model or system quality. Distinct from training data.

Fallback Chain — An ordered list of alternative providers or models to try when the primary option fails or degrades.

Feature Store — A centralized repository for storing, versioning, and serving ML features. Ensures consistency between training and serving.

Fine-Tuning — Continuing the training of a pre-trained model on domain-specific data to improve performance on targeted tasks.

Graceful Degradation — Returning a reduced-quality but still useful response when the full system is unavailable, rather than returning an error.

GraphRAG — Retrieval-augmented generation that uses knowledge graphs instead of (or alongside) vector stores. Handles relational and multi-hop queries.

Guardrails — Input and output validation layers that enforce safety, quality, and compliance constraints on LLM interactions.

Hallucination — When a model generates content that is factually incorrect, fabricated, or not grounded in the provided context.

Human-in-the-Loop (HITL) — A design pattern where certain automated decisions require human approval before execution. Used for high-risk or high-stakes actions.

Hybrid Search — Combining dense (vector) and sparse (keyword) retrieval methods and merging their results. Consistently outperforms either method alone.

Inference — The process of running a trained model on new inputs to produce outputs. In production AI, this typically means making API calls to LLM providers.

Inference-Time Compute — Allocating additional computation (more reasoning steps, multiple samples) at inference time to improve quality on hard queries.

Lakehouse — A data architecture combining data lake flexibility with data warehouse reliability. Provides ACID transactions, versioning, and schema enforcement.

Lineage — The complete tracking of data from raw source through transformations, training, and inference. Required for compliance and debugging.

LLM Gateway — A centralized proxy layer between applications and LLM providers that handles routing, authentication, rate limiting, logging, and failover.

Model Card — Standardized documentation of a model’s capabilities, limitations, intended uses, training data, and known failure modes.

Model Router — A system that routes inference requests to different models based on query complexity, cost targets, or latency requirements.

Observability — The ability to understand the internal state of a system from its external outputs. For AI systems, this includes tracing, metrics, logging, and quality monitoring.

Pagefind — A static site search library that indexes content at build time. Used by this site for full-text search without external services.

PII (Personally Identifiable Information) — Any data that could identify a specific individual. Must be detected and handled before entering prompts or logs.

Policy-as-Code — Encoding compliance rules as machine-checkable assertions that run automatically in CI/CD pipelines.

Prompt Injection — An attack where malicious instructions are embedded in user input to manipulate model behavior. The LLM equivalent of SQL injection.

Prompt Regression — When changes to prompts, system instructions, or models cause quality to degrade on previously passing test cases.

RAG (Retrieval-Augmented Generation) — A pattern that retrieves relevant context from external sources and includes it in the prompt before generation. Grounds model outputs in actual data.

Reranking — A two-stage retrieval approach where a first-pass retriever returns candidates and a second-pass model scores and reorders them for relevance.

Semantic Caching — Caching LLM responses indexed by the semantic meaning of queries rather than exact string matches. Serves similar (not just identical) queries from cache.

Shadow Mode — Running a new model in parallel on live traffic without serving its outputs to users. Used to compare quality before switching.

SLO (Service Level Objective) — A target value for a service metric (latency, quality, cost) that defines acceptable performance. More specific than an SLA.

Span — A single unit of work within a trace. In AI systems, spans typically represent retrieval, prompt construction, inference, and postprocessing steps.

Sparse Retrieval — Keyword-based retrieval methods (like BM25) that match on exact terms rather than learned representations.

Structured Output — Constraining model output to match a predefined schema (JSON, XML) using tools like Pydantic or Zod. Eliminates parsing failures.

TTFT (Time to First Token) — The latency between sending a request and receiving the first token of the response. Critical for perceived responsiveness in streaming applications.

Token — The fundamental unit of text that LLMs process. A token is roughly 3/4 of a word in English. Costs are typically measured per token.

Token Budget — A hard limit on the number of input and/or output tokens per request. Prevents unbounded costs from long contexts or verbose outputs.

Train/Serve Skew — Differences between the data or features used during model training and those available during serving. A common source of production quality issues.

Vector Store — A database optimized for storing and querying high-dimensional vectors (embeddings). The storage layer for dense retrieval in RAG systems.