Glossary
Agent — An AI system that can autonomously plan and execute multi-step tasks using tools and external APIs. Distinguished from simple LLM calls by the ability to take actions beyond text generation.
Batch Inference — Processing multiple inference requests together rather than individually. Trades latency for throughput and cost efficiency.
BM25 — A probabilistic keyword-based ranking function used in sparse retrieval. Relies on term frequency and document length normalization rather than learned embeddings.
Canary Deployment — Releasing a new model or prompt version to a small subset of traffic before full rollout. Limits blast radius of quality regressions.
Chain-of-Thought (CoT) — Prompting strategy that asks the model to show intermediate reasoning steps. Improves accuracy on complex tasks at the cost of additional output tokens.
Chunking — Splitting documents into smaller segments for embedding and retrieval. Chunk size and strategy directly affect retrieval quality.
Circuit Breaker — A pattern that monitors failure rates and temporarily stops sending requests to a degraded service. Prevents cascade failures and allows recovery time.
Context Window — The maximum number of tokens a model can process in a single request, including both input and output tokens.
Data Contract — A formal agreement between data producers and consumers specifying schema, quality thresholds, and SLAs. Prevents upstream changes from silently breaking downstream systems.
Dense Retrieval — Using learned vector embeddings to find semantically similar documents. Captures meaning rather than exact keyword matches.
Distribution Shift — When the statistical properties of input data in production differ from what was expected during development or training.
Embedding — A fixed-size vector representation of text (or other data) in a continuous vector space. Similar meanings map to nearby vectors.
Eval Dataset — A curated set of input-output pairs used to measure model or system quality. Distinct from training data.
Fallback Chain — An ordered list of alternative providers or models to try when the primary option fails or degrades.
Feature Store — A centralized repository for storing, versioning, and serving ML features. Ensures consistency between training and serving.
Fine-Tuning — Continuing the training of a pre-trained model on domain-specific data to improve performance on targeted tasks.
Graceful Degradation — Returning a reduced-quality but still useful response when the full system is unavailable, rather than returning an error.
GraphRAG — Retrieval-augmented generation that uses knowledge graphs instead of (or alongside) vector stores. Handles relational and multi-hop queries.
Guardrails — Input and output validation layers that enforce safety, quality, and compliance constraints on LLM interactions.
Hallucination — When a model generates content that is factually incorrect, fabricated, or not grounded in the provided context.
Human-in-the-Loop (HITL) — A design pattern where certain automated decisions require human approval before execution. Used for high-risk or high-stakes actions.
Hybrid Search — Combining dense (vector) and sparse (keyword) retrieval methods and merging their results. Consistently outperforms either method alone.
Inference — The process of running a trained model on new inputs to produce outputs. In production AI, this typically means making API calls to LLM providers.
Inference-Time Compute — Allocating additional computation (more reasoning steps, multiple samples) at inference time to improve quality on hard queries.
Lakehouse — A data architecture combining data lake flexibility with data warehouse reliability. Provides ACID transactions, versioning, and schema enforcement.
Lineage — The complete tracking of data from raw source through transformations, training, and inference. Required for compliance and debugging.
LLM Gateway — A centralized proxy layer between applications and LLM providers that handles routing, authentication, rate limiting, logging, and failover.
Model Card — Standardized documentation of a model’s capabilities, limitations, intended uses, training data, and known failure modes.
Model Router — A system that routes inference requests to different models based on query complexity, cost targets, or latency requirements.
Observability — The ability to understand the internal state of a system from its external outputs. For AI systems, this includes tracing, metrics, logging, and quality monitoring.
Pagefind — A static site search library that indexes content at build time. Used by this site for full-text search without external services.
PII (Personally Identifiable Information) — Any data that could identify a specific individual. Must be detected and handled before entering prompts or logs.
Policy-as-Code — Encoding compliance rules as machine-checkable assertions that run automatically in CI/CD pipelines.
Prompt Injection — An attack where malicious instructions are embedded in user input to manipulate model behavior. The LLM equivalent of SQL injection.
Prompt Regression — When changes to prompts, system instructions, or models cause quality to degrade on previously passing test cases.
RAG (Retrieval-Augmented Generation) — A pattern that retrieves relevant context from external sources and includes it in the prompt before generation. Grounds model outputs in actual data.
Reranking — A two-stage retrieval approach where a first-pass retriever returns candidates and a second-pass model scores and reorders them for relevance.
Semantic Caching — Caching LLM responses indexed by the semantic meaning of queries rather than exact string matches. Serves similar (not just identical) queries from cache.
Shadow Mode — Running a new model in parallel on live traffic without serving its outputs to users. Used to compare quality before switching.
SLO (Service Level Objective) — A target value for a service metric (latency, quality, cost) that defines acceptable performance. More specific than an SLA.
Span — A single unit of work within a trace. In AI systems, spans typically represent retrieval, prompt construction, inference, and postprocessing steps.
Sparse Retrieval — Keyword-based retrieval methods (like BM25) that match on exact terms rather than learned representations.
Structured Output — Constraining model output to match a predefined schema (JSON, XML) using tools like Pydantic or Zod. Eliminates parsing failures.
TTFT (Time to First Token) — The latency between sending a request and receiving the first token of the response. Critical for perceived responsiveness in streaming applications.
Token — The fundamental unit of text that LLMs process. A token is roughly 3/4 of a word in English. Costs are typically measured per token.
Token Budget — A hard limit on the number of input and/or output tokens per request. Prevents unbounded costs from long contexts or verbose outputs.
Train/Serve Skew — Differences between the data or features used during model training and those available during serving. A common source of production quality issues.
Vector Store — A database optimized for storing and querying high-dimensional vectors (embeddings). The storage layer for dense retrieval in RAG systems.