Data Contract Pattern
What It Is
Section titled “What It Is”A data contract is a formal, versioned agreement between a data producer and its consumers. It specifies the schema, data types, quality thresholds, freshness guarantees, and update frequency of a data source. Contracts are enforced as code in CI/CD pipelines, not as documentation that drifts out of sync.
The Problem It Solves
Section titled “The Problem It Solves”AI systems depend on data pipelines owned by different teams. Without contracts, a team can rename a column, change a data type, or stop populating a field — and the first sign of a problem is degraded model quality in production. Silent schema drift is one of the most common and hardest-to-debug failure modes in production AI systems.
Specific failure scenarios:
- An upstream team renames
user_idtocustomer_id. The feature pipeline breaks silently. - A field that was always populated starts arriving as null 5% of the time. Model predictions degrade gradually.
- An upstream pipeline switches from daily to weekly updates. The freshness assumption in your RAG pipeline breaks.
- A categorical field gains new values the model was never trained on. Predictions become unreliable for those categories.
How It Works
Section titled “How It Works”┌──────────────┐ Contract ┌──────────────┐│ Data Producer │───(schema +───│ Data Consumer ││ (Team A) │ quality + │ (AI Team) │└──────┬───────┘ SLA) └──────┬───────┘ │ │ ▼ ▼ Producer CI: Consumer CI: Validate output Validate input against contract against contract- Producer and consumer teams agree on a contract definition (schema, quality rules, SLAs).
- The contract is stored as a versioned artifact (YAML, JSON, or Protobuf) in a shared repository or registry.
- The producer’s CI pipeline validates outgoing data against the contract before publishing.
- The consumer’s pipeline validates incoming data against the contract before ingestion.
- Contract violations trigger alerts and block bad data from propagating downstream.
- Contract changes follow a versioning and deprecation protocol — breaking changes require a migration path.
When to Use It
Section titled “When to Use It”- Your AI system consumes data produced by other teams or external systems.
- You have experienced production incidents caused by upstream data changes.
- You need to enforce freshness, completeness, or type guarantees on data feeding models or retrieval systems.
- Multiple consumers depend on the same data source and need a shared quality bar.
When NOT to Use It
Section titled “When NOT to Use It”- You own both the producer and consumer pipelines and they change together. In this case, schema enforcement in your own codebase is sufficient — a formal contract adds ceremony without value.
- Your data sources are ephemeral or experimental. Contracts assume stability. If the schema changes weekly during rapid iteration, contracts become a bottleneck.
- The data is consumed only by humans for exploration (dashboards, ad-hoc queries). Contracts are for machine consumers that break silently.
Trade-offs
Section titled “Trade-offs”- Coordination overhead — Establishing contracts requires cross-team agreement. This is valuable but time-consuming, especially in organizations without strong data governance.
- Rigidity — Contracts resist change by design. Legitimate schema evolution becomes a multi-step process instead of a quick update.
- False sense of security — A contract that validates schema but not data distribution gives confidence without catching the most common issues (value drift, null rate changes).
- Adoption challenge — Contracts only work if producers actually run the validations. Without organizational commitment, they become ignored documentation.
Implementation Example
Section titled “Implementation Example”from dataclasses import dataclass
import yaml
@dataclassclass FieldContract: name: str dtype: str nullable: bool = False min_value: float | None = None max_value: float | None = None allowed_values: list[str] | None = None max_null_rate: float = 0.0
@dataclassclass DataContract: name: str version: str owner: str fields: list[FieldContract] freshness_hours: int = 24 min_row_count: int = 0
@classmethod def from_yaml(cls, path: str) -> "DataContract": with open(path) as f: raw = yaml.safe_load(f) fields = [FieldContract(**f) for f in raw["fields"]] return cls( name=raw["name"], version=raw["version"], owner=raw["owner"], fields=fields, freshness_hours=raw.get("freshness_hours", 24), min_row_count=raw.get("min_row_count", 0), )
def validate_dataframe(df, contract: DataContract) -> list[str]: violations = []
if len(df) < contract.min_row_count: violations.append( f"Row count {len(df)} below minimum {contract.min_row_count}" )
for field in contract.fields: if field.name not in df.columns: violations.append(f"Missing required field: {field.name}") continue
col = df[field.name] null_rate = col.isnull().mean()
if not field.nullable and null_rate > 0: violations.append(f"{field.name}: contains nulls but is not nullable") elif null_rate > field.max_null_rate: violations.append( f"{field.name}: null rate {null_rate:.2%} exceeds max {field.max_null_rate:.2%}" )
if field.min_value is not None: below = (col.dropna() < field.min_value).sum() if below > 0: violations.append( f"{field.name}: {below} values below minimum {field.min_value}" )
if field.allowed_values is not None: invalid = set(col.dropna().unique()) - set(field.allowed_values) if invalid: violations.append( f"{field.name}: unexpected values {invalid}" )
return violationsContract definition (YAML):
name: user_featuresversion: "2.1"owner: data-platform-teamfreshness_hours: 12min_row_count: 10000fields: - name: user_id dtype: string nullable: false - name: signup_days_ago dtype: int nullable: false min_value: 0 - name: plan_type dtype: string nullable: false allowed_values: ["free", "pro", "enterprise"] - name: monthly_api_calls dtype: float nullable: true max_null_rate: 0.05 min_value: 0Tool Landscape
Section titled “Tool Landscape”| Tool | Type | Notes |
|---|---|---|
| Soda | Data quality platform | Contract-style checks with YAML definitions and CI integration |
| Great Expectations | Open-source library | Expectation-based data validation, widely adopted |
| dbt contracts | Built into dbt | Schema contracts enforced at the transformation layer |
| Dataform | Google Cloud | Schema assertions in the transformation pipeline |
| Protobuf/Avro | Schema registries | Strong typing for streaming data with schema evolution rules |
Related Patterns
Section titled “Related Patterns”- Feature Store Pattern — Feature stores consume data that should be covered by contracts.
- Data Observability Pattern — Observability detects contract violations at runtime. Contracts define what “correct” means.
- Training Data Pipeline Pattern — Training pipelines need contracts on their input data sources.
- Eval Dataset Management — Eval datasets are themselves data artifacts that benefit from contract enforcement.