Data contracts: the missing layer between teams

A data engineer at a logistics company once described their situation to us as "playing Jenga with production pipelines." Every few weeks, an upstream team would change a column name, add a nullable field, or drop an event type — without telling anyone downstream. Pipelines would silently start producing wrong numbers, or break entirely. The operations dashboard that warehouse managers relied on would show stale data for hours before anyone noticed.

This is a data contract problem. Or more precisely, it is the absence of data contracts.

What is a data contract?

A data contract is a formal agreement between a data producer and a data consumer that defines what the producer will deliver, in what format, on what schedule, and with what quality guarantees. It is a service-level agreement for data.

At a minimum, a data contract specifies:

Schema — field names, types, and nullability
Semantics — what each field means and any known edge cases
Freshness SLA — how frequently data is updated and the maximum acceptable latency
Completeness expectations — what percentage of expected records should be present
Versioning policy — how the producer will communicate breaking changes and the notice period

The contract is owned by the producer and consumed by downstream teams. When a producer wants to make a breaking change, they are responsible for notifying consumers in advance, maintaining backwards compatibility during a transition period, or negotiating a migration path.

Why most teams don't have them

Data contracts require coordination across team boundaries. The application team producing events, the data engineering team ingesting them, and the analytics team consuming the output all need to agree. In most organisations, these teams have different priorities, different planning cycles, and often different tooling. Nobody owns the interface between them.

The result is implicit contracts — undocumented assumptions about what upstream systems will deliver. Implicit contracts work fine until someone upstream makes a change that breaks them. And because the contract was never written down, nobody knows it was broken until a downstream dashboard shows wrong numbers.

An implicit contract is just a future incident waiting to happen. Making it explicit costs an afternoon. The first incident it prevents will save days.

A minimal contract format

You do not need a specialised platform to start. A data contract can begin as a YAML file in a shared repository:

contract:
  name: order_events
  version: "2.1.0"
  owner: platform-team@company.com
  consumers:
    - data-engineering@company.com
    - analytics@company.com

sla:
  freshness_minutes: 5
  completeness_threshold: 0.995

schema:
  - name: event_id
    type: string
    nullable: false
    description: Unique identifier for the event. UUID v4.
  - name: order_id
    type: string
    nullable: false
    description: Internal order identifier. References orders table.
  - name: event_type
    type: string
    nullable: false
    description: >
      One of: order_placed, payment_confirmed, shipped, delivered, cancelled.
      No other values will be emitted without a version bump.
  - name: occurred_at
    type: timestamp
    nullable: false
    description: UTC timestamp of when the event occurred on the source system.
  - name: customer_id
    type: string
    nullable: true
    description: Present for authenticated sessions only. Null for guest orders.

versioning:
  breaking_change_notice_days: 14
  deprecation_policy: >
    Fields are deprecated with a 30-day notice before removal.
    Deprecated fields are marked in the schema with deprecated: true.

This YAML file lives in version control alongside the application code. Changes to the contract trigger a review process. Consumers subscribe to change notifications. The contract becomes part of the PR review for any upstream change that touches the data interface.

Validation: making contracts enforceable

A contract that is not enforced is just documentation. The second layer is validation — automatically checking that incoming data actually matches the contract.

In a dbt pipeline, this looks like schema tests on the staging models that ingest raw data:

models:
  - name: stg_order_events
    description: Staged order events from the platform contract v2.1
    columns:
      - name: event_id
        tests:
          - not_null
          - unique
      - name: event_type
        tests:
          - not_null
          - accepted_values:
              values: ['order_placed', 'payment_confirmed',
                       'shipped', 'delivered', 'cancelled']
      - name: occurred_at
        tests:
          - not_null
      - name: customer_id
        tests: []  # nullable per contract

These tests run on every pipeline execution. If an upstream team adds a new event_type value without updating the contract and notifying consumers, the test fails, the pipeline stops, and an alert fires. The failure is immediate and visible — not silent and discovered three days later when a manager asks why the dashboard numbers dropped.

Handling breaking changes

The most common failure mode we see is a producer team that considers a change "non-breaking" — adding a new nullable field, for example — but consumers have built logic that assumes the field does not exist. True backwards compatibility is harder than it looks.

Our recommended approach for breaking changes:

Version the contract — bump the major version for any breaking change
Notify consumers with enough lead time — two weeks minimum, four weeks for complex migrations
Run old and new versions in parallel — maintain the old schema alongside the new one during the transition window
Provide a migration guide — document exactly what consumers need to change and by when
Deprecate, then remove — fields should be marked deprecated before they are dropped, never removed without warning

Starting small without a platform

You do not need a contract platform or a schema registry to start. Begin with the three or four upstream sources that cause the most downstream pain. Write a YAML contract for each. Add it to the producer team's repository and require a review from a downstream consumer for any changes to it. Add validation tests to your staging models.

That alone — YAML files plus dbt tests — eliminates the most common category of silent pipeline failures. Once you have the habit and the process, you can graduate to tooling like Confluent Schema Registry, Great Expectations, or a commercial data observability platform if the scale justifies it.

The principle is simple: make the interface explicit, test it continuously, and require coordination for breaking changes. The tool is secondary to the habit.

Discuss a pipeline project Data engineering services