This site is under active development — some features may be incomplete
Back to all articles
2026-02-19 12 min read

Data Governance That Actually Works: A Practical Framework

Why most governance fails and how to build one that sticks

Let me start with an uncomfortable truth: most data governance programs fail. Not because the ideas are wrong, but because the implementation is wrong. I've watched organizations spend six figures on governance tools, hire dedicated governance teams, produce hundred-page policy documents — and end up exactly where they started, with nobody following the rules and data quality still a mess.

After 10+ years of building data platforms and implementing governance across companies of different sizes, I've come to a clear conclusion: governance that works is governance that's embedded in the engineering workflow, not bolted on top of it. It has to be automated, incremental, and deliver visible value to the people who actually touch the data every day.

This article lays out a practical 5-pillar framework that I've used successfully in production. No theory for its own sake — just patterns that survive contact with reality.

Why Most Data Governance Fails

Before we build something better, we need to understand why the standard approach keeps failing. I see the same failure modes everywhere:

The Top-Down Trap

A common pattern: a compliance event (GDPR audit, data breach, regulatory inquiry) triggers a panic. Leadership mandates "data governance." A committee is formed. Consultants are hired. A 200-page governance framework document is produced. It gets emailed to engineering. Engineering ignores it.

This fails because it treats governance as a policy problem when it's actually an engineering problem. You can't govern data with documents — you govern data with code, automation, and tooling that makes the right thing the easy thing.

The Tool-First Trap

Another common pattern: someone buys a shiny data catalog tool, spends months configuring it, populates it with metadata, and declares governance "done." Six months later, the catalog is stale, nobody updates it, and it becomes another piece of shelfware. Tools are necessary but not sufficient. A catalog without ownership, process, and enforcement is just an expensive wiki that nobody reads.

The Boil-the-Ocean Trap

Attempting to govern everything at once is a guaranteed way to govern nothing. I've seen teams try to classify every column in every table across a 2,000-table warehouse in a single sprint. They burn out, deliver incomplete work, and the whole initiative loses credibility. Governance has to be incremental and prioritized.

Why governance programs fail vs. what actually works
Common Anti-Patterns 200-page policy documents Nobody reads them, nobody follows them Governance committee meetings All talk, no automated enforcement Tool-first approach Shiny catalog, stale in 3 months Boil-the-ocean classification Tag everything now, maintain nothing later Governance as gatekeeper Engineers route around it, not through it Result: Abandoned within 12 months What Actually Works Governance as code (CI/CD) Automated checks in every PR Clear ownership model Every dataset has an accountable team Incremental rollout Start with critical assets, expand outward Quality checks in pipelines dbt tests, Great Expectations, Soda Governance enables, not blocks Self-serve access with guardrails Result: Adopted, maintained, valued

The 5-Pillar Framework

After iterating through multiple governance implementations, I've distilled the approach that works into five pillars. Each one is necessary, and they reinforce each other — but you don't have to implement them all at once. In fact, you shouldn't. Start with Pillar 1, get it working, then layer on the rest.

The 5-Pillar Practical Governance Framework
PRACTICAL DATA GOVERNANCE Embedded in engineering workflow · Automated · Incremental Ownership Every dataset has a team CODEOWNERS dbt meta.owner SLA contracts 1 Data Catalog Discoverable, documented dbt docs DataHub / Atlan Auto-generated 2 Quality Checks Automated validation dbt tests Great Expectations Soda / Elementary 3 Lineage Know where data flows dbt lineage OpenLineage Column-level 4 PII & Access Protect sensitive data Column masking Row-level security Tag-based policies 5

Pillar 1: Ownership — The Foundation of Everything

This is the single most important pillar, and it's the one most governance programs skip or get wrong. Without clear ownership, everything else falls apart. Quality checks with no owner become noise. Catalog entries with no maintainer become stale. Lineage with no accountability becomes a pretty graph nobody acts on.

What ownership means in practice

Every table, model, and dataset in your warehouse should have a clearly identified owning team. Not a person — a team. People leave, change roles, go on vacation. Teams persist. Ownership means:

  • Accountable for data quality — when this dataset breaks, this team gets paged.
  • Responsible for documentation — descriptions, business definitions, known caveats.
  • Authority over schema changes — no one changes this model without the owning team's review.
  • SLA commitment — this data is guaranteed fresh by X time, with Y% completeness.

Implementation pattern: dbt meta + CODEOWNERS

The best implementation I've found combines dbt's meta configuration with GitHub's CODEOWNERS file. This makes ownership both machine-readable and enforceable:

# dbt_project.yml - Set default ownership per directory
models:
  my_project:
    staging:
      +meta:
        owner: "data-platform-team"
        sla: "daily by 06:00 UTC"
    marts:
      finance:
        +meta:
          owner: "finance-analytics"
          sla: "daily by 08:00 UTC"
          pii_certified: true
      marketing:
        +meta:
          owner: "marketing-analytics"
          sla: "daily by 09:00 UTC"
# CODEOWNERS - Enforce review by owning team
/models/staging/           @data-platform-team
/models/marts/finance/     @finance-analytics
/models/marts/marketing/   @marketing-analytics

This pattern has a critical advantage: ownership is enforced through code review. If someone on the marketing team tries to modify a finance model, GitHub won't let them merge without finance-analytics approval. No governance committee needed — the Git workflow enforces it automatically.

Ownership CI check

You can go further by adding a CI check that ensures every new model has an owner defined:

# scripts/check_ownership.py
import yaml
import sys
from pathlib import Path

def check_model_ownership():
    """Fail CI if any model lacks an owner in its meta config."""
    models_dir = Path("models")
    missing_owners = []

    for sql_file in models_dir.rglob("*.sql"):
        yml_file = sql_file.with_suffix(".yml")
        schema_file = sql_file.parent / "schema.yml"

        has_owner = False
        for check_file in [yml_file, schema_file]:
            if check_file.exists():
                with open(check_file) as f:
                    config = yaml.safe_load(f)
                    # Check if owner is defined in meta
                    if _find_owner(config, sql_file.stem):
                        has_owner = True
                        break

        if not has_owner:
            missing_owners.append(str(sql_file))

    if missing_owners:
        print("Models missing owner in meta config:")
        for m in missing_owners:
            print(f"  - {m}")
        sys.exit(1)

    print(f"All {len(list(models_dir.rglob('*.sql')))} models have owners.")
    sys.exit(0)

Pillar 2: Data Catalog — Making Data Discoverable

A data catalog answers the question every analyst and engineer asks daily: "Does this data exist, and what does it mean?" But the key insight is: the best catalog is the one that populates itself.

dbt as your core catalog

If you're using dbt (and in 2026, you should be), you already have the foundation of a data catalog. dbt's schema YAML files capture descriptions, column definitions, tests, and metadata. The trick is making documentation a non-negotiable part of the development process, not an afterthought.

# models/marts/finance/schema.yml
version: 2

models:
  - name: fct_revenue
    description: >
      Daily revenue fact table. Grain: one row per order per day.
      Source: Stripe payment events joined with internal order system.
      Known caveat: refunds appear as negative amounts with a 48hr lag.
    meta:
      owner: "finance-analytics"
      tier: "gold"
      pii: false
      sla: "daily by 08:00 UTC"
    columns:
      - name: order_id
        description: "Unique order identifier from the orders system"
        tests:
          - unique
          - not_null
      - name: revenue_amount
        description: "Net revenue in USD cents. Negative values = refunds."
        tests:
          - not_null
      - name: customer_id
        description: "FK to dim_customers. Nullable for anonymous orders."
        tests:
          - relationships:
              to: ref('dim_customers')
              field: customer_id

Enforcing documentation in CI

Make it impossible to merge undocumented models. A simple dbt test or CI script can enforce this:

# In CI pipeline - require descriptions for all gold-tier models
dbt run-operation check_model_descriptions --args '{tier: gold}'

# Or use dbt's built-in: warn on undocumented models
# dbt_project.yml
on-run-end:
  - "{{ log_undocumented_models() }}"

For organizations that need more than dbt docs, tools like DataHub, Atlan, or OpenMetadata can aggregate metadata from dbt, your warehouse, BI tools, and orchestrator into a unified search experience. But start with dbt docs — it's free, version-controlled, and already in your workflow.

Data tiering

Not all data deserves the same level of governance. I use a three-tier system that's simple enough to actually follow:

  • Gold (Certified) — Production-ready, documented, tested, owned, SLA-backed. Powers dashboards and business decisions. Full governance applies.
  • Silver (Curated) — Clean, tested, documented, but may not have an SLA. Used by analytics teams for exploration and ad-hoc work.
  • Bronze (Raw/Staging) — Ingested data with minimal transformation. Documented enough to know what it is, but no quality guarantees beyond schema compliance.

This tiering prevents the boil-the-ocean problem. You apply full governance to Gold, lighter governance to Silver, and minimal governance to Bronze. Start with your most critical 20 tables, get them to Gold, then expand.

Pillar 3: Quality Checks — Trust Through Automation

Data quality is where governance becomes tangible. Every pipeline run should validate that the data meets expectations — not through manual review, but through automated checks that run in your pipeline and block bad data from reaching consumers.

Layered testing strategy

I use a layered approach that catches different types of quality issues at different stages:

# Layer 1: Schema tests (dbt built-in) - Run on every build
# Catch structural issues immediately
columns:
  - name: user_id
    tests:
      - unique
      - not_null
  - name: status
    tests:
      - accepted_values:
          values: ['active', 'churned', 'trial']

# Layer 2: Custom data tests (dbt) - Run on every build
# Catch business logic violations
tests:
  - name: assert_revenue_not_negative
    description: "Revenue should never be negative after refund netting"
    sql: |
      select order_id, net_revenue
      from {{ ref('fct_daily_revenue') }}
      where net_revenue < -10000  -- Allow small negatives from rounding

# Layer 3: Anomaly detection (Elementary / Great Expectations)
# Catch distribution shifts and volume anomalies
  - name: volume_anomaly
    description: "Alert if row count deviates >3 std from 30-day average"
    config:
      severity: warn
      alert_channel: "#data-quality-alerts"

Great Expectations for complex validation

When you need validation that goes beyond what dbt tests can express — cross-table comparisons, statistical tests, or multi-column business rules — Great Expectations is the tool I reach for:

# Great Expectations suite for finance data
import great_expectations as gx

context = gx.get_context()

# Define expectations for the revenue dataset
suite = context.add_expectation_suite("fct_revenue_suite")

# Row count should be within expected range
suite.add_expectation(
    gx.expectations.ExpectTableRowCountToBeBetween(
        min_value=10000,
        max_value=500000,
    )
)

# Revenue should match source within tolerance
suite.add_expectation(
    gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
        column_A="gross_revenue",
        column_B="net_revenue",
        or_equal=True,
    )
)

# No duplicate orders within a single day
suite.add_expectation(
    gx.expectations.ExpectCompoundColumnsToBeUnique(
        column_list=["order_id", "revenue_date"],
    )
)

Quality gates that block bad data

Tests are only useful if they have consequences. I implement a two-tier severity system:

  • Error-severity tests block the pipeline. If order_id has nulls, we stop. Bad data does not flow downstream.
  • Warn-severity tests alert but don't block. If row count is 15% lower than yesterday, the team gets a Slack notification to investigate, but the pipeline continues. This prevents alert fatigue while still catching anomalies.

The key is being intentional about which tier each test belongs to. Over time, your error-tier tests become your data contract — the guarantees you make to downstream consumers.

Pillar 4: Lineage — Understanding Data Flow

Lineage answers two critical questions: "Where did this data come from?" and "What breaks if I change this?" Without lineage, you're making changes in the dark, and every schema change is a gamble.

dbt lineage as your foundation

If you're using dbt, you get table-level lineage for free through ref() and source() declarations. This is already more lineage than most organizations have. But the real power comes from extending it:

# dbt exposes lineage through the manifest
# Use it to answer: "What depends on this source?"

# Query the dbt manifest to find all downstream models
import json

with open("target/manifest.json") as f:
    manifest = json.load(f)

def find_downstream(model_name, manifest):
    """Find all models that depend on a given model."""
    downstream = []
    target_id = f"model.my_project.{model_name}"

    for node_id, node in manifest["nodes"].items():
        if target_id in node.get("depends_on", {}).get("nodes", []):
            downstream.append(node["name"])

    return downstream

# Before changing stg_customers, check the blast radius
affected = find_downstream("stg_customers", manifest)
print(f"Changing stg_customers affects: {affected}")
# Output: ['dim_customers', 'fct_orders', 'fct_revenue', 'ml_churn_features']

Column-level lineage

Table-level lineage tells you which tables depend on which other tables. Column-level lineage tells you which columns flow where — critical for PII tracking and impact analysis. Tools like SQLGlot (open source), dbt Explorer, or catalog tools like Atlan can provide this:

# Using sqlglot to parse column-level lineage from SQL
import sqlglot
from sqlglot.lineage import lineage

# Parse a dbt model's SQL to trace column origins
sql = """
SELECT
    o.order_id,
    c.email,           -- PII! Where does this come from?
    o.amount * fx.rate AS amount_usd
FROM stg_orders o
JOIN stg_customers c ON o.customer_id = c.customer_id
JOIN stg_fx_rates fx ON o.currency = fx.currency
"""

# Trace where the 'email' column originates
result = lineage("email", sql, dialect="snowflake")
# Result: stg_customers.email -> fct_orders.email

Column-level lineage is especially powerful for PII governance — if you know that stg_customers.email flows into fct_orders.email, dim_customers.email, and ml_features.user_email, you can automatically apply masking policies everywhere it appears.

OpenLineage for cross-system lineage

When your data flows across systems — from Kafka to Spark to dbt to a BI tool — you need lineage that spans those boundaries. OpenLineage is the open standard here, with integrations for Dagster, Airflow, Spark, and dbt. It captures lineage events as your pipelines run and stores them in a central catalog like Marquez or DataHub.

Pillar 5: PII and Access Control — Protecting Sensitive Data

This is where governance meets compliance, and where getting it wrong has real consequences. The goal is simple to state and hard to implement: sensitive data should be accessible only to those who need it, and automatically protected everywhere it appears.

Tag-based access policies

The pattern that scales best is tag-based access control. Instead of manually granting permissions table by table, you tag columns with their sensitivity level and let the warehouse enforce policies based on tags:

# Snowflake: Tag-based masking policies
-- Create a PII tag
CREATE TAG IF NOT EXISTS pii_type
  ALLOWED_VALUES 'email', 'phone', 'ssn', 'address', 'name';

-- Create a masking policy
CREATE MASKING POLICY mask_pii_email AS (val STRING)
  RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() IN ('DATA_ADMIN', 'FINANCE_ADMIN')
      THEN val
    ELSE '***@' || SPLIT_PART(val, '@', 2)
  END;

-- Associate the policy with the tag
ALTER TAG pii_type SET MASKING POLICY mask_pii_email
  ON TAG VALUE 'email';

-- Now tag columns — the policy follows automatically
ALTER TABLE dim_customers MODIFY COLUMN email
  SET TAG pii_type = 'email';
ALTER TABLE fct_orders MODIFY COLUMN customer_email
  SET TAG pii_type = 'email';

-- Any new table with this tag automatically gets masked

dbt-driven PII tagging

You can combine this with dbt to make PII tagging part of your development workflow:

# models/marts/schema.yml
models:
  - name: dim_customers
    columns:
      - name: email
        description: "Customer email address"
        meta:
          pii: "email"
          masking_policy: "mask_pii_email"
      - name: phone
        description: "Customer phone number"
        meta:
          pii: "phone"
          masking_policy: "mask_pii_phone"
      - name: customer_segment
        description: "Business segment classification"
        meta:
          pii: false
# dbt post-hook to apply tags automatically
# macros/apply_pii_tags.sql
{% macro apply_pii_tags() %}
  {% set model = this %}
  {% for col in model.columns if col.meta.get('pii') %}
    ALTER TABLE {{ this }} MODIFY COLUMN {{ col.name }}
      SET TAG pii_type = '{{ col.meta.pii }}';
  {% endfor %}
{% endmacro %}

This closes the loop: developers declare PII in the schema YAML, dbt applies the tags on build, and the warehouse enforces masking automatically. PII protection follows the data wherever it goes, without manual intervention.

Row-level security for multi-tenant data

For organizations where different teams should see different slices of data, row-level security (RLS) policies provide fine-grained access:

-- Snowflake: Regional data access
CREATE ROW ACCESS POLICY region_access AS (region_code VARCHAR)
  RETURNS BOOLEAN ->
  CASE
    WHEN CURRENT_ROLE() = 'GLOBAL_ADMIN' THEN TRUE
    WHEN CURRENT_ROLE() = 'EU_ANALYTICS' AND region_code = 'EU' THEN TRUE
    WHEN CURRENT_ROLE() = 'US_ANALYTICS' AND region_code = 'US' THEN TRUE
    ELSE FALSE
  END;

ALTER TABLE fct_revenue ADD ROW ACCESS POLICY region_access
  ON (region_code);

Getting Buy-In: Engineers AND Business Stakeholders

Even the best technical framework fails without adoption. Getting buy-in requires different strategies for different audiences.

For engineers: make it invisible

Engineers resist governance when it adds friction to their workflow. The answer is to embed governance into tools they already use:

  • CI/CD checks — governance rules run automatically on every PR. No extra steps.
  • Pre-commit hooks — catch undocumented models and missing ownership before the PR is even created.
  • dbt integration — tests, documentation, and metadata live in the same files as the models. Governance is just part of writing a model.
  • Templates and scaffolding — provide cookiecutter templates or a CLI that generates new models with ownership, tests, and documentation pre-populated. Make the governed path the path of least resistance.
# Example: pre-commit hook for governance checks
# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: check-model-ownership
        name: Check model ownership
        entry: python scripts/check_ownership.py
        language: python
        files: '\.sql$'
        pass_filenames: false

      - id: check-model-description
        name: Check model description
        entry: python scripts/check_descriptions.py
        language: python
        files: '\.yml$'
        pass_filenames: false

For business stakeholders: show the value

Business teams don't care about dbt tests or CODEOWNERS files. They care about trusting their dashboards and making decisions confidently. Frame governance in their language:

  • "This dashboard has a data quality SLA" — instead of "we run dbt tests," tell them "this revenue number is validated every morning by 8 AM, and you'll be notified if anything looks wrong."
  • "You can find any dataset yourself" — instead of "we have a data catalog," tell them "you can search for any data asset, see who owns it, what it means, and how fresh it is, without asking an engineer."
  • "Your PII is protected automatically" — instead of "we have masking policies," tell them "customer email addresses are automatically hidden from anyone who doesn't need to see them."
  • Data quality scorecards — create a simple dashboard showing quality metrics over time. Business stakeholders love seeing that "98.5% of data quality checks passed this week" trending upward.

The executive pitch

For leadership, frame governance as risk reduction and velocity improvement, not compliance overhead:

  • Risk — "Without governance, a GDPR violation costs up to 4% of global revenue. Our framework automatically tracks and protects PII across all pipelines."
  • Velocity — "Engineers currently spend ~20% of their time figuring out what data exists and whether they can trust it. A governed catalog and quality framework cuts that to near zero."
  • Trust — "Last quarter we had 3 incidents where dashboards showed wrong numbers. With automated quality gates, bad data never reaches dashboards."

Pitfalls and Anti-Patterns

Even with a solid framework, there are traps to avoid. Here are the ones I've seen most often:

1. Governance as a separate team

Creating a dedicated "data governance team" that exists apart from engineering is almost always a mistake. Governance works when it's owned by the people who build and maintain the data. A central governance team becomes a bottleneck, a bureaucracy, and eventually irrelevant because they don't have context on the actual data.

Better pattern: A small governance enablement team (1-2 people) that builds tooling, templates, and CI checks — then distributes governance responsibility to domain teams through ownership. Think of them as the team that builds the guardrails, not the team that stands at the gate.

2. Over-classifying data

I've seen teams create 15 different PII classification levels with a 30-page taxonomy document. Nobody can remember them all, so nobody applies them correctly. Three levels is enough for most organizations: Public, Internal, Restricted (PII/Sensitive). If you truly need more granularity, add sub-categories within those three tiers — but the top level should be instantly understandable.

3. Testing everything with equal rigor

Not every table needs 50 tests. Apply the data tiering from Pillar 2: Gold tables get comprehensive tests and strict quality gates. Silver tables get basic schema tests and anomaly detection. Bronze tables get not_null on primary keys and unique checks, and that's it. Over-testing creates alert fatigue, which is worse than no testing because people start ignoring all alerts.

4. Forgetting about the catalog's freshness

A stale catalog is worse than no catalog — it actively misleads people. If your catalog isn't auto-populated from dbt, your orchestrator, and your warehouse metadata, it will go stale. Never rely on humans to manually update catalog entries. Automate or accept staleness.

5. Making governance a one-time project

Governance isn't a project with an end date. It's an ongoing practice, like testing or code review. Budget for maintenance: new models need to be classified, ownership needs to be updated when teams reorg, and quality checks need to be tuned as data evolves. Treat governance work as part of sprint planning, not a separate initiative.

Realistic implementation timeline: governance that sticks
Month 1 Month 2 Month 3 Month 4-5 Month 6+ Phase 1 Ownership + CODEOWNERS Phase 2 Quality checks for Gold tables Phase 3 Catalog + CI doc enforcement Phase 4 PII tagging + lineage + masking Ongoing Expand coverage, tune, maintain Start small, prove value, expand incrementally

Putting It All Together

Here's what a well-governed data pipeline looks like when all five pillars are in place:

  1. A developer creates a new dbt model. They add it with an owner, description, column definitions, and tests — using a scaffold template that pre-populates these fields.
  2. They open a PR. CI automatically runs: ownership check (pass), documentation check (pass), dbt tests (pass). CODEOWNERS requires the owning team's approval.
  3. The model is deployed. dbt builds the table, applies PII tags via post-hooks, and the catalog is updated automatically from the manifest.
  4. The pipeline runs daily. Quality checks validate the output. If error-severity tests fail, the pipeline stops and the owning team is alerted. Warn-severity anomalies go to a monitoring channel.
  5. A business analyst discovers the table in the data catalog, sees its documentation, owner, freshness, and quality score. They query it confidently, knowing PII columns are automatically masked for their role.
  6. Three months later, a schema change is needed. The engineer checks lineage to understand the blast radius, coordinates with downstream owners, and makes the change with confidence.

This is governance that works — not because it's comprehensive, but because it's embedded. It doesn't ask people to change their workflow; it makes their existing workflow produce governed outputs.

Conclusion

Data governance has a reputation problem. Too many organizations associate it with bureaucracy, committees, and policy documents that nobody reads. But governance done right is the opposite: it's automation, tooling, and engineering practices that make your data platform more trustworthy, more discoverable, and more secure — without slowing anyone down.

The 5-pillar framework I've outlined here — Ownership, Catalog, Quality, Lineage, and PII/Access — isn't theoretical. It's the distillation of what I've seen actually work across multiple organizations and data platforms. The key principles are consistent:

  • Automate everything. If a governance rule isn't enforced by code, it's a suggestion, not a rule.
  • Start small and prove value. Govern your top 20 Gold tables first. Expand only after you've demonstrated value.
  • Embed in the workflow. Governance that lives in CI, dbt config, and pre-commit hooks gets adopted. Governance that lives in wiki pages gets ignored.
  • Make the governed path the easy path. Templates, scaffolding, and automation should make doing the right thing easier than doing the wrong thing.
  • Measure and communicate. Track quality scores, catalog coverage, and incident rates. Share them with stakeholders who care about trust in data.

If you're building a data platform and governance feels like a burden, you're doing it wrong. When done right, governance is what makes your platform actually usable — the difference between a data warehouse people trust and a data swamp people fear. Start with ownership, add quality checks, build from there. Your future self will thank you.