Let me start with an uncomfortable truth: most data governance programs fail. Not because the ideas are wrong, but because the implementation is wrong. I've watched organizations spend six figures on governance tools, hire dedicated governance teams, produce hundred-page policy documents — and end up exactly where they started, with nobody following the rules and data quality still a mess.
After 10+ years of building data platforms and implementing governance across companies of different sizes, I've come to a clear conclusion: governance that works is governance that's embedded in the engineering workflow, not bolted on top of it. It has to be automated, incremental, and deliver visible value to the people who actually touch the data every day.
This article lays out a practical 5-pillar framework that I've used successfully in production. No theory for its own sake — just patterns that survive contact with reality.
Why Most Data Governance Fails
Before we build something better, we need to understand why the standard approach keeps failing. I see the same failure modes everywhere:
The Top-Down Trap
A common pattern: a compliance event (GDPR audit, data breach, regulatory inquiry) triggers a panic. Leadership mandates "data governance." A committee is formed. Consultants are hired. A 200-page governance framework document is produced. It gets emailed to engineering. Engineering ignores it.
This fails because it treats governance as a policy problem when it's actually an engineering problem. You can't govern data with documents — you govern data with code, automation, and tooling that makes the right thing the easy thing.
The Tool-First Trap
Another common pattern: someone buys a shiny data catalog tool, spends months configuring it, populates it with metadata, and declares governance "done." Six months later, the catalog is stale, nobody updates it, and it becomes another piece of shelfware. Tools are necessary but not sufficient. A catalog without ownership, process, and enforcement is just an expensive wiki that nobody reads.
The Boil-the-Ocean Trap
Attempting to govern everything at once is a guaranteed way to govern nothing. I've seen teams try to classify every column in every table across a 2,000-table warehouse in a single sprint. They burn out, deliver incomplete work, and the whole initiative loses credibility. Governance has to be incremental and prioritized.
The 5-Pillar Framework
After iterating through multiple governance implementations, I've distilled the approach that works into five pillars. Each one is necessary, and they reinforce each other — but you don't have to implement them all at once. In fact, you shouldn't. Start with Pillar 1, get it working, then layer on the rest.
Pillar 1: Ownership — The Foundation of Everything
This is the single most important pillar, and it's the one most governance programs skip or get wrong. Without clear ownership, everything else falls apart. Quality checks with no owner become noise. Catalog entries with no maintainer become stale. Lineage with no accountability becomes a pretty graph nobody acts on.
What ownership means in practice
Every table, model, and dataset in your warehouse should have a clearly identified owning team. Not a person — a team. People leave, change roles, go on vacation. Teams persist. Ownership means:
- Accountable for data quality — when this dataset breaks, this team gets paged.
- Responsible for documentation — descriptions, business definitions, known caveats.
- Authority over schema changes — no one changes this model without the owning team's review.
- SLA commitment — this data is guaranteed fresh by X time, with Y% completeness.
Implementation pattern: dbt meta + CODEOWNERS
The best implementation I've found combines dbt's meta configuration with GitHub's CODEOWNERS file. This makes ownership both machine-readable and enforceable:
# dbt_project.yml - Set default ownership per directory
models:
my_project:
staging:
+meta:
owner: "data-platform-team"
sla: "daily by 06:00 UTC"
marts:
finance:
+meta:
owner: "finance-analytics"
sla: "daily by 08:00 UTC"
pii_certified: true
marketing:
+meta:
owner: "marketing-analytics"
sla: "daily by 09:00 UTC"
# CODEOWNERS - Enforce review by owning team
/models/staging/ @data-platform-team
/models/marts/finance/ @finance-analytics
/models/marts/marketing/ @marketing-analytics
This pattern has a critical advantage: ownership is enforced through code review. If someone on the marketing team tries to modify a finance model, GitHub won't let them merge without finance-analytics approval. No governance committee needed — the Git workflow enforces it automatically.
Ownership CI check
You can go further by adding a CI check that ensures every new model has an owner defined:
# scripts/check_ownership.py
import yaml
import sys
from pathlib import Path
def check_model_ownership():
"""Fail CI if any model lacks an owner in its meta config."""
models_dir = Path("models")
missing_owners = []
for sql_file in models_dir.rglob("*.sql"):
yml_file = sql_file.with_suffix(".yml")
schema_file = sql_file.parent / "schema.yml"
has_owner = False
for check_file in [yml_file, schema_file]:
if check_file.exists():
with open(check_file) as f:
config = yaml.safe_load(f)
# Check if owner is defined in meta
if _find_owner(config, sql_file.stem):
has_owner = True
break
if not has_owner:
missing_owners.append(str(sql_file))
if missing_owners:
print("Models missing owner in meta config:")
for m in missing_owners:
print(f" - {m}")
sys.exit(1)
print(f"All {len(list(models_dir.rglob('*.sql')))} models have owners.")
sys.exit(0)
Pillar 2: Data Catalog — Making Data Discoverable
A data catalog answers the question every analyst and engineer asks daily: "Does this data exist, and what does it mean?" But the key insight is: the best catalog is the one that populates itself.
dbt as your core catalog
If you're using dbt (and in 2026, you should be), you already have the foundation of a data catalog. dbt's schema YAML files capture descriptions, column definitions, tests, and metadata. The trick is making documentation a non-negotiable part of the development process, not an afterthought.
# models/marts/finance/schema.yml
version: 2
models:
- name: fct_revenue
description: >
Daily revenue fact table. Grain: one row per order per day.
Source: Stripe payment events joined with internal order system.
Known caveat: refunds appear as negative amounts with a 48hr lag.
meta:
owner: "finance-analytics"
tier: "gold"
pii: false
sla: "daily by 08:00 UTC"
columns:
- name: order_id
description: "Unique order identifier from the orders system"
tests:
- unique
- not_null
- name: revenue_amount
description: "Net revenue in USD cents. Negative values = refunds."
tests:
- not_null
- name: customer_id
description: "FK to dim_customers. Nullable for anonymous orders."
tests:
- relationships:
to: ref('dim_customers')
field: customer_id
Enforcing documentation in CI
Make it impossible to merge undocumented models. A simple dbt test or CI script can enforce this:
# In CI pipeline - require descriptions for all gold-tier models
dbt run-operation check_model_descriptions --args '{tier: gold}'
# Or use dbt's built-in: warn on undocumented models
# dbt_project.yml
on-run-end:
- "{{ log_undocumented_models() }}"
For organizations that need more than dbt docs, tools like DataHub, Atlan, or OpenMetadata can aggregate metadata from dbt, your warehouse, BI tools, and orchestrator into a unified search experience. But start with dbt docs — it's free, version-controlled, and already in your workflow.
Data tiering
Not all data deserves the same level of governance. I use a three-tier system that's simple enough to actually follow:
- Gold (Certified) — Production-ready, documented, tested, owned, SLA-backed. Powers dashboards and business decisions. Full governance applies.
- Silver (Curated) — Clean, tested, documented, but may not have an SLA. Used by analytics teams for exploration and ad-hoc work.
- Bronze (Raw/Staging) — Ingested data with minimal transformation. Documented enough to know what it is, but no quality guarantees beyond schema compliance.
This tiering prevents the boil-the-ocean problem. You apply full governance to Gold, lighter governance to Silver, and minimal governance to Bronze. Start with your most critical 20 tables, get them to Gold, then expand.
Pillar 3: Quality Checks — Trust Through Automation
Data quality is where governance becomes tangible. Every pipeline run should validate that the data meets expectations — not through manual review, but through automated checks that run in your pipeline and block bad data from reaching consumers.
Layered testing strategy
I use a layered approach that catches different types of quality issues at different stages:
# Layer 1: Schema tests (dbt built-in) - Run on every build
# Catch structural issues immediately
columns:
- name: user_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['active', 'churned', 'trial']
# Layer 2: Custom data tests (dbt) - Run on every build
# Catch business logic violations
tests:
- name: assert_revenue_not_negative
description: "Revenue should never be negative after refund netting"
sql: |
select order_id, net_revenue
from {{ ref('fct_daily_revenue') }}
where net_revenue < -10000 -- Allow small negatives from rounding
# Layer 3: Anomaly detection (Elementary / Great Expectations)
# Catch distribution shifts and volume anomalies
- name: volume_anomaly
description: "Alert if row count deviates >3 std from 30-day average"
config:
severity: warn
alert_channel: "#data-quality-alerts"
Great Expectations for complex validation
When you need validation that goes beyond what dbt tests can express — cross-table comparisons, statistical tests, or multi-column business rules — Great Expectations is the tool I reach for:
# Great Expectations suite for finance data
import great_expectations as gx
context = gx.get_context()
# Define expectations for the revenue dataset
suite = context.add_expectation_suite("fct_revenue_suite")
# Row count should be within expected range
suite.add_expectation(
gx.expectations.ExpectTableRowCountToBeBetween(
min_value=10000,
max_value=500000,
)
)
# Revenue should match source within tolerance
suite.add_expectation(
gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
column_A="gross_revenue",
column_B="net_revenue",
or_equal=True,
)
)
# No duplicate orders within a single day
suite.add_expectation(
gx.expectations.ExpectCompoundColumnsToBeUnique(
column_list=["order_id", "revenue_date"],
)
)
Quality gates that block bad data
Tests are only useful if they have consequences. I implement a two-tier severity system:
- Error-severity tests block the pipeline. If
order_idhas nulls, we stop. Bad data does not flow downstream. - Warn-severity tests alert but don't block. If row count is 15% lower than yesterday, the team gets a Slack notification to investigate, but the pipeline continues. This prevents alert fatigue while still catching anomalies.
The key is being intentional about which tier each test belongs to. Over time, your error-tier tests become your data contract — the guarantees you make to downstream consumers.
Pillar 4: Lineage — Understanding Data Flow
Lineage answers two critical questions: "Where did this data come from?" and "What breaks if I change this?" Without lineage, you're making changes in the dark, and every schema change is a gamble.
dbt lineage as your foundation
If you're using dbt, you get table-level lineage for free through ref() and source() declarations. This is already more lineage than most organizations have. But the real power comes from extending it:
# dbt exposes lineage through the manifest
# Use it to answer: "What depends on this source?"
# Query the dbt manifest to find all downstream models
import json
with open("target/manifest.json") as f:
manifest = json.load(f)
def find_downstream(model_name, manifest):
"""Find all models that depend on a given model."""
downstream = []
target_id = f"model.my_project.{model_name}"
for node_id, node in manifest["nodes"].items():
if target_id in node.get("depends_on", {}).get("nodes", []):
downstream.append(node["name"])
return downstream
# Before changing stg_customers, check the blast radius
affected = find_downstream("stg_customers", manifest)
print(f"Changing stg_customers affects: {affected}")
# Output: ['dim_customers', 'fct_orders', 'fct_revenue', 'ml_churn_features']
Column-level lineage
Table-level lineage tells you which tables depend on which other tables. Column-level lineage tells you which columns flow where — critical for PII tracking and impact analysis. Tools like SQLGlot (open source), dbt Explorer, or catalog tools like Atlan can provide this:
# Using sqlglot to parse column-level lineage from SQL
import sqlglot
from sqlglot.lineage import lineage
# Parse a dbt model's SQL to trace column origins
sql = """
SELECT
o.order_id,
c.email, -- PII! Where does this come from?
o.amount * fx.rate AS amount_usd
FROM stg_orders o
JOIN stg_customers c ON o.customer_id = c.customer_id
JOIN stg_fx_rates fx ON o.currency = fx.currency
"""
# Trace where the 'email' column originates
result = lineage("email", sql, dialect="snowflake")
# Result: stg_customers.email -> fct_orders.email
Column-level lineage is especially powerful for PII governance — if you know that stg_customers.email flows into fct_orders.email, dim_customers.email, and ml_features.user_email, you can automatically apply masking policies everywhere it appears.
OpenLineage for cross-system lineage
When your data flows across systems — from Kafka to Spark to dbt to a BI tool — you need lineage that spans those boundaries. OpenLineage is the open standard here, with integrations for Dagster, Airflow, Spark, and dbt. It captures lineage events as your pipelines run and stores them in a central catalog like Marquez or DataHub.
Pillar 5: PII and Access Control — Protecting Sensitive Data
This is where governance meets compliance, and where getting it wrong has real consequences. The goal is simple to state and hard to implement: sensitive data should be accessible only to those who need it, and automatically protected everywhere it appears.
Tag-based access policies
The pattern that scales best is tag-based access control. Instead of manually granting permissions table by table, you tag columns with their sensitivity level and let the warehouse enforce policies based on tags:
# Snowflake: Tag-based masking policies
-- Create a PII tag
CREATE TAG IF NOT EXISTS pii_type
ALLOWED_VALUES 'email', 'phone', 'ssn', 'address', 'name';
-- Create a masking policy
CREATE MASKING POLICY mask_pii_email AS (val STRING)
RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('DATA_ADMIN', 'FINANCE_ADMIN')
THEN val
ELSE '***@' || SPLIT_PART(val, '@', 2)
END;
-- Associate the policy with the tag
ALTER TAG pii_type SET MASKING POLICY mask_pii_email
ON TAG VALUE 'email';
-- Now tag columns — the policy follows automatically
ALTER TABLE dim_customers MODIFY COLUMN email
SET TAG pii_type = 'email';
ALTER TABLE fct_orders MODIFY COLUMN customer_email
SET TAG pii_type = 'email';
-- Any new table with this tag automatically gets masked
dbt-driven PII tagging
You can combine this with dbt to make PII tagging part of your development workflow:
# models/marts/schema.yml
models:
- name: dim_customers
columns:
- name: email
description: "Customer email address"
meta:
pii: "email"
masking_policy: "mask_pii_email"
- name: phone
description: "Customer phone number"
meta:
pii: "phone"
masking_policy: "mask_pii_phone"
- name: customer_segment
description: "Business segment classification"
meta:
pii: false
# dbt post-hook to apply tags automatically
# macros/apply_pii_tags.sql
{% macro apply_pii_tags() %}
{% set model = this %}
{% for col in model.columns if col.meta.get('pii') %}
ALTER TABLE {{ this }} MODIFY COLUMN {{ col.name }}
SET TAG pii_type = '{{ col.meta.pii }}';
{% endfor %}
{% endmacro %}
This closes the loop: developers declare PII in the schema YAML, dbt applies the tags on build, and the warehouse enforces masking automatically. PII protection follows the data wherever it goes, without manual intervention.
Row-level security for multi-tenant data
For organizations where different teams should see different slices of data, row-level security (RLS) policies provide fine-grained access:
-- Snowflake: Regional data access
CREATE ROW ACCESS POLICY region_access AS (region_code VARCHAR)
RETURNS BOOLEAN ->
CASE
WHEN CURRENT_ROLE() = 'GLOBAL_ADMIN' THEN TRUE
WHEN CURRENT_ROLE() = 'EU_ANALYTICS' AND region_code = 'EU' THEN TRUE
WHEN CURRENT_ROLE() = 'US_ANALYTICS' AND region_code = 'US' THEN TRUE
ELSE FALSE
END;
ALTER TABLE fct_revenue ADD ROW ACCESS POLICY region_access
ON (region_code);
Getting Buy-In: Engineers AND Business Stakeholders
Even the best technical framework fails without adoption. Getting buy-in requires different strategies for different audiences.
For engineers: make it invisible
Engineers resist governance when it adds friction to their workflow. The answer is to embed governance into tools they already use:
- CI/CD checks — governance rules run automatically on every PR. No extra steps.
- Pre-commit hooks — catch undocumented models and missing ownership before the PR is even created.
- dbt integration — tests, documentation, and metadata live in the same files as the models. Governance is just part of writing a model.
- Templates and scaffolding — provide
cookiecuttertemplates or a CLI that generates new models with ownership, tests, and documentation pre-populated. Make the governed path the path of least resistance.
# Example: pre-commit hook for governance checks
# .pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: check-model-ownership
name: Check model ownership
entry: python scripts/check_ownership.py
language: python
files: '\.sql$'
pass_filenames: false
- id: check-model-description
name: Check model description
entry: python scripts/check_descriptions.py
language: python
files: '\.yml$'
pass_filenames: false
For business stakeholders: show the value
Business teams don't care about dbt tests or CODEOWNERS files. They care about trusting their dashboards and making decisions confidently. Frame governance in their language:
- "This dashboard has a data quality SLA" — instead of "we run dbt tests," tell them "this revenue number is validated every morning by 8 AM, and you'll be notified if anything looks wrong."
- "You can find any dataset yourself" — instead of "we have a data catalog," tell them "you can search for any data asset, see who owns it, what it means, and how fresh it is, without asking an engineer."
- "Your PII is protected automatically" — instead of "we have masking policies," tell them "customer email addresses are automatically hidden from anyone who doesn't need to see them."
- Data quality scorecards — create a simple dashboard showing quality metrics over time. Business stakeholders love seeing that "98.5% of data quality checks passed this week" trending upward.
The executive pitch
For leadership, frame governance as risk reduction and velocity improvement, not compliance overhead:
- Risk — "Without governance, a GDPR violation costs up to 4% of global revenue. Our framework automatically tracks and protects PII across all pipelines."
- Velocity — "Engineers currently spend ~20% of their time figuring out what data exists and whether they can trust it. A governed catalog and quality framework cuts that to near zero."
- Trust — "Last quarter we had 3 incidents where dashboards showed wrong numbers. With automated quality gates, bad data never reaches dashboards."
Pitfalls and Anti-Patterns
Even with a solid framework, there are traps to avoid. Here are the ones I've seen most often:
1. Governance as a separate team
Creating a dedicated "data governance team" that exists apart from engineering is almost always a mistake. Governance works when it's owned by the people who build and maintain the data. A central governance team becomes a bottleneck, a bureaucracy, and eventually irrelevant because they don't have context on the actual data.
Better pattern: A small governance enablement team (1-2 people) that builds tooling, templates, and CI checks — then distributes governance responsibility to domain teams through ownership. Think of them as the team that builds the guardrails, not the team that stands at the gate.
2. Over-classifying data
I've seen teams create 15 different PII classification levels with a 30-page taxonomy document. Nobody can remember them all, so nobody applies them correctly. Three levels is enough for most organizations: Public, Internal, Restricted (PII/Sensitive). If you truly need more granularity, add sub-categories within those three tiers — but the top level should be instantly understandable.
3. Testing everything with equal rigor
Not every table needs 50 tests. Apply the data tiering from Pillar 2: Gold tables get comprehensive tests and strict quality gates. Silver tables get basic schema tests and anomaly detection. Bronze tables get not_null on primary keys and unique checks, and that's it. Over-testing creates alert fatigue, which is worse than no testing because people start ignoring all alerts.
4. Forgetting about the catalog's freshness
A stale catalog is worse than no catalog — it actively misleads people. If your catalog isn't auto-populated from dbt, your orchestrator, and your warehouse metadata, it will go stale. Never rely on humans to manually update catalog entries. Automate or accept staleness.
5. Making governance a one-time project
Governance isn't a project with an end date. It's an ongoing practice, like testing or code review. Budget for maintenance: new models need to be classified, ownership needs to be updated when teams reorg, and quality checks need to be tuned as data evolves. Treat governance work as part of sprint planning, not a separate initiative.
Putting It All Together
Here's what a well-governed data pipeline looks like when all five pillars are in place:
- A developer creates a new dbt model. They add it with an owner, description, column definitions, and tests — using a scaffold template that pre-populates these fields.
- They open a PR. CI automatically runs: ownership check (pass), documentation check (pass), dbt tests (pass). CODEOWNERS requires the owning team's approval.
- The model is deployed. dbt builds the table, applies PII tags via post-hooks, and the catalog is updated automatically from the manifest.
- The pipeline runs daily. Quality checks validate the output. If error-severity tests fail, the pipeline stops and the owning team is alerted. Warn-severity anomalies go to a monitoring channel.
- A business analyst discovers the table in the data catalog, sees its documentation, owner, freshness, and quality score. They query it confidently, knowing PII columns are automatically masked for their role.
- Three months later, a schema change is needed. The engineer checks lineage to understand the blast radius, coordinates with downstream owners, and makes the change with confidence.
This is governance that works — not because it's comprehensive, but because it's embedded. It doesn't ask people to change their workflow; it makes their existing workflow produce governed outputs.
Conclusion
Data governance has a reputation problem. Too many organizations associate it with bureaucracy, committees, and policy documents that nobody reads. But governance done right is the opposite: it's automation, tooling, and engineering practices that make your data platform more trustworthy, more discoverable, and more secure — without slowing anyone down.
The 5-pillar framework I've outlined here — Ownership, Catalog, Quality, Lineage, and PII/Access — isn't theoretical. It's the distillation of what I've seen actually work across multiple organizations and data platforms. The key principles are consistent:
- Automate everything. If a governance rule isn't enforced by code, it's a suggestion, not a rule.
- Start small and prove value. Govern your top 20 Gold tables first. Expand only after you've demonstrated value.
- Embed in the workflow. Governance that lives in CI, dbt config, and pre-commit hooks gets adopted. Governance that lives in wiki pages gets ignored.
- Make the governed path the easy path. Templates, scaffolding, and automation should make doing the right thing easier than doing the wrong thing.
- Measure and communicate. Track quality scores, catalog coverage, and incident rates. Share them with stakeholders who care about trust in data.
If you're building a data platform and governance feels like a burden, you're doing it wrong. When done right, governance is what makes your platform actually usable — the difference between a data warehouse people trust and a data swamp people fear. Start with ownership, add quality checks, build from there. Your future self will thank you.