This site is under active development — some features may be incomplete
Back to all articles
2026-02-19 15 min read

How to Build a Data Warehouse from Scratch on AWS

A step-by-step guide from someone who has done it multiple times

Building a data warehouse from scratch is one of the most consequential projects a data team will undertake. Get it right, and you create a foundation that accelerates every analytics and ML initiative for years. Get it wrong, and you end up with a brittle, expensive system that nobody trusts and everyone routes around.

I have built data warehouses from the ground up at multiple companies -- from early-stage startups to enterprises processing billions of events daily -- and every build on AWS follows a remarkably similar pattern. This is the playbook I have refined over 10+ years and multiple warehouse builds. It is opinionated, practical, and based on real production experience, not theory.

Let me be clear about one thing upfront: a data warehouse is not a technology choice. It is an organizational commitment. Picking Redshift or Snowflake is maybe 10% of the work. The other 90% is understanding what your business actually needs, modeling the data correctly, building reliable pipelines, and establishing governance that scales. This guide covers all of it.

End-to-end AWS data warehouse architecture
SOURCES INGESTION STORAGE TRANSFORM SERVING PostgreSQL (OLTP) MongoDB REST APIs Kafka Streams SaaS (Salesforce) AWS Glue Batch ETL / CDC Kinesis Firehose Streaming delivery DMS Database migration S3 Data Lake Raw / Parquet / Iceberg Redshift Serverless / RA3 Glue Catalog Metadata / Schema dbt SQL transforms Dagster Orchestration Great Expectations Data quality BI (Tableau) Looker ML / SageMaker Reverse ETL Dagster / Airflow Orchestration Layer Scheduling, monitoring, alerting, lineage tracking IAM + Lake Formation CloudWatch Monitoring

Step 1: Requirements Gathering -- The Most Underrated Phase

Every warehouse I have seen fail -- and I have seen several -- failed because the team skipped or rushed requirements gathering. They jumped straight to picking a database engine, started loading data, and then discovered six months later that the schema did not support the questions the business actually needed to answer.

Here is the framework I use. It takes one to two weeks, but it saves months of rework.

Identify Your Stakeholders and Their Questions

You are not building a warehouse for the data team. You are building it for the people who will consume the data: product managers, finance, marketing, executives, ML engineers. Sit down with each group and ask three questions:

  • What decisions do you make today that would be better with data? This surfaces the actual business questions. Not "I want a dashboard" -- what decision does the dashboard support?
  • What data do you currently use, and where does it come from? This reveals existing data flows, spreadsheet-based processes, and shadow IT data systems.
  • What questions can you not answer today? This is where the real value lives. The gap between what they know and what they need to know defines your priority.

Audit Your Source Systems

Before you design a single table, you need a complete inventory of every source system. For each one, document:

  • Data volume and velocity -- how much data, how often does it change?
  • Access method -- REST API, database replication, file export, streaming?
  • Data quality -- is the source system clean or a mess? Are there nulls, duplicates, schema drift?
  • Change data capture feasibility -- can you get incremental changes, or do you need full refreshes?
  • SLA requirements -- how fresh does this data need to be in the warehouse? Real-time? Hourly? Daily?

I keep this in a simple spreadsheet that becomes the source-of-truth document for the entire project. It sounds boring. It is. But it prevents the "oh, we forgot about that system" surprise that derails warehouse projects.

Define Your SLAs Upfront

This is the one most teams skip, and it bites them hard later. For each major data consumer, agree on:

  • Data freshness -- "marketing dashboard refreshes by 7 AM ET" is a real SLA. "As fast as possible" is not.
  • Availability -- can the warehouse be down for maintenance on Sunday mornings?
  • Query performance -- executive dashboards need sub-second response. Ad hoc analytics can tolerate 30 seconds.
  • Data retention -- how far back does historical data need to go? This has massive cost implications on Redshift.

Step 2: Data Modeling -- Get This Right or Pay Forever

Data modeling is the single most important technical decision in a warehouse build. I have seen teams treat it as an afterthought -- "we'll just load the raw data and figure out the schema later." That approach creates a data swamp, not a warehouse.

My Recommended Approach: Dimensional Modeling with a Raw Layer

After experimenting with multiple approaches -- Data Vault 2.0, pure star schemas, One Big Table, and hybrid approaches -- I have settled on a layered architecture that works well for most teams:

-- Layer 1: Raw (Bronze)
-- Exact copy of source data, append-only, partitioned by load date
-- Schema: raw.{source_system}.{table_name}
-- Example: raw.salesforce.opportunities

-- Layer 2: Staging (Silver)
-- Cleaned, typed, deduplicated, standardized
-- Schema: staging.{domain}.{entity}
-- Example: staging.sales.opportunities

-- Layer 3: Marts (Gold)
-- Business-oriented dimensional models
-- Schema: marts.{department}.{model_name}
-- Example: marts.finance.monthly_revenue

-- Layer 4: Aggregates / Reporting
-- Pre-computed aggregates for dashboard performance
-- Schema: reporting.{dashboard_name}.{metric}
-- Example: reporting.exec_dashboard.daily_kpis

This layered approach -- sometimes called the medallion architecture -- gives you several critical advantages:

  • Raw data is immutable. If your transformation logic has a bug (and it will), you can always reprocess from raw. I have saved entire projects because we kept the raw layer intact.
  • Each layer has a clear contract. Raw is "whatever the source sends." Staging is "cleaned and typed." Marts are "business-ready." This makes debugging straightforward -- you know exactly where to look when numbers don't match.
  • Performance optimization is targeted. Sort keys, distribution keys, and materialized views go on the marts and aggregates layers. The raw layer can be cheap Parquet on S3.

Dimensional Modeling Principles

For the marts layer, I use classic Kimball-style dimensional modeling because it works. It has worked for 30 years, and the basics have not changed:

  • Fact tables record business events: orders placed, sessions started, invoices sent. They are narrow (mostly foreign keys and measures) and long (millions to billions of rows).
  • Dimension tables describe the entities involved: customers, products, dates, locations. They are wide (many descriptive attributes) and relatively short.
  • Slowly Changing Dimensions (SCDs) track how dimensions change over time. Type 2 SCDs (new row per change with effective dates) are the default. Type 1 (overwrite) only when you genuinely don't care about history.
-- Example: A fact table for order events
CREATE TABLE marts.sales.fact_orders (
    order_key         BIGINT IDENTITY(1,1),
    order_id          VARCHAR(50) NOT NULL,
    customer_key      BIGINT REFERENCES marts.sales.dim_customers(customer_key),
    product_key       BIGINT REFERENCES marts.sales.dim_products(product_key),
    date_key          INTEGER REFERENCES marts.common.dim_date(date_key),
    order_status      VARCHAR(20),
    quantity           INTEGER,
    unit_price        DECIMAL(12,2),
    discount_amount   DECIMAL(12,2),
    total_amount      DECIMAL(12,2),
    _loaded_at        TIMESTAMP DEFAULT GETDATE(),
    _source_system    VARCHAR(50)
)
DISTKEY(customer_key)
SORTKEY(date_key, customer_key);

-- Example: A Type 2 SCD for customers
CREATE TABLE marts.sales.dim_customers (
    customer_key      BIGINT IDENTITY(1,1),
    customer_id       VARCHAR(50) NOT NULL,    -- natural key
    customer_name     VARCHAR(200),
    email             VARCHAR(200),
    segment           VARCHAR(50),
    region            VARCHAR(50),
    account_manager   VARCHAR(100),
    is_current        BOOLEAN DEFAULT TRUE,
    effective_from    TIMESTAMP NOT NULL,
    effective_to      TIMESTAMP DEFAULT '9999-12-31',
    _loaded_at        TIMESTAMP DEFAULT GETDATE()
)
DISTKEY(customer_key)
SORTKEY(is_current, customer_id);

A note on Data Vault: I have implemented Data Vault 2.0 at enterprise scale and it has real strengths -- particularly for auditability and handling many source systems feeding the same business entities. But for most teams, it is overkill. The hub-link-satellite pattern adds significant complexity that only pays off if you have 10+ source systems, strict regulatory requirements, or a large modeling team. Start with dimensional modeling. Move to Data Vault when you hit its specific pain points, not before.

Step 3: Setting Up Redshift -- The Engine Room

For AWS-native warehouses, Redshift remains my default choice. I know Snowflake has mind share, and it is a great product -- but if you are already deep in the AWS ecosystem, Redshift Serverless has closed the gap dramatically and the cost advantage is real.

Redshift Serverless vs. Provisioned

For new projects, I start with Redshift Serverless every time. Here is why:

  • No cluster management. No node types, no resizing, no WLM queue configuration. You set a base capacity in RPUs and Redshift scales automatically.
  • Pay for what you use. Idle time costs nothing. For teams with variable workloads -- heavy during business hours, quiet at night -- this is significantly cheaper.
  • Automatic scaling. Burst workloads (like end-of-month reporting) just work. No manual concurrency scaling or queue management.
  • Same engine. It is the same Redshift engine under the hood. Your SQL, your COPY commands, your dbt models -- all identical.

The exception: if you have consistent, predictable, high-throughput workloads running 20+ hours a day, provisioned RA3 nodes with reserved instances will be cheaper. But start serverless and migrate when you have the usage data to justify it.

Critical Configuration

-- Create a namespace and workgroup (Redshift Serverless)
-- Base capacity: 32 RPUs for most teams (scales up automatically)
-- Enhanced VPC routing: Always on (keeps data traffic in your VPC)

-- Essential schema setup
CREATE SCHEMA raw;
CREATE SCHEMA staging;
CREATE SCHEMA marts;
CREATE SCHEMA reporting;
CREATE SCHEMA sandbox;  -- for ad hoc analyst queries

-- Create groups with appropriate permissions
CREATE GROUP data_engineers;
CREATE GROUP analysts;
CREATE GROUP bi_tools;

-- Engineers get full access to raw + staging, read on marts
GRANT ALL ON SCHEMA raw TO GROUP data_engineers;
GRANT ALL ON SCHEMA staging TO GROUP data_engineers;
GRANT USAGE ON SCHEMA marts TO GROUP data_engineers;
GRANT SELECT ON ALL TABLES IN SCHEMA marts TO GROUP data_engineers;

-- Analysts get read access to marts + sandbox write
GRANT USAGE ON SCHEMA marts TO GROUP analysts;
GRANT SELECT ON ALL TABLES IN SCHEMA marts TO GROUP analysts;
GRANT ALL ON SCHEMA sandbox TO GROUP analysts;

-- BI tools get read-only on marts + reporting
GRANT USAGE ON SCHEMA marts TO GROUP bi_tools;
GRANT USAGE ON SCHEMA reporting TO GROUP bi_tools;
GRANT SELECT ON ALL TABLES IN SCHEMA marts TO GROUP bi_tools;
GRANT SELECT ON ALL TABLES IN SCHEMA reporting TO GROUP bi_tools;

Distribution and Sort Keys -- The Performance Multiplier

This is where most teams leave massive performance gains on the table. Redshift distributes data across slices, and if your distribution strategy is wrong, every join causes a data redistribution (broadcast or hash) that can slow queries by 10-100x.

My rules of thumb:

  • Fact tables: DISTKEY on the most frequently joined column (usually the primary dimension key -- often customer_key or account_key).
  • Large dimension tables: DISTKEY on the primary key so joins with fact tables are co-located.
  • Small dimension tables: DISTSTYLE ALL. Redshift replicates the entire table to every slice. If it fits in memory, every join is local.
  • Sort keys: Put your most common WHERE clause columns first. Almost always includes a date column. Compound sort keys for predictable query patterns; interleaved for ad hoc analysis.

A hard-won lesson: do not over-optimize distribution keys on day one. Load your data, run your actual queries for two weeks, then check STL_ALERT_EVENT_LOG and SVL_QUERY_SUMMARY for redistribution warnings. Let the real query patterns guide your optimization, not your assumptions.

Step 4: Building the ETL with AWS Glue

Ingestion is where most warehouse projects spend 60-70% of their engineering time. Getting data from source systems into your raw layer reliably, efficiently, and with proper error handling is harder than it looks.

Why AWS Glue

For AWS-native stacks, Glue is my default ingestion engine. It is not the prettiest tool, but it has critical advantages:

  • Serverless. No infrastructure to manage. No EMR clusters to size, no Spark configurations to tune.
  • Native S3 and Redshift integration. Glue crawlers populate the Glue Data Catalog automatically. The Redshift connector handles COPY commands efficiently.
  • PySpark or Python shell. For complex transformations, PySpark. For simple API extractions, Python shell jobs are faster and cheaper.
  • Built-in bookmarking. Glue's job bookmarking tracks what data has been processed, enabling incremental loads without custom state management.

Ingestion Patterns

# Pattern 1: Full load for small/static tables (Glue Python shell)
import boto3
import pandas as pd
import awswrangler as wr

def full_load_reference_table():
    """Full load for small reference tables -- run daily."""
    df = pd.read_sql(
        "SELECT * FROM public.product_categories",
        connection_string
    )
    # Write to S3 raw layer as Parquet, partitioned by load date
    wr.s3.to_parquet(
        df=df,
        path=f"s3://data-lake-raw/erp/product_categories/",
        dataset=True,
        partition_cols=["_load_date"],
        database="raw",
        table="erp_product_categories",
        mode="overwrite_partitions"
    )

# Pattern 2: CDC with DMS + Glue for high-volume OLTP tables
# DMS streams changes to S3 as Parquet
# Glue job merges changes into the staging layer
def merge_cdc_changes():
    """Process CDC changes from DMS into staging layer."""
    # Read new CDC files since last bookmark
    cdc_df = wr.s3.read_parquet(
        path="s3://data-lake-raw/cdc/orders/",
        last_modified_begin=last_run_timestamp
    )

    # Apply CDC operations (INSERT, UPDATE, DELETE)
    existing_df = wr.s3.read_parquet(
        path="s3://data-lake-staging/sales/orders/"
    )

    merged_df = apply_cdc_merge(existing_df, cdc_df)

    wr.s3.to_parquet(
        df=merged_df,
        path="s3://data-lake-staging/sales/orders/",
        dataset=True,
        mode="overwrite"
    )

# Pattern 3: API extraction with pagination and error handling
def extract_from_api():
    """Extract data from paginated REST API with retry logic."""
    all_records = []
    page = 1

    while True:
        response = requests.get(
            f"{API_BASE}/events?page={page}&per_page=1000",
            headers={"Authorization": f"Bearer {get_secret('api_token')}"},
            timeout=30
        )
        response.raise_for_status()
        data = response.json()

        if not data["results"]:
            break

        all_records.extend(data["results"])
        page += 1

    df = pd.DataFrame(all_records)
    df["_extracted_at"] = datetime.utcnow().isoformat()

    wr.s3.to_parquet(
        df=df,
        path=f"s3://data-lake-raw/api/events/_load_date={today}/",
        dataset=True
    )

Critical principle: your raw layer should be an exact, immutable copy of source data. Do not clean, filter, or transform during ingestion. The only additions should be metadata columns: _loaded_at, _source_system, _load_date. Transformation belongs in the staging and marts layers, managed by dbt.

Layered data flow: sources through raw, staging, and marts
Source OLTP / API RAW (Bronze) S3 Parquet raw.erp.orders raw.crm.accounts raw.api.events immutable, append-only STAGING (Silver) Redshift + dbt stg_orders stg_accounts stg_events cleaned, typed, deduped MARTS (Gold) Redshift + dbt fact_orders dim_customers dim_products business-ready models BI ML Apps Each layer has clear ownership, contracts, and quality gates

Step 5: Transformation with dbt -- The Modern Standard

Once data lands in your raw layer, dbt takes over for transformation. If you are not using dbt for your warehouse transformations in 2026, you are making your life harder than it needs to be. It is the single best tool for managing SQL-based transformation logic.

dbt Project Structure

I organize every dbt project the same way. This structure has survived projects with 500+ models:

dbt_warehouse/
├── models/
│   ├── staging/          # 1:1 with source tables
│   │   ├── erp/
│   │   │   ├── _erp__sources.yml
│   │   │   ├── _erp__models.yml
│   │   │   ├── stg_erp__orders.sql
│   │   │   └── stg_erp__products.sql
│   │   └── crm/
│   │       ├── _crm__sources.yml
│   │       ├── stg_crm__accounts.sql
│   │       └── stg_crm__contacts.sql
│   ├── intermediate/     # Complex joins / business logic
│   │   └── int_order_items_enriched.sql
│   ├── marts/            # Business-facing models
│   │   ├── finance/
│   │   │   ├── fact_revenue.sql
│   │   │   └── dim_cost_centers.sql
│   │   └── sales/
│   │       ├── fact_orders.sql
│   │       ├── dim_customers.sql
│   │       └── dim_products.sql
│   └── reporting/        # Pre-aggregated for dashboards
│       └── rpt_daily_revenue_summary.sql
├── tests/
│   └── generic/
│       └── test_revenue_positive.sql
├── macros/
│   ├── generate_schema_name.sql
│   └── incremental_helpers.sql
└── dbt_project.yml

Key dbt Patterns I Use Everywhere

-- Staging model: stg_erp__orders.sql
-- Convention: rename columns, cast types, filter deleted records

WITH source AS (
    SELECT * FROM {{ source('erp', 'orders') }}
),

renamed AS (
    SELECT
        order_id::VARCHAR(50)         AS order_id,
        customer_id::VARCHAR(50)      AS customer_id,
        product_id::VARCHAR(50)       AS product_id,
        order_date::DATE              AS order_date,
        quantity::INTEGER             AS quantity,
        unit_price::DECIMAL(12,2)     AS unit_price,
        (quantity * unit_price)::DECIMAL(12,2) AS line_total,
        status::VARCHAR(20)           AS order_status,
        _loaded_at                    AS _loaded_at
    FROM source
    WHERE _is_deleted = FALSE  -- soft deletes from CDC
)

SELECT * FROM renamed
-- Incremental model: fact_orders.sql
-- Only process new/changed records since last run

{{
    config(
        materialized='incremental',
        unique_key='order_id',
        incremental_strategy='merge',
        dist='customer_key',
        sort=['date_key', 'customer_key']
    )
}}

WITH orders AS (
    SELECT * FROM {{ ref('stg_erp__orders') }}
    {% if is_incremental() %}
    WHERE _loaded_at > (SELECT MAX(_loaded_at) FROM {{ this }})
    {% endif %}
),

customers AS (
    SELECT * FROM {{ ref('dim_customers') }}
    WHERE is_current = TRUE
),

date_dim AS (
    SELECT * FROM {{ ref('dim_date') }}
)

SELECT
    o.order_id,
    c.customer_key,
    d.date_key,
    o.order_status,
    o.quantity,
    o.unit_price,
    o.line_total     AS total_amount,
    o._loaded_at
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
LEFT JOIN date_dim d ON o.order_date = d.calendar_date

Testing That Actually Matters

dbt's testing framework is powerful, but I see teams either write zero tests or write hundreds of trivial ones. Here is what actually catches production issues:

# In _sales__models.yml
models:
  - name: fact_orders
    description: "One row per order line item"
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_key
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_key
      - name: total_amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000  # flag outliers
      - name: date_key
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 20200101
              max_value: 20261231  # no future dates

The tests that save you at 3 AM are: uniqueness on primary keys (catches duplicate ingestion), referential integrity (catches join key mismatches), and range checks on measures (catches unit conversion errors and data type overflows).

Step 6: Orchestration -- Making It All Run Reliably

Orchestration is the connective tissue of your warehouse. It is what turns a collection of scripts and SQL models into a reliable, observable data platform. I have written extensively about why I prefer Dagster over Airflow for dbt-centric stacks, but the principles apply regardless of which tool you choose.

The Pipeline DAG

A typical warehouse refresh pipeline follows this pattern:

  1. Extract: Pull data from source systems into S3 raw layer (Glue jobs, API extractors)
  2. Load: Copy raw data from S3 into Redshift raw schema (COPY commands or Spectrum)
  3. Stage: Run dbt staging models (clean, type, deduplicate)
  4. Transform: Run dbt intermediate and mart models (business logic, joins, aggregations)
  5. Test: Run dbt tests and data quality checks
  6. Notify: Alert on failures, publish freshness metrics
# Dagster example: asset-based warehouse pipeline
from dagster import asset, AssetExecutionContext
from dagster_dbt import DbtCliResource, dbt_assets

@asset(group_name="ingestion")
def raw_orders(context: AssetExecutionContext):
    """Extract orders from source ERP into S3 raw layer."""
    glue_client = boto3.client("glue")
    response = glue_client.start_job_run(
        JobName="extract-erp-orders",
        Arguments={"--load_date": context.partition_key}
    )
    # Wait for completion and check status
    wait_for_glue_job(glue_client, response["JobRunId"])

@asset(group_name="ingestion")
def raw_customers(context: AssetExecutionContext):
    """Extract customer data from CRM into S3 raw layer."""
    glue_client = boto3.client("glue")
    response = glue_client.start_job_run(
        JobName="extract-crm-customers",
        Arguments={"--load_date": context.partition_key}
    )
    wait_for_glue_job(glue_client, response["JobRunId"])

# dbt models are automatically loaded as Dagster assets
@dbt_assets(manifest=dbt_manifest_path)
def dbt_warehouse_models(context: AssetExecutionContext, dbt: DbtCliResource):
    yield from dbt.cli(["build"], context=context).stream()

# Downstream: trigger BI refresh after marts are built
@asset(
    deps=["fact_orders", "dim_customers"],
    group_name="serving"
)
def refresh_tableau_extracts():
    """Trigger Tableau extract refresh after mart models complete."""
    tableau_client.refresh_workbook("Sales Dashboard")

Error Handling and Alerting

The difference between a toy pipeline and a production pipeline is error handling. Every orchestration setup needs:

  • Retry logic with exponential backoff -- source APIs go down, Glue jobs hit transient errors. Retry 3 times with increasing delays before alerting.
  • Dead letter handling -- when a source system sends malformed data, don't fail the entire pipeline. Log the bad records, process what you can, and alert.
  • SLA monitoring -- if the pipeline hasn't completed by 7 AM, page someone. Don't wait for a stakeholder to notice stale dashboards.
  • Data freshness checks -- after each run, validate that the max timestamp in your mart tables matches expectations. A pipeline can "succeed" while loading zero rows.

Step 7: Governance -- The Boring Work That Makes Everything Else Possible

Governance is the part nobody wants to do and everybody needs. Without it, your warehouse becomes a dumping ground that nobody trusts.

Access Control

On AWS, you need layered access control:

  • IAM policies control who can access AWS resources (S3 buckets, Glue jobs, Redshift clusters).
  • Redshift RBAC controls who can query what within the warehouse. Use role-based access with least privilege.
  • Lake Formation provides column-level and row-level security for the data lake layer. Essential if you have PII or financial data.
  • VPC and networking -- Redshift enhanced VPC routing ensures data never traverses the public internet. Private subnets for Glue jobs.

Data Quality Framework

dbt tests are your first line of defense, but you need a broader data quality framework:

# Great Expectations suite for the orders pipeline
expectation_suite = {
    "expectation_suite_name": "orders_quality",
    "expectations": [
        # Volume check: detect anomalous drops
        {
            "expectation_type": "expect_table_row_count_to_be_between",
            "kwargs": {"min_value": 1000, "max_value": 500000}
        },
        # Freshness check: no stale data
        {
            "expectation_type": "expect_column_max_to_be_between",
            "kwargs": {
                "column": "order_date",
                "min_value": "{{ yesterday }}",
                "max_value": "{{ today }}"
            }
        },
        # Distribution check: detect data drift
        {
            "expectation_type": "expect_column_mean_to_be_between",
            "kwargs": {
                "column": "total_amount",
                "min_value": 50.0,
                "max_value": 500.0
            }
        }
    ]
}

Documentation and Lineage

A warehouse without documentation is a liability. Here is what I mandate on every project:

  • dbt docs -- every model and column has a description. Run dbt docs generate and host the site. This is non-negotiable.
  • Lineage tracking -- dbt provides model-level lineage. Dagster extends this to include ingestion and downstream consumers. At minimum, you should be able to answer "where does this number come from?" for any metric in any dashboard.
  • Data dictionary -- a central document defining business terms. What is "revenue"? What is an "active user"? Get the business to agree on definitions before encoding them in SQL.
  • Runbook -- operational documentation for when things break. What to do when a source system is down, when Redshift is slow, when a dbt model fails. Write this on day one, not after the first incident.
Data governance layers: security, quality, and observability
Data Warehouse Redshift + S3 Data Lake Security Layer IAM + Lake Formation + VPC + RBAC + Encryption at rest Data Quality dbt tests Great Expectations Volume anomaly detection Freshness monitoring Observability Dagster asset lineage CloudWatch metrics Query performance logs Cost monitoring Documentation dbt docs + Data dictionary + Runbooks + SLA definitions

Common Mistakes I Have Made (So You Don't Have To)

After building warehouses at multiple companies, here are the mistakes that keep recurring -- including ones I made myself early in my career:

1. Starting with the Technology Instead of the Requirements

The worst warehouse I ever worked on was one where the team picked Redshift, loaded every table from every source system, and then asked "so what do we do with this?" Six months and $200K in Redshift costs later, nobody was using it because it didn't answer any actual business questions. Always start with the business questions and work backward to the technical requirements.

2. Skipping the Raw Layer

Early in my career, I built pipelines that transformed data during ingestion. When we discovered a bug in the transformation logic (a currency conversion that was using stale exchange rates), we had no way to reprocess -- the raw data was gone. Always keep an immutable raw layer. S3 storage is cheap. Re-collecting source data is expensive or impossible.

3. Not Investing in Incremental Models Early

Full table refreshes work fine when your fact table has 100K rows. They become ruinously expensive when it has 500M rows. Design your dbt models with incremental materialization from the start. Yes, it is more complex. Yes, it is worth it. The migration from table to incremental materialization in a running production system is painful.

4. Ignoring Redshift-Specific Optimizations

Redshift is not Postgres. It looks like Postgres, it speaks the Postgres wire protocol, but its query optimizer makes completely different decisions. Specifically:

  • Distribution keys matter enormously. A bad DISTKEY can make a simple JOIN take 100x longer because it forces a full data redistribution.
  • VACUUM and ANALYZE are not optional. Redshift doesn't auto-vacuum aggressively enough for high-write workloads. Schedule VACUUM SORT and ANALYZE after major loads.
  • Don't use SELECT *. Redshift is columnar. Selecting all columns reads all columns from disk. Be explicit about what you need.
  • Use Redshift Spectrum for cold data. Historical data that is rarely queried should live in S3 as Parquet, accessible via external tables. Move only hot data into native Redshift tables.

5. Treating the Warehouse as a One-Time Project

A data warehouse is a living system. Source schemas change. New data sources appear. Business definitions evolve. Teams need new metrics. If you treat the warehouse as a project with an end date, it will calcify and die. Budget ongoing engineering capacity for the warehouse -- at least 20% of one senior engineer's time for maintenance, schema evolution, and performance tuning.

The Timeline: What to Expect

For a team of two to three data engineers building a warehouse from scratch on AWS, here is a realistic timeline:

  • Weeks 1-2: Requirements gathering, stakeholder interviews, source system audit
  • Weeks 3-4: Data modeling (dimensional model design, schema definitions, dbt project skeleton)
  • Weeks 5-7: Infrastructure setup (Redshift, S3, Glue, IAM, VPC, Dagster deployment)
  • Weeks 8-12: Pipeline development (Glue ingestion jobs, dbt staging + mart models, tests)
  • Weeks 13-14: Integration testing, performance tuning, sort/dist key optimization
  • Weeks 15-16: BI tool integration, analyst onboarding, documentation, runbooks

That is roughly four months from kickoff to production, assuming reasonably well-defined source systems and a team that has done this before. Add a month if it is the team's first warehouse. Add two months if you have more than 10 source systems or complex regulatory requirements.

Conclusion

Building a data warehouse from scratch is a significant undertaking, but it does not have to be mysterious. The pattern is well-established: understand your business requirements, model the data properly, build reliable ingestion, transform with dbt, orchestrate with a modern tool like Dagster, and wrap the whole thing in governance that scales.

The technology choices matter less than most people think. What matters is the discipline: immutable raw layers, clear layer contracts, incremental models, comprehensive testing, and genuine documentation. Get those right and your warehouse will be a trusted, valued asset for years. Skip them, and no amount of Redshift Serverless auto-scaling will save you.

The best warehouse is one that people actually use. Start with the questions the business needs answered, build the simplest thing that answers them reliably, and iterate from there. Ten years and multiple warehouse builds later, that is still the best advice I have.