Blog/Data Warehouse Design: Performance Guide

Data Engineering7 min readNovember 2, 2025

Data Warehouse Design: Performance Guide

Bad warehouse design kills performance. We redesigned warehouses at 50+ companies. Learn dimensional modeling, star vs snowflake schemas, and optimization strategies.

Bad warehouse design kills performance. We redesigned warehouses at 50+ companies. Your schema choice affects query speed, costs, and how easily you can add new features.

Start with Dimensional Modeling

Most warehouses use dimensional modeling. It separates data into two types:

Fact tables: Business events (sales, orders, clicks) Dimension tables: Descriptive details (customers, products, dates)

This structure gives you:

Fast queries through smart joins
Clear business meaning
Easy changes when requirements shift

Schema Patterns

Star Schema

Structure:

One central fact table
Multiple dimension tables connected directly

Example:

Sales Fact
├── Customer Dimension
├── Product Dimension
├── Date Dimension
└── Store Dimension

Pros:

Simple to understand
Fast queries (fewer joins)
Easy to implement

Cons:

Some denormalization (data redundancy)
Larger dimension tables

Best for:

Most analytics use cases
Medium to large datasets
Teams new to dimensional modeling

Snowflake Schema

Structure:

One central fact table
Normalized dimension tables (dimensions have sub-dimensions)

Example:

Sales Fact
├── Customer Dimension
│   ├── Geography Dimension
│   └── Demographics Dimension
├── Product Dimension
│   ├── Category Dimension
│   └── Brand Dimension
└── Date Dimension

Pros:

No data redundancy
Easier to maintain dimension changes
Lower storage requirements

Cons:

More joins (slower queries)
More complex to understand
Harder to implement

Best for:

Very large datasets where storage matters
Highly normalized source systems
Advanced teams comfortable with complexity

Our recommendation: Start with star schema. Normalize to snowflake only if you have specific storage or maintenance requirements.

Fact Table Design

Granularity

Choose the right granularity (level of detail):

Transaction-level:

One row per transaction
Most detailed, most flexible
Largest storage requirements

Aggregated:

One row per time period + dimensions
Smaller storage, faster queries
Less flexible

Recommendation: Store at the lowest granularity you might need. Aggregate for specific use cases, but keep detailed data available.

Measures and Metrics

Additive Measures:

Can be summed across any dimension
Examples: revenue, quantity, cost

Semi-additive Measures:

Can be summed across some dimensions, not others
Example: balances (sum across accounts, not across time)

Non-additive Measures:

Cannot be summed
Example: ratios, percentages, averages

Design rule: Store additive and semi-additive measures in fact tables. Calculate non-additive measures in queries or views.

Surrogate Keys

Use synthetic keys (auto-incrementing IDs) instead of natural keys:

Why:

Handle changes to natural keys
Improve join performance
Separate business logic from technical keys

Implementation:

Fact tables: Composite key (surrogate keys from all dimensions)
Dimension tables: Surrogate key + natural key + change tracking

Dimension Table Design

Slowly Changing Dimensions (SCD)

Dimensions change over time. Choose the right strategy:

Type 1: Overwrite

New value replaces old value
No history maintained

Use when: Historical accuracy doesn't matter

Type 2: Add New Row

New row with new surrogate key
Old row kept with end date
Complete history maintained

Use when: Historical accuracy is critical

Type 3: Add New Column

Keep current and previous value
Limited history (only one previous)

Use when: You need to track one previous value

Most common: Type 2 for customer/product dimensions, Type 1 for reference data.

Degenerate Dimensions

Some transaction attributes don't belong in dimension tables:

Examples:

Order number
Invoice number
Transaction ID

Store in: Fact table itself (as degenerate dimensions)

Why:

High cardinality (too many distinct values)
Only used for filtering, not grouping
No descriptive attributes

Performance Optimization

Partitioning

Partition large fact tables:

Common strategies:

Date partitioning: By day/month/year
Range partitioning: By value ranges
Hash partitioning: For even distribution

Best practice: Partition by the dimension you filter on most often (usually date).

Clustering

Group related data together:

Columnar databases (Snowflake, BigQuery) automatically cluster, but you can improve:

Snowflake:

Cluster by dimensions used in WHERE clauses
Limit to 4-5 columns

BigQuery:

Partition by date
Cluster by up to 4 dimensions

Indexing

Traditional warehouses (Redshift, Postgres):

Index foreign keys in fact tables
Index dimension natural keys
Consider bitmap indexes for low-cardinality columns

Modern warehouses (Snowflake, BigQuery):

Automatic indexing
Focus on partitioning and clustering

Materialized Views

Pre-aggregate common queries:

Use for:

Frequently accessed aggregations
Complex calculations
Cross-table joins

Trade-offs:

Faster queries
Storage cost
Maintenance overhead (refresh schedules)

Best practice: Materialize aggregations by day, week, month for common time periods.

Design Patterns

Conformed Dimensions

Use the same dimension tables across fact tables:

Example:

Same Date dimension for Sales and Inventory facts
Same Customer dimension for Orders and Support facts

Benefits:

Consistent reporting
Easier cross-functional analysis
Reduced maintenance

Factless Fact Tables

Track events without measures:

Use cases:

Student attendance (student, date, class)
Event tracking (user, event_type, timestamp)
Status changes (entity, status, timestamp)

Implementation: Fact table with only dimension keys, no measures.

Aggregate Fact Tables

Pre-summarize detailed facts:

Example:

Daily sales (from transaction-level facts)
Monthly customer metrics (from daily facts)

Benefits:

Faster queries for common aggregations
Lower compute costs

Trade-off:

Storage overhead
Need to refresh aggregations

Common Mistakes

Over-Normalization

Normalizing too much hurts performance.

Solution: Denormalize dimensions for query performance. Storage is cheap, query time is expensive.

Too Much Granularity

Storing more detail than you'll ever use.

Solution: Start with business requirements. Add detail only if needed.

Ignoring Query Patterns

Designing without understanding how data will be queried.

Solution: Review common queries. Design schema to speed up those queries.

Poor Partitioning

Partitioning by wrong columns or too many partitions.

Solution: Partition by most common filter (usually date). Limit partition count (aim for <1000 partitions).

No Documentation

Schema becomes incomprehensible over time.

Solution: Document business logic, transformations, and data lineage.

Evolution and Maintenance

Adding New Dimensions

Process:

Create new dimension table
Add foreign key to fact table
Populate dimension with historical data
Update ETL to populate going forward

Handling Schema Changes

New measures:

Add columns to fact table
Backfill if needed
Update queries gradually

New dimensions:

Create dimension table
Add to fact table
Populate and maintain

Data Quality

Enforce at load time:

Foreign key constraints (where supported)
Not null constraints for required fields
Data type validation

Monitor continuously:

Referential integrity checks
Completeness checks
Distribution monitoring

Tools and Best Practices

dbt for Modeling

Use dbt to define models as code:

Benefits:

Version control
Testing
Documentation
Reusability

Pattern:

Staging models: Clean source data
Intermediate models: Transformations
Mart models: Business-ready datasets

Documentation

Document everything:

Business definitions
Data sources
Transformations
Known issues
Usage examples

Conclusion

Good warehouse design balances performance, flexibility, and maintainability. Start with star schema, partition wisely, and improve based on actual query patterns. Remember: perfect is the enemy of good. Design for your current needs, but plan for evolution.

The best warehouse design is one that:

Answers business questions quickly
Adapts to changing requirements
Is understandable by your team
Can scale with your data

Focus on these principles, and you'll build warehouses that serve your organization well for years to come.

Blog/Data Warehouse Design: Performance Guide

Data Engineering7 min readNovember 2, 2025

Data Warehouse Design: Performance Guide

Bad warehouse design kills performance. We redesigned warehouses at 50+ companies. Learn dimensional modeling, star vs snowflake schemas, and optimization strategies.

Bad warehouse design kills performance. We redesigned warehouses at 50+ companies. Your schema choice affects query speed, costs, and how easily you can add new features.

Start with Dimensional Modeling

Most warehouses use dimensional modeling. It separates data into two types:

Fact tables: Business events (sales, orders, clicks) Dimension tables: Descriptive details (customers, products, dates)

This structure gives you:

Fast queries through smart joins
Clear business meaning
Easy changes when requirements shift

Schema Patterns

Star Schema

Structure:

One central fact table
Multiple dimension tables connected directly

Example:

Sales Fact
├── Customer Dimension
├── Product Dimension
├── Date Dimension
└── Store Dimension

Pros:

Simple to understand
Fast queries (fewer joins)
Easy to implement

Cons:

Some denormalization (data redundancy)
Larger dimension tables

Best for:

Most analytics use cases
Medium to large datasets
Teams new to dimensional modeling

Snowflake Schema

Structure:

One central fact table
Normalized dimension tables (dimensions have sub-dimensions)

Example:

Sales Fact
├── Customer Dimension
│   ├── Geography Dimension
│   └── Demographics Dimension
├── Product Dimension
│   ├── Category Dimension
│   └── Brand Dimension
└── Date Dimension

Pros:

No data redundancy
Easier to maintain dimension changes
Lower storage requirements

Cons:

More joins (slower queries)
More complex to understand
Harder to implement

Best for:

Very large datasets where storage matters
Highly normalized source systems
Advanced teams comfortable with complexity

Our recommendation: Start with star schema. Normalize to snowflake only if you have specific storage or maintenance requirements.

Fact Table Design

Granularity

Choose the right granularity (level of detail):

Transaction-level:

One row per transaction
Most detailed, most flexible
Largest storage requirements

Aggregated:

One row per time period + dimensions
Smaller storage, faster queries
Less flexible

Recommendation: Store at the lowest granularity you might need. Aggregate for specific use cases, but keep detailed data available.

Measures and Metrics

Additive Measures:

Can be summed across any dimension
Examples: revenue, quantity, cost

Semi-additive Measures:

Can be summed across some dimensions, not others
Example: balances (sum across accounts, not across time)

Non-additive Measures:

Cannot be summed
Example: ratios, percentages, averages

Design rule: Store additive and semi-additive measures in fact tables. Calculate non-additive measures in queries or views.

Surrogate Keys

Use synthetic keys (auto-incrementing IDs) instead of natural keys:

Why:

Handle changes to natural keys
Improve join performance
Separate business logic from technical keys

Implementation:

Fact tables: Composite key (surrogate keys from all dimensions)
Dimension tables: Surrogate key + natural key + change tracking

Dimension Table Design

Slowly Changing Dimensions (SCD)

Dimensions change over time. Choose the right strategy:

Type 1: Overwrite

New value replaces old value
No history maintained

Use when: Historical accuracy doesn't matter

Type 2: Add New Row

New row with new surrogate key
Old row kept with end date
Complete history maintained

Use when: Historical accuracy is critical

Type 3: Add New Column

Keep current and previous value
Limited history (only one previous)

Use when: You need to track one previous value

Most common: Type 2 for customer/product dimensions, Type 1 for reference data.

Degenerate Dimensions

Some transaction attributes don't belong in dimension tables:

Examples:

Order number
Invoice number
Transaction ID

Store in: Fact table itself (as degenerate dimensions)

Why:

High cardinality (too many distinct values)
Only used for filtering, not grouping
No descriptive attributes

Performance Optimization

Partitioning

Partition large fact tables:

Common strategies:

Date partitioning: By day/month/year
Range partitioning: By value ranges
Hash partitioning: For even distribution

Best practice: Partition by the dimension you filter on most often (usually date).

Clustering

Group related data together:

Columnar databases (Snowflake, BigQuery) automatically cluster, but you can improve:

Snowflake:

Cluster by dimensions used in WHERE clauses
Limit to 4-5 columns

BigQuery:

Partition by date
Cluster by up to 4 dimensions

Indexing

Traditional warehouses (Redshift, Postgres):

Index foreign keys in fact tables
Index dimension natural keys
Consider bitmap indexes for low-cardinality columns

Modern warehouses (Snowflake, BigQuery):

Automatic indexing
Focus on partitioning and clustering

Materialized Views

Pre-aggregate common queries:

Use for:

Frequently accessed aggregations
Complex calculations
Cross-table joins

Trade-offs:

Faster queries
Storage cost
Maintenance overhead (refresh schedules)

Best practice: Materialize aggregations by day, week, month for common time periods.

Design Patterns

Conformed Dimensions

Use the same dimension tables across fact tables:

Example:

Same Date dimension for Sales and Inventory facts
Same Customer dimension for Orders and Support facts

Benefits:

Consistent reporting
Easier cross-functional analysis
Reduced maintenance

Factless Fact Tables

Track events without measures:

Use cases:

Student attendance (student, date, class)
Event tracking (user, event_type, timestamp)
Status changes (entity, status, timestamp)

Implementation: Fact table with only dimension keys, no measures.

Aggregate Fact Tables

Pre-summarize detailed facts:

Example:

Daily sales (from transaction-level facts)
Monthly customer metrics (from daily facts)

Benefits:

Faster queries for common aggregations
Lower compute costs

Trade-off:

Storage overhead
Need to refresh aggregations

Common Mistakes

Over-Normalization

Normalizing too much hurts performance.

Solution: Denormalize dimensions for query performance. Storage is cheap, query time is expensive.

Too Much Granularity

Storing more detail than you'll ever use.

Solution: Start with business requirements. Add detail only if needed.

Ignoring Query Patterns

Designing without understanding how data will be queried.

Solution: Review common queries. Design schema to speed up those queries.

Poor Partitioning

Partitioning by wrong columns or too many partitions.

Solution: Partition by most common filter (usually date). Limit partition count (aim for <1000 partitions).

No Documentation

Schema becomes incomprehensible over time.

Solution: Document business logic, transformations, and data lineage.

Evolution and Maintenance

Adding New Dimensions

Process:

Create new dimension table
Add foreign key to fact table
Populate dimension with historical data
Update ETL to populate going forward

Handling Schema Changes

New measures:

Add columns to fact table
Backfill if needed
Update queries gradually

New dimensions:

Create dimension table
Add to fact table
Populate and maintain

Data Quality

Enforce at load time:

Foreign key constraints (where supported)
Not null constraints for required fields
Data type validation

Monitor continuously:

Referential integrity checks
Completeness checks
Distribution monitoring

Tools and Best Practices

dbt for Modeling

Use dbt to define models as code:

Benefits:

Version control
Testing
Documentation
Reusability

Pattern:

Staging models: Clean source data
Intermediate models: Transformations
Mart models: Business-ready datasets

Documentation

Document everything:

Business definitions
Data sources
Transformations
Known issues
Usage examples

Conclusion

The best warehouse design is one that:

Answers business questions quickly
Adapts to changing requirements
Is understandable by your team
Can scale with your data

Focus on these principles, and you'll build warehouses that serve your organization well for years to come.