Data Warehouse Design: Performance Guide
Bad warehouse design kills performance. We redesigned warehouses at 50+ companies. Learn dimensional modeling, star vs snowflake schemas, and optimization strategies.
Bad warehouse design kills performance. We redesigned warehouses at 50+ companies. Your schema choice affects query speed, costs, and how easily you can add new features.
Start with Dimensional Modeling
Most warehouses use dimensional modeling. It separates data into two types:
Fact tables: Business events (sales, orders, clicks) Dimension tables: Descriptive details (customers, products, dates)
This structure gives you:
- Fast queries through smart joins
- Clear business meaning
- Easy changes when requirements shift
Schema Patterns
Star Schema
Structure:
- One central fact table
- Multiple dimension tables connected directly
Example:
Sales Fact
├── Customer Dimension
├── Product Dimension
├── Date Dimension
└── Store Dimension
Pros:
- Simple to understand
- Fast queries (fewer joins)
- Easy to implement
Cons:
- Some denormalization (data redundancy)
- Larger dimension tables
Best for:
- Most analytics use cases
- Medium to large datasets
- Teams new to dimensional modeling
Snowflake Schema
Structure:
- One central fact table
- Normalized dimension tables (dimensions have sub-dimensions)
Example:
Sales Fact
├── Customer Dimension
│ ├── Geography Dimension
│ └── Demographics Dimension
├── Product Dimension
│ ├── Category Dimension
│ └── Brand Dimension
└── Date Dimension
Pros:
- No data redundancy
- Easier to maintain dimension changes
- Lower storage requirements
Cons:
- More joins (slower queries)
- More complex to understand
- Harder to implement
Best for:
- Very large datasets where storage matters
- Highly normalized source systems
- Advanced teams comfortable with complexity
Our recommendation: Start with star schema. Normalize to snowflake only if you have specific storage or maintenance requirements.
Fact Table Design
Granularity
Choose the right granularity (level of detail):
Transaction-level:
- One row per transaction
- Most detailed, most flexible
- Largest storage requirements
Aggregated:
- One row per time period + dimensions
- Smaller storage, faster queries
- Less flexible
Recommendation: Store at the lowest granularity you might need. Aggregate for specific use cases, but keep detailed data available.
Measures and Metrics
Additive Measures:
- Can be summed across any dimension
- Examples: revenue, quantity, cost
Semi-additive Measures:
- Can be summed across some dimensions, not others
- Example: balances (sum across accounts, not across time)
Non-additive Measures:
- Cannot be summed
- Example: ratios, percentages, averages
Design rule: Store additive and semi-additive measures in fact tables. Calculate non-additive measures in queries or views.
Surrogate Keys
Use synthetic keys (auto-incrementing IDs) instead of natural keys:
Why:
- Handle changes to natural keys
- Improve join performance
- Separate business logic from technical keys
Implementation:
- Fact tables: Composite key (surrogate keys from all dimensions)
- Dimension tables: Surrogate key + natural key + change tracking
Dimension Table Design
Slowly Changing Dimensions (SCD)
Dimensions change over time. Choose the right strategy:
Type 1: Overwrite
- New value replaces old value
- No history maintained
Use when: Historical accuracy doesn't matter
Type 2: Add New Row
- New row with new surrogate key
- Old row kept with end date
- Complete history maintained
Use when: Historical accuracy is critical
Type 3: Add New Column
- Keep current and previous value
- Limited history (only one previous)
Use when: You need to track one previous value
Most common: Type 2 for customer/product dimensions, Type 1 for reference data.
Degenerate Dimensions
Some transaction attributes don't belong in dimension tables:
Examples:
- Order number
- Invoice number
- Transaction ID
Store in: Fact table itself (as degenerate dimensions)
Why:
- High cardinality (too many distinct values)
- Only used for filtering, not grouping
- No descriptive attributes
Performance Optimization
Partitioning
Partition large fact tables:
Common strategies:
- Date partitioning: By day/month/year
- Range partitioning: By value ranges
- Hash partitioning: For even distribution
Best practice: Partition by the dimension you filter on most often (usually date).
Clustering
Group related data together:
Columnar databases (Snowflake, BigQuery) automatically cluster, but you can improve:
Snowflake:
- Cluster by dimensions used in WHERE clauses
- Limit to 4-5 columns
BigQuery:
- Partition by date
- Cluster by up to 4 dimensions
Indexing
Traditional warehouses (Redshift, Postgres):
- Index foreign keys in fact tables
- Index dimension natural keys
- Consider bitmap indexes for low-cardinality columns
Modern warehouses (Snowflake, BigQuery):
- Automatic indexing
- Focus on partitioning and clustering
Materialized Views
Pre-aggregate common queries:
Use for:
- Frequently accessed aggregations
- Complex calculations
- Cross-table joins
Trade-offs:
- Faster queries
- Storage cost
- Maintenance overhead (refresh schedules)
Best practice: Materialize aggregations by day, week, month for common time periods.
Design Patterns
Conformed Dimensions
Use the same dimension tables across fact tables:
Example:
- Same Date dimension for Sales and Inventory facts
- Same Customer dimension for Orders and Support facts
Benefits:
- Consistent reporting
- Easier cross-functional analysis
- Reduced maintenance
Factless Fact Tables
Track events without measures:
Use cases:
- Student attendance (student, date, class)
- Event tracking (user, event_type, timestamp)
- Status changes (entity, status, timestamp)
Implementation: Fact table with only dimension keys, no measures.
Aggregate Fact Tables
Pre-summarize detailed facts:
Example:
- Daily sales (from transaction-level facts)
- Monthly customer metrics (from daily facts)
Benefits:
- Faster queries for common aggregations
- Lower compute costs
Trade-off:
- Storage overhead
- Need to refresh aggregations
Common Mistakes
Over-Normalization
Normalizing too much hurts performance.
Solution: Denormalize dimensions for query performance. Storage is cheap, query time is expensive.
Too Much Granularity
Storing more detail than you'll ever use.
Solution: Start with business requirements. Add detail only if needed.
Ignoring Query Patterns
Designing without understanding how data will be queried.
Solution: Review common queries. Design schema to speed up those queries.
Poor Partitioning
Partitioning by wrong columns or too many partitions.
Solution: Partition by most common filter (usually date). Limit partition count (aim for <1000 partitions).
No Documentation
Schema becomes incomprehensible over time.
Solution: Document business logic, transformations, and data lineage.
Evolution and Maintenance
Adding New Dimensions
Process:
- Create new dimension table
- Add foreign key to fact table
- Populate dimension with historical data
- Update ETL to populate going forward
Handling Schema Changes
New measures:
- Add columns to fact table
- Backfill if needed
- Update queries gradually
New dimensions:
- Create dimension table
- Add to fact table
- Populate and maintain
Data Quality
Enforce at load time:
- Foreign key constraints (where supported)
- Not null constraints for required fields
- Data type validation
Monitor continuously:
- Referential integrity checks
- Completeness checks
- Distribution monitoring
Tools and Best Practices
dbt for Modeling
Use dbt to define models as code:
Benefits:
- Version control
- Testing
- Documentation
- Reusability
Pattern:
- Staging models: Clean source data
- Intermediate models: Transformations
- Mart models: Business-ready datasets
Documentation
Document everything:
- Business definitions
- Data sources
- Transformations
- Known issues
- Usage examples
Conclusion
Good warehouse design balances performance, flexibility, and maintainability. Start with star schema, partition wisely, and improve based on actual query patterns. Remember: perfect is the enemy of good. Design for your current needs, but plan for evolution.
The best warehouse design is one that:
- Answers business questions quickly
- Adapts to changing requirements
- Is understandable by your team
- Can scale with your data
Focus on these principles, and you'll build warehouses that serve your organization well for years to come.