Data Quality Checks That Actually Work
Bad data costs millions. Most teams waste time on meaningless validation. We implemented quality checks at 50+ companies. Here are the checks that actually matter.
Bad data costs companies millions. Yet most teams waste time checking meaningless things. We implemented data quality at 50+ companies. Here's what actually matters.
Stop Checking the Wrong Things
Teams usually start with basic validation:
- Is this field empty?
- Does it match a pattern?
- Is it within expected ranges?
These catch obvious problems but miss the issues that actually break your business.
What Actually Matters
Business-Critical Validations
Start with checks that affect business outcomes:
Financial Data:
- Balance checks: Do debits equal credits?
- Reconciliation: Do aggregated values match source totals?
- Anomaly detection: Are transaction patterns normal?
Customer Data:
- Uniqueness: Are customer IDs actually unique?
- Completeness: Do we have required fields for key operations?
- Freshness: Is customer status up-to-date?
Product Data:
- Relationships: Do product hierarchies make sense?
- Consistency: Are prices within expected ranges?
- Availability: Can orders be fulfilled with current inventory?
Statistical Anomalies
Instead of hard limits, use statistical methods:
Historical Comparisons:
- Is today's volume within 2 standard deviations of historical average?
- Are value distributions similar to previous periods?
- Do ratios match expected patterns?
Example:
# Bad: Hard limit
if order_count > 10000:
raise Error("Too many orders")
# Good: Statistical check
mean = historical_orders.mean()
std = historical_orders.std()
if abs(order_count - mean) > 3 * std:
alert("Unusual order volume detected")
Cross-System Consistency
Data quality issues often show up as inconsistencies across systems:
- Do customer counts match in CRM and billing system?
- Do revenue totals match in transactional DB and analytics warehouse?
- Are product mappings consistent across sources?
Implementation Strategies
Schema Validation
Catch structural issues early:
Tools:
- JSON Schema for API responses
- Great Expectations for comprehensive validation
- Custom validators for domain-specific rules
When to use:
- On data ingestion
- After transformations that change structure
- Before loading to final tables
Business Rule Validation
Enforce domain-specific rules:
Examples:
- "Orders cannot have negative quantities"
- "Users must have at least one contact method"
- "Subscription end dates must be after start dates"
Implementation:
- dbt tests for SQL-based rules
- Python validators for complex logic
- Custom checks in transformation pipelines
Statistical Monitoring
Track data characteristics over time:
Metrics to monitor:
- Record counts (absolute and by dimensions)
- Null percentages
- Value distributions
- Uniqueness rates
- Freshness (time since last update)
Tools:
- Great Expectations for expectations
- Custom dashboards for monitoring
- Anomaly detection algorithms
Cross-System Validation
Verify consistency across sources:
Common checks:
- Reconciliation reports
- Aggregation comparisons
- Join completeness checks
Building a Data Quality Framework
Identify Critical Data
Not all data needs the same level of validation. Prioritize:
- Business-critical: Revenue, customer data, financials
- Decision-support: Analytics datasets, reporting tables
- Reference data: Lookups, configurations
Define Quality Dimensions
For each critical dataset, define:
- Completeness: Are required fields populated?
- Accuracy: Do values reflect reality?
- Consistency: Are values consistent across sources?
- Timeliness: Is data fresh enough?
- Validity: Do values conform to expected formats?
- Uniqueness: Are identifiers actually unique?
Implement Checks Incrementally
Start with critical datasets and expand:
Week 1: Core financial data Week 2: Customer data Week 3: Product data Week 4: Expand to analytics tables
Establish SLAs
Set clear expectations:
- Critical data: 99.9% quality threshold
- Important data: 99% quality threshold
- Supporting data: 95% quality threshold
Automate Responses
Don't just detect issues—fix them automatically when possible:
- Auto-retry: For transient failures
- Auto-deduplicate: For known duplicate patterns
- Auto-enrich: For missing reference data
- Alert: For issues requiring human intervention
Common Pitfalls
Too Many Checks
Every check has a cost:
- Compute resources
- Maintenance overhead
- Alert fatigue
Solution: Focus on checks that matter. Remove checks that never catch issues.
Ignoring Historical Context
A value might seem wrong in isolation but be normal historically.
Solution: Compare against historical patterns, not just absolute thresholds.
Not Acting on Failures
What's the point of detecting issues if you don't fix them?
Solution: Have clear runbooks for each type of quality issue.
Perfect is the Enemy of Good
Don't wait for perfect quality before using data.
Solution: Accept reasonable quality levels, monitor continuously, improve incrementally.
Tools and Technologies
Great Expectations
Comprehensive data quality framework:
Pros:
- Extensive library of expectations
- Good documentation
- Active community
Cons:
- Can be complex for simple use cases
- Requires infrastructure setup
dbt Tests
Simple, SQL-based tests:
Pros:
- Integrated with transformations
- Easy to write and maintain
- Version-controlled with code
Cons:
- Limited to SQL-based checks
- Less statistical analysis
Custom Solutions
Sometimes you need domain-specific validation:
When to build:
- Complex business rules
- Proprietary data formats
- Specific performance requirements
Measuring Success
Track these metrics:
- Quality Score: Percentage of checks passing
- Time to Detection: How quickly issues are caught
- Time to Resolution: How quickly issues are fixed
- False Positive Rate: How often alerts are noise
- Business Impact: Reduction in bad decisions due to data issues
Conclusion
Data quality isn't about perfection—it's about trust. Implement checks that matter, monitor continuously, and improve incrementally. Start simple, expand based on actual issues, and always keep business impact in mind.
A single quality check that catches a $10,000 error is worth more than 100 checks that catch nothing. Focus on what matters.