Data Fabric Architecture: Unifying Your Data
How to build a data fabric that connects siloed systems. Architecture patterns, implementation strategies, and lessons from production deployments.
Your data lives everywhere. Customer data in Salesforce. Product data in PostgreSQL. Analytics in Snowflake. Logs in Elasticsearch. Marketing data in Google Analytics.
Each system works fine on its own. But when you need to answer questions that span multiple systems, you're stuck. You can't join data across silos. You duplicate data everywhere. Governance becomes impossible.
Data fabric architecture solves this. It creates a unified layer that connects all your data sources without requiring you to move everything into one place.
After building data fabrics for dozens of companies, we've seen teams cut data integration time by 60% and reduce data duplication by 40%. Here's how they did it.
What is Data Fabric?
Data fabric is an architecture pattern that provides unified access to data across multiple systems, locations, and formats.
Key characteristics:
- Connects data without moving it
- Provides a single interface to query multiple sources
- Maintains data governance and security
- Supports both structured and unstructured data
- Works with on-premises and cloud systems
Think of it as a virtual layer that sits on top of your existing systems, making them look like one unified database.
Why Data Fabric Matters
The problem with data silos:
Scenario: You want to analyze customer lifetime value.
Without data fabric:
- Export customer data from CRM
- Export order data from e-commerce system
- Export support ticket data from helpdesk
- Load everything into a data warehouse
- Join tables, clean data, build reports
- Repeat next month when data changes
With data fabric:
- Write one query that joins across systems
- Get results in seconds
- Data stays in source systems
Benefits:
- Faster time to insight (no ETL delays)
- Reduced data duplication
- Single source of truth
- Better governance (one place to manage access)
- Lower costs (less storage, less processing)
Data Fabric vs Data Warehouse vs Data Lake
These aren't competing approaches. They solve different problems.
Data Warehouse:
- Stores data in one place
- Optimized for analytics
- Requires ETL to load data
- Best for: Structured data, regular reporting
Data Lake:
- Stores raw data in one place
- Flexible schema
- Requires processing to use
- Best for: Unstructured data, exploration
Data Fabric:
- Connects data where it lives
- No central storage required
- Query across systems
- Best for: Real-time access, reducing duplication
Many companies use all three: data fabric for real-time queries, data warehouse for analytics, data lake for raw storage.
Core Components
1. Data Virtualization Layer
Provides a unified query interface. When you query, it:
- Determines which systems contain the data
- Translates your query to each system's language
- Combines results from multiple sources
- Returns unified results
class DataFabricQueryEngine:
def __init__(self, connectors):
self.connectors = connectors # Maps data sources to connectors
self.query_planner = QueryPlanner()
self.result_merger = ResultMerger()
def execute_query(self, query):
# Parse query
parsed = self.parse_query(query)
# Plan execution (which systems to query)
execution_plan = self.query_planner.plan(parsed)
# Execute queries in parallel
results = []
for step in execution_plan:
connector = self.connectors[step.source]
result = connector.execute(step.query)
results.append(result)
# Merge results
merged = self.result_merger.merge(results, parsed.join_conditions)
return merged
2. Metadata Catalog
Tracks what data exists where. Think of it as a searchable index of all your data.
What it stores:
- Data sources and their schemas
- Data lineage (where data came from, how it was transformed)
- Data quality metrics
- Access policies and permissions
- Business glossaries and definitions
class MetadataCatalog:
def __init__(self):
self.sources = {} # source_id -> source_info
self.schemas = {} # source_id -> schema
self.lineage = {} # data_id -> lineage_info
def register_source(self, source_id, source_info, schema):
self.sources[source_id] = source_info
self.schemas[source_id] = schema
def search(self, query):
"""Search for data by name, type, or tags"""
results = []
for source_id, schema in self.schemas.items():
if self.matches(query, schema):
results.append({
'source_id': source_id,
'schema': schema,
'location': self.sources[source_id]['location']
})
return results
def get_lineage(self, data_id):
"""Get data lineage: where data came from, transformations applied"""
return self.lineage.get(data_id, {})
3. Connectors
Connect to different data sources. Each connector knows how to:
- Authenticate to the source system
- Translate queries to the source's query language
- Handle data type conversions
- Manage connection pooling
class DatabaseConnector:
def __init__(self, connection_string, source_type):
self.connection_string = connection_string
self.source_type = source_type
self.pool = self.create_connection_pool()
def execute(self, query):
"""Execute query and return results"""
connection = self.pool.get_connection()
try:
cursor = connection.cursor()
cursor.execute(self.translate_query(query))
results = cursor.fetchall()
return self.format_results(results)
finally:
self.pool.return_connection(connection)
def translate_query(self, query):
"""Translate fabric query to source-specific SQL"""
# Handle dialect differences
if self.source_type == 'postgresql':
return self.translate_to_postgres(query)
elif self.source_type == 'mysql':
return self.translate_to_mysql(query)
# ...
4. Security and Governance Layer
Manages access control, data masking, and compliance.
class SecurityLayer:
def __init__(self, policies):
self.policies = policies
def check_access(self, user, data_source, operation):
"""Check if user can access data source"""
for policy in self.policies:
if policy.matches(user, data_source, operation):
return policy.allows()
return False
def apply_masking(self, data, user, data_source):
"""Apply data masking based on user permissions"""
if not self.check_access(user, data_source, 'read'):
return self.mask_sensitive_fields(data)
return data
Architecture Patterns
Pattern 1: Query Federation
Distribute queries across multiple sources and combine results.
Use when:
- Data stays in source systems
- Real-time access needed
- Sources have good query performance
Example:
-- Query spans multiple systems
SELECT
c.name,
o.total,
s.ticket_count
FROM customers@crm c
JOIN orders@ecommerce o ON c.id = o.customer_id
JOIN support_stats@support s ON c.id = s.customer_id
WHERE c.created_at > '2025-01-01'
The fabric engine:
- Queries CRM for customer data
- Queries e-commerce for orders
- Queries support system for ticket counts
- Joins results in memory
- Returns unified result
Pattern 2: Cached Views
Pre-compute common queries and cache results.
Use when:
- Same queries run frequently
- Source systems are slow
- Data doesn't change often
class CachedView:
def __init__(self, query, ttl_minutes=60):
self.query = query
self.ttl = timedelta(minutes=ttl_minutes)
self.cache = None
self.last_refresh = None
def get_data(self):
if self.needs_refresh():
self.refresh()
return self.cache
def needs_refresh(self):
if self.cache is None:
return True
if self.last_refresh is None:
return True
return datetime.now() - self.last_refresh > self.ttl
def refresh(self):
# Execute query against fabric
self.cache = data_fabric.execute(self.query)
self.last_refresh = datetime.now()
Pattern 3: Materialized Aggregations
Pre-compute aggregations and store them.
Use when:
- Aggregations are expensive
- Same aggregations queried repeatedly
- Incremental updates possible
class MaterializedAggregation:
def __init__(self, aggregation_query, update_schedule):
self.query = aggregation_query
self.schedule = update_schedule
self.storage = AggregationStorage()
def update(self):
"""Compute aggregation and store result"""
result = data_fabric.execute(self.query)
self.storage.save(result)
def query(self, filters):
"""Query pre-computed aggregation"""
return self.storage.query(filters)
Implementation Strategy
Phase 1: Start with metadata
Before building connectors, catalog what you have.
- Inventory all data sources
- Document schemas and relationships
- Identify common queries
- Map data lineage
Phase 2: Build core connectors
Start with your most important systems.
- Choose 2-3 critical data sources
- Build connectors for each
- Test query performance
- Iterate based on feedback
Phase 3: Add virtualization layer
Enable cross-system queries.
- Build query planner
- Implement result merging
- Add query optimization
- Test with real queries
Phase 4: Expand and improve
Add more sources and improve performance.
- Add remaining data sources
- Implement caching
- Add materialized views
- Monitor and tune
Query Optimization
Cross-system queries can be slow. Optimize them.
Pushdown optimization:
Push filters and aggregations to source systems instead of processing in memory.
def optimize_query(query):
"""Push operations down to sources when possible"""
# Original query
# SELECT * FROM customers WHERE age > 30
# Instead of:
# 1. SELECT * FROM customers (get all data)
# 2. Filter in memory WHERE age > 30
# Do:
# 1. SELECT * FROM customers WHERE age > 30 (filter at source)
optimized = query.copy()
# Push WHERE clauses to sources
for source in query.sources:
source_filters = extract_filters_for_source(query.filters, source)
optimized.add_source_filter(source, source_filters)
# Push aggregations when possible
if can_push_aggregation(query):
optimized.push_aggregation_to_source()
return optimized
Parallel execution:
Query multiple sources simultaneously.
import asyncio
async def execute_parallel(execution_plan):
"""Execute independent queries in parallel"""
tasks = []
for step in execution_plan:
if step.can_run_parallel:
task = asyncio.create_task(
execute_step(step)
)
tasks.append(task)
results = await asyncio.gather(*tasks)
return results
Result caching:
Cache frequently accessed data.
from functools import lru_cache
from hashlib import md5
class QueryCache:
def __init__(self, ttl_seconds=300):
self.cache = {}
self.ttl = ttl_seconds
def get(self, query):
cache_key = self.generate_key(query)
if cache_key in self.cache:
result, timestamp = self.cache[cache_key]
if time.time() - timestamp < self.ttl:
return result
return None
def set(self, query, result):
cache_key = self.generate_key(query)
self.cache[cache_key] = (result, time.time())
def generate_key(self, query):
"""Generate cache key from query"""
query_string = str(query)
return md5(query_string.encode()).hexdigest()
Data Governance in Fabric
Data fabric makes governance easier by centralizing it.
Access control:
class AccessControl:
def __init__(self):
self.policies = []
def add_policy(self, user_group, data_source, permissions):
"""Define who can access what"""
self.policies.append({
'user_group': user_group,
'data_source': data_source,
'permissions': permissions # ['read', 'write', 'delete']
})
def check_permission(self, user, data_source, operation):
"""Check if user has permission"""
user_groups = self.get_user_groups(user)
for policy in self.policies:
if (policy['user_group'] in user_groups and
policy['data_source'] == data_source and
operation in policy['permissions']):
return True
return False
Data lineage tracking:
Track where data comes from and how it flows.
class LineageTracker:
def __init__(self):
self.lineage_graph = {}
def track_query(self, query, sources, result):
"""Track data lineage for a query"""
query_id = self.generate_query_id(query)
self.lineage_graph[query_id] = {
'query': query,
'sources': sources,
'result_location': result.location,
'timestamp': datetime.now(),
'transformations': self.extract_transformations(query)
}
def get_lineage(self, data_id):
"""Get full lineage for a piece of data"""
lineage = []
current = data_id
while current:
if current in self.lineage_graph:
entry = self.lineage_graph[current]
lineage.append(entry)
current = entry.get('parent')
else:
break
return lineage
Common Challenges
Performance:
Cross-system queries are slower than single-system queries. Solutions:
- Use caching aggressively
- Push operations to sources
- Materialize common queries
- Accept that some queries will be slower
Data consistency:
Data in different systems might be inconsistent. Solutions:
- Document known inconsistencies
- Use timestamps to identify stale data
- Implement reconciliation processes
- Set expectations with users
Complexity:
Data fabric adds another layer to your stack. Solutions:
- Start simple, add complexity gradually
- Document everything
- Provide good tooling for users
- Monitor and alert on issues
Real-World Example: Multi-System Analytics
A retail company had data in:
- PostgreSQL (product catalog)
- MongoDB (customer profiles)
- Snowflake (sales transactions)
- Elasticsearch (website logs)
Challenge: Analysts needed to join data across systems for customer journey analysis. Current process took days and required manual data exports.
Solution: Built a data fabric that:
- Connected all four systems
- Provided SQL interface for cross-system queries
- Cached common queries
- Tracked data lineage
Results:
- Query time reduced from days to minutes
- Eliminated manual data exports
- Improved data consistency
- Enabled real-time analytics
Tools and Technologies
Commercial solutions:
- Denodo: Data virtualization platform
- Informatica: Enterprise data fabric
- Talend: Data integration and fabric
- Starburst: Query engine for data lakes
Open source:
- Apache Drill: SQL query engine
- Presto/Trino: Distributed SQL engine
- Apache Calcite: Query planning framework
- DataHub: Metadata catalog
Build your own:
- Use existing query engines (Presto, Drill)
- Build connectors for your systems
- Add metadata catalog
- Implement security layer
Getting Started
Start small. Don't try to connect everything at once.
Week 1-2: Catalog your data
- List all data sources
- Document schemas
- Identify relationships
- Map common queries
Week 3-4: Build first connector
- Choose one important system
- Build connector
- Test query performance
- Get feedback
Week 5-6: Add virtualization
- Build query planner
- Enable cross-system queries
- Add caching
- Monitor performance
Week 7+: Expand
- Add more connectors
- Optimize queries
- Add governance features
- Scale based on usage
The Bottom Line
Data fabric architecture connects your siloed systems without requiring you to move everything. It provides unified access while keeping data where it belongs.
Start by cataloging what you have. Then build connectors for your most important systems. Enable cross-system queries. Expand gradually.
The companies seeing 60% faster integration didn't build perfect systems on day one. They started with one connection, made it work, then added more.
Remember: Data fabric is about making existing systems work together, not replacing them.