Building RAG Systems with Airbyte: Complete Guide
Learn how to build RAG (Retrieval-Augmented Generation) systems using Airbyte to sync data into vector databases. Step-by-step guide with real examples.
You're building a RAG (Retrieval-Augmented Generation) system. You need to feed your vector database with data from multiple sources—documents, databases, APIs. You need it synced regularly, reliably, and at scale.
We've built RAG systems for dozens of companies. Here's how we use Airbyte to power the data pipeline that feeds these systems.
What is RAG and Why It Matters
RAG combines retrieval (searching a knowledge base) with generation (LLM creating responses). Instead of training a model on your data, you:
- Store your data in a vector database
- Convert queries to embeddings
- Search for relevant chunks
- Pass those chunks to an LLM as context
- Generate answers based on retrieved context
Why RAG works:
- No model training needed
- Can update knowledge base without retraining
- More accurate than pure LLM (grounded in your data)
- Cost-effective (smaller models work well)
The challenge: Your data lives everywhere—PDFs, databases, APIs, SaaS tools. You need to sync all of it into your vector database, keep it updated, and handle scale.
That's where Airbyte comes in.
The RAG Data Pipeline Architecture
Here's how we structure RAG systems with Airbyte:
Data Flow:
- Sources: Documents, databases, APIs, SaaS tools
- Airbyte: Syncs data to staging area (warehouse or storage)
- Processing: Chunk documents, generate embeddings
- Vector Database: Store embeddings (Pinecone, Weaviate, etc.)
- RAG Application: Query vector DB, retrieve context, generate answers
Why Airbyte fits:
- Syncs from 600+ sources
- Handles unstructured data (PDFs, documents)
- Real-time or batch sync options
- Scales to millions of documents
- Open-source (you control the pipeline)
Setting Up Your Data Sources
RAG systems need diverse data. Here's what we typically sync:
1. Document Sources
File Storage (S3, GCS, Azure):
- PDFs, Word docs, markdown files
- Airbyte syncs file metadata and content
- Can trigger on new file uploads
Example setup:
- Source: S3 bucket with documents
- Destination: Data warehouse (Snowflake, BigQuery)
- Schedule: Hourly sync for new documents
What you get:
- File paths, metadata, content
- Ready for chunking and embedding
2. Database Sources
PostgreSQL, MySQL, MongoDB:
- Product documentation
- Knowledge base articles
- Support tickets
- User-generated content
Example setup:
- Source: PostgreSQL knowledge base
- Destination: Data warehouse
- Sync mode: Incremental (only new/updated rows)
- Schedule: Every 15 minutes
What you get:
- Structured content ready for embedding
- Change tracking (only process new data)
3. SaaS Application Sources
Confluence, Notion, Slack:
- Internal documentation
- Team knowledge
- Customer conversations
Example setup:
- Source: Confluence API
- Destination: Data warehouse
- Sync mode: Incremental
- Schedule: Daily
What you get:
- Unified knowledge base from multiple tools
- Automatic updates as content changes
4. API Sources
Custom APIs, REST endpoints:
- Proprietary systems
- Third-party data
- Real-time updates
Example setup:
- Source: Custom REST API connector (built with Airbyte CDK)
- Destination: Data warehouse
- Schedule: Real-time or frequent batches
What you get:
- Any data source becomes available
- Custom connectors built in hours
Processing Pipeline: From Raw Data to Embeddings
Once Airbyte syncs data to your warehouse, you need to process it:
Step 1: Extract and Chunk
Chunking strategy:
- Fixed size: 500-1000 tokens per chunk
- Semantic: Split on paragraph/section boundaries
- Overlap: 50-100 tokens between chunks (preserves context)
Tools we use:
- LangChain (Python)
- LlamaIndex (Python)
- Custom chunking logic
Example chunking:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
)
chunks = splitter.split_text(document_content)
Step 2: Generate Embeddings
Embedding models:
- OpenAI: text-embedding-3-large (best quality)
- Cohere: embed-english-v3.0 (good balance)
- Open-source: sentence-transformers (cost-effective)
Batch processing:
- Process chunks in batches (100-1000 at a time)
- Use async/parallel processing
- Cache embeddings (don't regenerate unchanged chunks)
Example embedding:
import openai
def generate_embeddings(chunks):
response = openai.Embedding.create(
input=chunks,
model="text-embedding-3-large"
)
return [item.embedding for item in response.data]
Step 3: Load into Vector Database
Vector databases we use:
- Pinecone: Managed, easy to use
- Weaviate: Open-source, flexible
- Qdrant: Self-hosted option
- Chroma: Lightweight, local
Example loading:
import pinecone
pinecone.init(api_key="your-key", environment="us-east-1")
index = pinecone.Index("rag-index")
# Upsert embeddings with metadata
vectors = [
{
"id": chunk_id,
"values": embedding,
"metadata": {
"source": document_id,
"chunk_index": i,
"text": chunk_text
}
}
for i, (chunk_id, embedding, chunk_text) in enumerate(zip(chunk_ids, embeddings, chunks))
]
index.upsert(vectors=vectors)
Complete RAG Pipeline with Airbyte
Here's a production-ready setup we've used:
Architecture
Components:
- Airbyte: Syncs data from sources → warehouse
- dbt: Transforms raw data → clean format
- Processing job: Chunks + embeddings (Airflow/Dagster)
- Vector DB: Stores embeddings (Pinecone/Weaviate)
- RAG API: Serves queries (FastAPI/Flask)
Implementation Steps
1. Set up Airbyte connectors:
- S3 → Snowflake (documents)
- PostgreSQL → Snowflake (knowledge base)
- Confluence → Snowflake (docs)
- Custom API → Snowflake (proprietary data)
2. Create dbt models:
- Clean and normalize data
- Prepare for chunking
- Track changes (only process new/updated)
3. Build processing pipeline:
- Triggered by new data in warehouse
- Chunks documents
- Generates embeddings
- Upserts to vector DB
4. Build RAG application:
- Query interface
- Embedding search
- LLM generation
- Response formatting
Real-World Example
Company: B2B SaaS with 50K documents
Setup:
- 3 S3 buckets (product docs, support articles, marketing content)
- PostgreSQL knowledge base
- Confluence workspace
- Custom API for product updates
Airbyte configuration:
- 4 connectors syncing to Snowflake
- Hourly incremental syncs
- ~500K new documents/month
Processing:
- dbt models clean and prepare data
- Airflow DAG runs every hour
- Processes new documents only
- Generates embeddings (OpenAI)
- Upserts to Pinecone
Results:
- 2M chunks in vector DB
- Query latency: <200ms
- 95% accuracy on internal queries
- Cost: $2K/month (Airbyte + processing)
Best Practices
1. Incremental Processing
Don't reprocess everything:
- Track processed documents
- Only chunk/embed new or updated content
- Use change detection (Airbyte provides this)
Implementation:
- Store document hashes in warehouse
- Compare before processing
- Skip unchanged documents
2. Metadata Strategy
Store rich metadata:
- Source document ID
- Chunk position
- Timestamp
- Source type
- User permissions (if applicable)
Why it matters:
- Filter search results
- Trace back to source
- Handle permissions
- Debug issues
3. Chunking Strategy
Match chunking to content:
- Technical docs: Larger chunks (1000+ tokens)
- Conversational: Smaller chunks (500 tokens)
- Code: Split by function/class
- Tables: Keep as single chunks
Test different strategies:
- Measure retrieval quality
- A/B test chunk sizes
- Tune for your use case
4. Embedding Quality
Choose the right model:
- Quality: OpenAI text-embedding-3-large
- Cost: sentence-transformers
- Speed: Cohere embed-english-v3.0
Fine-tuning:
- Consider fine-tuning for domain-specific content
- Can improve retrieval by 10-20%
5. Vector Database Selection
Choose based on needs:
- Pinecone: Easiest, managed, good performance
- Weaviate: Open-source, flexible, self-hosted
- Qdrant: Self-hosted, good performance
- Chroma: Lightweight, local development
Considerations:
- Scale (millions of vectors?)
- Latency requirements
- Budget (managed vs self-hosted)
- Multi-tenancy needs
Common Challenges and Solutions
Challenge 1: Large Document Sets
Problem: Processing millions of documents is slow and expensive.
Solutions:
- Incremental processing (only new/updated)
- Parallel processing (multiple workers)
- Batch embeddings (process in groups)
- Cache aggressively (don't re-embed unchanged)
Example:
- 1M documents: Process in batches of 10K
- Use 10 workers in parallel
- Cache embeddings (save 80% compute)
Challenge 2: Real-Time Updates
Problem: Need vector DB updated within minutes of source changes.
Solutions:
- Frequent Airbyte syncs (every 5-15 minutes)
- Stream processing (Kafka + Airbyte)
- Event-driven architecture (webhooks trigger processing)
Example:
- Airbyte syncs every 10 minutes
- Processing job triggered on new data
- Vector DB updated within 15 minutes
Challenge 3: Multi-Source Coordination
Problem: Data from 10+ sources, need unified knowledge base.
Solutions:
- Airbyte syncs all to single warehouse
- dbt models unify schemas
- Single processing pipeline handles all sources
- Metadata tracks source for filtering
Example:
- 12 Airbyte connectors → Snowflake
- dbt models create unified schema
- One processing pipeline
- Source metadata in vector DB
Challenge 4: Cost Management
Problem: Embedding generation is expensive at scale.
Solutions:
- Use cheaper models where quality allows
- Cache embeddings (don't regenerate)
- Incremental processing (only new data)
- Batch processing (better API rates)
Example:
- OpenAI for high-value content
- sentence-transformers for bulk content
- Cache saves 70% of embedding costs
- Incremental processing saves 90% compute
Performance Tips
Sync Performance
Getting better performance from Airbyte:
- Use incremental syncs (not full refresh)
- Run connectors in parallel (sync multiple sources)
- Balance sync schedules (freshness vs cost)
Example:
- 10 connectors syncing in parallel
- Incremental syncs (10x faster than full)
- Hourly schedule (good balance)
Processing Performance
Chunking and embedding:
- Parallel processing (multiple workers)
- Batch API calls (embedding APIs)
- Async processing (don't block on I/O)
Example:
- 20 workers processing chunks
- Batch 100 chunks per API call
- 10x faster than sequential
Query Performance
Vector search:
- Better indexes (HNSW, IVF)
- Approximate search (faster, slightly less accurate)
- Caching frequent queries
Example:
- HNSW index in Pinecone
- Top-10 search (not exhaustive)
- Query cache for common questions
Monitoring and Maintenance
What to Monitor
Airbyte syncs:
- Success rate
- Sync duration
- Data volume
- Error rates
Processing pipeline:
- Documents processed
- Embeddings generated
- Vector DB updates
- Processing time
RAG application:
- Query latency
- Retrieval quality
- User satisfaction
- Error rates
Alerting
Set up alerts for:
- Failed Airbyte syncs
- Processing pipeline failures
- High query latency
- Vector DB errors
Maintenance Tasks
Regular maintenance:
- Review and update chunking strategy
- Tune embedding model selection
- Clean up old/irrelevant documents
- Monitor costs and cut waste
Cost Breakdown
Typical RAG system costs:
Airbyte:
- Self-hosted: Infrastructure (~$500/month)
- Pro: Capacity-based pricing (contact sales for exact rates)
Processing:
- Embedding generation: $0.10-0.50 per 1K documents (OpenAI)
- Compute: $200-500/month (batch jobs)
Vector Database:
- Pinecone: $70-500/month (based on vectors)
- Weaviate self-hosted: Infrastructure costs
Total:
- Small system (100K documents): $500-1,000/month
- Medium system (1M documents): $2,000-5,000/month
- Large system (10M documents): $10,000-20,000/month
Next Steps
- Identify your data sources: What needs to be in your RAG system?
- Set up Airbyte connectors: Sync sources to your warehouse
- Design chunking strategy: How will you split documents?
- Choose embedding model: Balance quality and cost
- Select vector database: Based on scale and requirements
- Build processing pipeline: Chunk → Embed → Load
- Create RAG application: Query interface and generation
- Test and iterate: Measure quality, improve
Conclusion
Airbyte is the foundation of production RAG systems. It handles the hard part—syncing data from everywhere into one place. Then you process it, embed it, and serve it.
We've built RAG systems processing millions of documents using this architecture. Airbyte's flexibility, scale, and open-source nature make it ideal for feeding vector databases.
The key is incremental processing, smart chunking, and the right embedding model. Get those right, and your RAG system will deliver accurate, up-to-date answers.
Start with a few sources, prove the concept, then scale. Airbyte makes it easy to add more sources as you grow.