Blog/Building RAG Systems with Airbyte: Complete Guide

Data Engineering10 min readNovember 8, 2025

Building RAG Systems with Airbyte: Complete Guide

Learn how to build RAG (Retrieval-Augmented Generation) systems using Airbyte to sync data into vector databases. Step-by-step guide with real examples.

You're building a RAG (Retrieval-Augmented Generation) system. You need to feed your vector database with data from multiple sources—documents, databases, APIs. You need it synced regularly, reliably, and at scale.

We've built RAG systems for dozens of companies. Here's how we use Airbyte to power the data pipeline that feeds these systems.

What is RAG and Why It Matters

RAG combines retrieval (searching a knowledge base) with generation (LLM creating responses). Instead of training a model on your data, you:

Store your data in a vector database
Convert queries to embeddings
Search for relevant chunks
Pass those chunks to an LLM as context
Generate answers based on retrieved context

Why RAG works:

No model training needed
Can update knowledge base without retraining
More accurate than pure LLM (grounded in your data)
Cost-effective (smaller models work well)

The challenge: Your data lives everywhere—PDFs, databases, APIs, SaaS tools. You need to sync all of it into your vector database, keep it updated, and handle scale.

That's where Airbyte comes in.

The RAG Data Pipeline Architecture

Here's how we structure RAG systems with Airbyte:

Data Flow:

Sources: Documents, databases, APIs, SaaS tools
Airbyte: Syncs data to staging area (warehouse or storage)
Processing: Chunk documents, generate embeddings
Vector Database: Store embeddings (Pinecone, Weaviate, etc.)
RAG Application: Query vector DB, retrieve context, generate answers

Why Airbyte fits:

Syncs from 600+ sources
Handles unstructured data (PDFs, documents)
Real-time or batch sync options
Scales to millions of documents
Open-source (you control the pipeline)

Setting Up Your Data Sources

RAG systems need diverse data. Here's what we typically sync:

1. Document Sources

File Storage (S3, GCS, Azure):

PDFs, Word docs, markdown files
Airbyte syncs file metadata and content
Can trigger on new file uploads

Example setup:

Source: S3 bucket with documents
Destination: Data warehouse (Snowflake, BigQuery)
Schedule: Hourly sync for new documents

What you get:

File paths, metadata, content
Ready for chunking and embedding

2. Database Sources

PostgreSQL, MySQL, MongoDB:

Product documentation
Knowledge base articles
Support tickets
User-generated content

Example setup:

Source: PostgreSQL knowledge base
Destination: Data warehouse
Sync mode: Incremental (only new/updated rows)
Schedule: Every 15 minutes

What you get:

Structured content ready for embedding
Change tracking (only process new data)

3. SaaS Application Sources

Confluence, Notion, Slack:

Internal documentation
Team knowledge
Customer conversations

Example setup:

Source: Confluence API
Destination: Data warehouse
Sync mode: Incremental
Schedule: Daily

What you get:

Unified knowledge base from multiple tools
Automatic updates as content changes

4. API Sources

Custom APIs, REST endpoints:

Proprietary systems
Third-party data
Real-time updates

Example setup:

Source: Custom REST API connector (built with Airbyte CDK)
Destination: Data warehouse
Schedule: Real-time or frequent batches

What you get:

Any data source becomes available
Custom connectors built in hours

Processing Pipeline: From Raw Data to Embeddings

Once Airbyte syncs data to your warehouse, you need to process it:

Step 1: Extract and Chunk

Chunking strategy:

Fixed size: 500-1000 tokens per chunk
Semantic: Split on paragraph/section boundaries
Overlap: 50-100 tokens between chunks (preserves context)

Tools we use:

LangChain (Python)
LlamaIndex (Python)
Custom chunking logic

Example chunking:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

chunks = splitter.split_text(document_content)

Step 2: Generate Embeddings

Embedding models:

OpenAI: text-embedding-3-large (best quality)
Cohere: embed-english-v3.0 (good balance)
Open-source: sentence-transformers (cost-effective)

Batch processing:

Process chunks in batches (100-1000 at a time)
Use async/parallel processing
Cache embeddings (don't regenerate unchanged chunks)

Example embedding:

import openai

def generate_embeddings(chunks):
    response = openai.Embedding.create(
        input=chunks,
        model="text-embedding-3-large"
    )
    return [item.embedding for item in response.data]

Step 3: Load into Vector Database

Vector databases we use:

Pinecone: Managed, easy to use
Weaviate: Open-source, flexible
Qdrant: Self-hosted option
Chroma: Lightweight, local

Example loading:

import pinecone

pinecone.init(api_key="your-key", environment="us-east-1")
index = pinecone.Index("rag-index")

# Upsert embeddings with metadata
vectors = [
    {
        "id": chunk_id,
        "values": embedding,
        "metadata": {
            "source": document_id,
            "chunk_index": i,
            "text": chunk_text
        }
    }
    for i, (chunk_id, embedding, chunk_text) in enumerate(zip(chunk_ids, embeddings, chunks))
]

index.upsert(vectors=vectors)

Complete RAG Pipeline with Airbyte

Here's a production-ready setup we've used:

Architecture

Components:

Airbyte: Syncs data from sources → warehouse
dbt: Transforms raw data → clean format
Processing job: Chunks + embeddings (Airflow/Dagster)
Vector DB: Stores embeddings (Pinecone/Weaviate)
RAG API: Serves queries (FastAPI/Flask)

Implementation Steps

1. Set up Airbyte connectors:

S3 → Snowflake (documents)
PostgreSQL → Snowflake (knowledge base)
Confluence → Snowflake (docs)
Custom API → Snowflake (proprietary data)

2. Create dbt models:

Clean and normalize data
Prepare for chunking
Track changes (only process new/updated)

3. Build processing pipeline:

Triggered by new data in warehouse
Chunks documents
Generates embeddings
Upserts to vector DB

4. Build RAG application:

Query interface
Embedding search
LLM generation
Response formatting

Real-World Example

Company: B2B SaaS with 50K documents

Setup:

3 S3 buckets (product docs, support articles, marketing content)
PostgreSQL knowledge base
Confluence workspace
Custom API for product updates

Airbyte configuration:

4 connectors syncing to Snowflake
Hourly incremental syncs
~500K new documents/month

Processing:

dbt models clean and prepare data
Airflow DAG runs every hour
Processes new documents only
Generates embeddings (OpenAI)
Upserts to Pinecone

Results:

2M chunks in vector DB
Query latency: <200ms
95% accuracy on internal queries
Cost: $2K/month (Airbyte + processing)

Best Practices

1. Incremental Processing

Don't reprocess everything:

Track processed documents
Only chunk/embed new or updated content
Use change detection (Airbyte provides this)

Implementation:

Store document hashes in warehouse
Compare before processing
Skip unchanged documents

2. Metadata Strategy

Store rich metadata:

Source document ID
Chunk position
Timestamp
Source type
User permissions (if applicable)

Why it matters:

Filter search results
Trace back to source
Handle permissions
Debug issues

3. Chunking Strategy

Match chunking to content:

Technical docs: Larger chunks (1000+ tokens)
Conversational: Smaller chunks (500 tokens)
Code: Split by function/class
Tables: Keep as single chunks

Test different strategies:

Measure retrieval quality
A/B test chunk sizes
Tune for your use case

4. Embedding Quality

Choose the right model:

Quality: OpenAI text-embedding-3-large
Cost: sentence-transformers
Speed: Cohere embed-english-v3.0

Fine-tuning:

Consider fine-tuning for domain-specific content
Can improve retrieval by 10-20%

5. Vector Database Selection

Choose based on needs:

Pinecone: Easiest, managed, good performance
Weaviate: Open-source, flexible, self-hosted
Qdrant: Self-hosted, good performance
Chroma: Lightweight, local development

Considerations:

Scale (millions of vectors?)
Latency requirements
Budget (managed vs self-hosted)
Multi-tenancy needs

Common Challenges and Solutions

Challenge 1: Large Document Sets

Problem: Processing millions of documents is slow and expensive.

Solutions:

Incremental processing (only new/updated)
Parallel processing (multiple workers)
Batch embeddings (process in groups)
Cache aggressively (don't re-embed unchanged)

Example:

1M documents: Process in batches of 10K
Use 10 workers in parallel
Cache embeddings (save 80% compute)

Challenge 2: Real-Time Updates

Problem: Need vector DB updated within minutes of source changes.

Solutions:

Frequent Airbyte syncs (every 5-15 minutes)
Stream processing (Kafka + Airbyte)
Event-driven architecture (webhooks trigger processing)

Example:

Airbyte syncs every 10 minutes
Processing job triggered on new data
Vector DB updated within 15 minutes

Challenge 3: Multi-Source Coordination

Problem: Data from 10+ sources, need unified knowledge base.

Solutions:

Airbyte syncs all to single warehouse
dbt models unify schemas
Single processing pipeline handles all sources
Metadata tracks source for filtering

Example:

12 Airbyte connectors → Snowflake
dbt models create unified schema
One processing pipeline
Source metadata in vector DB

Challenge 4: Cost Management

Problem: Embedding generation is expensive at scale.

Solutions:

Use cheaper models where quality allows
Cache embeddings (don't regenerate)
Incremental processing (only new data)
Batch processing (better API rates)

Example:

OpenAI for high-value content
sentence-transformers for bulk content
Cache saves 70% of embedding costs
Incremental processing saves 90% compute

Performance Tips

Sync Performance

Getting better performance from Airbyte:

Use incremental syncs (not full refresh)
Run connectors in parallel (sync multiple sources)
Balance sync schedules (freshness vs cost)

Example:

10 connectors syncing in parallel
Incremental syncs (10x faster than full)
Hourly schedule (good balance)

Processing Performance

Chunking and embedding:

Parallel processing (multiple workers)
Batch API calls (embedding APIs)
Async processing (don't block on I/O)

Example:

20 workers processing chunks
Batch 100 chunks per API call
10x faster than sequential

Query Performance

Vector search:

Better indexes (HNSW, IVF)
Approximate search (faster, slightly less accurate)
Caching frequent queries

Example:

HNSW index in Pinecone
Top-10 search (not exhaustive)
Query cache for common questions

Monitoring and Maintenance

What to Monitor

Airbyte syncs:

Success rate
Sync duration
Data volume
Error rates

Processing pipeline:

Documents processed
Embeddings generated
Vector DB updates
Processing time

RAG application:

Query latency
Retrieval quality
User satisfaction
Error rates

Alerting

Set up alerts for:

Failed Airbyte syncs
Processing pipeline failures
High query latency
Vector DB errors

Maintenance Tasks

Regular maintenance:

Review and update chunking strategy
Tune embedding model selection
Clean up old/irrelevant documents
Monitor costs and cut waste

Cost Breakdown

Typical RAG system costs:

Airbyte:

Self-hosted: Infrastructure (~$500/month)
Pro: Capacity-based pricing (contact sales for exact rates)

Processing:

Embedding generation: $0.10-0.50 per 1K documents (OpenAI)
Compute: $200-500/month (batch jobs)

Vector Database:

Pinecone: $70-500/month (based on vectors)
Weaviate self-hosted: Infrastructure costs

Total:

Small system (100K documents): $500-1,000/month
Medium system (1M documents): $2,000-5,000/month
Large system (10M documents): $10,000-20,000/month

Next Steps

Identify your data sources: What needs to be in your RAG system?
Set up Airbyte connectors: Sync sources to your warehouse
Design chunking strategy: How will you split documents?
Choose embedding model: Balance quality and cost
Select vector database: Based on scale and requirements
Build processing pipeline: Chunk → Embed → Load
Create RAG application: Query interface and generation
Test and iterate: Measure quality, improve

Conclusion

Airbyte is the foundation of production RAG systems. It handles the hard part—syncing data from everywhere into one place. Then you process it, embed it, and serve it.

We've built RAG systems processing millions of documents using this architecture. Airbyte's flexibility, scale, and open-source nature make it ideal for feeding vector databases.

The key is incremental processing, smart chunking, and the right embedding model. Get those right, and your RAG system will deliver accurate, up-to-date answers.

Start with a few sources, prove the concept, then scale. Airbyte makes it easy to add more sources as you grow.

Blog/Building RAG Systems with Airbyte: Complete Guide

Data Engineering10 min readNovember 8, 2025

Building RAG Systems with Airbyte: Complete Guide

Learn how to build RAG (Retrieval-Augmented Generation) systems using Airbyte to sync data into vector databases. Step-by-step guide with real examples.

We've built RAG systems for dozens of companies. Here's how we use Airbyte to power the data pipeline that feeds these systems.

What is RAG and Why It Matters

RAG combines retrieval (searching a knowledge base) with generation (LLM creating responses). Instead of training a model on your data, you:

Store your data in a vector database
Convert queries to embeddings
Search for relevant chunks
Pass those chunks to an LLM as context
Generate answers based on retrieved context

Why RAG works:

No model training needed
Can update knowledge base without retraining
More accurate than pure LLM (grounded in your data)
Cost-effective (smaller models work well)

The challenge: Your data lives everywhere—PDFs, databases, APIs, SaaS tools. You need to sync all of it into your vector database, keep it updated, and handle scale.

That's where Airbyte comes in.

The RAG Data Pipeline Architecture

Here's how we structure RAG systems with Airbyte:

Data Flow:

Sources: Documents, databases, APIs, SaaS tools
Airbyte: Syncs data to staging area (warehouse or storage)
Processing: Chunk documents, generate embeddings
Vector Database: Store embeddings (Pinecone, Weaviate, etc.)
RAG Application: Query vector DB, retrieve context, generate answers

Why Airbyte fits:

Syncs from 600+ sources
Handles unstructured data (PDFs, documents)
Real-time or batch sync options
Scales to millions of documents
Open-source (you control the pipeline)

Setting Up Your Data Sources

RAG systems need diverse data. Here's what we typically sync:

1. Document Sources

File Storage (S3, GCS, Azure):

PDFs, Word docs, markdown files
Airbyte syncs file metadata and content
Can trigger on new file uploads

Example setup:

Source: S3 bucket with documents
Destination: Data warehouse (Snowflake, BigQuery)
Schedule: Hourly sync for new documents

What you get:

File paths, metadata, content
Ready for chunking and embedding

2. Database Sources

PostgreSQL, MySQL, MongoDB:

Product documentation
Knowledge base articles
Support tickets
User-generated content

Example setup:

Source: PostgreSQL knowledge base
Destination: Data warehouse
Sync mode: Incremental (only new/updated rows)
Schedule: Every 15 minutes

What you get:

Structured content ready for embedding
Change tracking (only process new data)

3. SaaS Application Sources

Confluence, Notion, Slack:

Internal documentation
Team knowledge
Customer conversations

Example setup:

Source: Confluence API
Destination: Data warehouse
Sync mode: Incremental
Schedule: Daily

What you get:

Unified knowledge base from multiple tools
Automatic updates as content changes

4. API Sources

Custom APIs, REST endpoints:

Proprietary systems
Third-party data
Real-time updates

Example setup:

Source: Custom REST API connector (built with Airbyte CDK)
Destination: Data warehouse
Schedule: Real-time or frequent batches

What you get:

Any data source becomes available
Custom connectors built in hours

Processing Pipeline: From Raw Data to Embeddings

Once Airbyte syncs data to your warehouse, you need to process it:

Step 1: Extract and Chunk

Chunking strategy:

Fixed size: 500-1000 tokens per chunk
Semantic: Split on paragraph/section boundaries
Overlap: 50-100 tokens between chunks (preserves context)

Tools we use:

LangChain (Python)
LlamaIndex (Python)
Custom chunking logic

Example chunking:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

chunks = splitter.split_text(document_content)

Step 2: Generate Embeddings

Embedding models:

OpenAI: text-embedding-3-large (best quality)
Cohere: embed-english-v3.0 (good balance)
Open-source: sentence-transformers (cost-effective)

Batch processing:

Process chunks in batches (100-1000 at a time)
Use async/parallel processing
Cache embeddings (don't regenerate unchanged chunks)

Example embedding:

import openai

def generate_embeddings(chunks):
    response = openai.Embedding.create(
        input=chunks,
        model="text-embedding-3-large"
    )
    return [item.embedding for item in response.data]

Step 3: Load into Vector Database

Vector databases we use:

Pinecone: Managed, easy to use
Weaviate: Open-source, flexible
Qdrant: Self-hosted option
Chroma: Lightweight, local

Example loading:

import pinecone

pinecone.init(api_key="your-key", environment="us-east-1")
index = pinecone.Index("rag-index")

# Upsert embeddings with metadata
vectors = [
    {
        "id": chunk_id,
        "values": embedding,
        "metadata": {
            "source": document_id,
            "chunk_index": i,
            "text": chunk_text
        }
    }
    for i, (chunk_id, embedding, chunk_text) in enumerate(zip(chunk_ids, embeddings, chunks))
]

index.upsert(vectors=vectors)

Complete RAG Pipeline with Airbyte

Here's a production-ready setup we've used:

Architecture

Components:

Airbyte: Syncs data from sources → warehouse
dbt: Transforms raw data → clean format
Processing job: Chunks + embeddings (Airflow/Dagster)
Vector DB: Stores embeddings (Pinecone/Weaviate)
RAG API: Serves queries (FastAPI/Flask)

Implementation Steps

1. Set up Airbyte connectors:

S3 → Snowflake (documents)
PostgreSQL → Snowflake (knowledge base)
Confluence → Snowflake (docs)
Custom API → Snowflake (proprietary data)

2. Create dbt models:

Clean and normalize data
Prepare for chunking
Track changes (only process new/updated)

3. Build processing pipeline:

Triggered by new data in warehouse
Chunks documents
Generates embeddings
Upserts to vector DB

4. Build RAG application:

Query interface
Embedding search
LLM generation
Response formatting

Real-World Example

Company: B2B SaaS with 50K documents

Setup:

3 S3 buckets (product docs, support articles, marketing content)
PostgreSQL knowledge base
Confluence workspace
Custom API for product updates

Airbyte configuration:

4 connectors syncing to Snowflake
Hourly incremental syncs
~500K new documents/month

Processing:

dbt models clean and prepare data
Airflow DAG runs every hour
Processes new documents only
Generates embeddings (OpenAI)
Upserts to Pinecone

Results:

2M chunks in vector DB
Query latency: <200ms
95% accuracy on internal queries
Cost: $2K/month (Airbyte + processing)

Best Practices

1. Incremental Processing

Don't reprocess everything:

Track processed documents
Only chunk/embed new or updated content
Use change detection (Airbyte provides this)

Implementation:

Store document hashes in warehouse
Compare before processing
Skip unchanged documents

2. Metadata Strategy

Store rich metadata:

Source document ID
Chunk position
Timestamp
Source type
User permissions (if applicable)

Why it matters:

Filter search results
Trace back to source
Handle permissions
Debug issues

3. Chunking Strategy

Match chunking to content:

Technical docs: Larger chunks (1000+ tokens)
Conversational: Smaller chunks (500 tokens)
Code: Split by function/class
Tables: Keep as single chunks

Test different strategies:

Measure retrieval quality
A/B test chunk sizes
Tune for your use case

4. Embedding Quality

Choose the right model:

Quality: OpenAI text-embedding-3-large
Cost: sentence-transformers
Speed: Cohere embed-english-v3.0

Fine-tuning:

Consider fine-tuning for domain-specific content
Can improve retrieval by 10-20%

5. Vector Database Selection

Choose based on needs:

Pinecone: Easiest, managed, good performance
Weaviate: Open-source, flexible, self-hosted
Qdrant: Self-hosted, good performance
Chroma: Lightweight, local development

Considerations:

Scale (millions of vectors?)
Latency requirements
Budget (managed vs self-hosted)
Multi-tenancy needs

Common Challenges and Solutions

Challenge 1: Large Document Sets

Problem: Processing millions of documents is slow and expensive.

Solutions:

Incremental processing (only new/updated)
Parallel processing (multiple workers)
Batch embeddings (process in groups)
Cache aggressively (don't re-embed unchanged)

Example:

1M documents: Process in batches of 10K
Use 10 workers in parallel
Cache embeddings (save 80% compute)

Challenge 2: Real-Time Updates

Problem: Need vector DB updated within minutes of source changes.

Solutions:

Frequent Airbyte syncs (every 5-15 minutes)
Stream processing (Kafka + Airbyte)
Event-driven architecture (webhooks trigger processing)

Example:

Airbyte syncs every 10 minutes
Processing job triggered on new data
Vector DB updated within 15 minutes

Challenge 3: Multi-Source Coordination

Problem: Data from 10+ sources, need unified knowledge base.

Solutions:

Airbyte syncs all to single warehouse
dbt models unify schemas
Single processing pipeline handles all sources
Metadata tracks source for filtering

Example:

12 Airbyte connectors → Snowflake
dbt models create unified schema
One processing pipeline
Source metadata in vector DB

Challenge 4: Cost Management

Problem: Embedding generation is expensive at scale.

Solutions:

Use cheaper models where quality allows
Cache embeddings (don't regenerate)
Incremental processing (only new data)
Batch processing (better API rates)

Example:

OpenAI for high-value content
sentence-transformers for bulk content
Cache saves 70% of embedding costs
Incremental processing saves 90% compute

Performance Tips

Sync Performance

Getting better performance from Airbyte:

Use incremental syncs (not full refresh)
Run connectors in parallel (sync multiple sources)
Balance sync schedules (freshness vs cost)

Example:

10 connectors syncing in parallel
Incremental syncs (10x faster than full)
Hourly schedule (good balance)

Processing Performance

Chunking and embedding:

Parallel processing (multiple workers)
Batch API calls (embedding APIs)
Async processing (don't block on I/O)

Example:

20 workers processing chunks
Batch 100 chunks per API call
10x faster than sequential

Query Performance

Vector search:

Better indexes (HNSW, IVF)
Approximate search (faster, slightly less accurate)
Caching frequent queries

Example:

HNSW index in Pinecone
Top-10 search (not exhaustive)
Query cache for common questions

Monitoring and Maintenance

What to Monitor

Airbyte syncs:

Success rate
Sync duration
Data volume
Error rates

Processing pipeline:

Documents processed
Embeddings generated
Vector DB updates
Processing time

RAG application:

Query latency
Retrieval quality
User satisfaction
Error rates

Alerting

Set up alerts for:

Failed Airbyte syncs
Processing pipeline failures
High query latency
Vector DB errors

Maintenance Tasks

Regular maintenance:

Review and update chunking strategy
Tune embedding model selection
Clean up old/irrelevant documents
Monitor costs and cut waste

Cost Breakdown

Typical RAG system costs:

Airbyte:

Self-hosted: Infrastructure (~$500/month)
Pro: Capacity-based pricing (contact sales for exact rates)

Processing:

Embedding generation: $0.10-0.50 per 1K documents (OpenAI)
Compute: $200-500/month (batch jobs)

Vector Database:

Pinecone: $70-500/month (based on vectors)
Weaviate self-hosted: Infrastructure costs

Total:

Small system (100K documents): $500-1,000/month
Medium system (1M documents): $2,000-5,000/month
Large system (10M documents): $10,000-20,000/month

Next Steps

Identify your data sources: What needs to be in your RAG system?
Set up Airbyte connectors: Sync sources to your warehouse
Design chunking strategy: How will you split documents?
Choose embedding model: Balance quality and cost
Select vector database: Based on scale and requirements
Build processing pipeline: Chunk → Embed → Load
Create RAG application: Query interface and generation
Test and iterate: Measure quality, improve

Conclusion

Airbyte is the foundation of production RAG systems. It handles the hard part—syncing data from everywhere into one place. Then you process it, embed it, and serve it.

We've built RAG systems processing millions of documents using this architecture. Airbyte's flexibility, scale, and open-source nature make it ideal for feeding vector databases.

The key is incremental processing, smart chunking, and the right embedding model. Get those right, and your RAG system will deliver accurate, up-to-date answers.

Start with a few sources, prove the concept, then scale. Airbyte makes it easy to add more sources as you grow.