N9INE
Services
Case StudiesBlogAbout
hello@n9ine.com

STOP GUESSING. START KNOWING.

Book a Free Consultation

One Insight a Month Worth More Than Most Consulting Calls

Real case studies, proven frameworks, and actionable data strategies — no fluff, just what works. Join data leaders who read this before making decisions.

Drop us a line

hello@n9ine.com

LinkedIn

Connect with us

© 2026 N9ine Data Analytics. All rights reserved.

Blog/Building RAG Systems with Airbyte: Complete Guide
Data Engineering10 min readNovember 8, 2025

Building RAG Systems with Airbyte: Complete Guide

Learn how to build RAG (Retrieval-Augmented Generation) systems using Airbyte to sync data into vector databases. Step-by-step guide with real examples.

You're building a RAG (Retrieval-Augmented Generation) system. You need to feed your vector database with data from multiple sources—documents, databases, APIs. You need it synced regularly, reliably, and at scale.

We've built RAG systems for dozens of companies. Here's how we use Airbyte to power the data pipeline that feeds these systems.

What is RAG and Why It Matters

RAG combines retrieval (searching a knowledge base) with generation (LLM creating responses). Instead of training a model on your data, you:

  1. Store your data in a vector database
  2. Convert queries to embeddings
  3. Search for relevant chunks
  4. Pass those chunks to an LLM as context
  5. Generate answers based on retrieved context

Why RAG works:

  • No model training needed
  • Can update knowledge base without retraining
  • More accurate than pure LLM (grounded in your data)
  • Cost-effective (smaller models work well)

The challenge: Your data lives everywhere—PDFs, databases, APIs, SaaS tools. You need to sync all of it into your vector database, keep it updated, and handle scale.

That's where Airbyte comes in.

The RAG Data Pipeline Architecture

Here's how we structure RAG systems with Airbyte:

Data Flow:

  1. Sources: Documents, databases, APIs, SaaS tools
  2. Airbyte: Syncs data to staging area (warehouse or storage)
  3. Processing: Chunk documents, generate embeddings
  4. Vector Database: Store embeddings (Pinecone, Weaviate, etc.)
  5. RAG Application: Query vector DB, retrieve context, generate answers

Why Airbyte fits:

  • Syncs from 600+ sources
  • Handles unstructured data (PDFs, documents)
  • Real-time or batch sync options
  • Scales to millions of documents
  • Open-source (you control the pipeline)

Setting Up Your Data Sources

RAG systems need diverse data. Here's what we typically sync:

1. Document Sources

File Storage (S3, GCS, Azure):

  • PDFs, Word docs, markdown files
  • Airbyte syncs file metadata and content
  • Can trigger on new file uploads

Example setup:

  • Source: S3 bucket with documents
  • Destination: Data warehouse (Snowflake, BigQuery)
  • Schedule: Hourly sync for new documents

What you get:

  • File paths, metadata, content
  • Ready for chunking and embedding

2. Database Sources

PostgreSQL, MySQL, MongoDB:

  • Product documentation
  • Knowledge base articles
  • Support tickets
  • User-generated content

Example setup:

  • Source: PostgreSQL knowledge base
  • Destination: Data warehouse
  • Sync mode: Incremental (only new/updated rows)
  • Schedule: Every 15 minutes

What you get:

  • Structured content ready for embedding
  • Change tracking (only process new data)

3. SaaS Application Sources

Confluence, Notion, Slack:

  • Internal documentation
  • Team knowledge
  • Customer conversations

Example setup:

  • Source: Confluence API
  • Destination: Data warehouse
  • Sync mode: Incremental
  • Schedule: Daily

What you get:

  • Unified knowledge base from multiple tools
  • Automatic updates as content changes

4. API Sources

Custom APIs, REST endpoints:

  • Proprietary systems
  • Third-party data
  • Real-time updates

Example setup:

  • Source: Custom REST API connector (built with Airbyte CDK)
  • Destination: Data warehouse
  • Schedule: Real-time or frequent batches

What you get:

  • Any data source becomes available
  • Custom connectors built in hours

Processing Pipeline: From Raw Data to Embeddings

Once Airbyte syncs data to your warehouse, you need to process it:

Step 1: Extract and Chunk

Chunking strategy:

  • Fixed size: 500-1000 tokens per chunk
  • Semantic: Split on paragraph/section boundaries
  • Overlap: 50-100 tokens between chunks (preserves context)

Tools we use:

  • LangChain (Python)
  • LlamaIndex (Python)
  • Custom chunking logic

Example chunking:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

chunks = splitter.split_text(document_content)

Step 2: Generate Embeddings

Embedding models:

  • OpenAI: text-embedding-3-large (best quality)
  • Cohere: embed-english-v3.0 (good balance)
  • Open-source: sentence-transformers (cost-effective)

Batch processing:

  • Process chunks in batches (100-1000 at a time)
  • Use async/parallel processing
  • Cache embeddings (don't regenerate unchanged chunks)

Example embedding:

import openai

def generate_embeddings(chunks):
    response = openai.Embedding.create(
        input=chunks,
        model="text-embedding-3-large"
    )
    return [item.embedding for item in response.data]

Step 3: Load into Vector Database

Vector databases we use:

  • Pinecone: Managed, easy to use
  • Weaviate: Open-source, flexible
  • Qdrant: Self-hosted option
  • Chroma: Lightweight, local

Example loading:

import pinecone

pinecone.init(api_key="your-key", environment="us-east-1")
index = pinecone.Index("rag-index")

# Upsert embeddings with metadata
vectors = [
    {
        "id": chunk_id,
        "values": embedding,
        "metadata": {
            "source": document_id,
            "chunk_index": i,
            "text": chunk_text
        }
    }
    for i, (chunk_id, embedding, chunk_text) in enumerate(zip(chunk_ids, embeddings, chunks))
]

index.upsert(vectors=vectors)

Complete RAG Pipeline with Airbyte

Here's a production-ready setup we've used:

Architecture

Components:

  1. Airbyte: Syncs data from sources → warehouse
  2. dbt: Transforms raw data → clean format
  3. Processing job: Chunks + embeddings (Airflow/Dagster)
  4. Vector DB: Stores embeddings (Pinecone/Weaviate)
  5. RAG API: Serves queries (FastAPI/Flask)

Implementation Steps

1. Set up Airbyte connectors:

  • S3 → Snowflake (documents)
  • PostgreSQL → Snowflake (knowledge base)
  • Confluence → Snowflake (docs)
  • Custom API → Snowflake (proprietary data)

2. Create dbt models:

  • Clean and normalize data
  • Prepare for chunking
  • Track changes (only process new/updated)

3. Build processing pipeline:

  • Triggered by new data in warehouse
  • Chunks documents
  • Generates embeddings
  • Upserts to vector DB

4. Build RAG application:

  • Query interface
  • Embedding search
  • LLM generation
  • Response formatting

Real-World Example

Company: B2B SaaS with 50K documents

Setup:

  • 3 S3 buckets (product docs, support articles, marketing content)
  • PostgreSQL knowledge base
  • Confluence workspace
  • Custom API for product updates

Airbyte configuration:

  • 4 connectors syncing to Snowflake
  • Hourly incremental syncs
  • ~500K new documents/month

Processing:

  • dbt models clean and prepare data
  • Airflow DAG runs every hour
  • Processes new documents only
  • Generates embeddings (OpenAI)
  • Upserts to Pinecone

Results:

  • 2M chunks in vector DB
  • Query latency: <200ms
  • 95% accuracy on internal queries
  • Cost: $2K/month (Airbyte + processing)

Best Practices

1. Incremental Processing

Don't reprocess everything:

  • Track processed documents
  • Only chunk/embed new or updated content
  • Use change detection (Airbyte provides this)

Implementation:

  • Store document hashes in warehouse
  • Compare before processing
  • Skip unchanged documents

2. Metadata Strategy

Store rich metadata:

  • Source document ID
  • Chunk position
  • Timestamp
  • Source type
  • User permissions (if applicable)

Why it matters:

  • Filter search results
  • Trace back to source
  • Handle permissions
  • Debug issues

3. Chunking Strategy

Match chunking to content:

  • Technical docs: Larger chunks (1000+ tokens)
  • Conversational: Smaller chunks (500 tokens)
  • Code: Split by function/class
  • Tables: Keep as single chunks

Test different strategies:

  • Measure retrieval quality
  • A/B test chunk sizes
  • Tune for your use case

4. Embedding Quality

Choose the right model:

  • Quality: OpenAI text-embedding-3-large
  • Cost: sentence-transformers
  • Speed: Cohere embed-english-v3.0

Fine-tuning:

  • Consider fine-tuning for domain-specific content
  • Can improve retrieval by 10-20%

5. Vector Database Selection

Choose based on needs:

  • Pinecone: Easiest, managed, good performance
  • Weaviate: Open-source, flexible, self-hosted
  • Qdrant: Self-hosted, good performance
  • Chroma: Lightweight, local development

Considerations:

  • Scale (millions of vectors?)
  • Latency requirements
  • Budget (managed vs self-hosted)
  • Multi-tenancy needs

Common Challenges and Solutions

Challenge 1: Large Document Sets

Problem: Processing millions of documents is slow and expensive.

Solutions:

  • Incremental processing (only new/updated)
  • Parallel processing (multiple workers)
  • Batch embeddings (process in groups)
  • Cache aggressively (don't re-embed unchanged)

Example:

  • 1M documents: Process in batches of 10K
  • Use 10 workers in parallel
  • Cache embeddings (save 80% compute)

Challenge 2: Real-Time Updates

Problem: Need vector DB updated within minutes of source changes.

Solutions:

  • Frequent Airbyte syncs (every 5-15 minutes)
  • Stream processing (Kafka + Airbyte)
  • Event-driven architecture (webhooks trigger processing)

Example:

  • Airbyte syncs every 10 minutes
  • Processing job triggered on new data
  • Vector DB updated within 15 minutes

Challenge 3: Multi-Source Coordination

Problem: Data from 10+ sources, need unified knowledge base.

Solutions:

  • Airbyte syncs all to single warehouse
  • dbt models unify schemas
  • Single processing pipeline handles all sources
  • Metadata tracks source for filtering

Example:

  • 12 Airbyte connectors → Snowflake
  • dbt models create unified schema
  • One processing pipeline
  • Source metadata in vector DB

Challenge 4: Cost Management

Problem: Embedding generation is expensive at scale.

Solutions:

  • Use cheaper models where quality allows
  • Cache embeddings (don't regenerate)
  • Incremental processing (only new data)
  • Batch processing (better API rates)

Example:

  • OpenAI for high-value content
  • sentence-transformers for bulk content
  • Cache saves 70% of embedding costs
  • Incremental processing saves 90% compute

Performance Tips

Sync Performance

Getting better performance from Airbyte:

  • Use incremental syncs (not full refresh)
  • Run connectors in parallel (sync multiple sources)
  • Balance sync schedules (freshness vs cost)

Example:

  • 10 connectors syncing in parallel
  • Incremental syncs (10x faster than full)
  • Hourly schedule (good balance)

Processing Performance

Chunking and embedding:

  • Parallel processing (multiple workers)
  • Batch API calls (embedding APIs)
  • Async processing (don't block on I/O)

Example:

  • 20 workers processing chunks
  • Batch 100 chunks per API call
  • 10x faster than sequential

Query Performance

Vector search:

  • Better indexes (HNSW, IVF)
  • Approximate search (faster, slightly less accurate)
  • Caching frequent queries

Example:

  • HNSW index in Pinecone
  • Top-10 search (not exhaustive)
  • Query cache for common questions

Monitoring and Maintenance

What to Monitor

Airbyte syncs:

  • Success rate
  • Sync duration
  • Data volume
  • Error rates

Processing pipeline:

  • Documents processed
  • Embeddings generated
  • Vector DB updates
  • Processing time

RAG application:

  • Query latency
  • Retrieval quality
  • User satisfaction
  • Error rates

Alerting

Set up alerts for:

  • Failed Airbyte syncs
  • Processing pipeline failures
  • High query latency
  • Vector DB errors

Maintenance Tasks

Regular maintenance:

  • Review and update chunking strategy
  • Tune embedding model selection
  • Clean up old/irrelevant documents
  • Monitor costs and cut waste

Cost Breakdown

Typical RAG system costs:

Airbyte:

  • Self-hosted: Infrastructure (~$500/month)
  • Pro: Capacity-based pricing (contact sales for exact rates)

Processing:

  • Embedding generation: $0.10-0.50 per 1K documents (OpenAI)
  • Compute: $200-500/month (batch jobs)

Vector Database:

  • Pinecone: $70-500/month (based on vectors)
  • Weaviate self-hosted: Infrastructure costs

Total:

  • Small system (100K documents): $500-1,000/month
  • Medium system (1M documents): $2,000-5,000/month
  • Large system (10M documents): $10,000-20,000/month

Next Steps

  1. Identify your data sources: What needs to be in your RAG system?
  2. Set up Airbyte connectors: Sync sources to your warehouse
  3. Design chunking strategy: How will you split documents?
  4. Choose embedding model: Balance quality and cost
  5. Select vector database: Based on scale and requirements
  6. Build processing pipeline: Chunk → Embed → Load
  7. Create RAG application: Query interface and generation
  8. Test and iterate: Measure quality, improve

Conclusion

Airbyte is the foundation of production RAG systems. It handles the hard part—syncing data from everywhere into one place. Then you process it, embed it, and serve it.

We've built RAG systems processing millions of documents using this architecture. Airbyte's flexibility, scale, and open-source nature make it ideal for feeding vector databases.

The key is incremental processing, smart chunking, and the right embedding model. Get those right, and your RAG system will deliver accurate, up-to-date answers.

Start with a few sources, prove the concept, then scale. Airbyte makes it easy to add more sources as you grow.

All postsBook a consultation