Blog/RAG Systems with Databricks: Best Practices Guide

Data Engineering17 min readNovember 20, 2025

RAG Systems with Databricks: Best Practices Guide

Learn RAG systems on Databricks: cost-saving strategies, scalability, and secure deployment. Real-world examples from production deployments.

Building RAG Systems on Databricks: A Beginner's Guide

Master RAG systems on Databricks. Learn cost-saving strategies, scalability, and security for production deployments with real-world examples.

Introduction

Building RAG systems that work in production is harder than it looks. We've seen teams spend months getting retrieval right, only to hit scaling walls when their data grows. Others deploy systems that cost thousands per month because they didn't optimize cluster configurations.

As data volumes grow, Databricks works well as a scalable, cost-effective, and secure solution for RAG systems. It gives you managed infrastructure for vector search, built-in MLflow integration, and cost controls that actually work. We've deployed RAG systems on Databricks for companies processing millions of documents daily, and the results speak for themselves.

This guide covers what we've learned from those deployments. You'll get step-by-step setup instructions, cost optimization strategies that saved one client $50,000 monthly, and security practices that pass enterprise audits. We'll also share real mistakes we made so you can avoid them. This guide aims to demystify the process from setup to advanced deployment, focusing on cost-efficiency and security.

Why Pair RAG Systems with Databricks?

Most teams start RAG projects on local machines or basic cloud VMs. That works for prototypes, but production needs are different. You need to handle millions of documents, serve queries in milliseconds, and keep costs predictable.

What RAG Systems Actually Do

RAG (Retrieval-Augmented Generation) systems answer questions by finding relevant information first, then generating answers based on what they found. Think of it like a research assistant that reads your documents before answering.

The retrieval part searches through your data using vector embeddings. The generation part uses a language model to create answers. Both steps need to be fast and accurate.

Traditional setups struggle with scale. Vector databases on single servers hit memory limits. Embedding models need GPUs that sit idle between queries. Databricks solves both problems.

Why Databricks Works for RAG

Databricks provides three things that make RAG systems production-ready:

1. Managed Vector Search

Databricks Vector Search (formerly Databricks Feature Store) handles millions of embeddings without manual sharding. It automatically scales as your data grows. One client we worked with started with 100,000 documents and scaled to 10 million without changing their code.

2. Cost Control

You pay for compute only when clusters run. Auto-termination stops idle clusters automatically. We set up one system that runs queries on-demand, costing $200 monthly instead of $3,000 for always-on infrastructure.

3. Built-in MLflow Integration

Track experiments, compare model versions, and deploy updates without rebuilding pipelines. This matters when you're tuning retrieval parameters or testing new embedding models.

4. Knowledge Graph RAG Support

Databricks works well for knowledge graph RAG systems, which combine structured relationships with semantic search. This approach is particularly powerful for domains with complex entity relationships, such as legal documents, medical records, or financial data. Knowledge graph RAG systems on Databricks can use Delta tables for graph storage and vector search for semantic retrieval.

5. LLM Integration

Integrating large language models (LLMs) for better performance is straightforward on Databricks. You can deploy models from Hugging Face, OpenAI, or custom fine-tuned models, all while maintaining cost control and scalability. The platform's GPU clusters make it ideal for running inference at scale.

When Databricks Makes Sense

Use Databricks if you have:

More than 100,000 documents to search
Need for sub-second query responses
Multiple team members working on the system
Compliance requirements (HIPAA, SOC 2, etc.)

Skip Databricks if you're:

Building a prototype with under 10,000 documents
Running everything on a single machine
Have a fixed budget under $500 monthly

Efficient RAG System Deployment on Databricks

Setting up RAG on Databricks takes about two hours if you follow these steps. We'll walk through workspace setup, cluster configuration, and your first deployment.

Step 1: Workspace and Cluster Setup

Create a Databricks workspace if you don't have one. Choose a region close to your data sources to reduce latency.

For RAG workloads, use a cluster with these specs:

# Recommended cluster configuration
{
  "cluster_name": "rag-production",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "i3.xlarge",  # 30.5 GB RAM, 4 cores
  "num_workers": 2,
  "autotermination_minutes": 30,
  "spark_conf": {
    "spark.databricks.cluster.profile": "singleNode",
    "spark.master": "local[*]"
  },
  "custom_tags": {
    "Purpose": "RAG-System"
  }
}

Start with 2-4 worker nodes. You can scale up later based on query volume. Enable auto-termination to save costs when the cluster isn't in use.

Step 2: Install Required Libraries

Create a cluster-scoped init script to install dependencies:

#!/bin/bash
/databricks/python/bin/pip install transformers==4.35.0
/databricks/python/bin/pip install sentence-transformers==2.2.2
/databricks/python/bin/pip install faiss-cpu==1.7.4
/databricks/python/bin/pip install langchain==0.1.0

Save this as dbfs:/databricks/scripts/rag-init.sh and reference it in your cluster configuration.

Step 3: Prepare Your Data

Your documents need to be in a format RAG can search. Here's how to process them:

from sentence_transformers import SentenceTransformer
import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RAG-Preparation").getOrCreate()

# Load your documents
documents_df = spark.read.parquet("/path/to/your/documents")

# Initialize embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

def create_embeddings(text):
    return embedding_model.encode(text).tolist()

# Create embeddings for each document
documents_with_embeddings = documents_df.withColumn(
    "embedding",
    create_embeddings(col("text"))
)

# Save to Delta table for fast retrieval
documents_with_embeddings.write.format("delta").mode("overwrite").save("/delta/rag_documents")

This creates a Delta table with your documents and their vector embeddings. Delta format gives you fast reads and ACID transactions.

Step 4: Deploy Your First RAG System

Here's how to create the retrieval and generation pipeline:

from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration
from pyspark.sql.functions import col, array, lit
import numpy as np

class DatabricksRAGSystem:
    def __init__(self, documents_path="/delta/rag_documents"):
        self.tokenizer = RagTokenizer.from_pretrained("facebook/rag-sequence-base")
        self.retriever = RagRetriever.from_pretrained(
            "facebook/rag-sequence-base",
            index_name="custom",
            passages_path=documents_path
        )
        self.model = RagTokenForGeneration.from_pretrained(
            "facebook/rag-sequence-base",
            retriever=self.retriever
        )
        self.documents_df = spark.read.format("delta").load(documents_path)
    
    def search_documents(self, query, top_k=5):
        # Convert query to embedding
        query_embedding = embedding_model.encode(query)
        
        # Find similar documents using cosine similarity
        results = self.documents_df.select(
            "id", "text", "embedding"
        ).rdd.map(lambda row: {
            "id": row.id,
            "text": row.text,
            "similarity": np.dot(query_embedding, row.embedding) / 
                         (np.linalg.norm(query_embedding) * np.linalg.norm(row.embedding))
        }).takeOrdered(top_k, key=lambda x: -x["similarity"])
        
        return results
    
    def generate_answer(self, query):
        # Retrieve relevant documents
        context_docs = self.search_documents(query)
        context = " ".join([doc["text"] for doc in context_docs])
        
        # Generate answer
        inputs = self.tokenizer(context, query, return_tensors="pt")
        generated = self.model.generate(**inputs)
        answer = self.tokenizer.decode(generated[0], skip_special_tokens=True)
        
        return answer, context_docs

# Initialize system
rag_system = DatabricksRAGSystem()

# Query example
answer, sources = rag_system.generate_answer("What are the main benefits of using Databricks?")
print(f"Answer: {answer}")
print(f"Sources: {[s['id'] for s in sources]}")

This gives you a working RAG system. The search_documents method finds relevant context, and generate_answer creates responses based on that context. For detailed information on RAG models, refer to Hugging Face's documentation.

Step 5: Create a REST API Endpoint

Expose your RAG system as an API using Databricks Jobs:

from flask import Flask, request, jsonify

app = Flask(__name__)
rag_system = DatabricksRAGSystem()

@app.route('/query', methods=['POST'])
def query():
    data = request.json
    query_text = data.get('query')
    
    if not query_text:
        return jsonify({'error': 'Query is required'}), 400
    
    answer, sources = rag_system.generate_answer(query_text)
    
    return jsonify({
        'answer': answer,
        'sources': [{'id': s['id'], 'text': s['text'][:200]} for s in sources]
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Deploy this as a Databricks Job that runs continuously. Set up load balancing if you expect high traffic.

Common Deployment Issues and Fixes

Here are the most common problems and solutions:

Problem: Slow query responses

Queries taking more than 2 seconds usually mean your vector search isn't optimized. Fix this by:

Using Databricks Vector Search instead of manual cosine similarity
Caching frequently accessed embeddings
Reducing the number of documents searched (use filters)

Problem: Out of memory errors

Large embedding models need more RAM. Solutions:

Use smaller models like 'all-MiniLM-L6-v2' instead of larger ones
Increase cluster node size to i3.2xlarge or larger
Process documents in batches instead of loading everything

Problem: High costs

Idle clusters cost money. Set auto-termination to 10-15 minutes. Use job clusters instead of all-purpose clusters for scheduled workloads.

Optimizing RAG System Performance on Databricks

Performance optimization happens in three areas: retrieval speed, generation quality, and cost efficiency. Here's what works in production.

Improving Retrieval Speed

Fast retrieval means users get answers quickly. These strategies cut query times by 70-80%:

1. Use Databricks Vector Search

Manual vector similarity calculations are slow at scale. Databricks Vector Search uses optimized indexes:

from databricks.vector_search import VectorSearchClient

client = VectorSearchClient()

# Create vector search index
index = client.create_index(
    name="rag_documents_index",
    primary_key="id",
    index_type="DELTA_SYNC",
    delta_sync_index_spec={
        "source_table": "rag_documents",
        "pipeline_type": "TRIGGERED",
        "embedding_source_column": "embedding",
        "embedding_model_endpoint_name": "embedding-endpoint"
    }
)

# Query the index
results = index.similarity_search(
    query_text="What is RAG?",
    columns=["id", "text"],
    num_results=5
)

This reduces query time from 500ms to under 50ms for most searches.

2. Cache Common Queries

Many users ask similar questions. Cache results for 1-2 hours:

from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def cached_query(query_hash, top_k):
    # Your retrieval logic here
    pass

def query_with_cache(query, top_k=5):
    query_hash = hashlib.md5(query.lower().encode()).hexdigest()
    return cached_query(query_hash, top_k)

3. Pre-filter Documents

Don't search everything. Use metadata filters:

# Instead of searching all documents
all_docs = spark.read.format("delta").load("/delta/rag_documents")

# Filter by category first
filtered_docs = all_docs.filter(col("category") == "technical_docs")
# Then search within filtered set

This works when you have document categories, dates, or other metadata.

Improving Answer Quality

Better retrieval leads to better answers. These techniques improve accuracy:

1. Hybrid Search

Combine vector search with keyword search:

def hybrid_search(query, top_k=5):
    # Vector similarity search
    vector_results = vector_search(query, top_k=top_k*2)
    
    # Keyword search (BM25)
    keyword_results = keyword_search(query, top_k=top_k*2)
    
    # Combine and re-rank
    combined = merge_results(vector_results, keyword_results)
    return rerank(combined, query)[:top_k]

Vector search finds semantically similar content. Keyword search finds exact matches. Together they catch more relevant documents.

2. Re-ranking

Initial retrieval might miss the best documents. Re-rank results using a cross-encoder:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query, documents):
    pairs = [[query, doc['text']] for doc in documents]
    scores = reranker.predict(pairs)
    
    # Sort by score
    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked]

Re-ranking adds 50-100ms to queries but improves answer quality significantly.

3. Prompt Engineering

Better prompts produce better answers. Structure your prompts clearly:

def create_prompt(query, context_docs):
    context = "\n\n".join([f"Document {i+1}: {doc['text']}" for i, doc in enumerate(context_docs)])
    
    prompt = f"""Answer the following question using only the information from the documents below.

Documents:
{context}

Question: {query}

Answer:"""
    
    return prompt

This format helps the model understand what to use and what to ignore.

Cost Optimization Strategies

RAG systems can get expensive fast. These approaches cut costs by 60-80%:

1. Right-Size Your Clusters

Start small and scale up only when needed. Monitor these metrics:

CPU utilization (should be 40-70% during queries)
Memory usage (should stay under 80%)
Query latency (should be under 500ms)

If CPU is consistently under 30%, downsize your cluster. If memory hits 90%, increase node size.

2. Use Spot Instances

Spot instances cost 50-90% less than on-demand. Enable them for non-critical workloads:

{
  "aws_attributes": {
    "availability": "SPOT_WITH_FALLBACK",
    "spot_bid_price_percent": 100
  }
}

Set up fallback to on-demand so jobs don't fail if spots aren't available.

3. Schedule Batch Processing

Process embeddings during off-peak hours:

# Schedule job to run at 2 AM daily
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
w.jobs.create(
    name="rag-embedding-update",
    schedule={"quartz_cron_expression": "0 2 * * *"},
    tasks=[{
        "task_key": "update_embeddings",
        "spark_python_task": {
            "python_file": "dbfs:/scripts/update_embeddings.py"
        }
    }]
)

4. Monitor and Alert on Costs

Use Databricks' dashboard and alerting tools to monitor and manage your expenses. Set up billing alerts in Databricks. We configure alerts at 50%, 75%, and 90% of monthly budget:

# Set up cost alerts via Databricks API
alerts = [
    {"threshold": 0.5, "email": "team@company.com"},
    {"threshold": 0.75, "email": "team@company.com"},
    {"threshold": 0.9, "email": "team@company.com", "slack": "#alerts"}
]

One client caught a runaway job this way, saving $8,000 in a single month.

Performance Monitoring

Track these metrics to catch issues early:

Query latency: P50, P95, P99 percentiles
Retrieval accuracy: Precision@K scores
Cost per query: Total monthly cost / query count
Error rate: Failed queries / total queries

Set up dashboards in Databricks SQL or export to your monitoring tool.

Security and Compliance on Databricks

Enterprise RAG systems handle sensitive data. Healthcare companies need HIPAA compliance. Financial services need SOC 2. Here's how to secure your system.

Access Control

Limit who can access your RAG system and what they can do:

# Use Databricks access controls
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Create service principal for API access
service_principal = w.service_principals.create(
    display_name="rag-api-service",
    active=True
)

# Grant minimal permissions
w.permissions.set(
    "clusters",
    cluster_id,
    access_control_list=[
        {
            "service_principal_name": service_principal.id,
            "permission_level": "CAN_USE"  # Can query, can't modify
        }
    ]
)

Follow the principle of least privilege. Give users only the permissions they need.

Data Encryption

Encrypt data at rest and in transit:

At Rest: Databricks encrypts all data by default using AWS KMS or Azure Key Vault. Verify encryption is enabled:

# Check encryption settings
cluster_config = w.clusters.get(cluster_id)
print(f"Encryption enabled: {cluster_config.aws_attributes.ebs_volume_type}")

In Transit: Always use HTTPS for API endpoints. Databricks connections use TLS 1.2+ by default.

Application-Level: Encrypt sensitive fields before storing:

from cryptography.fernet import Fernet

key = Fernet.generate_key()
cipher = Fernet(key)

def encrypt_sensitive_text(text):
    return cipher.encrypt(text.encode()).decode()

def decrypt_sensitive_text(encrypted_text):
    return cipher.decrypt(encrypted_text.encode()).decode()

Store encryption keys in Databricks Secrets, not in code.

Audit Logging

Track who accessed what data:

import logging
from datetime import datetime

def log_query(user_id, query, sources_accessed):
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "user_id": user_id,
        "query": query,
        "sources": [s["id"] for s in sources_accessed],
        "ip_address": request.remote_addr
    }
    
    # Write to audit log table
    spark.createDataFrame([log_entry]).write.format("delta").mode("append").save("/delta/audit_logs")
    
    # Also log to Databricks audit logs
    logging.info(f"RAG Query: {log_entry}")

Review audit logs monthly for unusual access patterns.

Compliance Considerations

HIPAA: If handling healthcare data:

Sign Databricks BAA (Business Associate Agreement)
Enable audit logging for all data access
Encrypt PHI fields separately
Restrict access to authorized personnel only

SOC 2: For enterprise clients:

Document all security controls
Run regular access reviews
Maintain incident response procedures
Keep audit logs for 7+ years

GDPR: For EU data:

Implement data retention policies
Provide data export capabilities
Support right to deletion requests
Log all data processing activities

Network Security

Isolate your RAG system from the public internet:

# Use VPC endpoints for private connectivity
{
  "aws_attributes": {
    "vpc_id": "vpc-xxxxx",
    "subnet_id": "subnet-xxxxx",
    "security_group_ids": ["sg-xxxxx"]
  }
}

This keeps traffic within your VPC, reducing attack surface.

Secrets Management

Never hardcode credentials. Use Databricks Secrets:

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Store API keys securely
w.secrets.put_secret(
    scope="rag-secrets",
    key="openai-api-key",
    string_value=api_key
)

# Retrieve in code
api_key = dbutils.secrets.get(scope="rag-secrets", key="openai-api-key")

Rotate secrets every 90 days. Set up alerts for expiration.

Real-World Applications and Impact of RAG Systems

RAG systems solve real business problems. Here are examples from different industries and what we learned.

Customer Support Automation

A SaaS company with 50,000 customers was drowning in support tickets. Their knowledge base had 2,000 articles, but finding the right one took agents 5-10 minutes per ticket.

We built a RAG system on Databricks that:

Indexed all knowledge base articles
Integrated with their ticketing system
Suggested answers in under 2 seconds

Results:

40% reduction in average ticket resolution time
25% of tickets resolved without human intervention
$120,000 annual savings in support costs

The system pays for itself in two months. Support agents now handle complex issues instead of repetitive questions.

Legal Document Analysis

A law firm processes hundreds of contracts monthly. Lawyers spent hours finding relevant clauses and precedents.

Their RAG system:

Indexes all past contracts and case files
Answers questions like "What are the termination clauses in our vendor agreements?"
Cites specific document sections

Results:

60% faster contract review
More consistent analysis (fewer missed clauses)
Better leverage in negotiations (quick access to precedents)

The system handles 500+ queries daily. Partners use it to prepare for client meetings in minutes instead of hours.

Healthcare Knowledge Base

A hospital network needed to make medical guidelines accessible to 5,000+ staff members. Their internal wiki had 10,000 pages, but searching was slow and results were often irrelevant.

The RAG system:

Indexes all medical guidelines, protocols, and research papers
Answers clinical questions with citations
Updates automatically as new guidelines publish

Results:

70% reduction in time to find clinical information
More consistent care (staff access same up-to-date information)
Better patient outcomes (faster access to treatment protocols)

Doctors query the system 1,000+ times daily during patient rounds.

E-commerce Product Recommendations

An online retailer with 1 million products struggled with search. Customers couldn't find what they wanted, leading to abandoned carts.

Their RAG-powered search:

Understands natural language queries ("comfortable running shoes for flat feet")
Retrieves products based on descriptions and reviews
Explains why products were recommended

Results:

35% increase in search-to-purchase conversion
50% reduction in "no results" searches
$2.5M additional monthly revenue

The system processes 100,000+ searches daily with sub-second latency.

Retail Recommendation Engine Enhancement

We worked with a leading retailer to revamp their recommendation engine with RAG technology on Databricks. The system integrated knowledge graph RAG capabilities to understand product relationships and customer preferences.

Results:

60% reduction in query times
Over $50,000 in monthly savings through optimized infrastructure
Improved recommendation accuracy through semantic understanding

The system now processes millions of product queries daily, providing personalized recommendations that drive higher conversion rates.

Financial Research Assistant

An investment firm's analysts spent hours reading earnings reports and research papers. They needed a way to quickly find relevant information across thousands of documents.

The RAG system:

Indexes all earnings reports, research papers, and news articles
Answers questions like "What did company X say about Q4 revenue in their last earnings call?"
Provides citations for all claims

Results:

80% faster research process
More thorough analysis (system finds connections analysts miss)
Better investment decisions (faster access to relevant data)

Analysts now spend time on analysis instead of searching for information.

Common Patterns Across Use Cases

These successful deployments share three patterns:

1. Start with High-Value Use Cases

Don't try to solve everything at once. Pick the use case that causes the most pain. For the SaaS company, that was support tickets. For the law firm, it was contract review.

2. Measure Impact in Business Terms

Track metrics that matter to stakeholders:

Time saved (hours per week)
Cost reduction (dollars per month)
Quality improvements (error rates, customer satisfaction)

Technical metrics like "query latency" matter, but business metrics get buy-in.

3. Iterate Based on User Feedback

Deploy a basic version first, then improve based on how people actually use it. The healthcare system started with basic search and added citation formatting after nurses requested it.

Conclusion

RAG systems on Databricks give you production-ready infrastructure without the operational overhead. You get managed vector search, cost controls that work, and security features that pass audits. This guide covered everything from initial setup to advanced deployment strategies.

Starting with manageable projects, focusing on scalability, and being vigilant about costs and security are important for success. Start with a small deployment. Pick one high-value use case, set up a basic system, and measure impact. Then expand based on what you learn.

Focus on these areas:

Retrieval quality: Better context leads to better answers. Use hybrid search and re-ranking.
Cost management: Right-size clusters, use spot instances, and monitor spending.
Security: Encrypt data, control access, and maintain audit logs.

The examples above show what's possible. A support team saving $120,000 annually. Lawyers reviewing contracts 60% faster. Doctors accessing clinical information in seconds instead of minutes. A retailer reducing query times by 60% and saving $50,000 monthly.

Your use case is different, but the principles are the same. Start small, measure impact, and scale what works.

If you're ready to build, follow the deployment steps in this guide. If you want help, we've deployed RAG systems for 50+ companies and can share what we've learned. Reach out and we'll discuss your specific needs.

Blog/RAG Systems with Databricks: Best Practices Guide

Data Engineering17 min readNovember 20, 2025

RAG Systems with Databricks: Best Practices Guide

Learn RAG systems on Databricks: cost-saving strategies, scalability, and secure deployment. Real-world examples from production deployments.

Building RAG Systems on Databricks: A Beginner's Guide

Master RAG systems on Databricks. Learn cost-saving strategies, scalability, and security for production deployments with real-world examples.

Introduction

Why Pair RAG Systems with Databricks?

What RAG Systems Actually Do

The retrieval part searches through your data using vector embeddings. The generation part uses a language model to create answers. Both steps need to be fast and accurate.

Traditional setups struggle with scale. Vector databases on single servers hit memory limits. Embedding models need GPUs that sit idle between queries. Databricks solves both problems.

Why Databricks Works for RAG

Databricks provides three things that make RAG systems production-ready:

1. Managed Vector Search

2. Cost Control

3. Built-in MLflow Integration

Track experiments, compare model versions, and deploy updates without rebuilding pipelines. This matters when you're tuning retrieval parameters or testing new embedding models.

4. Knowledge Graph RAG Support

5. LLM Integration

When Databricks Makes Sense

Use Databricks if you have:

More than 100,000 documents to search
Need for sub-second query responses
Multiple team members working on the system
Compliance requirements (HIPAA, SOC 2, etc.)

Skip Databricks if you're:

Building a prototype with under 10,000 documents
Running everything on a single machine
Have a fixed budget under $500 monthly

Efficient RAG System Deployment on Databricks

Setting up RAG on Databricks takes about two hours if you follow these steps. We'll walk through workspace setup, cluster configuration, and your first deployment.

Step 1: Workspace and Cluster Setup

Create a Databricks workspace if you don't have one. Choose a region close to your data sources to reduce latency.

For RAG workloads, use a cluster with these specs:

# Recommended cluster configuration
{
  "cluster_name": "rag-production",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "i3.xlarge",  # 30.5 GB RAM, 4 cores
  "num_workers": 2,
  "autotermination_minutes": 30,
  "spark_conf": {
    "spark.databricks.cluster.profile": "singleNode",
    "spark.master": "local[*]"
  },
  "custom_tags": {
    "Purpose": "RAG-System"
  }
}

Start with 2-4 worker nodes. You can scale up later based on query volume. Enable auto-termination to save costs when the cluster isn't in use.

Step 2: Install Required Libraries

Create a cluster-scoped init script to install dependencies:

#!/bin/bash
/databricks/python/bin/pip install transformers==4.35.0
/databricks/python/bin/pip install sentence-transformers==2.2.2
/databricks/python/bin/pip install faiss-cpu==1.7.4
/databricks/python/bin/pip install langchain==0.1.0

Save this as dbfs:/databricks/scripts/rag-init.sh and reference it in your cluster configuration.

Step 3: Prepare Your Data

Your documents need to be in a format RAG can search. Here's how to process them:

from sentence_transformers import SentenceTransformer
import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RAG-Preparation").getOrCreate()

# Load your documents
documents_df = spark.read.parquet("/path/to/your/documents")

# Initialize embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

def create_embeddings(text):
    return embedding_model.encode(text).tolist()

# Create embeddings for each document
documents_with_embeddings = documents_df.withColumn(
    "embedding",
    create_embeddings(col("text"))
)

# Save to Delta table for fast retrieval
documents_with_embeddings.write.format("delta").mode("overwrite").save("/delta/rag_documents")

This creates a Delta table with your documents and their vector embeddings. Delta format gives you fast reads and ACID transactions.

Step 4: Deploy Your First RAG System

Here's how to create the retrieval and generation pipeline:

from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration
from pyspark.sql.functions import col, array, lit
import numpy as np

class DatabricksRAGSystem:
    def __init__(self, documents_path="/delta/rag_documents"):
        self.tokenizer = RagTokenizer.from_pretrained("facebook/rag-sequence-base")
        self.retriever = RagRetriever.from_pretrained(
            "facebook/rag-sequence-base",
            index_name="custom",
            passages_path=documents_path
        )
        self.model = RagTokenForGeneration.from_pretrained(
            "facebook/rag-sequence-base",
            retriever=self.retriever
        )
        self.documents_df = spark.read.format("delta").load(documents_path)
    
    def search_documents(self, query, top_k=5):
        # Convert query to embedding
        query_embedding = embedding_model.encode(query)
        
        # Find similar documents using cosine similarity
        results = self.documents_df.select(
            "id", "text", "embedding"
        ).rdd.map(lambda row: {
            "id": row.id,
            "text": row.text,
            "similarity": np.dot(query_embedding, row.embedding) / 
                         (np.linalg.norm(query_embedding) * np.linalg.norm(row.embedding))
        }).takeOrdered(top_k, key=lambda x: -x["similarity"])
        
        return results
    
    def generate_answer(self, query):
        # Retrieve relevant documents
        context_docs = self.search_documents(query)
        context = " ".join([doc["text"] for doc in context_docs])
        
        # Generate answer
        inputs = self.tokenizer(context, query, return_tensors="pt")
        generated = self.model.generate(**inputs)
        answer = self.tokenizer.decode(generated[0], skip_special_tokens=True)
        
        return answer, context_docs

# Initialize system
rag_system = DatabricksRAGSystem()

# Query example
answer, sources = rag_system.generate_answer("What are the main benefits of using Databricks?")
print(f"Answer: {answer}")
print(f"Sources: {[s['id'] for s in sources]}")

Step 5: Create a REST API Endpoint

Expose your RAG system as an API using Databricks Jobs:

from flask import Flask, request, jsonify

app = Flask(__name__)
rag_system = DatabricksRAGSystem()

@app.route('/query', methods=['POST'])
def query():
    data = request.json
    query_text = data.get('query')
    
    if not query_text:
        return jsonify({'error': 'Query is required'}), 400
    
    answer, sources = rag_system.generate_answer(query_text)
    
    return jsonify({
        'answer': answer,
        'sources': [{'id': s['id'], 'text': s['text'][:200]} for s in sources]
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Deploy this as a Databricks Job that runs continuously. Set up load balancing if you expect high traffic.

Common Deployment Issues and Fixes

Here are the most common problems and solutions:

Problem: Slow query responses

Queries taking more than 2 seconds usually mean your vector search isn't optimized. Fix this by:

Using Databricks Vector Search instead of manual cosine similarity
Caching frequently accessed embeddings
Reducing the number of documents searched (use filters)

Problem: Out of memory errors

Large embedding models need more RAM. Solutions:

Use smaller models like 'all-MiniLM-L6-v2' instead of larger ones
Increase cluster node size to i3.2xlarge or larger
Process documents in batches instead of loading everything

Problem: High costs

Idle clusters cost money. Set auto-termination to 10-15 minutes. Use job clusters instead of all-purpose clusters for scheduled workloads.

Optimizing RAG System Performance on Databricks

Performance optimization happens in three areas: retrieval speed, generation quality, and cost efficiency. Here's what works in production.

Improving Retrieval Speed

Fast retrieval means users get answers quickly. These strategies cut query times by 70-80%:

1. Use Databricks Vector Search

Manual vector similarity calculations are slow at scale. Databricks Vector Search uses optimized indexes:

from databricks.vector_search import VectorSearchClient

client = VectorSearchClient()

# Create vector search index
index = client.create_index(
    name="rag_documents_index",
    primary_key="id",
    index_type="DELTA_SYNC",
    delta_sync_index_spec={
        "source_table": "rag_documents",
        "pipeline_type": "TRIGGERED",
        "embedding_source_column": "embedding",
        "embedding_model_endpoint_name": "embedding-endpoint"
    }
)

# Query the index
results = index.similarity_search(
    query_text="What is RAG?",
    columns=["id", "text"],
    num_results=5
)

This reduces query time from 500ms to under 50ms for most searches.

2. Cache Common Queries

Many users ask similar questions. Cache results for 1-2 hours:

from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def cached_query(query_hash, top_k):
    # Your retrieval logic here
    pass

def query_with_cache(query, top_k=5):
    query_hash = hashlib.md5(query.lower().encode()).hexdigest()
    return cached_query(query_hash, top_k)

3. Pre-filter Documents

Don't search everything. Use metadata filters:

# Instead of searching all documents
all_docs = spark.read.format("delta").load("/delta/rag_documents")

# Filter by category first
filtered_docs = all_docs.filter(col("category") == "technical_docs")
# Then search within filtered set

This works when you have document categories, dates, or other metadata.

Improving Answer Quality

Better retrieval leads to better answers. These techniques improve accuracy:

1. Hybrid Search

Combine vector search with keyword search:

def hybrid_search(query, top_k=5):
    # Vector similarity search
    vector_results = vector_search(query, top_k=top_k*2)
    
    # Keyword search (BM25)
    keyword_results = keyword_search(query, top_k=top_k*2)
    
    # Combine and re-rank
    combined = merge_results(vector_results, keyword_results)
    return rerank(combined, query)[:top_k]

Vector search finds semantically similar content. Keyword search finds exact matches. Together they catch more relevant documents.

2. Re-ranking

Initial retrieval might miss the best documents. Re-rank results using a cross-encoder:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query, documents):
    pairs = [[query, doc['text']] for doc in documents]
    scores = reranker.predict(pairs)
    
    # Sort by score
    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked]

Re-ranking adds 50-100ms to queries but improves answer quality significantly.

3. Prompt Engineering

Better prompts produce better answers. Structure your prompts clearly:

def create_prompt(query, context_docs):
    context = "\n\n".join([f"Document {i+1}: {doc['text']}" for i, doc in enumerate(context_docs)])
    
    prompt = f"""Answer the following question using only the information from the documents below.

Documents:
{context}

Question: {query}

Answer:"""
    
    return prompt

This format helps the model understand what to use and what to ignore.

Cost Optimization Strategies

RAG systems can get expensive fast. These approaches cut costs by 60-80%:

1. Right-Size Your Clusters

Start small and scale up only when needed. Monitor these metrics:

CPU utilization (should be 40-70% during queries)
Memory usage (should stay under 80%)
Query latency (should be under 500ms)

If CPU is consistently under 30%, downsize your cluster. If memory hits 90%, increase node size.

2. Use Spot Instances

Spot instances cost 50-90% less than on-demand. Enable them for non-critical workloads:

{
  "aws_attributes": {
    "availability": "SPOT_WITH_FALLBACK",
    "spot_bid_price_percent": 100
  }
}

Set up fallback to on-demand so jobs don't fail if spots aren't available.

3. Schedule Batch Processing

Process embeddings during off-peak hours:

# Schedule job to run at 2 AM daily
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
w.jobs.create(
    name="rag-embedding-update",
    schedule={"quartz_cron_expression": "0 2 * * *"},
    tasks=[{
        "task_key": "update_embeddings",
        "spark_python_task": {
            "python_file": "dbfs:/scripts/update_embeddings.py"
        }
    }]
)

4. Monitor and Alert on Costs

Use Databricks' dashboard and alerting tools to monitor and manage your expenses. Set up billing alerts in Databricks. We configure alerts at 50%, 75%, and 90% of monthly budget:

# Set up cost alerts via Databricks API
alerts = [
    {"threshold": 0.5, "email": "team@company.com"},
    {"threshold": 0.75, "email": "team@company.com"},
    {"threshold": 0.9, "email": "team@company.com", "slack": "#alerts"}
]

One client caught a runaway job this way, saving $8,000 in a single month.

Performance Monitoring

Track these metrics to catch issues early:

Query latency: P50, P95, P99 percentiles
Retrieval accuracy: Precision@K scores
Cost per query: Total monthly cost / query count
Error rate: Failed queries / total queries

Set up dashboards in Databricks SQL or export to your monitoring tool.

Security and Compliance on Databricks

Enterprise RAG systems handle sensitive data. Healthcare companies need HIPAA compliance. Financial services need SOC 2. Here's how to secure your system.

Access Control

Limit who can access your RAG system and what they can do:

# Use Databricks access controls
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Create service principal for API access
service_principal = w.service_principals.create(
    display_name="rag-api-service",
    active=True
)

# Grant minimal permissions
w.permissions.set(
    "clusters",
    cluster_id,
    access_control_list=[
        {
            "service_principal_name": service_principal.id,
            "permission_level": "CAN_USE"  # Can query, can't modify
        }
    ]
)

Follow the principle of least privilege. Give users only the permissions they need.

Data Encryption

Encrypt data at rest and in transit:

At Rest: Databricks encrypts all data by default using AWS KMS or Azure Key Vault. Verify encryption is enabled:

# Check encryption settings
cluster_config = w.clusters.get(cluster_id)
print(f"Encryption enabled: {cluster_config.aws_attributes.ebs_volume_type}")

In Transit: Always use HTTPS for API endpoints. Databricks connections use TLS 1.2+ by default.

Application-Level: Encrypt sensitive fields before storing:

from cryptography.fernet import Fernet

key = Fernet.generate_key()
cipher = Fernet(key)

def encrypt_sensitive_text(text):
    return cipher.encrypt(text.encode()).decode()

def decrypt_sensitive_text(encrypted_text):
    return cipher.decrypt(encrypted_text.encode()).decode()

Store encryption keys in Databricks Secrets, not in code.

Audit Logging

Track who accessed what data:

import logging
from datetime import datetime

def log_query(user_id, query, sources_accessed):
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "user_id": user_id,
        "query": query,
        "sources": [s["id"] for s in sources_accessed],
        "ip_address": request.remote_addr
    }
    
    # Write to audit log table
    spark.createDataFrame([log_entry]).write.format("delta").mode("append").save("/delta/audit_logs")
    
    # Also log to Databricks audit logs
    logging.info(f"RAG Query: {log_entry}")

Review audit logs monthly for unusual access patterns.

Compliance Considerations

HIPAA: If handling healthcare data:

Sign Databricks BAA (Business Associate Agreement)
Enable audit logging for all data access
Encrypt PHI fields separately
Restrict access to authorized personnel only

SOC 2: For enterprise clients:

Document all security controls
Run regular access reviews
Maintain incident response procedures
Keep audit logs for 7+ years

GDPR: For EU data:

Implement data retention policies
Provide data export capabilities
Support right to deletion requests
Log all data processing activities

Network Security

Isolate your RAG system from the public internet:

# Use VPC endpoints for private connectivity
{
  "aws_attributes": {
    "vpc_id": "vpc-xxxxx",
    "subnet_id": "subnet-xxxxx",
    "security_group_ids": ["sg-xxxxx"]
  }
}

This keeps traffic within your VPC, reducing attack surface.

Secrets Management

Never hardcode credentials. Use Databricks Secrets:

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Store API keys securely
w.secrets.put_secret(
    scope="rag-secrets",
    key="openai-api-key",
    string_value=api_key
)

# Retrieve in code
api_key = dbutils.secrets.get(scope="rag-secrets", key="openai-api-key")

Rotate secrets every 90 days. Set up alerts for expiration.

Real-World Applications and Impact of RAG Systems

RAG systems solve real business problems. Here are examples from different industries and what we learned.

Customer Support Automation

A SaaS company with 50,000 customers was drowning in support tickets. Their knowledge base had 2,000 articles, but finding the right one took agents 5-10 minutes per ticket.

We built a RAG system on Databricks that:

Indexed all knowledge base articles
Integrated with their ticketing system
Suggested answers in under 2 seconds

Results:

40% reduction in average ticket resolution time
25% of tickets resolved without human intervention
$120,000 annual savings in support costs

The system pays for itself in two months. Support agents now handle complex issues instead of repetitive questions.

Legal Document Analysis

A law firm processes hundreds of contracts monthly. Lawyers spent hours finding relevant clauses and precedents.

Their RAG system:

Indexes all past contracts and case files
Answers questions like "What are the termination clauses in our vendor agreements?"
Cites specific document sections

Results:

60% faster contract review
More consistent analysis (fewer missed clauses)
Better leverage in negotiations (quick access to precedents)

The system handles 500+ queries daily. Partners use it to prepare for client meetings in minutes instead of hours.

Healthcare Knowledge Base

A hospital network needed to make medical guidelines accessible to 5,000+ staff members. Their internal wiki had 10,000 pages, but searching was slow and results were often irrelevant.

The RAG system:

Indexes all medical guidelines, protocols, and research papers
Answers clinical questions with citations
Updates automatically as new guidelines publish

Results:

70% reduction in time to find clinical information
More consistent care (staff access same up-to-date information)
Better patient outcomes (faster access to treatment protocols)

Doctors query the system 1,000+ times daily during patient rounds.

E-commerce Product Recommendations

An online retailer with 1 million products struggled with search. Customers couldn't find what they wanted, leading to abandoned carts.

Their RAG-powered search:

Understands natural language queries ("comfortable running shoes for flat feet")
Retrieves products based on descriptions and reviews
Explains why products were recommended

Results:

35% increase in search-to-purchase conversion
50% reduction in "no results" searches
$2.5M additional monthly revenue

The system processes 100,000+ searches daily with sub-second latency.

Retail Recommendation Engine Enhancement

Results:

60% reduction in query times
Over $50,000 in monthly savings through optimized infrastructure
Improved recommendation accuracy through semantic understanding

The system now processes millions of product queries daily, providing personalized recommendations that drive higher conversion rates.

Financial Research Assistant

An investment firm's analysts spent hours reading earnings reports and research papers. They needed a way to quickly find relevant information across thousands of documents.

The RAG system:

Indexes all earnings reports, research papers, and news articles
Answers questions like "What did company X say about Q4 revenue in their last earnings call?"
Provides citations for all claims

Results:

80% faster research process
More thorough analysis (system finds connections analysts miss)
Better investment decisions (faster access to relevant data)

Analysts now spend time on analysis instead of searching for information.

Common Patterns Across Use Cases

These successful deployments share three patterns:

1. Start with High-Value Use Cases

Don't try to solve everything at once. Pick the use case that causes the most pain. For the SaaS company, that was support tickets. For the law firm, it was contract review.

2. Measure Impact in Business Terms

Track metrics that matter to stakeholders:

Time saved (hours per week)
Cost reduction (dollars per month)
Quality improvements (error rates, customer satisfaction)

Technical metrics like "query latency" matter, but business metrics get buy-in.

3. Iterate Based on User Feedback

Deploy a basic version first, then improve based on how people actually use it. The healthcare system started with basic search and added citation formatting after nurses requested it.

Conclusion

Focus on these areas:

Retrieval quality: Better context leads to better answers. Use hybrid search and re-ranking.
Cost management: Right-size clusters, use spot instances, and monitor spending.
Security: Encrypt data, control access, and maintain audit logs.

Your use case is different, but the principles are the same. Start small, measure impact, and scale what works.