← Back to Blog
AI EngineeringLLM10 min read·April 13, 2026

How to Build a Production-Ready RAG Pipeline in 2026 (Step-by-Step)

Most RAG tutorials show you how to get something working in a Jupyter notebook. This guide shows you how to build something that works in production — with real users, real documents, and real consequences when it gets things wrong.

We've built RAG pipelines for law firms, SaaS companies, eCommerce platforms, and financial services providers. Here's the architecture we use, the mistakes we've made, and the evaluation framework that tells us when something is actually ready to ship. If you're looking for a partner to build this for you, see our Custom AI Agents service.

What Makes a RAG Pipeline “Production-Ready”?

A working prototype answers questions. A production system does all of the following:

Handles documents it was not specifically designed for
Knows when it does not know something — and says so
Returns answers in under 2 seconds at scale
Logs every query and response for monitoring and debugging
Has a feedback loop that improves retrieval over time
Fails gracefully when the LLM or vector DB is temporarily down

The 7-Layer RAG Architecture We Use

01

Layer 01: Document Ingestion

Not just PDF loading. Production ingestion handles PDFs, Word docs, web pages, Notion databases, Google Docs, CSVs, and plain text. Each format needs its own parser. We use: PyMuPDF for PDFs, python-docx for Word, Playwright for web pages, and the Notion API for Notion exports. Each parser includes error handling for malformed files — in production, you will encounter malformed files.

02

Layer 02: Preprocessing & Cleaning

Raw documents are messy. Remove: headers and footers that repeat on every page, tables of contents, page numbers, duplicate content, and boilerplate legal text. Clean: fix encoding issues, normalise whitespace, extract tables separately (tables need different chunking to prose). Do not skip this step — garbage in, garbage out, regardless of how good your retrieval is.

03

Layer 03: Chunking Strategy

This is where most RAG systems fail. Fixed-size chunking (512 tokens) is simple and works for uniform documents. Sentence-based chunking is better for prose but worse for technical content. Semantic chunking groups by meaning — best quality, slowest. Hierarchical chunking creates a parent-child structure — best for retrieval. Our default: semantic chunking with 256-token chunks and 64-token overlap, with parent document retrieval for context injection.

04

Layer 04: Embedding

Model selection matters enormously. text-embedding-3-large (OpenAI) gives the best general performance. text-embedding-3-small is 5x cheaper with roughly a 10% quality drop. BGE-M3 is the best open-source option and is self-hostable. Cohere embed-v3 is best for multilingual use cases. We benchmark every new client use case with at least 3 embedding models before committing to one.

05

Layer 05: Vector Storage

Supabase pgvector is perfect up to 1M vectors and integrates natively with SQL — ideal for most clients. Pinecone is a managed service that scales to billions of vectors at higher cost. Weaviate is strong for hybrid search. Chroma is fine for local development only. Our default: Supabase pgvector for most clients, Pinecone for high-volume production.

06

Layer 06: Retrieval

Pure vector search is not enough. Production retrieval uses hybrid search (vector similarity combined with BM25 keyword search), reranking (a cross-encoder like Cohere Rerank or BGE Reranker to reorder results by relevance), query expansion (generate 3 variants of the user query and retrieve for all), and MMR — Maximal Marginal Relevance — to reduce redundancy in retrieved chunks.

07

Layer 07: Generation & Output

The LLM prompt matters as much as retrieval. Key elements: a system prompt defining the AI's role and limitations, retrieved context with source citations, an explicit instruction to say "I don't know" when context is insufficient, output format specification (JSON, markdown, or plain text), and confidence scoring. Test your prompts against adversarial queries before shipping.

RAG Pipeline Architecture

End-to-end flow from documents to response

DocsIngestChunkEmbedVector DBUser QueryQuery ExpandRetrieveRerankLLMResponse⚠ Where most systems fail

The Evaluation Framework — How We Know It's Ready to Ship

We evaluate every RAG system on 4 metrics using the RAGAS framework. Target: all scores above 0.85 before shipping.

Faithfulness

Does the answer match the retrieved context? Checks for hallucinations.

Target: > 0.85

Answer Relevancy

Does the answer actually address the question asked?

Target: > 0.85

Context Recall

Was the correct information successfully retrieved?

Target: > 0.85

Context Precision

What % of retrieved chunks were actually useful for the answer?

Target: > 0.85

Common RAG Failures and How to Fix Them

FAILURE

Retrieves wrong chunks

FIX

Improve chunking strategy, add metadata filtering, implement hybrid search with BM25.

FAILURE

Hallucinates despite good retrieval

FIX

Strengthen system prompt with explicit grounding instructions, implement faithfulness check as a post-processing step.

FAILURE

Too slow (>3s response)

FIX

Implement async retrieval, reduce chunk count returned, cache frequent queries, switch to a faster embedding model.

FAILURE

Works in testing, fails in production

FIX

Test with real user queries from day one, not synthetic ones. Implement query logging immediately — not after the problem appears.

FAILURE

Retrieves outdated information

FIX

Implement document versioning, add timestamps to metadata, filter by recency when the query is time-sensitive.

A Real Example — Legal Document RAG System

Client

UK law firm with 50,000+ contracts. Associates spending 2 hours per contract review finding relevant precedents.

Architecture decisions

· Hierarchical chunking by clause type

· Custom metadata: contract type, date, jurisdiction, parties

· Hybrid search with clause-type filtering

· BGE Reranker for precision

· Citations required in every response

Results

Review time

2 hours15 minutes

Retrieval accuracy

~70%94%

Associate focus

SearchingAnalysing

The Tech Stack We Recommend in 2026

IngestionLangChain document loaders + custom parsers per format
ChunkingLangChain text splitters + custom semantic chunker
Embeddingtext-embedding-3-large or BGE-M3 (self-hosted)
Vector DBSupabase pgvector (< 1M vectors) · Pinecone (> 1M vectors)
RetrievalLangChain retriever + Cohere Rerank
GenerationGPT-4o or Claude 3.5 Sonnet
EvaluationRAGAS framework — automated nightly runs
MonitoringLangSmith or custom logging in Postgres
DeploymentFastAPI + Docker + AWS ECS

Building a RAG system?

We can audit your existing architecture and identify the specific layer causing retrieval or quality issues — or build the full pipeline from scratch. See our Custom AI Agents service or AI Automation service for more.

Book a Free Architecture Review →
Share this article:LinkedInTwitter / X

Related Articles