We've built RAG pipelines for law firms, SaaS companies, eCommerce platforms, and financial services providers. Here's the architecture we use, the mistakes we've made, and the evaluation framework that tells us when something is actually ready to ship. If you're looking for a partner to build this for you, see our Custom AI Agents service.
What Makes a RAG Pipeline “Production-Ready”?
A working prototype answers questions. A production system does all of the following:
The 7-Layer RAG Architecture We Use
Layer 01: Document Ingestion
Not just PDF loading. Production ingestion handles PDFs, Word docs, web pages, Notion databases, Google Docs, CSVs, and plain text. Each format needs its own parser. We use: PyMuPDF for PDFs, python-docx for Word, Playwright for web pages, and the Notion API for Notion exports. Each parser includes error handling for malformed files — in production, you will encounter malformed files.
Layer 02: Preprocessing & Cleaning
Raw documents are messy. Remove: headers and footers that repeat on every page, tables of contents, page numbers, duplicate content, and boilerplate legal text. Clean: fix encoding issues, normalise whitespace, extract tables separately (tables need different chunking to prose). Do not skip this step — garbage in, garbage out, regardless of how good your retrieval is.
Layer 03: Chunking Strategy
This is where most RAG systems fail. Fixed-size chunking (512 tokens) is simple and works for uniform documents. Sentence-based chunking is better for prose but worse for technical content. Semantic chunking groups by meaning — best quality, slowest. Hierarchical chunking creates a parent-child structure — best for retrieval. Our default: semantic chunking with 256-token chunks and 64-token overlap, with parent document retrieval for context injection.
Layer 04: Embedding
Model selection matters enormously. text-embedding-3-large (OpenAI) gives the best general performance. text-embedding-3-small is 5x cheaper with roughly a 10% quality drop. BGE-M3 is the best open-source option and is self-hostable. Cohere embed-v3 is best for multilingual use cases. We benchmark every new client use case with at least 3 embedding models before committing to one.
Layer 05: Vector Storage
Supabase pgvector is perfect up to 1M vectors and integrates natively with SQL — ideal for most clients. Pinecone is a managed service that scales to billions of vectors at higher cost. Weaviate is strong for hybrid search. Chroma is fine for local development only. Our default: Supabase pgvector for most clients, Pinecone for high-volume production.
Layer 06: Retrieval
Pure vector search is not enough. Production retrieval uses hybrid search (vector similarity combined with BM25 keyword search), reranking (a cross-encoder like Cohere Rerank or BGE Reranker to reorder results by relevance), query expansion (generate 3 variants of the user query and retrieve for all), and MMR — Maximal Marginal Relevance — to reduce redundancy in retrieved chunks.
Layer 07: Generation & Output
The LLM prompt matters as much as retrieval. Key elements: a system prompt defining the AI's role and limitations, retrieved context with source citations, an explicit instruction to say "I don't know" when context is insufficient, output format specification (JSON, markdown, or plain text), and confidence scoring. Test your prompts against adversarial queries before shipping.
RAG Pipeline Architecture
End-to-end flow from documents to response
The Evaluation Framework — How We Know It's Ready to Ship
We evaluate every RAG system on 4 metrics using the RAGAS framework. Target: all scores above 0.85 before shipping.
Faithfulness
Does the answer match the retrieved context? Checks for hallucinations.
Answer Relevancy
Does the answer actually address the question asked?
Context Recall
Was the correct information successfully retrieved?
Context Precision
What % of retrieved chunks were actually useful for the answer?
Common RAG Failures and How to Fix Them
FAILURE
Retrieves wrong chunks
FIX
Improve chunking strategy, add metadata filtering, implement hybrid search with BM25.
FAILURE
Hallucinates despite good retrieval
FIX
Strengthen system prompt with explicit grounding instructions, implement faithfulness check as a post-processing step.
FAILURE
Too slow (>3s response)
FIX
Implement async retrieval, reduce chunk count returned, cache frequent queries, switch to a faster embedding model.
FAILURE
Works in testing, fails in production
FIX
Test with real user queries from day one, not synthetic ones. Implement query logging immediately — not after the problem appears.
FAILURE
Retrieves outdated information
FIX
Implement document versioning, add timestamps to metadata, filter by recency when the query is time-sensitive.
A Real Example — Legal Document RAG System
Client
UK law firm with 50,000+ contracts. Associates spending 2 hours per contract review finding relevant precedents.
Architecture decisions
· Hierarchical chunking by clause type
· Custom metadata: contract type, date, jurisdiction, parties
· Hybrid search with clause-type filtering
· BGE Reranker for precision
· Citations required in every response
Results
Review time
Retrieval accuracy
Associate focus
The Tech Stack We Recommend in 2026
Building a RAG system?
We can audit your existing architecture and identify the specific layer causing retrieval or quality issues — or build the full pipeline from scratch. See our Custom AI Agents service or AI Automation service for more.
Book a Free Architecture Review →