Advanced RAG — Chunking Strategy, Hybrid Search and Reranking in Production
Naive RAG with RecursiveCharacterTextSplitter fails on complex documents. How to implement semantic chunking, hybrid search with BM25, RRF fusion and Cross-Encoder reranking — with full production-ready code and RAGAS evaluation.
Naive RAG looks impressive in demos. Load a PDF, embed chunks with RecursiveCharacterTextSplitter, query Qdrant, get a reasonable answer. The problems appear in production — with 400-page legal documents, technical specifications full of tables and formulas, multilingual knowledge bases with specialised terminology. RAGAS metrics drop below acceptable thresholds and users start reporting: "the model hallucinates", "couldn't find the obvious answer from the document", "response is cut off mid-sentence".
Three root causes of production RAG failures: chunking mechanically severs context — a sentence split across two chunks means the model lacks complete information; dense retrieval fails on technical terminology — embeddings poorly distinguish standards codes (ISO 27001, GDPR), model numbers and proper names; no reranking — the LLM receives the Top-3 from retrieval without quality assessment, and the "lost in the middle" effect causes it to ignore middle fragments of long contexts.
/// PROSTY RAG vs ZAAWANSOWANY RAG
Demo vs architektura produkcyjna
Advanced RAG is a pipeline, not a one-time architectural decision. Every stage — from document splitting strategy through hybrid search to Cross-Encoder reranking — has a measurable impact on response quality. This article covers each stage with code ready for production deployment.
Chunking — Defining the Unit of Meaning
A chunk isn't just "512 tokens of text" — it's a unit of meaning that the model sees as context for answering. Chunk too small: insufficient context, the model "hallucinates" missing information. Chunk too large: "diluted" context — the answer is buried somewhere in it, but the model can't extract it precisely. The optimal strategy depends on document type and the nature of user queries.
/// STRATEGIE CHUNKINGU — PORÓWNANIE
4 podejścia do podziału dokumentów
Fixed-size (RecursiveCharacterTextSplitter from LangChain) — fast, simple, good for prototypes. Sentence boundary (spaCy) — preserves sentence integrity, better for narrative documents. Semantic chunking — groups sentences by embedding similarity, topic shift triggers a new chunk. Hierarchical parent-child (SmallToLarge from LlamaIndex) — small nodes (128 tok.) for retrieval, large parent nodes (512 tok.) for LLM context. Eliminates truncated responses with long technical documents.
from langchain.text_splitter import RecursiveCharacterTextSplitterimport spacyfrom sentence_transformers import SentenceTransformerimport numpy as np# ─── 1. Fixed-size chunking (baseline) ───────────────────────────────────────def fixed_size_chunking(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]: splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=overlap, separators=["\n\n", "\n", ". ", " "], ) return splitter.split_text(text)# ─── 2. Sentence boundary chunking (spaCy) ───────────────────────────────────def sentence_boundary_chunking(text: str, max_sentences: int = 5) -> list[str]: nlp = spacy.load("en_core_web_sm") doc = nlp(text) sentences = [sent.text.strip() for sent in doc.sents] return [" ".join(sentences[i : i + max_sentences]) for i in range(0, len(sentences), max_sentences)]# ─── 3. Semantic chunking (cosine distance threshold) ────────────────────────def semantic_chunking(text: str, threshold: float = 0.3) -> list[str]: model = SentenceTransformer("intfloat/multilingual-e5-large") sentences = text.split(". ") embeddings = model.encode(sentences, normalize_embeddings=True) chunks: list[str] = [] current: list[str] = [sentences[0]] for idx in range(1, len(sentences)): sim = float(np.dot(embeddings[idx - 1], embeddings[idx])) if (1 - sim) > threshold: chunks.append(". ".join(current)) current = [sentences[idx]] else: current.append(sentences[idx]) if current: chunks.append(". ".join(current)) return chunks
When to use hierarchical parent-child: technical documents with multiple sections (specifications, manuals, legal contracts), FAQ bases, multi-section reports. When fixed-size is sufficient: homogeneous short articles and notes without complex structure.
Hybrid Search — Dense + BM25 + RRF
Dense retrieval (embeddings + cosine similarity) handles semantics and synonyms well, but fails on technical terms: standards codes (ISO 27001, GDPR), model numbers and product codes. BM25 hits exact phrases precisely but doesn't understand paraphrasing. Reciprocal Rank Fusion (RRF) combines rankings from both methods without weight calibration — sums 1/(k+rank) per document and sorts descending.
/// HYBRID SEARCH + CROSS-ENCODER RERANKER
Od zapytania do Top-5 reranked chunks
from qdrant_client import QdrantClientfrom sentence_transformers import SentenceTransformerfrom rank_bm25 import BM25Okapiimport numpy as npclient = QdrantClient(url="http://localhost:6333")embedder = SentenceTransformer("intfloat/multilingual-e5-large")COLLECTION = "knowledge_base"# ─── Dense retrieval via Qdrant ──────────────────────────────────────────────def dense_search(query: str, top_k: int = 20) -> list[tuple]: q_vec = embedder.encode(f"query: {query}", normalize_embeddings=True).tolist() hits = client.search(collection_name=COLLECTION, query_vector=q_vec, limit=top_k, with_payload=True) return [(r.id, r.score, r.payload["text"]) for r in hits]# ─── BM25 sparse retrieval (in-memory; for scale → Elasticsearch) ────────────def bm25_search(query: str, corpus: list[str], top_k: int = 20) -> list[tuple]: bm25 = BM25Okapi([t.lower().split() for t in corpus]) scores = bm25.get_scores(query.lower().split()) idxs = np.argsort(scores)[::-1][:top_k] return [(int(i), float(scores[i])) for i in idxs]# ─── Reciprocal Rank Fusion ───────────────────────────────────────────────────def rrf_fusion(dense: list, sparse: list, k: int = 60) -> list[tuple]: rrf: dict[int, float] = {} for rank, (doc_id, *_) in enumerate(dense): rrf[doc_id] = rrf.get(doc_id, 0) + 1 / (k + rank + 1) for rank, (doc_id, *_) in enumerate(sparse): rrf[doc_id] = rrf.get(doc_id, 0) + 1 / (k + rank + 1) return sorted(rrf.items(), key=lambda x: x[1], reverse=True)def hybrid_search(query: str, corpus: list[str], top_k: int = 10) -> list[tuple]: return rrf_fusion(dense_search(query), bm25_search(query, corpus))[:top_k]
Alternative for BM25: Qdrant 1.7+ supports native sparse vector search (SPLADE, BM42) without an external engine. For small corpora (<100k documents), in-memory BM25 (rank_bm25) is fully sufficient.
Reranking with Cross-Encoder
Bi-encoder (embeddings) is fast because each document is encoded independently of the query. Cross-encoder analyses the (query, chunk) pair jointly — it's 50–100× slower but significantly more accurate. Strategy: bi-encoder returns Top-20–50 candidates, cross-encoder reranks them and keeps Top-3–5 for the LLM. Latency increases by ~200–400ms on GPU — an acceptable cost given the measurable quality gain.
from sentence_transformers import CrossEncoderfrom openai import OpenAI# mmarco-mMiniLMv2: multilingual reranker (EN/PL/DE/FR/…), fast (MiniLM backbone)# alternative: BAAI/bge-reranker-v2-m3 — higher quality, heavier modelreranker = CrossEncoder("cross-encoder/mmarco-mMiniLMv2-L12-H384-v1")def rerank(query: str, candidates: list[dict], top_k: int = 5) -> list[dict]: pairs = [(query, c["text"]) for c in candidates] scores = reranker.predict(pairs) for i, score in enumerate(scores): candidates[i]["rerank_score"] = float(score) return sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)[:top_k]def rag_answer(query: str, corpus: list[str], llm: OpenAI) -> str: from hybrid_search_qdrant import hybrid_search fused = hybrid_search(query, corpus, top_k=20) candidates = [{"id": doc_id, "text": corpus[doc_id]} for doc_id, _ in fused] top_docs = rerank(query, candidates, top_k=5) context = "\n---\n".join(d["text"] for d in top_docs) resp = llm.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Answer only based on the provided context. If the answer is not in the context — say: I don't know."}, {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}, ], temperature=0, ) return resp.choices[0].message.content
Key principle: don't increase Top-K for the LLM to "give it more information". The "lost in the middle" problem means models ignore middle fragments of long contexts. 3–5 highly-ranked chunks produce better answers than 15 poorly-sorted ones. Target: Top-20 retrieval → Top-5 after reranking → LLM with ~1500 tokens of context.
Evaluation with RAGAS — Quality Gate Before Deployment
Evaluating RAG without automated metrics is an architectural mistake — you don't know whether a chunking change improved results, you have no gate before deploying new embedding versions. RAGAS measures four dimensions: Context Precision (relevance of retrieved chunks), Context Recall (completeness of information), Faithfulness (whether the answer comes from the context, not the model's parametric knowledge) and Answer Relevancy (whether the answer actually addresses the question asked).
from datasets import Datasetfrom ragas import evaluatefrom ragas.metrics import ( context_precision, context_recall, faithfulness, answer_relevancy,)# prepare at least 50–100 (question, ground_truth) pairs for reliable resultstest_cases = [ { "question": "What are the prerequisites for ISO 27001 certification?", "answer": "ISO 27001 requires implementing an ISMS, conducting a risk assessment and an internal audit.", "contexts": [ "ISO 27001 is an international standard for information security management systems (ISMS).", "Certification requires documenting security policies, risk assessment and risk treatment plans.", ], "ground_truth": "ISO 27001 certification requires an ISMS, risk assessment and an internal audit.", },]dataset = Dataset.from_list(test_cases)results = evaluate( dataset=dataset, metrics=[context_precision, context_recall, faithfulness, answer_relevancy],)df = results.to_pandas()print(df[["context_precision", "context_recall", "faithfulness", "answer_relevancy"]].mean())
Production quality thresholds: Context Precision ≥ 0.80, Context Recall ≥ 0.75, Faithfulness ≥ 0.85, Answer Relevancy ≥ 0.80. Treat these values as a CI/CD gate — before merging new embedding versions or chunking strategies, run evaluation on a fixed set of 100 test questions.
| Metric | Naive RAG | Advanced RAG | Action when score < 0.70 |
|---|---|---|---|
| Context Precision | 0.51 | 0.84 | Change chunking strategy or embedding model |
| Context Recall | 0.63 | 0.82 | Increase top-K retrieval or improve index coverage |
| Faithfulness | 0.74 | 0.91 | Strengthen system prompt or change the base LLM |
| Answer Relevancy | 0.68 | 0.87 | Improve query reformulation and system instructions |
Run RAGAS weekly on a random sample of real production queries — not just your synthetic test set. Synthetic metrics can be inflated (questions are "matched" to document content). Collect real (question, answer) pairs from production logs and gradually expand your evaluation set.
---
I build advanced RAG pipelines for companies that have knowledge bases and need their employees or customers to get precise, documented answers — not hallucinations. From auditing existing systems through chunking, hybrid search and reranking optimisation to production quality monitoring with RAGAS. Get in touch — I start with an analysis of your document base and the first benchmark.
/// RELATED_RECORDS
How AI Reads Invoices from Email and Enters Them into ERP
AI can automatically read an invoice from an email attachment — PDF, scan, or phone photo — and enter the data directly into an ERP system without any manual retyping. Full automation of cost invoice processing: from the mailbox to accounting.
Where to Start with AI Implementation in Your Company
AI implementation starts not with choosing a tool, but with identifying one repetitive process that wastes the most human time. Learn step by step how to select, map, and automate that process.
How to Build a Company Internal Knowledge Base with AI (RAG in Practice)
An internal knowledge base built on RAG lets you create your own company chatbot that answers only from your company's documents — not the model's guesses. Safe, up-to-date, precise AI with full control over your data.
Signal received?
Terminate
Silence
Initiate protocol. Establish connection. Let's build something loud.
