Do I need to use all three techniques — chunking, hybrid search and reranker?

No. Start with the biggest pain point. Technical documents with specialised terminology? Hybrid search (BM25 + dense) delivers immediate gains. Truncated responses or lack of context? Improve chunking (sentence boundary or hierarchical). Want to squeeze out the last few percentage points with stable retrieval? Add a reranker. Each layer independently lifts metrics — together they have a multiplicative effect.

Which embedding model should I choose?

Best options in 2025: **intfloat/multilingual-e5-large** (good quality/speed balance, 1024 dim, free for self-hosting), **BAAI/bge-m3** (highest multilingual quality, hybrid sparse+dense, ~570 MB), **text-embedding-3-large** from OpenAI (1536 dim, managed, $0.000130/1k tokens if you don't want to self-host). Avoid EN-only models for non-English content — the quality gap is significant (~15 pp on multilingual benchmarks).

How long does indexing 10,000 PDF documents take?

With multilingual-e5-large on GPU (A100): ~20 minutes for 10k documents of ~10 pages each, semantic chunking. On CPU: 3–4× slower. A dense Qdrant index for 500k vectors at 1024 dim occupies ~2 GB RAM. Plan for batch processing with checkpoints — this is not a one-time operation (documents get updated and full reindexing must be possible without downtime).

Qdrant vs Weaviate vs Pinecone — which to choose?

Qdrant: open-source, best performance for self-hosted, native hybrid search from v1.7, actively maintained. Weaviate: more built-in features (hybrid search out-of-the-box, GraphQL API), slower and more resource-intensive. Pinecone: fully managed, serverless from ~$0.033/1M queries — zero ops, but cost grows with vector count. My default: Qdrant self-hosted up to 5M vectors, Pinecone serverless for MVP without infrastructure.

How do I automatically detect RAG quality drift in production?

Collect thumbs up/down from users and map them to specific queries. Run RAGAS in online mode on a random 5–10% sample of queries (LLM-as-judge). Monitor average cosine scores from retrieval — a drop below 0.6 signals time for re-evaluation. Alert: when Context Precision drops by >10 pp versus baseline — time to review the embedding model or chunking strategy.

RETURN_TO_BLOG

2026-06-04AI & Automation 16 min

Advanced RAG — Chunking Strategy, Hybrid Search and Reranking in Production

Paweł Wiszniewski

SEO & GEO Specialist · AI Engineer

Naive RAG looks impressive in demos. Load a PDF, embed chunks with RecursiveCharacterTextSplitter, query Qdrant, get a reasonable answer. The problems appear in production — with 400-page legal documents, technical specifications full of tables and formulas, multilingual knowledge bases with specialised terminology. RAGAS metrics drop below acceptable thresholds and users start reporting: "the model hallucinates", "couldn't find the obvious answer from the document", "response is cut off mid-sentence".

Naive RAG with RecursiveCharacterTextSplitter fails on complex documents. How to implement semantic chunking, hybrid search with BM25, RRF fusion and Cross-Encoder reranking — with full production-ready code and RAGAS evaluation.

Three root causes of production RAG failures: chunking mechanically severs context — a sentence split across two chunks means the model lacks complete information; dense retrieval fails on technical terminology — embeddings poorly distinguish standards codes (ISO 27001, GDPR), model numbers and proper names; no reranking — the LLM receives the Top-3 from retrieval without quality assessment, and the "lost in the middle" effect causes it to ignore middle fragments of long contexts.

/// NAIVE RAG vs ADVANCED RAG

Demo vs production architecture

Naive RAG (demo)

01Fixed-size chunking (RecursiveCharacterTextSplitter)

02Single bi-encoder embedding model

03Dense retrieval only — Top-3 straight to the LLM

04No reranking or answer-quality evaluation

Advanced RAG (production)

01Semantic / hierarchical chunking

02Multilingual-E5 embeddings

03Hybrid search: Dense + BM25 + RRF fusion

04Cross-Encoder reranking (Top-20 → Top-5) + RAGAS gate

RAGAS benchmark — same questions, same document base

Context Precision51%→84%

Naive

Advanced

Context Recall63%→82%

Naive

Advanced

Faithfulness74%→91%

Naive

Advanced

Advanced RAG is a pipeline, not a one-time architectural decision. Every stage — from document splitting strategy through hybrid search to Cross-Encoder reranking — has a measurable impact on response quality. This article covers each stage with code ready for production deployment.

Chunking — Defining the Unit of Meaning

A chunk isn't just "512 tokens of text" — it's a unit of meaning that the model sees as context for answering. Chunk too small: insufficient context, the model "hallucinates" missing information. Chunk too large: "diluted" context — the answer is buried somewhere in it, but the model can't extract it precisely. The optimal strategy depends on document type and the nature of user queries.

/// CHUNKING STRATEGIES — COMPARISON

4 approaches to splitting documents

Fixed-sizeRecursiveCharacterTextSplitter

512 tok., overlap 64

+Fast, simple, no dependencies

−Cuts sentences in half

Sentence boundaryspaCy / NLTK

Natural sentence boundaries

+Preserves sentence integrity

−Slower preprocessing

Semanticsentence-transformers

Cosine distance > 0.3 = new chunk

+Best context quality

−Requires GPU for indexing

HierarchicalSmallToLarge (LlamaIndex)

128 tok. retrieval / 512 tok. ctx

+Precision + rich LLM context

−Complex implementation

Fixed-size (RecursiveCharacterTextSplitter from LangChain) — fast, simple, good for prototypes. Sentence boundary (spaCy) — preserves sentence integrity, better for narrative documents. Semantic chunking — groups sentences by embedding similarity, topic shift triggers a new chunk. Hierarchical parent-child (SmallToLarge from LlamaIndex) — small nodes (128 tok.) for retrieval, large parent nodes (512 tok.) for LLM context. Eliminates truncated responses with long technical documents.

chunking_strategies.py

from langchain.text_splitter import RecursiveCharacterTextSplitterimport spacyfrom sentence_transformers import SentenceTransformerimport numpy as np# ─── 1. Fixed-size chunking (baseline) ───────────────────────────────────────def fixed_size_chunking(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:    splitter = RecursiveCharacterTextSplitter(        chunk_size=chunk_size,        chunk_overlap=overlap,        separators=["\n\n", "\n", ". ", " "],    )    return splitter.split_text(text)# ─── 2. Sentence boundary chunking (spaCy) ───────────────────────────────────def sentence_boundary_chunking(text: str, max_sentences: int = 5) -> list[str]:    nlp = spacy.load("en_core_web_sm")    doc = nlp(text)    sentences = [sent.text.strip() for sent in doc.sents]    return [" ".join(sentences[i : i + max_sentences]) for i in range(0, len(sentences), max_sentences)]# ─── 3. Semantic chunking (cosine distance threshold) ────────────────────────def semantic_chunking(text: str, threshold: float = 0.3) -> list[str]:    model = SentenceTransformer("intfloat/multilingual-e5-large")    sentences = text.split(". ")    embeddings = model.encode(sentences, normalize_embeddings=True)    chunks: list[str] = []    current: list[str] = [sentences[0]]    for idx in range(1, len(sentences)):        sim = float(np.dot(embeddings[idx - 1], embeddings[idx]))        if (1 - sim) > threshold:            chunks.append(". ".join(current))            current = [sentences[idx]]        else:            current.append(sentences[idx])    if current:        chunks.append(". ".join(current))    return chunks

When to use hierarchical parent-child: technical documents with multiple sections (specifications, manuals, legal contracts), FAQ bases, multi-section reports. When fixed-size is sufficient: homogeneous short articles and notes without complex structure.

Hybrid Search — Dense + BM25 + RRF

Dense retrieval (embeddings + cosine similarity) handles semantics and synonyms well, but fails on technical terms: standards codes (ISO 27001, GDPR), model numbers and product codes. BM25 hits exact phrases precisely but doesn't understand paraphrasing. Reciprocal Rank Fusion (RRF) combines rankings from both methods without weight calibration — sums 1/(k+rank) per document and sorts descending.

/// HYBRID SEARCH + CROSS-ENCODER RERANKER

From query to Top-5 reranked chunks

Query

›

Dense

Qdrant

›

BM25

sparse

›

RRF

fusion

›

Reranker

Cross-Encoder

›

Top-5

to LLM

Precision@5 — 200 technical questions, 50k document base

Dense only

67%

BM25 only

54%

Hybrid RRF

79%

Hybrid + Reranker

84%

hybrid_search_qdrant.py

from qdrant_client import QdrantClientfrom sentence_transformers import SentenceTransformerfrom rank_bm25 import BM25Okapiimport numpy as npclient = QdrantClient(url="http://localhost:6333")embedder = SentenceTransformer("intfloat/multilingual-e5-large")COLLECTION = "knowledge_base"# ─── Dense retrieval via Qdrant ──────────────────────────────────────────────def dense_search(query: str, top_k: int = 20) -> list[tuple]:    q_vec = embedder.encode(f"query: {query}", normalize_embeddings=True).tolist()    hits = client.search(collection_name=COLLECTION, query_vector=q_vec, limit=top_k, with_payload=True)    return [(r.id, r.score, r.payload["text"]) for r in hits]# ─── BM25 sparse retrieval (in-memory; for scale → Elasticsearch) ────────────def bm25_search(query: str, corpus: list[str], top_k: int = 20) -> list[tuple]:    bm25 = BM25Okapi([t.lower().split() for t in corpus])    scores = bm25.get_scores(query.lower().split())    idxs = np.argsort(scores)[::-1][:top_k]    return [(int(i), float(scores[i])) for i in idxs]# ─── Reciprocal Rank Fusion ───────────────────────────────────────────────────def rrf_fusion(dense: list, sparse: list, k: int = 60) -> list[tuple]:    rrf: dict[int, float] = {}    for rank, (doc_id, *_) in enumerate(dense):        rrf[doc_id] = rrf.get(doc_id, 0) + 1 / (k + rank + 1)    for rank, (doc_id, *_) in enumerate(sparse):        rrf[doc_id] = rrf.get(doc_id, 0) + 1 / (k + rank + 1)    return sorted(rrf.items(), key=lambda x: x[1], reverse=True)def hybrid_search(query: str, corpus: list[str], top_k: int = 10) -> list[tuple]:    return rrf_fusion(dense_search(query), bm25_search(query, corpus))[:top_k]

Alternative for BM25: Qdrant 1.7+ supports native sparse vector search (SPLADE, BM42) without an external engine. For small corpora (<100k documents), in-memory BM25 (rank_bm25) is fully sufficient.

Reranking with Cross-Encoder

Bi-encoder (embeddings) is fast because each document is encoded independently of the query. Cross-encoder analyses the (query, chunk) pair jointly — it's 50–100× slower but significantly more accurate. Strategy: bi-encoder returns Top-20–50 candidates, cross-encoder reranks them and keeps Top-3–5 for the LLM. Latency increases by ~200–400ms on GPU — an acceptable cost given the measurable quality gain.

reranking_pipeline.py

from sentence_transformers import CrossEncoderfrom openai import OpenAI# mmarco-mMiniLMv2: multilingual reranker (EN/PL/DE/FR/…), fast (MiniLM backbone)# alternative: BAAI/bge-reranker-v2-m3 — higher quality, heavier modelreranker = CrossEncoder("cross-encoder/mmarco-mMiniLMv2-L12-H384-v1")def rerank(query: str, candidates: list[dict], top_k: int = 5) -> list[dict]:    pairs = [(query, c["text"]) for c in candidates]    scores = reranker.predict(pairs)    for i, score in enumerate(scores):        candidates[i]["rerank_score"] = float(score)    return sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)[:top_k]def rag_answer(query: str, corpus: list[str], llm: OpenAI) -> str:    from hybrid_search_qdrant import hybrid_search    fused = hybrid_search(query, corpus, top_k=20)    candidates = [{"id": doc_id, "text": corpus[doc_id]} for doc_id, _ in fused]    top_docs = rerank(query, candidates, top_k=5)    context = "\n---\n".join(d["text"] for d in top_docs)    resp = llm.chat.completions.create(        model="gpt-4o",        messages=[            {"role": "system", "content": "Answer only based on the provided context. If the answer is not in the context — say: I don't know."},            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},        ],        temperature=0,    )    return resp.choices[0].message.content

Key principle: don't increase Top-K for the LLM to "give it more information". The "lost in the middle" problem means models ignore middle fragments of long contexts. 3–5 highly-ranked chunks produce better answers than 15 poorly-sorted ones. Target: Top-20 retrieval → Top-5 after reranking → LLM with ~1500 tokens of context.

Evaluation with RAGAS — Quality Gate Before Deployment

Evaluating RAG without automated metrics is an architectural mistake — you don't know whether a chunking change improved results, you have no gate before deploying new embedding versions. RAGAS measures four dimensions: Context Precision (relevance of retrieved chunks), Context Recall (completeness of information), Faithfulness (whether the answer comes from the context, not the model's parametric knowledge) and Answer Relevancy (whether the answer actually addresses the question asked).

ragas_eval.py

from datasets import Datasetfrom ragas import evaluatefrom ragas.metrics import (    context_precision,    context_recall,    faithfulness,    answer_relevancy,)# prepare at least 50–100 (question, ground_truth) pairs for reliable resultstest_cases = [    {        "question": "What are the prerequisites for ISO 27001 certification?",        "answer": "ISO 27001 requires implementing an ISMS, conducting a risk assessment and an internal audit.",        "contexts": [            "ISO 27001 is an international standard for information security management systems (ISMS).",            "Certification requires documenting security policies, risk assessment and risk treatment plans.",        ],        "ground_truth": "ISO 27001 certification requires an ISMS, risk assessment and an internal audit.",    },]dataset = Dataset.from_list(test_cases)results = evaluate(    dataset=dataset,    metrics=[context_precision, context_recall, faithfulness, answer_relevancy],)df = results.to_pandas()print(df[["context_precision", "context_recall", "faithfulness", "answer_relevancy"]].mean())

Production quality thresholds: Context Precision ≥ 0.80, Context Recall ≥ 0.75, Faithfulness ≥ 0.85, Answer Relevancy ≥ 0.80. Treat these values as a CI/CD gate — before merging new embedding versions or chunking strategies, run evaluation on a fixed set of 100 test questions.

Metric	Naive RAG	Advanced RAG	Action when score < 0.70
Context Precision	0.51	0.84	Change chunking strategy or embedding model
Context Recall	0.63	0.82	Increase top-K retrieval or improve index coverage
Faithfulness	0.74	0.91	Strengthen system prompt or change the base LLM
Answer Relevancy	0.68	0.87	Improve query reformulation and system instructions

Run RAGAS weekly on a random sample of real production queries — not just your synthetic test set. Synthetic metrics can be inflated (questions are "matched" to document content). Collect real (question, answer) pairs from production logs and gradually expand your evaluation set.

---

I build advanced RAG pipelines for companies that have knowledge bases and need their employees or customers to get precise, documented answers — not hallucinations. From auditing existing systems through chunking, hybrid search and reranking optimisation to production quality monitoring with RAGAS. Get in touch — I start with an analysis of your document base and the first benchmark.

/// RELATED_SERVICES

Need these concepts implemented? Explore the services related to this topic.

Service

AI App Development

Custom AI software and AI-powered web applications. MVP development, full stack engineering, and AI systems programming from scratch to production.

View service Service

AI & Automation

Virtual employees who never sleep. Autonomous agents and workflows.

View service

/// SOURCES

/// RELATED_RECORDS

AI & Automation

Vibe Coding: Complete Guide to AI Coding Tools 2026

Claude Code, Cursor, GitHub Copilot, Codex CLI, Gemini CLI, Lovable, Bolt.new — 60% of all new code worldwide is AI-generated (Gartner, 2026). A complete map of 11 vibe coding tools across 3 categories, with pricing, use cases, and a selection guide for businesses.

18 min

AI & Automation

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

OpenAI Deep Research, Perplexity, and web-browsing agents are reshaping desk research: a report that takes an analyst 4–8 hours, an agent finishes in 5–20 minutes with source citations. I explain how these tools work, when they genuinely replace a human and when they don't, what ROI looks like, how to build your own research-automation pipeline, and when it makes sense to let the agent do it instead of an employee.

15 min

AI & Automation

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

AI cuts CV screening time by 75%, but recruitment systems are classified as high-risk AI under the EU AI Act — with a full compliance package: human oversight, transparency, technical documentation, EU database registration. I explain what AI in HR can safely do (screening as a filter, chatbot, onboarding), where the line is (autonomous decisions without a human), which tools work for SMEs, and how to avoid legal exposure.

17 min

/// AUTHOR

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

LinkedIn Facebook

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...

BIAŁYSTOK, PL

+48 732 022 086 pawel.wiszniewski95@gmail.com

Demo vs production architecture

Chunking — Defining the Unit of Meaning

4 approaches to splitting documents

Hybrid Search — Dense + BM25 + RRF

From query to Top-5 reranked chunks

Reranking with Cross-Encoder

Evaluation with RAGAS — Quality Gate Before Deployment

/// RELATED_SERVICES

AI App Development

AI & Automation

/// SOURCES

/// RELATED_RECORDS

Vibe Coding: Complete Guide to AI Coding Tools 2026

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

Signal received?

TerminateSilence

Terminate
Silence