jestbetweenus

The simplest RAG pipeline looks like this: chunk a document, embed the chunks, embed the user's question, find the top-k nearest chunks by cosine similarity, stuff them into a prompt. It works fine in demos. It falls apart on anything real.

Here's what goes wrong and what actually fixes it.

Problem 1: Question embeddings don't look like answer embeddings

"What was the committee's recommendation?" embeds near other questions. It does not embed near "The committee recommended a 90-day review period," which is the passage you're looking for.

Fix: HyDE (Hypothetical Document Embeddings)

Before searching, ask the LLM to write a short passage that would answer the question — as if it were already in the document. Embed that synthetic passage instead of (or alongside) the raw question. The synthetic passage embeds in the same space as real document text, so it lands much closer to the relevant chunks.

Cost: one extra LLM call per query (cheap with Haiku). Retrieval improvement: dramatic on anything conversational or analytical.

Problem 2: Vector search misses exact matches

A contract might mention "Section 12.3(b)" or "Case No. 2023-CV-04471" dozens of times. These identifiers don't embed near anything meaningful — they're just tokens. Vector search will rank them low. A keyword search would find them instantly.

Fix: Hybrid search with BM25

Run both a cosine similarity search (vector) and a full-text BM25 search in parallel. Merge the results using Reciprocal Rank Fusion: score each chunk as 1/(60 + rank_vector) + 1/(60 + rank_bm25). Chunks that appear in both lists score highest. Chunks found only by keyword get surfaced anyway.

PostgreSQL's tsvector + ts_rank gives you BM25-like scoring without a separate search engine. pgvector handles the vector side. One database, two passes.

Problem 3: Cosine similarity is a noisy signal

Two chunks can have similar embeddings and completely different relevance to the specific question being asked. The embedding model doesn't know what question will be asked — it just encodes semantic meaning in a general sense.

Fix: Cross-encoder reranking

After retrieval, run a second model that reads the actual text of each chunk and the question together, then scores relevance. This is a cross-encoder: it sees both at once and can reason about fit.

You can't run this on every chunk in the database (too slow). But you can run it on the top-80 retrieved candidates. Use a fast, cheap model (Haiku) that reads 250-character previews and selects the top 12.

Two-stage pipeline: bi-encoder retrieval for recall (fast, approximate), cross-encoder reranking for precision (slower, accurate). Latency overhead: under one second.

Problem 4: Chunk boundaries cut context

An answer might start in the middle of one chunk and continue into the next. If you split naively at fixed character counts, you'll miss it.

Fix: Split at semantic boundaries

Chunk at heading boundaries first. Only split further (at paragraph boundaries) when a section exceeds the embedding token limit. This preserves the natural structure of the document — the boundaries that an author put there for a reason.

Also: some converters (including Azure Document Intelligence) generate spurious headings from exhibit labels, page headers, and transcript markers. Detect and absorb these into body text so they don't create useless one-line chunks.

Problem 5: Embeddings don't know which document they're from

A chunk that says "the plaintiff argues" means something different in a breach of contract case versus a criminal appeal. Without document context, the embedding model encodes the text in isolation.

Fix: Contextual embeddings

Prepend a brief contextual header to each chunk before embedding:

Document: employment_agreement_2024.pdf
Section: Termination Without Cause

The employee may be terminated without cause upon thirty (30) days written notice...

The embedding now encodes both the document identity and section context. Retrieval becomes more precise — especially in multi-document setups where the same legal language appears across many files.

None of these techniques is complicated on its own. The compound effect of all five is substantial. A pipeline with HyDE + hybrid search + reranking + intelligent chunking + contextual embeddings retrieves the right passage on the first try far more often than a naive top-k vector search. That difference is what makes the answers trustworthy.