your-rag-is-probably-worse-than-you-think-three-ways-to-find-out.ipynb
File Edit View Run Kernel kernel: utshav · idle
../blog / your-rag-is-probably-worse-than-you-think-three-ways-to-find-out.ipynb

Your RAG is probably worse than you think. Three ways to find out.

The first time I shipped a RAG system, I knew it was working because I asked it three questions in the staging UI and the answers looked correct. Reader, it was not working. It took two weeks of users

4 min read

The first time I shipped a RAG system, I knew it was working because I asked it three questions in the staging UI and the answers looked correct.

Reader, it was not working.

It took two weeks of users complaining before I went back to look. About 30% of queries were retrieving documents that had nothing to do with the question. The model was doing such a polite job of pretending the irrelevant context was relevant that the answers sounded fine. They were just quietly wrong.

This is the dirty secret of RAG: the failure mode is fluent confidence, not gibberish. Your eyeball test will pass. Your users won't.

Below are the three numbers I now look at before declaring a RAG system "working." None of them require an evaluation framework. You can set this up in an afternoon.

1. Retrieval recall at k

Before you ever look at the LLM's output, ask: did we even hand the model the right context?

For each test question in your eval set, you need to know which document(s) should have been retrieved. Then you measure: in the top-k chunks the retriever returned, was the right one in there?

This is recall@k.

If your retriever has 60% recall@5, the LLM only has a fighting chance on 60% of your queries. The other 40%, it's making things up to please you, and a measure of generation quality across all queries will lie to you about how the system is doing.

How to build the eval set: write 50 questions you'd want the system to answer. For each, manually find the document(s) in your corpus that contain the answer. That's it. Fifty, not five thousand. You're trying to catch systemic bugs, not benchmark.

I've never built a RAG system where the first version had recall@5 above 80%. Usually it's around 50–65%. The fix is rarely "switch to a fancier embedding." It's almost always chunking strategy or query rewriting.

2. Answer faithfulness

Once the right context is in the prompt, the LLM has to actually use it. Often it doesn't.

Faithfulness measures: of the claims the model made in its answer, how many are supported by the retrieved context? Anything not supported is a hallucination, even if it's true in some abstract sense.

The dumb-but-effective way to measure this: take the model's answer, split it into individual claims, and ask another LLM (use a strong one, GPT-4 class) "is this claim supported by this context, yes or no?" Aggregate the yeses.

You don't need a labeled dataset. You're using one model to judge another, which is fine for a directional metric. It catches the ugly cases.

I've watched faithfulness drop from 92% to 70% after a "minor" prompt change that supposedly just made answers more concise. Concise, but making things up.

3. Failure-mode share

When the system gets a query wrong, why did it get it wrong? You want a histogram, not a single number.

In my experience, RAG failures fall into roughly four buckets:

  • Retrieval miss (correct doc wasn't in the top-k)

  • Hallucination (right context was there, model made things up anyway)

  • Refusal (model said "I don't know" when the answer was clearly in the context)

  • Wrong but plausible (model picked the wrong context chunk to ground in)

Tag every failure in your eval set with one of these. Then you know what to fix first. If 80% of your failures are retrieval misses, no amount of prompt engineering will save you. If 80% are hallucinations on good context, you have a model or temperature problem.

Most teams I've watched skip this step and end up "improving" the wrong thing for weeks.

Putting it together

Your minimum viable eval rig:

# 50 hand-written test cases
test_cases = [
    {
        "query": "What's the dosing for X in patients over 70?",
        "expected_doc_ids": ["protocol-2024-x", "geriatric-supplement-v3"],
        "answer_must_mention": ["reduce by 50%", "monitor renal function"],
    },
    # ... 49 more
]

# For each case
for case in test_cases:
    retrieved = retriever.search(case["query"], k=5)
    answer = llm.answer(case["query"], retrieved)

    metrics = {
        "recall@5": any(d.id in case["expected_doc_ids"] for d in retrieved),
        "answer_mentions": all(s in answer for s in case["answer_must_mention"]),
        "faithfulness": judge_llm.is_supported(answer, retrieved),
    }
    log(case, retrieved, answer, metrics)
This is not pretty. It is not a framework. It will not scale to a hundred thousand queries. It will, however, tell you whether your system is broken in five minutes. That's almost always enough.

The actual hard part
Writing the 50 test cases is the hard part. Most teams skip it because it feels unglamorous, and ship. Then they spend the next quarter doing prompt engineering in the dark.

Don't do that. Spend the afternoon writing the test set. Run the three numbers. Then ship.


# end of notebook
posts.next() · discuss on hashnode