System DesignGenAI

The 7 RAG Anti-Patterns That Quietly Tank ML System Design Interviews

·9 min read

Retrieval-augmented generation now appears in nearly every machine learning system design loop. The reason is simple: it is the architecture most teams reach for when they put a language model into production, so it is a fair proxy for whether a candidate has built something real. Most candidates can describe the happy path without much trouble. Embed the documents, store the vectors, retrieve the top-k by similarity, place them in the prompt, generate an answer. That description earns almost no signal, because everyone gives it.

The distance between an average answer and a strong one lives entirely in the failure modes. An interviewer is listening for whether you know where a RAG system breaks, why it breaks there, and what you would measure to catch it before users do. The seven anti-patterns below are the ones that come up most often in retrieval system design, along with what a strong answer does instead.

1. Treating chunking as an afterthought

Fixed-size character chunking is the default in most tutorials, and it is usually the first thing to break. Splitting a document every 500 characters cuts through sentences, separates a claim from the sentence that qualifies it, and detaches a table from the header that gives its columns meaning. Retrieval then returns fragments that look plausible in isolation and are useless in combination.

Chunk along the structure of the source instead. Split on section and paragraph boundaries for prose, and on function or class boundaries for code. Add overlap between adjacent chunks so a concept that straddles a boundary is not lost. Treat chunk size as a parameter tied to two things: the granularity of the questions users ask, and the context window of the embedding model you are using.

A strong answer often goes one step further and mentions parent-document retrieval, where the system retrieves small, precise chunks for matching but returns the larger surrounding section to the model for context. The underlying point is that retrieval quality is capped by chunk quality. No reranker recovers information that chunking has already thrown away, and saying that out loud signals you understand the dependency.

2. Using a general embedding model on a specialized domain

An embedding model that scores well on general web text can perform poorly on legal contracts, clinical notes, or source code. The reason is that similarity in the model's embedding space does not necessarily correspond to relevance in the domain. Two clauses that a lawyer would treat as opposites can sit close together in a general embedding space because they share vocabulary.

The fix is to stop trusting public leaderboards as a substitute for evaluation. Benchmark two or three candidate models on a labeled slice of your own data before committing to one. When the gap between a general model and the domain is large, consider a domain-adapted or fine-tuned embedding model, trained with contrastive learning on pairs drawn from the target corpus. Code retrieval, long-document retrieval, and multilingual retrieval each tend to favor different models, and treating them as one problem is a common mistake.

This is also where a candidate can show awareness of cost. Higher embedding dimensions improve recall but increase storage and query latency. Mentioning that tradeoff, rather than reaching for the largest model by default, is the kind of judgment the round is testing for.

3. Skipping the reranking stage

Retrieval over an approximate nearest neighbor index is fast because it uses a bi-encoder, which embeds the query and each document independently and compares them with a distance metric. That speed comes at a cost. Cosine similarity in embedding space is a rough approximation of relevance, and the raw top-k by vector distance is not the same as the top-k by usefulness.

Strong RAG systems use two stages. The first is cheap, high-recall retrieval that pulls a candidate set of perhaps fifty to a hundred documents. The second is a cross-encoder reranker that scores each candidate against the query jointly, attending to both at once, and reorders them before anything reaches the model. The cross-encoder is far more accurate than the bi-encoder and far too expensive to run across the whole corpus, which is exactly why it belongs on the small candidate set rather than the full index.

Naming this division of labor — high recall in the first stage and high precision in the second — is one of the clearest markers of a senior answer. Candidates who treat vector search as both retrieval and ranking are describing a system that will surface near-misses and rank them confidently.

4. Building it without retrieval metrics

If the only thing a team measures is the quality of the final answer, there is no way to tell whether a bad answer came from bad retrieval or bad generation. The two failure types need different fixes, and a system you cannot attribute failures in is a system you cannot improve.

Before optimizing the generator, build a small labeled evaluation set of queries paired with the documents that should be retrieved for them, and measure the retriever directly. Precision@k and recall@k tell you how much of what you returned was relevant and how much of what was relevant you returned. A rank-aware metric such as mean reciprocal rank or normalized discounted cumulative gain tells you whether the relevant results are near the top, which matters when only the first few chunks fit in the prompt. Evaluate retrieval and generation as separate stages with separate numbers.

When a candidate cannot say how they would measure the retriever, they are describing a system they will only be able to debug by intuition. A practical detail that lands well is generating part of the evaluation set synthetically — using a model to write plausible questions from known chunks, then having humans verify a sample.

5. Going pure dense and dropping lexical search

Dense vector retrieval is strong at capturing meaning and weak at exact matching. It misses rare tokens, internal identifiers, error codes, product names, and acronyms — which are precisely the queries where users expect an exact hit. A user searching for a specific error code does not want the semantically nearest paragraph. They want the paragraph that contains that code.

Hybrid retrieval addresses this by running a dense vector search and a sparse lexical method such as BM25 in parallel, then fusing the two ranked lists, often with reciprocal rank fusion. Dense embeddings and lexical search fail on different inputs, and that complementarity is the entire reason to run both rather than choosing one. A candidate who can name the specific query types where dense retrieval falls short, rather than treating vectors as a universal solution, is demonstrating that they have watched a real system disappoint real users.

6. Designing with no latency budget

Every stage adds latency. Embedding the query, searching the index, reranking the candidates, and generating the answer all take time, and multi-hop retrieval or large retrieved contexts compound it. An answer that optimizes purely for accuracy and never states a latency target is incomplete, because a retrieval system that takes eight seconds to answer is a different product from one that answers in eight hundred milliseconds, regardless of how relevant the results are.

State the budget at the start and allocate it across the stages. Then discuss the levers for staying inside it: caching answers to frequent queries, caching at the semantic level so near-duplicate queries reuse work, using a smaller reranker, capping the amount of retrieved context passed to the model, tuning the nearest neighbor index for a recall and speed point that fits the budget, and running independent stages asynchronously. Treating latency as a first-class constraint rather than an afterthought is one of the strongest signals that a candidate has shipped something rather than only read about it.

7. Assuming retrieval prevents hallucination

Retrieving the correct context does not force the model to use it. The model can ignore the retrieved passages, blend them with what it learned during pretraining, or cite the wrong source for a correct fact. Retrieval reduces hallucination. It does not eliminate it, and a design that assumes otherwise has no answer for the case where retrieval returns nothing useful.

Grounding has to be engineered. Constrain the model to answer from the retrieved context and to say when the context does not contain the answer. Attach citations to claims and verify that the cited passage actually supports the claim, rather than trusting the model's own attribution. Measure faithfulness — which captures whether the generated answer is entailed by the retrieved evidence — as a metric distinct from whether the answer is relevant. Let the system abstain when retrieval confidence is low, because a refusal is recoverable and a confident fabrication is not. The failure case worth designing against is the answer that is fluent, well formatted, and wrong.

The pattern underneath all seven

These anti-patterns look like seven separate topics, but they share a single root. Naming the components of a RAG pipeline is table stakes, and an interviewer assumes you can do it. The signal they are actually grading is whether you understand where each component fails, how those failures interact, and what you would measure to find them. A candidate who can walk through a pipeline and point to its weak joints is describing a system they have operated. A candidate who can only describe the happy path is describing one they have read about.

Prepare for the failure modes, not the diagram. The diagram is the easy half, and it is the half everyone else has already covered.

Prep for questions like these with GradientCast — see our plans. Staff-level ML system design walkthroughs and behavioral answers, built by senior ML engineers with FAANG experience.

More from Insights