System DesignGenAI

The 7 RAG Anti-Patterns That Will Tank Your System Design Interview

·4 min read

RAG — retrieval-augmented generation — is now the most common LLM system design question in senior and staff loops at AI-first companies. It shows up as "design a document Q&A system," "build an internal knowledge assistant," or "walk me through how you'd add search to our existing LLM product."

Most candidates can describe the happy path: chunk the docs, embed them, store them in a vector database, retrieve the top-k at query time, stuff them into context, generate an answer. That's table stakes. It gets you to a "meet bar" answer at best.

The staff-level question is: where does this break, and what do you do about it?

Here are the seven failure modes that separate a strong hire answer from a hire answer.

1. Treating Chunking as an Afterthought

Chunking strategy has more impact on retrieval quality than almost any other decision in the pipeline — and candidates almost never mention it proactively.

Fixed-size chunking (e.g., 512 tokens every time) will split sentences, destroy context, and degrade retrieval precision. The right answer depends on the document structure: recursive character splitting for unstructured prose, section-aware splitting for technical docs, sentence-level chunking for dense reference material.

What interviewers want to hear: "The chunking strategy is the first place I'd tune, because it directly controls what unit the embedding model is scoring relevance against."

2. Using the Wrong Embedding Model for the Domain

General-purpose embeddings (OpenAI text-embedding-3-small, BGE, etc.) work reasonably well out of the box. But for domain-specific corpora — legal documents, biomedical literature, internal codebases — they fail in ways that are hard to detect without deliberate evaluation.

Strong candidates know that domain mismatch shows up as high precision@1 on easy queries and catastrophic failures on edge cases. They'd propose domain-specific fine-tuning or a task-specific model where the use case justifies it.

3. No Re-Ranking Layer

Top-k vector retrieval optimizes for semantic similarity, not answer quality. The most semantically similar chunk is not always the most useful one for generating a correct response. A re-ranking model (cross-encoder, Cohere Rerank, a learned ranker) scores each retrieved chunk against the query more precisely, at the cost of latency.

Candidates who skip this layer — and don't acknowledge the tradeoff — signal they've only read about RAG at the surface level.

4. Ignoring Retrieval Evaluation Metrics

"How do you know if your RAG pipeline is working?" is a question that ends many loops early.

The right answer covers: precision@k and recall@k on a golden query set, MRR (mean reciprocal rank), and end-to-end answer quality metrics (ROUGE, BERTScore, or ideally LLM-as-judge on a labeled set). Candidates who can only say "we'd do some manual testing" haven't shipped a production retrieval system.

5. No Hybrid Search

Dense retrieval (vector similarity) is excellent at semantic matching and poor at exact keyword matching. BM25 (sparse retrieval) is the opposite. Hybrid search — combining both with a weighted or learned fusion — consistently outperforms either alone, especially for queries that mix semantic intent with specific terminology.

Omitting hybrid search from a design is fine if you acknowledge it's a deliberate tradeoff (latency, complexity) rather than an oversight.

6. Latency Budget Invisible to the Candidate

A production RAG pipeline has multiple latency contributors: embedding the query, retrieval from the vector store, re-ranking, context stuffing, and LLM generation. Staff engineers know these rough numbers and can reason about where to parallelize, cache, or cut.

Candidates who treat the pipeline as a black box — "it should be fast enough" — fail the systems thinking bar. Walk through the latency budget: "embedding is ~10ms, retrieval ~50ms, re-ranking adds another 100ms if synchronous — here's where I'd consider async pre-fetching."

7. No Grounding or Hallucination Controls

RAG reduces hallucination by anchoring generation to retrieved context — but it doesn't eliminate it. If the relevant context isn't retrieved (a recall failure), the model will generate an answer anyway, often confidently wrong.

Strong answers address this directly: citation grounding (requiring the model to attribute claims to retrieved chunks), confidence scoring, fallback to "I don't know" when retrieval confidence is low, and human-in-the-loop escalation for high-stakes domains.


Getting all seven right in a 45-minute loop is a high bar. But knowing these failure modes — and being able to state which ones you'd address first given the constraints of the specific problem — is exactly what separates a strong hire signal from a generic architecture walkthrough.

Prep for questions like these with GradientCast — see our plans. Staff-level ML system design walkthroughs and behavioral answers, built by engineers who run these loops every week.

More from Insights