The Good, Ol' "Design a Recommendation Feed" Question: What Separates a Hire from a No-Hire Answer
The request to design a recommendation feed appears in more machine learning system design loops than any other question, for a practical reason. Almost every product that reaches scale ends up ranking something, whether that is a feed, a set of search results, a list of products, or a queue of notifications. The question is a fair proxy for whether a candidate has built and operated a real ranking system, rather than only read about one.
Most candidates can produce the standard architecture within a few minutes. Candidate generation narrows millions of items down to a few hundred, a ranking model scores those few hundred, and the highest-scoring results are served. That description is correct, and it is also the point at which most answers stop being distinguishable from one another. The decisions that move an interview outcome live in the reasoning around the diagram, not in the diagram itself. What follows is a walk through the places where strong answers separate from average ones.
Establish the problem before drawing the architecture
The weakest way to open is to start sketching components immediately. The strongest candidates spend the first few minutes deciding what the system is for. The dimensions worth pinning down are the surface and its objective, the size and turnover of the content pool, the scale of the user base, the latency budget, and what a good recommendation actually means in this product.
These choices are not cosmetic. A short-video feed optimizing for time spent, a marketplace optimizing for completed purchases, and a professional network optimizing for meaningful connections share a high-level shape and almost nothing at the objective level. The feature sets differ, the labels differ, the acceptable latency differs, and the definition of a harmful outcome differs. A candidate who fixes these before designing is solving the right problem. A candidate who skips them is liable to design something generic that answers no specific question well.
A useful habit is to separate functional requirements — meaning what the system returns and how fast — from the modeling objective — meaning what it is trying to maximize. Conflating the two is a frequent source of muddled answers.
The objective is the hardest part of the problem
Choosing what to optimize is harder than choosing a model, and it is where the most revealing differences appear. Optimizing for raw clicks or watch time looks like the obvious move and is a trap. A model rewarded purely for clicks will learn to surface clickbait, because clickbait earns clicks. A model rewarded purely for watch time will learn to favor content that holds attention regardless of whether the user is glad to have watched it.
The gap between the proxy a team can measure cheaply — such as clicks and dwell time — and the outcome it actually wants — such as a user who returns next week and trusts the product — is the central difficulty in recommendation systems. Strong candidates treat it as the core of the problem rather than a footnote. They describe multi-objective ranking that combines several signals, positive ones like clicks, completion, and shares, and negative ones like hides, skips, and reports. They bring in explicit signals, such as occasional satisfaction surveys, as a check on cheap implicit signals. They name guardrail metrics like long-term retention and session quality that a launch is not allowed to harm even if the primary metric improves.
The framing that lands well is reward hacking. Any single proxy metric, optimized hard enough, will be gamed by the model in ways that diverge from the true goal, so the design has to anticipate that rather than discover it in production. Candidates who never interrogate the objective are the ones most likely to build something that wins this week's metric and loses the user over the quarter.
Candidate generation and ranking is necessary, not sufficient
The two-stage structure — recall-oriented retrieval followed by precision-oriented ranking — is the part of the answer that has to be correct and earns little credit beyond correctness. Candidate generation usually blends several sources: collaborative-filtering signals, a two-tower embedding model retrieved through approximate nearest neighbor search, recent and trending items, and content-based matches. The ranking stage applies a heavier model with a richer feature set to the few hundred survivors. A re-ranking or policy layer on top often handles diversity and business rules.
The practical advice for an interview is to state this cleanly and keep moving. Spending twenty minutes detailing a two-tower retrieval model while never addressing the objective or the evaluation strategy is a common way to spend the budget on the half of the problem that carries the least signal.
Position bias is the data trap most candidates miss
Training a ranking model on logged interaction data carries a problem inside it. Items shown at the top of the feed receive more clicks because they were at the top, not necessarily because they were more relevant. A model trained naively on that data learns to reproduce the ranking that generated the logs rather than to improve on it. The system gets very good at predicting what the old system would have shown.
Naming this is a strong signal, and proposing a remedy is stronger. Inverse propensity weighting reweights training examples by the probability that an item was shown in a given position, correcting for the exposure the logging policy gave it. Another approach injects a small amount of randomization into the serving policy so the logs contain unbiased exposure for a fraction of traffic. A candidate who raises position bias has demonstrated that they think about where labels come from, which is a more advanced concern than which model architecture to use. The related issue of selection bias — that the system only ever observes outcomes for items it chose to show — is worth a sentence as well.
Offline metrics do not equal online results
A model that improves offline can lose in a live experiment, and a model that looks flat offline can win. Offline metrics such as AUC, NDCG, and recall at k are useful as a cheap filter to decide which candidates are worth testing, but they are not the source of truth. The source of truth for a recommender is an online experiment that measures real user behavior over a meaningful window, with guardrail metrics watched alongside the primary one.
Strong candidates state this relationship explicitly and treat A/B testing as the decision mechanism. The most advanced answers mention techniques that reduce the cost of getting there, such as off-policy or counterfactual evaluation that estimates how a new policy would have performed using logged data from the old one, and interleaving experiments that compare two rankers on the same user with far less traffic than a standard A/B test. A candidate who presents an offline AUC improvement as proof of success, with no mention of online evaluation, is describing a workflow they have not run end to end.
Cold start needs a concrete plan
New users arrive with no history and new items arrive with no interactions, and answering cold start with "we use embeddings" is not a plan. Strong answers handle the two cases separately because they have different solutions.
Cold items lean on content features — attributes of the item itself — so they can be scored and surfaced before they accumulate engagement signal. Cold users lean on popularity priors, any context available at the start such as device, location, or referral source, lightweight onboarding signals, and deliberate exploration. The exploration point is the part most candidates omit. A system that always serves its current best guess never gathers the data it needs to improve for a new user or item. Accepting slightly worse short-term recommendations in order to learn, framed as an explore-exploit tradeoff and implemented with a method like Thompson sampling or epsilon-greedy selection, is what separates a complete answer from a partial one.
Feedback loops and the failure path
A recommender's outputs become the training data for its next version. That closed loop means popular items receive more exposure, which earns them more engagement, which the next model reads as evidence that they deserve still more exposure. Left alone, the system narrows what users see and amplifies whatever it already favored. Naming this dynamic, and proposing diversity objectives or sustained exploration as a counterweight, separates candidates who reason about the system over months from those who reason about a single forward pass.
The failure path deserves a sentence too. When the ranking service is slow or unavailable, the feed still has to return something. A fallback to popularity-based or cached results, rather than an empty screen, is a small detail that signals real production experience. Interviewers notice candidates who design for the day the model is down, because that day always arrives.
The pattern underneath the question
The architecture is the easy half of this question, and it is the half nearly every candidate covers. The hard half is judgment about objectives, about where training labels come from, about how to evaluate a change honestly, about what happens to new users and items, and about how the system behaves over time and under failure. That is the half the round is built to test, and it is where preparation pays off. Practice reasoning about the decisions around the diagram, because the diagram itself is the part you can draw in your sleep and the part that distinguishes no one.
Prep for questions like these with GradientCast — see our plans. Staff-level ML system design walkthroughs and behavioral answers, built by senior ML engineers with FAANG experience.