Video Search (Text to Video)
Video search is a text-to-video retrieval problem: the user types a query and the system needs to surface relevant videos from a billion-item index in under 200ms. I'll work through business and ML objectives, system architecture, data and features, modeling, infrastructure, evaluation, and robustness.
Solution Walkthrough
Business Objective
The objective is to maximize user satisfaction with search results, measured through successful task completion and quality-adjusted watch time from search sessions, while maintaining search freshness to ensure new creators can gain visibility. This is more nuanced than just optimizing click-through rate. A user might click on a result but immediately bounce if the video doesn't match their intent, or they might watch a video all the way through but still feel unsatisfied if it didn't answer their question.
Quality-adjusted watch time means we're weighing watch duration by satisfaction signals like whether users completed the video, whether they engaged positively (likes, subscribes), and critically, whether they stopped searching after finding this video. That last signal is gold; it tells us we successfully resolved their query. A user who searches for "how to fix a leaky faucet," clicks the third result, watches 80% of it, and doesn't search again in the next hour has had a successful experience.
The freshness component is critical for ecosystem health. If search results are dominated by established videos with millions of views, new creators can never break through even if their content is excellent. We need to balance relevance with recency and give new high-quality videos a chance to surface.
ML Objective
From an ML perspective, this is a two-stage retrieval and ranking problem operating across a massive corpus. Given a text query and potentially billions of videos, we need to first retrieve a candidate set of potentially relevant videos, then rank them by predicted relevance and quality. The challenge is the semantic gap, matching natural language queries to video content that might not contain those exact words but is conceptually relevant.
We're predicting multiple signals: relevance to query intent, expected watch time, probability of engagement, likelihood of query satisfaction. These predictions feed into a ranking function that balances immediate relevance with diversity, freshness, and long-term platform health.
Unlock Full Solution
Get access to the complete walkthrough, key concepts, summary, and follow-up questions.