Language Classification for Posts
Language classification sounds deceptively simple, but it's a critical infrastructure component running on billions of posts daily at sub-millisecond latency, feeding dozens of downstream systems. Getting it wrong cascades failures across the entire platform. I'll work through business and ML objectives, system architecture, data and features, modeling, infrastructure, evaluation, and robustness.
Solution Walkthrough
Business Objective
The objective is to accurately identify the language of every post on the platform so that downstream systems can do their jobs properly. Think about what depends on knowing the language: users need to see content in languages they understand, so feed ranking has to know language. Ad serving needs to match language-appropriate ads. Content moderation has language-specific models, a hate speech classifier for English is useless on a Hindi post. Translation services need to know what they're translating from. Essentially, language classification is foundational infrastructure, and when it breaks, everything downstream breaks with it.
What makes the accuracy requirement so stringent is this cascading effect. If we misclassify a Portuguese post as Spanish, it gets routed to the wrong moderation pipeline, served to the wrong ad targeting segment, and the user might see a "Translate to English" button that doesn't work because it's trying to translate from the wrong source language. Each individual error might seem minor, but across billions of posts daily, even a 1% error rate means tens of millions of posts getting wrong downstream treatment.
ML Objective
At its core, this is a multi-class classification problem; we're assigning one of a hundred or more language labels to each post. But the reality is much more nuanced than a standard classification setup. The primary challenge is extreme class imbalance. English, Spanish, Portuguese, Hindi, and Arabic dominate the platform, while hundreds of other languages have relatively minimal representation. We need greater than 99% macro-F1 on high-resource languages and at least 95% on low-resource ones, because serving Yoruba speakers poorly isn't acceptable just because there are fewer of them.
Beyond the imbalance, there are several phenomena that make this problem genuinely hard. Code-switching is extremely common, people mix languages within a single post, like Spanglish or Hinglish, and you need to either identify the dominant language or flag it as mixed. Transliteration adds another layer, Hindi written in Latin characters looks nothing like Hindi in Devanagari to a naive model. Short text is inherently ambiguous: a post that just says "ok" or "lol" could be in any language. And then there's the contextual dimension, the same text might actually be in different languages depending on who's posting it. "Die" is an English word, but it's also a German article, and the model needs contextual signals to disambiguate.
Unlock Full Solution
Get access to the complete walkthrough, key concepts, summary, and follow-up questions.