Building a Production RecSys in a Weekend
There's a genre of machine learning tutorial that teaches you how to build a recommendation model. You load MovieLens, train a matrix factorization or a two-tower network, evaluate recall@10, and declare victory. I've done this a dozen times. It never once helped me understand how recommendation systems actually work in production.
The model is maybe 10% of a recommendation system. The other 90% is everything the tutorials skip: How do you guarantee that the features computed during training are identical to the features computed at serving time? How do you narrow a catalog of millions down to a handful of candidates before your ranking model even sees them? How does the retrieval stage hand results to the ranking stage? How do multi-task objectives interact during training?
These are the questions that determine whether a system works. And the answers aren't in papers β they're in the plumbing.
Why I built this
I had operational experience with recommendation systems β tuning features, debugging skew, shipping ranking changes β but not the kind of deep structural intuition you get from building a system end-to-end, from raw data to scored response.
mmoe-recsys is my attempt to develop that intuition. Not by reading another paper or tutorial, but by making every decision myself: how retrieval hands candidates to ranking, how feature transforms stay consistent across training and serving, how multi-task losses interact during backprop. The kind of understanding you can only get by building.
The collaboration
mmoe-recsys was built in a weekend by three collaborators: me, Claude Code (Anthropic's CLI agent), and Cursor (AI-native IDE). This isn't a throwaway demo β it's a tested, linted, type-checked monorepo with 118 passing tests, CI, MLflow tracking, ONNX export, and a gRPC serving pipeline. The kind of system that would normally take a small team weeks to scaffold.
The workflow looked like this: I held the architectural vision β what the system should do, how stages connect, which abstractions matter. Claude Code handled the heavy implementation: scaffolding the monorepo, writing training loops, wiring up the serving pipeline, building integration tests, setting up CI. Cursor filled the gaps β quick edits, inline completions, navigating unfamiliar code that Claude had just written. I reviewed every PR, caught structural issues, pushed back on over-engineering, and made the judgment calls about what to build and what to skip.
What surprised me wasn't the speed. It was the shape of my contribution. I spent almost no time writing code from scratch. I spent almost all of it reading code, making architectural decisions, and saying "no, not like that." The ratio of thinking to typing shifted dramatically β and the thinking was the part that actually mattered.
What we built
The system implements the standard industrial two-stage pattern:
Retrieval narrows the full item catalog to a few hundred candidates. A Two-Tower model β separate user and item encoders β trained with InfoNCE loss and LogQ popularity correction. The item tower pre-computes embeddings offline; at serving time, only the user tower runs, and Faiss does the nearest-neighbor lookup. This is the stage that makes the problem tractable.
Ranking scores every candidate from retrieval. An MMoE-DeepFM with two heads β CTR (will the user click?) and CVR (will they convert?) β sharing expert networks through learned gating. CVR loss uses direct gradient masking on clicked samples (not the ESMM tower decomposition β just zero out the CVR loss for unclicked rows). The final score is expected value: CTR * CVR * price. You can optionally enable DCN cross-layers in the experts, inspired by Pinterest's posts on GPU-served two-tower models and next-gen lightweight ranking.
What I learned about working with AI
The hardest problems were all at the boundaries β not "how do I implement an attention-weighted user history encoder" but "how do I make sure the padding and masking for that encoder is identical in the dataset class and the serving pipeline." Claude was excellent at implementing components in isolation. The integration points β where retrieval meets ranking, where training artifacts meet serving code β required human judgment about what should be shared and what should be separate.
This maps directly to what matters in the codebase. The model code is maybe 200 lines per stage. The feature processing, dataset construction, ONNX export wrappers, and serving pipeline are 3x that. AI tools are good at the 200 lines. The 600 lines β the parts that determine whether the system actually works in production β are where human architectural judgment still does the heavy lifting. And the most important abstraction in those 600 lines is the FeatureProcessor.
The zero-skew guarantee
Training/serving skew is the silent killer of ML systems. You compute features one way in your training pipeline, a slightly different way in your serving path, and your model degrades in production in ways that are nearly impossible to debug. Every production ML team has a war story about this.
The solution in mmoe-recsys is structural: a single FeatureProcessor class is the only code path that transforms raw features into model inputs. The offline compilation step uses it. The serving pipeline uses it. Same class, same frozen artifacts (vocab mappings, normalization stats), same output. There's an integration test that instantiates two independent processors from the same artifacts and verifies they produce identical results.
This isn't a novel idea. It's the standard approach at companies that have been burned enough times. But I've never seen it in a runnable reference implementation, which is exactly why we built it as the central abstraction.
Who this is for
If you've outgrown MovieLens tutorials and want to see how a two-stage recommender actually fits together β retrieval to ranking to scored response β this is a runnable reference. The code is tested, the patterns are production-grade (gradient masking for CVR, LogQ correction, swappable FaissβMilvus and SQLiteβRedis via Python Protocols), and the whole system generates its own synthetic data. There's nothing to download and no credentials to configure. Here's the full quickstart β clone to ranked recommendations over gRPC in about five minutes.
This is the first post in what I hope becomes a series on recommendation systems β from classical pipelines to the generative and sequential architectures that are starting to reshape the field. More soon.