This article discusses the various algorithms that make up the Netflixrecommender system, and describes its business purpose.
We also describe the role of search and related algorithms, which for us turns into a recommendations problem as well.
We explain the motivations behind and review the approach that we use to improve the recommendation algorithms, combining A/B testing focused on improving member retention and medium term engagement, as well as offline experimentation using historical member engagement data.
We discuss some of the issues in designing and interpreting A/B tests.
Finally, we describe some current areas of focused innovation, which include making our recommender system global and language aware.
[Why RL?] …Consumer research suggests that a typical Netflix member loses interest after perhaps 60–90 seconds of choosing, having reviewed 10–20 titles (perhaps 3 in detail) on one or two screens. The user either finds something of interest or the risk of the user abandoning our service increases substantially. The recommender problem is to make sure that on those two screens each member in our diverse pool will find something compelling to view, and will understand why it might be of interest.
Historically, the Netflix recommendation problem has been thought of as equivalent to the problem of predicting the number of stars that a person would rate a video after watching it, on a scale 1–5. We indeed relied on such an algorithm heavily when our main business was shipping DVDs by mail, partly because in that context, a star rating was the main feedback that we received that a member had actually watched the video. We even organized a competition aimed at improving the accuracy of the rating prediction, resulting in algorithms that we use in production to predict ratings to this day.
But the days when stars and DVDs were the focus of recommendations at Netflix have long passed. Now, we stream the content, and have vast amounts of data that describe what each Netflix member watches, how each member watches (eg. the device, time of day, day of week, intensity of watching), the place in our product in which each video was discovered, and even the recommendations that were shown but not played in each session. These data and our resulting experiences improving the Netflix product have taught us that there are much better ways to help people find videos to watch than focusing only on those with a high predicted star rating.
[Personalization value] …The effective catalog size (ECS) is a metric that describes how spread viewing is across the items in our catalog. If most viewing comes from a single video, it will be close to 1. If all videos generate the same amount of viewing, it is close to the number of videos in the catalog. Otherwise it is somewhere in between. The ECS is described in more detail in Appendix A.
Figure 4: (Left) The black line is the effective catalog size (ECS) plotted as a function of the number of most popular videos considered in the catalog, ranging from 1 through n (the number of videos in the catalog) on the x-axis. The red line is the effective catalog size for the first k PVR-ranked videos for each member. At a PVR rank corresponding to the median rank across all plays, the ECS in red is roughly 4× that in black. The values in the x and y-axis are not shown for competitive reasons. For more details, see Appendix A.
(Right) The take-rate from the first k ranks, as a function of the video popularity rank in black, and as a function of the PVR rank in red. The y-values were normalized through division by a constant so that the maximum value shown equalled 1.
Without personalization, all our members would get the same videos recommended to them. The black line in left plot in Figure 4 shows how the ECS without personalization increases as the number of videos we include in our data increases, starting with the most popular video and adding the next popular video as we move to the right on the x-axis. The red line on the same plot, on the other hand, shows how the ECS grows not as a function of the videos that we include, but rather as a function of the number of PVR ranks that we include to capture personalization.
Although the difference in the amount of catalog exploration with and without personalization is striking, it alone is not compelling enough. After all, perhaps we would spread viewing even more evenly by offering completely random recommendations for each session. More important, personalization allows us to substantially increase our chances of success when offering recommendations. One metric that gets at this is the take-rate—the fraction of recommendations offered resulting in a play. The two lines in the right plot in Figure 4 show the take-rate, one as a function of a video’s popularity, and the other as a function of a video’s PVR rank. The lift in take-rate that we get from recommendations is substantial. But, most important, when produced and used correctly, recommendations lead to meaningful increases in overall engagement with the product (eg. streaming hours) and lower subscription cancellations rates.
[Long-term A/B testing] …Changes to the product directly impact only current members; thus, the main measurement target of changes to our recommendation algorithms is improved member retention. That said, our retention rates are already high enough that it takes a very meaningful improvement to make a retention difference of even 0.1% (10 basis points). However, we have observed that improving engagement—the time that our members spend viewing Netflix content—is strongly correlated with improving retention. Accordingly, we design randomized, controlled experiments, often called A/B tests, to compare the medium-term engagement with Netflix along with member cancellation rates across algorithm variants. Algorithms that improve these A/B test metrics are considered better. Equivalently, we build algorithms toward the goal of maximizing medium-term engagement with Netflix and member retention rates.
…We then let the members in each cell interact with the product over a period of months, typically 2–6 months.
…While we have found multiple clear wins per year every year, we see more overall engagement wins that are not large enough to affect retention rates, and even more local engagement wins that do not change overall streaming or retention rates (eg. because they simply cannibalize streaming from other parts of the product, or because they increase overall engagement or retention rates by too small of an amount for us to detect with any reasonable statistical confidence given the test’s sample size).
[Testing traps] …4.3. Nuances of A/B Testing: A/B test results are our most important source of information for making product decisions. Most times, our tests are extremely informative. Yet, despite the statistical sophistication that goes into their design and analysis, interpreting A/B tests remains partly art. For example, we sometimes see retention wins that pass the statistical tests, but that are not supported by increases in overall or local engagement metrics. In such cases, we tend to assume a random variation not driven by our test experiences. Our common practice is to then rerun such A/B tests. We usually find that the retention wins do not repeat, unlike clearer wins supported by local and overall engagement metrics increases.
Other times, we see overall engagement increases without local metrics increases. We are similarly skeptical of those, and often repeat them as well, finding that the positive results do not repeat. The number of tests with seemingly confusing results can be decreased through more sophisticated experiment design and analysis, for example, using so-called variancereduction techniques such as stratified sampling (eg. see Denget al2013) to make the cells in a test even more comparable to each other, for instance, in terms of attributes that are likely to correlate highly with streaming and retention rates, such as the method of payment or the device of sign-up.
[Offline RL] …4.6. Faster Innovation Through Offline Experiments: The time scale of our A/B tests might seem long, especially compared to those used by many other companies to optimize metrics, such as click-through rates. This is partly addressed by testing multiple variants against a control in each test; thus, rather than having two variants, A and B, we typically include 5–10 algorithm variants in each test, for example, using the same new model but different signal subsets and/or parameters and/or model trainings. This is still slow, however, too slow to help us find the best parameter values for a model with many parameters, for example. For new members, more test cells also means more days to allocate new signups into the test to have the same sample size in each cell.
Another option to speed up testing is to execute many different A/B tests at once on the same member population. As long as the variations in test experience are compatible with each other, and we judge them not to combine in a nonlinear way on the experience, we might allocate each new member into several different tests at once—for example, a similars test, a PVR algorithm test, and a search test. Accordingly, a single member might get similars algorithm version B, PVR algorithm version D, and search results version F. Over perhaps 30 sessions during the test period, the member’s experience is accumulated into metrics for each of the 3 different tests.
But to really speed up innovation, we also rely on a different type of experimentation based on analyzing historical data. This offline experimentation changes from algorithm to algorithm, but it always consists of computing a metric for every algorithm variant tested that describes how well the algorithm variants fit previous user engagement. For example, for PVR, we might have 100 different variants that differ only in the parameter values used, and that relied on data up to two days ago in their training. We then use each algorithm variant to rank the catalog for a sample of members using data up to two days ago, then find the ranks of the videos played by the members in the sample in the last two days. These ranks are then used to compute metrics for each user across variants—for example, the mean reciprocal rank, precision, and recall—that are then averaged across the members in the sample, possibly with some normalization. For a different and detailed offline metric example, used for our page construction algorithm, see Alvino & Basilico2015. Offline experiments allow us to iterate quickly on algorithm prototypes, and to prune the candidate variants that we use in actual A/B experiments. The typical innovation flow is shown in Figure 8.
As appealing as offline experiments are, they have a major drawback: they assume that members would have behaved the same way, for example, playing the same videos, if the new algorithm being evaluated had been used to generate the recommendations. Thus, for instance, a new algorithm that results in very different recommendations from the production algorithm is unlikely to find that its recommendations have been played more than the corresponding recommendations from the production algorithm that actually served the recommendations to our members. This suggests that offline experiments need to be interpreted in the context of how different the algorithms being tested are from the production algorithm.
However, it is unclear what distance metric across algorithms can lead to better offline experiment interpretations that will correlate better with A/B test outcomes, since the latter is what we are after. Thus, while we do rely on offline experiments heavily, for lack of a better option, to decide when to A/B test a new algorithm and which new algorithms to test, we do not find them to be as highly predictive of A/B test outcomes as we would like.