内容简介:Netflix is synonymous to most people in this day and age as the go-to streaming service for movies and tv shows. What most people do not know however, is that Netflix started out in the late 1990s with a subscription-based model, posting DVDs to people’s h
Netflix is synonymous to most people in this day and age as the go-to streaming service for movies and tv shows. What most people do not know however, is that Netflix started out in the late 1990s with a subscription-based model, posting DVDs to people’s homes in the US.
The Netflix Prize
In 2000, Netflix introduced personalised movie recommendations and in 2006, launched Netflix Prize, a machine learning and data mining competition with a $1 million dollar prize money. Back then, Netflix used Cinematch , its proprietary recommender system which had a root mean squared error (RMSE) of 0.9525 and challenged people to beat this benchmark by 10%. The team who could achieve target or got close to this target after a year would be awarded the prize money.
The winner of the Progress Prize a year later in 2007 used a linear combination of Matrix Factorisation (a.k.a. SVD) and Restricted Boltzmann Machines (RBM), achieving a RMSE of 0.88. Netflix then put those algorithms into production after some adaptations to the source code. What is worth noting is that despite some teams achieving a RMSE of 0.8567 in 2009, the company did not put those algorithms into production due to the engineering effort required to gain the marginal increase in accuracy. This serves an important point in real-life recommender systems — that there is always a positive relationship between model improvements and engineering efforts.
Streaming — the new way of consumption
A more important reason why Netflix did not incorporate the improved models from the Netflix Prize is because it introduced streaming in 2007. With streaming, the amount of data it has surged dramatically. It has to change the way its recommender system was generating recommendations and ingesting data.
Fast forward to 2020, Netflix has transformed from a mail service posting DVDs in the US to a global streaming service with 182.8 million subscribers. Consequently, its recommender system transformed from a regression problem predicting ratings to a ranking problem, to a page-generation problem, to a problem maximising user experience (defined as maximising number of hours streamed i.e. personalising everything that can be personalised). The main question that this article aims to address is:
What is Netflix using as its recommender system?
Netflix as a Business
Netflix has a subscription based model. Simply put, the more members (term used by Netflix, synonymous to users/subscribers) Netflix has, the higher its revenue. Revenue can be seen as a function of three things:
- Acquisition rate of new users
- Cancellation rates
- Rate at which former members rejoin
How important is Netflix’s Recommender System?
80% of stream time is achieved through Netflix’s recommender system, which is a highly impressive number. Moreover, Netflix believes in creating a user experience that will seek to improve retention rate, which in turn translates to savings on customer acquisition (estimated $1B per year as of 2016).
Netflix Recommender System
How does Netflix rank titles?
It is quite clear that Netflix utilises a two-tiered row-based ranking system, where ranking happens:
- Within each row (strongest recommendations on the left)
- Across rows (strongest recommendations on top)
Each row highlights a particular theme (e.g. Top 10, Trending, Horror, etc), and is typically generated using one algorithm. Each member’s homepage consists of approximately 40 rows of up to 75 items, depending on the device the member is using.
Why Rows?
The advantages can be seen from two perspectives — 1) As a user, it is more coherent when presented a row of items that are similar, and then decide if he or she is interested in watching something in that category; 2) As a company, it is easier to collect feedback as a right-scroll on a row would indicate interest whilst a scroll-down (ignoring the row) would indicate non-interest (not necessarily irrelevance).
Fun Fact:Did you know that artworks are personalised based on your profile and preferences as well? Find out more here !
What algorithms are used?
Netflix uses a variety of rankers mentioned in its paper, though specifics of each model’s architecture is not specified. Here is a summary of what they are:
Personalised Video Ranking (PVR)— This algorithm is a general-purpose one, which usually filters down the catalog by a certain criteria (e.g. Violent TV Programmes, US TV shows, Romance, etc), combined with side features including user features and popularity.
Top-N Video Ranker— Similar to PVR except that it only looks at the head of the rankings and looks at the entire catalog. It is optimised using metrics that look at the head of the catalog rankings (e.g. MAP@K, NDCG).
Trending Now Ranker— This algorithm captures temporal trends which Netflix deduces to be strong predictors. These short-term trends can range from a few minutes a a few days. These events/trends are typically:
- Events that have a seasonal trend and repeat themselves (e.g. Valentines day leads to an uptick in Romance videos being consumed)
- One-off, short term events (e.g. Coronavirus or other disasters, leading to short-term interest in documentaries about them)
Continue Watching Ranker— This algorithm looks at items that the member has consumed but has not completed, typically:
- Episodic content (e.g. drama series)
- Non-episodic content that can be consumed in small bites (e.g. movies that are half-completed, series that are episode independent such as Black Mirror)
The algorithm calculates the probability of the member continue watching and includes other context-aware signals (e.g. time elapsed since viewing, point of abandonment, device watched on, etc).
In a presentation by Justin Basilico [2], he presented on the use of RNNs in time-sensitive sequence prediction which I believe is used in this algorithm. He devised that Netflix could use a particular member’s past plays alongside the contextual information and use this to predict what the member’s next play might be. In particular, using continuous time together with discrete time context as input performs the best.
Video-Video Similarity Ranker—a.k.a. Because you watched (BYW)
This algorithm basically resembles that of a content-based filtering algorithm. Based on an item consumed by the member, the algorithm computes other similar items (using an item-item similarity matrix) and returns the most similar items. Amongst the other algorithms, this one is unpersonalised as no other side features are utilised. However, it is personalised in the sense that it is a conscious choice to display a particular item’s similar items a member’s homepage (more details in Page Generation below).
Row Generation Process
Each of the above algorithms go through the row generation process seen in the image below. For example, if PVR is looking at Romance titles, it will find candidates that fit this genre, and at the same time come up with evidence to support the presentation of a row (e.g. previously watched Romance movies that the member has watched). From my understanding, this evidence selection algorithm is incorporated (or used together) in every other ranking algorithm listed above to create a more curated list ranking of items (see below Netflix’s model workflow image).
This evidence selection algorithm uses “all the information [Netflix] shows on the top left of the page, including the predicted star rating that was the focus on the Netflix prize; the synopsis; other facts displayed about the video, such as any awards, cast or other metadata; and the images [Netflix] use to support [their] recommendations in the rows and elsewhere in the UI. [1] ”
Each of the five algorithms go through the same row generation process as seen in the image below.
Page Generation
After the algorithms generate candidate rows (already ranked within each row vector), how does Netflix decide which of these 10,000s of rows to display?
Historically, Netflix has used a template-based approach to tackle this problem of page generation i.e. a massive blood bath of rows competing for precious screen real estate. It is a task focused on not only accuracy, but also providing diversity, accessibility and stability at the same time. Other considerations include hardware capabilities (what device is being used) and which rows/columns are visible at first glance and upon scroll.
This means that Netflix wants to accurately predict what users want to watch in that session, but not forgetting that he/she might want to pick up on videos that were left off halfway. At the same time, it wants to highlight the depth of its catalog by providing something fresh, and perhaps capture trends that are going on in the member’s region. Finally, stability is necessary when members have interacted with Netflix’s for awhile and are used to navigating the page in a certain manner.
With all these requirements, one can see why a template-based approach can work quite well for a start because one can have a few fixed set of criterions to be met at all times. However, having many of such rules in place naturally landed Netflix into a local optimum in terms of providing a good member experience.
How then do we approach this row ranking problem?
Row-based approach
The row-based approach uses existing recommendation or learning-to-rank approaches to score each row and rank them based on those scores. This approach can be relatively fast but lacks diversity. A member might end up seeing a page full of rows that generally matches his/her interest, but row-wise might be very similar. How do we then incorporate diversity?
Stage-wise approach
An improvement to the row-rise approach is to use a stage-wise approach, where each row is scored like the above method. However, rows are selected sequentially from the first, and whenever a row is selected, the next rows are recomputed to take into account its relationship to both the previous rows as well as the previous items already chosen for the page. This is a simple greedy stage-wise approach.
We could improve this by using a k -row lookahead approach, where we consider the next k rows when computing the scoring for each row. However, neither of these approaches would obtain a global optimum.
Machine Learning Approach
The solution and approach that Netflix uses is a Machine Learning one, where they aim to create a scoring function by training a model using historical information of which homepages they have created for their members — including what they actually see, how they interacted with and what they played.
Of course, there are many other features and ways that can represent a particular row in the homepage for the algorithm. It could be as simple as using all the item metadata (as an embedding) and aggregating them, or indexing them by position. Regardless of what features are used to represent the page, the main goal is to generate hypothetical pages and see what items the user would have interacted with. Scoring is then done using page-level metrics, such as Precision@m-by-n and Recall@m-by-n (which are adaptations of the Precision@k and Recall@k but in a two-dimensional space).
Cold-start, Deployment and Big Data
Cold-start Problem
The aged cold-start problem — Netflix has it too. Traditionally, Netflix tries to curb this by obtaining some user preference information by asking new members to fill up a survey to ‘jump start’ the recommendations[6]. If this step is skipped, the recommendation engine will then provide a diverse and popular set of titles.
Also, recently during this Covid-19 period, Netflix Party (a Chrome extension) was created and in my opinion, this has massive effects on curbing this cold-start problem as this data is likely sent back to Netflix to analyse.
Simply put, Netflix used to be single-person activity (at least what can be monitored by Netflix). You could be at home watching a title alone or with a group of friends, but Netflix has no idea of who you are watching it with physically. With Netflix Party , Netflix could potentially create a graph of who you have interacted with, and potentially perform a collaborative-filtering like algorithm to do recommendations to new users as well.
It’s All A/Bout Testing
The gap between offline evaluation and online evaluation remains. Whilst offline metrics help evaluate how well our model is performing on the training data, there is no guarantee that those results will translate to actual improvements in user experience (i.e. total watch time). As such, the Netflix team has in place an incredible and efficient A/B testing process to quickly test these new algorithms that they have built.
Do bear in mind that A/B testing itself is an art, as there are many variables to consider including how to select the control and test group, how to determine if an A/B test is statistically significant (i.e. improve the overall user experience as a whole), choosing a control/test group size, what metrics to use in A/B testing, and many more.
Fundamentally, offline evaluation helps Netflix in determine when to throw models into an A/B test and which models to A/B test. You can read more about Netflix’s A/B testing experimentation process here .
Data, data, data and more data
With online streaming, the data that Netflix manages and have access to is limitless. Managing this amount of data is only possible with the right architecture — that is segregating offline , online and nearline computation.
With offline computation , there are less limitations on the amount of data and the computational complexity of the algorithms since it runs in a batch manner with relaxed timing requirements. However, it can easily grow stale between updates because the most recent data is nor incorporated. For personalized architectures, a key issue is combining both online and offline computation in a seamless manner.
With online computation , we expect response to recent events and user interaction and hence has to be done so in real-time. Therefore, online computation cannot be too complex and computationally costly. Also, a fallback mechanism is necessary such as reverting to a precomputed result.
With nearline computation , we have an intermediate compromise between the two approaches, in that it can perform online-like computations, but do not require them to be served in real-time, allowing it (computing and serving) to be asynchronous. This opens the door for more complex processing to be done per event, such as updating recommendations to reflect that a movie has been watched immediately after a member begins to watch it. This is useful for incremental learning algorithms.
Below shows a detailed architecture diagram of Netflix.
For a much in depth view into how these individual components are used, please read the following blog post .
Conclusion
That said, it’s time to binge-watch boys! Stay safe amid these tough times of Covid-19. Thank God for Netflix.
References
[1] The Netflix Recommender System
[2] Recent Trends in Personalization: A Netflix Perspective
[3] Learning a Personalized Homepage
[4] It’s All A/Bout Testing: The Netflix Experimentation Platform
[5] Selecting the best artwork for videos through A/B testing
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Defensive Design for the Web
37signals、Matthew Linderman、Jason Fried / New Riders / 2004-3-2 / GBP 18.99
Let's admit it: Things will go wrong online. No matter how carefully you design a site, no matter how much testing you do, customers still encounter problems. So how do you handle these inevitable bre......一起来看看 《Defensive Design for the Web》 这本书的介绍吧!