A recent Wired article on the eventual winners of the Netflix prize contained some good insights. The data set used for the contest contained over 100 million ratings for about 480,000 users recorded over a period of 6 years. In the end, the winning team incorporated a number of findings within that data that hinted at some general user behaviors. The details of the solutions can be viewed here.
Take, for example, the notion of frequency, defined as the number of movies rated in a day by a particular user. Looking at the solution writeup for the BellKor team [PDF], the authors theorize:
"...when rating in a bulk, users still reflect their normal preferences. However, certain movies exhibit an asymmetric attitude towards them. Some people like them, and will remember them for long as their all-time favorites. On the other hand, some people dislike them and just tend to forget them. Thus, when giving bulk ratings, only those with the positive approach will mark them as their favorites, while those disliking them will not mention them. "
The winning solution also suggests that people use different criteria for rating movies they saw a long time ago, and that people tend to rate differently depending on the day of the week.
Obviously, reading all of these insights into the data, one is inclined to ask for more details. What specific criteria are used for rating movies watched recently versus months/years ago? Why do people exhibit a skewing effect when considering the same movie over increasing periods of time? Are people just in better moods on Sundays versus Mondays?
These solutions provide an excellent demonstration of the semantic boundaries between quantitative, algorithmic solutions and qualitative, user-centered research. Emerging patterns in a large dataset can hint at behavioral theories that can then be examined by qualitative approaches with real users. Conversely, qualitative research may uncover themes in user attitudes that could lead to new ways of looking at the data corpus.
These insights could not only help to improve the solution at hand (in this case, movie recommendations), but could also lead to new designs for the way movies are rated on the site. For instance, one could use the findings to design the process by which ratings are entered, what movies are presented to rate (recently viewed, similar genres, similar ratings by friends, etc.), and how ratings are presented to others.
For example, take the current rating system. Movies that are not yet rated are still presented with a rating that reflects Netflix's "best guess" for the user. The user can then enter their own rating, overriding the best guess. For example, here is Netflix's best guess for my rating of the film Flirting With Disaster:
I have seen this film...in fact, I own the DVD. I just haven't rated it yet. Some questions that come to mind are:
- To what extent does this best guess influence my rating? Is there a natural inclination to match what is already presented there?
- If the best guess is typically close to my own rating, does it have more influence?
- I last viewed this film quite a while ago. How well does the best guess reflect the skew in my rating that, according to the winning solution mentioned above, occurs over time?
These would seem to be some good questions for study, and could possibly extend beyond movies to other domains (shopping, past jobs, music, events, locations, etc.) In any case, a mixed-method approach would seem to offer the best opportunity for useful answers.
