The Academy Awards are just around the corner and I can’t wait to see who wins! But that is mostly because I want the internet to finally quiet down about the awful, terrible, heartbreaking travesty that is the prospect of Leonardo DiCaprio going another year without a single Oscar on his fireplace mantle.
##The Idea A little while ago I saw a comment on reddit discussing how well the Critic’s Choice awards matched up with the Oscar winners since 2000. I’m hoping /u/MarcusHalberstram88 didn’t have to search through the archives himself to tabulate the data. In any case, I thought, “hey maybe this guy is on to something”; it might be interesting to see how well different award ceremonies predicted Oscar winners throughout the years. After careful investigation of all the award ceremonies out there, I settled on 9:
- Academy Awards
- Golden Globes
- Independent Spirits
- Critic’s Choice
- Producer’s Guild
- Screen Actor’s Guild
- Writer’s Guild
- Director’s Guild
Conveniently, IMDb has a pretty comprehensive Awards Central section, where each year for each award follows a pretty standard page layout (Golden Globes 2016 for reference) which made the data relatively easy to scrape. In comes Python with the help of BeautifulSoup and out goes award winners and movie metadata. In order to easily compare across award ceremonies, the names for each award needed to be normalized; titles like “Best Achievement in Directing”, “Best Director”, “Best Director - Motion Picture” all mapped to the same award:
director. I manually went through the 9 ceremonies and mapped their awards to a normalized name. To complicate things, some ceremonies (cough Oscars cough) like to change their award names whenever it fancies them so I had to go through past years to account for these changes. After fixing edge cases in the scraping code and creating a comprehensive map of normalized names, it felt like I maybe I hadn’t really saved much time over that reddit user above. In total I scraped 26 years of data which was about 650 movies.
##Organizing the Data The world of web development is constantly changing and I was feeling left behind; I decided to look up the new hotness in webdev and all these cool-sounding words kept popping up like “React”, “Flux”, and “Babel”. I have worked with React in the past, but Flux and ES7/Babel were new to me so I decided, why not make a website? Screenshots of Excel Graphs are so boring and interactive websites are so in. I exported the data to JSON and made a website. You can see the website over here. I decided to display each year in a table so you can look at the details and learn interesting things like the fact that Leading Actress, Supporting Actor, and Supporting Actress were all swept in 2015! Or that in 2005 no ceremony correctly predicted that the Oscar would go to Million Dollar Baby. In the end, I settled on 8 different awards since these were ones that were shared across all ceremonies (with the exception of Adapted Screenplay). I have more data pulled and normalized, so I aim to display and predict more awards for next year. I also used d3 to make some fancy charts which show trends in number of correct predictions over time.
##Hopping on to the ML Hype Train
One thing I noticed when looking at the trends was that it was clear that certain events predicted the Academy better than others. Moreso, it seemed that different ceremonies were better indicators for specific awards. Prediction? Trends? Indicators? This sounds like a perfict fit for machine learning! Why not throw all this data at a classifier and see what happens? So that’s what I did. The goal was to take a movie and perform binary classification to determine if the movie would or would not win a specific Academy Award (e.g. would The Revenant win Leading Actor?). The features for each movie were the other 8 award ceremonies (0 if did not win the award at the ceremony or 1 if it did), the year of release, and some additional metadata provided by OMDb.
sklearn provided a great, out of the box
LogisticRegression classifier. It took a little bit of hyperparameter tuning but eventually the accuracy got to a respectable 60-80% depending on the award being predicted. One nice thing about a Logistic Regression classifier is that it also provides probabilities for each class; we can look at the probability value to see how confident the classifier was in it’s prediction. We can also extract the coefficients the classifier assigned to each feature to see which was the best predictor when assigning the class.
So what did the classifier predict? Go check out the predictions section of the site!