Movie Metric Predictor Report

Findings at a Glance

Decision Tree and Random Forest models can be used to predict high user approval and high engagement in movies on IMDb with some degree of success. When optimized, these models predicted high approval movies (with average IMDb user score above a 6) correctly 71% of the time, and high engagement (more than 2250 votes on IMDb) 81% of the time. In both cases release year, duration of the movie, and whether English was spoken in the movie, were among the most important features for the prediction. In the case of approval, the specific genres present in the movie were important for the prediction, while in the engagement model, the number of genres represented was important. These findings are interesting because they suggest that some relatively superficial aspects of a movie play a significant role in its audience approval, and whether it succeeds or fails.

Background

This project started out as a group project in my bootcamp. The goal was to understand what factors drive user approval and engagement. The original dataset used was the IMDb Extensive Dataset published on Kaggle. This dataset is an especially powerful tool for understanding movies, as it contains 85,855 movies, spanning the entire history of movie making. In order to make a dataset useable for this purpose we manipulated and cleaned the data, deriving a single .csv. The main metric we used for user approval was mean user score, and the main metric for user engagement was total votes. A couple of our findings will be helpful to understand as a starting point for the present work. The first finding is that there are fewer early movies that get low scores, than there are later movies that get low scores. Second, we found that in movies with only one genre designation, the average user approval for a movie is significantly different based on what that genre is. Finally, we found that the median number of votes a movie in IMDb will get is low (my estimate in the present work is 484 votes) but that a substantial population of outliers exist, which generate significantly more engagement - from thousands to millions of votes.

Acknowledgements to: Amee Yang, James Bryant II, and Rebecca Lubera, my group members for this group project.

Figures showing the distribution of average vote, and total votes through time.

But Can These Data Help Us Predict Engagement or Approval?

The initial analysis on the IMDb dataset was interesting, but falls somewhat short of the original goal - the idea was to identify the factors that lead to user approval and engagement. In order to fulfill the original goal of the project, I decided to pick up where my group and I left off, and further transform and analyze the data from the IMDb extensive dataset. My new goal was to attempt to use some of the data in the data set to engineer features which could be used to model the data using machine learning algorithms.

Reexploration and Feature Engineering

The main issue that I saw initially with respect to the implementation of a machine learning model, was the fact that a lot of the data in this set is categorical. There were three categories that interested me for this part of the project - genre, language and country. All of these columns could consist of one or several string values. For example, a genre entry in this data set may contain 1 - 3 genre values, out of a set of 23 unique genre values. This kind of data is not ready for an ML model and needs to be dummy encoded in some way. I used the genre column to engineer 24 features for my analysis - one for each genre, plus a feature called genre complexity. Genre complexity is a count of the number of genre designations a movie has in its genre column. To code this I used a for loop to loop through the genre column, split each entry into a list and then append the length of a list to an empty list for genre complexity. The genre features were processed similarly. In this case, I looped through a list of genres and for each, I looped through the genre column. I again split each value into a list, and then used an if statement to look for whatever value of genre the first loop was on. In each case a zero was added to my list for each genre, but if the loop found the genre while looking for it, this would be changed to a 1. This way every movie in the set has a vector of length 23 for its genre classification, and a 1 in any position of the genre vector indicates that a particular genre is present.

Example of a loop used for feature encoding. This is the code used to accomplish the tasks mentioned above

I decided not to do a full encoding of country and language fields, due to the large number of values these can take. Instead, I tried to anticipate countries and languages that would be strongly represented within the data, and specifically encoded these into binary features. For example I have a feature to cover whether English is present in my model, but I did not do this for every language. I used a list of the top countries for movie production, to aid in my decision making.

In order to decide which of these features I should use, I made a lot of box plots to see if the medians were different for the films with a particular country or language, versus the ones without.

Box plots were used to judge potential features. English language containing movies are slightly less approved of than movies without English.

Modeling The Data With ML Methods

The first step toward modeling the data with machine learning is to decide what kind of algorithm to apply to the data. I'm skeptical of regression models for this problem, because it seems overly ambitious for what I'm trying to do. Instead I decided to view this as a classification problem. In order to apply a classification strategy I encoded two new columns (one in each of two datasets). The first, for average vote is called is_liked and has a value of 1 if the movie has an average score greater than 6. Six is a reasonable value to use, because it is pretty close to the mean value for average score. The analogous column for total votes, is is_outlier, which gets a value of 1 if the total number of votes the movie has is greater than 1.5*IQR + the median number of votes (i.e. roughly 2250 votes).

I prioritized decision tree and random forest models, as these are relatively easily understood, and I know of ways to extract feature importance for them. Both of these models work by separating populations based on variables. A decision tree uses features to make splits in the data, starting with an initial split (the root node) and then through successive layers of nodes, with the goal of purifying two output conditions in the end nodes. The method uses gini impurity as a metric for how good a split is - a gini of 0 would perfectly separate the conditions, where a split with a gini value of 0.5 is useless and separates the population such that half the members of the resultant nodes are in one classification, and half are in the other.

Random forest is effectively the same thing as decision tree, but it has additional features that optimize the tree model. A random forest model constructs some number of decision tree models, and then uses them to form an optimized tree.

Implementing and Tuning the Models

All the machine learning tools used in this work were imported from the Scikit-learn library of Python. In order to implement models, the data first needed to be processed. A first processing step was to use the min/max scaler to scale the features, because year and duration had much different ranges than most of the other features. Two different datasets were used in the implementation - the set used to predict is_liked contained 38 features, while the one used for is_outlier contained 40. Some features, such as genre_complexity showed promise for the engagement prediction, but not the approval prediction, which is the reason for the different number. I optimized these models by tuning the hyperparameters iteratively over a range that seemed reasonable. In particular, the tree depth parameter seemed important. Given an arbitrary tree depth, the models could almost completely sort the training data, but comparisons with the test data showed this to be overfitting.

Overly large decision trees overfit the training data badly.

Results: Approval and Engagement are Somewhat Predictable

The best results I was able to obtain for audience approval came from random forest models. The random forest model with a tree depth of 16 and the default 100 estimator trees gave an accuracy of 0.709 on the testing set.

The best results I was able to obtain for audience engagement were better, and came from a simple decision tree model. The random forest models gave only modest improvements in the audience approval data, and also significantly complicated the model for audience engagement. The best decision tree for engagement had a tree depth of 8, and gave an accuracy of 0.808. More detailed results follow in the form of classification report tables.

Discussion: Engagement

In my view the most interesting results came from the engagement model. The model was able to sort outliers from normal (e.g. very low) engagement with around 80% accuracy, but the classification report supplies some nuances that show this is a somewhat misleading picture. In the report above, recall refers to the proportion of a classification identified by the algorithm, and precision tells us how often the algorithm made a correct identification within a class. What we see is that the algorithm is very good at identifying normal engagement - it sorts 94% of those movies into the correct class. On the other hand it's sorting a significant number of the outliers into the normal class. This fact is most likely due to the fact that the dataset is imbalanced with respect to these two populations. The set contains many more ordinary movies than outliers (by definition). It's possible that the results could be improved by attempting to address this imbalance.

The engagement model made use of relatively few of the features.

The engagement model is relatively striking, because of its simplicity. Looking above, we find that the year, duration, presence of English language, and genre complexity dwarf the other features. It is somewhat surprising that genre categories are not very prominent in their use here.

Discussion: Approval

Approval is more straightforward to interpret because the dataset is not imbalanced with respect to approval. The threshold for the two categories was set at 6 because that value is close to the central tendencies revealed in the first figure for approval. The classification report shows that this model is equally precise for both classes , but it tends to miss fewer of the cases where the film is approved of. At about 70% accuracy, this model is not without merit, but not highly predictive.

The approval model was a random forest, and slightly more complex.

Our model for approval is somewhat more multifaceted than the one for engagement. It is much more surprising to me that the genre categories are marginal in this model, as the medians for these genres often differ markedly. A possible reason for this may be that the number of movies in some of the genres is small relative to the overall dataset. Horror and Drama have very different median values for approval, and are relatively large genres, so it is not surprising to find that they are the most used genres for splitting the tree. The elephant in the room is the fact that year and duration are highly prominent in all models. In order to investigate this, the optimized decision tree model for approval is rendered below.

Representative decision tree model with a depth of 8. Note the high gini values for the initial splits: this indicates these are low value.

The tree above shows the condition for splitting highest in the box, with the operator meaning less than or equal to. By number year is the first feature and duration is the second feature so X[0] is year and X[1] is duration. You can see that while this model still gets about 70% accuracy the first splits are not very good. This can be seen from the high gini values in these splits. While it's possible to understand what the model is doing here, it's hard to understand why these two features would be so important. My hypothesis is that the main reason is that they are continuous variables and allow for a relatively even split, even if it isn't purifying the classes very much.

Conclusions and Limitations

The most prudent conclusion I can derive from my work starts with acknowledging a limitation. The data that I am using to predict engagement and approval is relatively shallow and doesn't really describe the content in those movies. That is, in a sense, the most interesting thing to me about this project. It's not apparent to me that a model should be able to predict whether people will like or engage with a movie based on how long it is, whether it has English language, how many genres are present in it, and what genres are there. From that perspective, this work might point to some trends that movie makers could be cognizant of when making movies. The models I am using are not the most sophisticated in the ML field; however, I think the small differences between decision tree and random forest, and the tendancy for the models to overfit the data at relatively modest tree depths point to diminishing returns for more complicated models. In order to make better models of approval and engagement in movies, I believe better data is necessary. This could involve natural language processing or financial data about the movies.

Greg Spahlinger

gspahlin@gmail.com

Github

LinkedIn

Predicting User Approval of and Engagement with Movies on IMDb With Machine Learning