Exploring the Association of Movie Trailer Performance on YouTube and Box Office Success using Neural Net, Python, and R

This was submitted as a project for my Big Data Analytics class in my MS Business Analytics program. The original title is “Exploring the Association of Movie Trailer Performance on YouTube and Box Office Success”. My other teammates are Yi Cai, Michael Friscia, and Zheyu Tian. This project was done for educational purposes only. Click the photos to enlarge. Check out the GitHub page for the files and data set. Due to policies of thenumbers.com regarding their data, that particular data set won’t be uploaded.

UPDATE: If you scroll below, you will see that the final accuracy was 82.55%. Using genetic algorithms and a Sklearn implementation, the accuracy was improved to 98.66% (with a final generation average accuracy of 92.28%). Check out the code in this GitHub repo.

Problem Statement

The purpose of this study is to determine if there is a correlation between the performance of trailers on YouTube and Hollywood movie sales.

Project Significance

  • By evaluating important predictors from YouTube viewers, studios and agencies can create and publish movie trailers on YouTube more efficiently, thus:
    • driving box office ticket sales domestically and globally
    • generating more revenue
  • Trailer performance can be focused on and improved if it shows that there is a correlation to boxoffice/post-show sales

Data Collection

  • Data was collected from YouTube, using its proprietary API, and from thenumbers.com
    • Youtube – trailer performance and comments
    • thenumbers.com – Movie Box Office data
  • 32.4GB (when comments are expanded into 1 line per comment)
  • 1,713 movies
  • 5,244 trailers
  • 2,979,511 comments

Youtube Data

Variable Selection

  • The ROI variable had to be created.

Variables selected

Hypothesis and Rationale

  • There is a positive correlation between Youtube movie trailer performance indicators  and Box office performance/Video Sales.
    • Rationale: “Likes” = Sales
  • There is a positive correlation between Movie trailer comment sentiments and Box office/Video Sales  performance.
    • Rationale: If trailers are viewed in a positive manner, then people will be more likely to watch the movie.

Conceptual Model

  • After data extraction using Python, data was transformed using Python. Output files were CSV and TXT files.
  • Three sentiment models were implemented in the project: polarity-based sentiment models by using Bing Liu’s and Harvard IV-4 dictionaries, and Naive Bayes Classifier: NLTK Sentiment model.
    • To process part of the sentiment analysis, Apache Spark was used.
  • The sentiment scores were also used to help identify the ROI of each movie using a neural network model.

Project Conceptual Model

Results and Discussion

Variable Correlation Test

The graph, which was generated by R, shows the correlations between the independent variables and dependent variables.

There are three main main conclusions based on the graph:  

 1.The graph demonstrated a positive correlation among count Views, Count Comments, and Likes/Dislikes.

2. The graph was also used to test the hypotheses regarding the movie trailer features and movie performance which assumed that the movie trailer comment counts/ Movie Trailer Likes and Movie Box Office are positively correlated.

 3. Unfortunately, three sentiment models have little correlation with the Box Office Data (eg. ROI), which means that the initial hypothesis wasn’t proved. Two feature-based sentiment models have negative correlations with: Count Views, Count Comments, Likes/Dislikes.

Time Series Analysis

  • It was interesting to see that for 2008, even though with the financial crisis, overall ROI turned out to be good.
  • Another interesting finding is that ROI continuously decreased after 2008.

Sentiment Analysis

Two models were implemented for sentiment analysis.

  • a polarity-based model using Bing Liu’s and a Harvard dictionary, which nets the counts of positive and negative words that can be found in each comment, and
  • the NLTK Sentiment Analyzer using the Vader dictionary, which is a rule-based approach
  • Scores were scaled and centered to zero to maintain positive scores > 0 and negative scores < 0. The scale is [-1,1].

  • Comparing the performance of the three models, the Polarity-based models gravitated towards negative sentiment, which could be explained by the internal structure of the dictionaries used; meaning, if there were more negative than positive words, most likely there will be a higher chance of a higher negative-word count.
  • For the NLTK Sentiment Analyzer, results showed more positive sentiment towards the comments.

Sentiment Analysis – Movie Studios

  • Based on the Harvard sentiment dictionary, Paramount Vantage has the lowest average sentiment score whereas Weinstein Company has the highest.
  • The Vader sentiment sentiment dictionary determined that Apparition has the highest average sentiment score while Focus/Gramercy has the lowest sentiment average score.
  • Bing Liu sentiment dictionary predicted that Freestyle Releasing and Apparition have the lowest and highest average sentiment score, respectively.

Sentiment Analysis – Genre

  • When evaluating the Bing Liu and Harvard dictionaries, Romantic Comedies and Documentaries have the highest and lowest average sentiment score respectively.
  • Interestingly, for the NLTK Analyzer, the Concerts and Performances genre has the lowest average sentiment score, while Romantic Comedy has the highest score.

Clustering (to follow)

Predicting Box Office ROI Performance using Neural Net

  • ROI performance was classified using four bins:
    • Poor (less than the 25% quantile)
    • Passing (between 25% and 50% quantile)
    • Ok (between the 50% and 75% quantile)
    • Great (above the  75% quantile)
  • Neural Net implemented using R
  • ROI Performance ~ countsComments + countsViews + Ratio_of_Likes_and_Dislikes + ProdBudget + genre + MPAArating + MovieStudio + BingLiuSentiment + HarvardSentiment + VadeSentiment
  • Model Accuracy = 82.55%

    Neural Net Model Results

Conclusion

  • Due to the success of the neural network model, companies now have the ability to accurately predict the ROI of their movies, specifically with the use of the number of YouTube comments, ratio of likes and dislikes, and their sentiment scores from the three models.
  • With the hypotheses predicted for the research, there is a higher probability of Box Office success which would then in return generate a higher ROI for movie studios and production companies. 
  • Although the sentiment results are different among the three dictionaries, this implicates that some dictionaries used in the models view more neutral words as negative or positive.
    • The best alternative methods  to predict the sentiment of YouTube comments in movies are to use domain-specific dictionaries and the application of  machine learning classifiers paired with a sample comment-sentiment data set. 

Scope and Limitations

  • There are many popular websites and applications that can be used to comment on trailers or movies, such as Rotten Tomatoes, Facebook, Twitter and so on. However, in this case, Youtube is the only trailer source used.

  • Trailers are not the only factors that impact box office and video sales. Other factors such as advertisements,the actors, and  the competition of other movies being released at the same time can have an effect on the movie’s box office sales. However, these factors are not included in this study. Further studies could be conducted with those variables included.

Reference

Love in the Fastlane – Predicting Success in Speed Dating using Logistic Regression and R

This was submitted as a project for my Statistical Methods and Computation class in my MS Business Analytics program. The original title is “Love in the Fastlane: Success in Speed Dating”. My other teammate is Ruoxuan Gong. This project was done for educational purposes only. Click the photos to enlarge. Check out the GitHub page for the files and data set.

Problem Statement

The purpose of this study is to determine if speed dating outcomes can be predicted, and if yes, what are the most important factors that would help speed dating participants successfully match with each other

Methodology

Data Cleaning and Pre-processing

  • 8,369 rows containing speed dating round and participant information in one table
  • Table was separated into two tables: round-specific data (round condition and relative data) and participant data (demographics and interests)
  • Has a big amount of double-counted rows:
    • Wave ID 1 with Participant A and Partner D
    • Wave ID 1 with Participant D and Partner A (double-count and removed)
  • After removal of double-counts, nrow = 4,184.
  • Columns with a lot of missing data were removed.
  • Domain knowledge used to produce initial set of variables
  • Rows with missing data were removed; new nrow = 3,377
  • All Participants are women and all partners are men

Hypothesis

It is possible to predict, to a certain level of confidence, the outcome of a speed dating round(match or no match) by analyzing and taking into consideration the different factors during the round itself.

Variable Selection

  • Dependent Variable: Match (1/0)
  • Independent Variables: Round-specific data, difference in preference ratings (attractiveness, sincerity, intelligence, fun(funny?), ambitious, and shared hobbies), ratings for and by the participant

Tool and Model Selection

  • Tool: R was the main tool used for this project.
  • Model: Logistic Regression
    • Target is binary (Match = 1/0)
    • Assumptions:
      • Explanatory variables are measured without error
      • Model is correctly specified (No important variables are omitted, extraneous variables are excluded)
      • Outcomes not completely separable
      • No outliers
      • Variables should have little or no multicollinearity (VIF test)
      • Observations are independent (no time series, no in group data)
      • Sample size = n = at least 10 observations for each outcome (0/1) per predictor

Results and Discussion

Descriptive Analysis – Overall Data set

  • Here, we could see that the ages of the participants typically ranged from the 20s to 40s, with a few outliers
  • Ages 20-30: mixed ratings for almost all interests
  • Ages 30-40: Trend of rating goes higher for reading, movies, music, museums, and art

Distribution of Interests by Age by Gender

 

  • In the chart below, data shows that women are more interested in theater, art, and shopping, while men like gaming more.
  • Both genders share interests in dining, reading, movies, and music

  • In terms of preference, it would seem that women prefer intelligence over ambition and attractiveness
  • Men, on the other hand, prefer attractiveness over ambition and shared hobbies

Descriptive Analysis – Match = 1 Scenarios

  • We also wanted to take a look at what the average man and woman in match = 1 scenarios looked like
  • From the looks of it, men who joined were generally older than women
  • Also, men expected to be happier in the event and compared to women
  • Surprisingly, most joined the event to
  • “have a fun night out”, “meet new people”, or “try it out”.
  • Only a very few joined for “romantic reasons”

Age, Goal, and Expected Happiness Comparison for Match = 1 Scenarios

  • There were high ratings from both genders for exercise, dining, hiking, music, concerts, and movies
  • There were small variances for museums, art, clubbing, reading, and TV
  • Large variances were seen for theater, shopping, and yoga
  • Lastly, both had low ratings for gaming (I don’t understand why… lol)

Interest Rating Comparison by Gender

  • Very low variances were observed in the preferences of men and women
  • Might suggest that a match tends to happen when both participant and partner have the same level of preference, no matter what level it is

Preference Comparison by Gender

  •  This is how an average man and woman in a match = 1 scenario would look like in terms of their interest and preference ratings

Predictive Analysis – Logistic Regression

  • Going to predictive analytics, the first thing that we did was to check the distribution of the variables. Although multivariate normality isn’t an assumption for logistic regression, having normal variables help make the model stable.
  • We selected the variables that were not normal in shape and decided to apply transformations on them.
  • Of the three, square root transformation produced the most-normal-looking transformations. This prompted us to pick this over the original and log transformed variables.

  • The initial model produced the following results:
    • Accuracy = 83.41%
    • Recall = 0.244
    • precision = 0.659
    • AUC = 0.849
    • AIC = 1,945.32

  • We believed that this model could still be improved. That being said, we chose to apply a stepwise function on it.
  • After around 12 iterations, our round variables were stripped down to 13 (from 25)
  • AIC decreased to 1,929.27 (from 1,945.32)
  • Although the ROC curve didn’t seem to change, the AUC score increased from 0.84989 to 0.84996
  • The equation below represents the model in its current form

  • To satisfy one of the assumptions (the model should be correctly specified, meaning there are no important variables that are omitted and all extraneous variables should be excluded), we removed all the variables that had high p-values (>0.5, except for the ones with the “.”s)
  • By removing the extraneous variables, we were able to make the coefficients of the remaining variables more reliable
  • The new model resulted to the equation below:

  • The improved model has an precision of 0.6739. In addition, the final AUC score is 0.847.
  • It also shows an improvement in the model’s total True Negatives.

Assumption Testing

  • Explanatory variables are measured without error
    • Limitation of using third-party data. We assumed this to be true.
  • Model is correctly specified (No important variables are omitted, extraneous variables are excluded)
    • Demonstrated above.
  • Outcomes not completely separable
    • In R, the glm() function will not work if this was not true
  • No outliers
  • Observations are independent
    • no time series, no in group data
  • Sample size = n = at least 10 observations for each outcome (0/1) per predictor
    • Our nrow more than covers this requirement
  • Variables should have little or no multicollinearity
    • In order to test for multicollinearity, we ran the Variance Inflation Factor test.
    • Having a VIF of <5 means that the model has low or no multicollinearity. That being said, since the VIFs of our variables are <2, then we can say that the level of multicollinearity in the model is negligible, at best.

Conclusion

In conclusion, a logistic regression model can be used to predict the outcome of speed dating rounds. It can be represented using the formula below:

Implications

  • Attractiveness has a very big impact on producing a successful match, especially when females (participants) are perceived to be more attractive.
  • For females, meeting someone who brings fun to them increases the chance of getting a match.
  • In other words, males (partners) care about attractiveness more, and females prefer someone who has a sense of humor.

However, factors like having an ambitious personality has a negative impact on a successful match.

Scope and Limitations

  • Many valuable variables were excluded because of missing values
  • Given the reality of using third party data, information in terms of location and time of data collection is limited, leaving us without any knowledge if there are any biases in the results

References

[1] Donald, B. (2013, May 6). New Stanford research on speed dating examines what makes couples ‘click’ in four minutes. Retrieved from http://news.stanford.edu/news/2013/may/jurafsky-mcfarland-dating-050613.html

[2] Data set: http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/

Google Slides

Predicting the Winner of March Madness 2017 using R, Python, and Machine Learning

This project was done using R and Python, and the results were used as a submission to Deloitte’s March Madness Data Crunch Competition. Team members: Luo Yi, Yufei Long, and Yuyang Yue. Check the GitHub for the code.

Of the 64 teams that competed, we predicted Gonzaga University to win. Unfortunately, they lost to University of North Carolina.

Methodology

  1. Data transformation
  2. Data exploration
    • Feature Correlation testing
    • Principal Component Analysis
  3. Feature Selection
  4. Model Testing
    • Decision Tree
    • Logistic Regression
    • Random Forest
  5. Results and other analysis

Data Transformation

The data that was used to train the initial model was from a data set that contained 2002-2016 team performance data, which included statistics, efficiency ratings, etc.,  from different sources. Each row was a game that consisted of two teams and their respective performance data. For the initial training of the models, we were instructed to use 2002-2013 data as the training set and 2014-2016 data as the testing set. After examining the data, we debated on what would be the way to use it. We finally decided on creating new relative variables that would reflect the difference/ratio of team 1 and team 2’s performance. Feature correlation testing was also done during this phase. The results supported the need for relative variables.

Features Correlation Heatmap (Original Features)

Data Exploration

After transformation, feature correlation testing was repeated. This time, results were much more favorable. The heat map below shows that the correlation between the new variables is acceptable.

Features Correlation Heatmap (New Features)

Principal Component Analysis was also performed on the new features. We hoped to show which features were the most influential, even before running any machine learning models. Imputation was done to deal with missing values. The thicker lines in the chart below signify a more influential link to the 8 new discriminant features. This, however, was used to understand the features more and wasn’t used as an input for all the models.

Feature Selection by PCA

Feature Selection

For this project, we opted to remove anything (aside from seed and distance from game location) that wasn’t a performance metric. Some of the variables that were discarded were ratings data since we believed that they were too subjective to be reliable indicators.

Model Testing

We used three models for this project: Decision Tree, Logistic Regression, and Random Forest.

Decision Tree – Results were less than favorable for this model. Overfitting occurred and we had to drop it.

Random Forest (R) – We decided to use the Random Forest model for 2 different reasons: the need to bypass overfitting restrictions and its democratic nature.

Predictor Importance

  • OOB Estimate of error rate: 26.9%
  • Error reduction plateaus at approx. 2,600 trees
  • Model Log-loss: 0.5556
  • Chart Legend:
    • Black:  Out-of-bag estimate of error rate
    • Green and Red: Class errors

Forest Error Performance

Logistic Regression (Python) – From PCA analysis and Random Forest Model, 5 features were selected for this model. 

Features Selected for Logistic Regression

Results and Other Analysis

Summary of Results

Running them against the testing set, we were able to get a higher accuracy for the Random Forest model. Log loss, which was also one of the key performance indicators for the competition, was relatively the same for the 2 models. That being said, Random Forest was chosen to run the new 2017 march madness data.

As previously mentioned, we had predicted Gonzaga University to win the tournament. We came really close though. It made a lot of sense because, compared to the other teams, Gonzaga was a frequent contender in March Madness.

One of the more interesting teams this season was the cinderella team, South Carolina. They had gone against expectations, and this is why we decided to analyze their journey even further.

In the 1st round, we were able to correctly predict that South Carolina was going to win. However, because we were using historical data, it was obvious that we were going to predict them to lose in the next stages, especially since they were going against stronger teams. Despite “water under the bridge” data, they were able to reach the Final 4.

Cinderella Team Win Rate by Stage

One of the questions that we wanted to attempt to answer was why they kept on winning. What was so different this year that they were able to surprise everyone?

One reason that we speculated about was the high performance of one of South Carolina’s players, Sindarius Thornwell. In the past years, he was averaging 11-13 pts per game. This year, he was dropping 21.4 pts per game. Moreover, in his last 5 appearances, his was able to increase this stat to 23.6 pts per game. Looking at the score difference of South Carolina’s games in March Madness, it is evident that he was very influential in the team’s success. One could even say that without his 23.6 pts per game, the turnout of their campaign would’ve been different. But hey, that’s just speculation.

Score Difference for Cinderella Team Matches

 

Sindarius Thornwell March Madness Stats