Predicting the Winner of March Madness 2017 using R, Python, and Machine Learning

This project was done using R and Python, and the results were used as a submission to Deloitte’s March Madness Data Crunch Competition. Team members: Luo Yi, Yufei Long, and Yuyang Yue. Check the GitHub for the code.

Of the 64 teams that competed, we predicted Gonzaga University to win. Unfortunately, they lost to University of North Carolina.

Methodology

  1. Data transformation
  2. Data exploration
    • Feature Correlation testing
    • Principal Component Analysis
  3. Feature Selection
  4. Model Testing
    • Decision Tree
    • Logistic Regression
    • Random Forest
  5. Results and other analysis

Data Transformation

The data that was used to train the initial model was from a data set that contained 2002-2016 team performance data, which included statistics, efficiency ratings, etc.,  from different sources. Each row was a game that consisted of two teams and their respective performance data. For the initial training of the models, we were instructed to use 2002-2013 data as the training set and 2014-2016 data as the testing set. After examining the data, we debated on what would be the way to use it. We finally decided on creating new relative variables that would reflect the difference/ratio of team 1 and team 2’s performance. Feature correlation testing was also done during this phase. The results supported the need for relative variables.

Features Correlation Heatmap (Original Features)

Data Exploration

After transformation, feature correlation testing was repeated. This time, results were much more favorable. The heat map below shows that the correlation between the new variables is acceptable.

Features Correlation Heatmap (New Features)

Principal Component Analysis was also performed on the new features. We hoped to show which features were the most influential, even before running any machine learning models. Imputation was done to deal with missing values. The thicker lines in the chart below signify a more influential link to the 8 new discriminant features. This, however, was used to understand the features more and wasn’t used as an input for all the models.

Feature Selection by PCA

Feature Selection

For this project, we opted to remove anything (aside from seed and distance from game location) that wasn’t a performance metric. Some of the variables that were discarded were ratings data since we believed that they were too subjective to be reliable indicators.

Model Testing

We used three models for this project: Decision Tree, Logistic Regression, and Random Forest.

Decision Tree – Results were less than favorable for this model. Overfitting occurred and we had to drop it.

Random Forest (R) – We decided to use the Random Forest model for 2 different reasons: the need to bypass overfitting restrictions and its democratic nature.

Predictor Importance

  • OOB Estimate of error rate: 26.9%
  • Error reduction plateaus at approx. 2,600 trees
  • Model Log-loss: 0.5556
  • Chart Legend:
    • Black:  Out-of-bag estimate of error rate
    • Green and Red: Class errors

Forest Error Performance

Logistic Regression (Python) – From PCA analysis and Random Forest Model, 5 features were selected for this model. 

Features Selected for Logistic Regression

Results and Other Analysis

Summary of Results

Running them against the testing set, we were able to get a higher accuracy for the Random Forest model. Log loss, which was also one of the key performance indicators for the competition, was relatively the same for the 2 models. That being said, Random Forest was chosen to run the new 2017 march madness data.

As previously mentioned, we had predicted Gonzaga University to win the tournament. We came really close though. It made a lot of sense because, compared to the other teams, Gonzaga was a frequent contender in March Madness.

One of the more interesting teams this season was the cinderella team, South Carolina. They had gone against expectations, and this is why we decided to analyze their journey even further.

In the 1st round, we were able to correctly predict that South Carolina was going to win. However, because we were using historical data, it was obvious that we were going to predict them to lose in the next stages, especially since they were going against stronger teams. Despite “water under the bridge” data, they were able to reach the Final 4.

Cinderella Team Win Rate by Stage

One of the questions that we wanted to attempt to answer was why they kept on winning. What was so different this year that they were able to surprise everyone?

One reason that we speculated about was the high performance of one of South Carolina’s players, Sindarius Thornwell. In the past years, he was averaging 11-13 pts per game. This year, he was dropping 21.4 pts per game. Moreover, in his last 5 appearances, his was able to increase this stat to 23.6 pts per game. Looking at the score difference of South Carolina’s games in March Madness, it is evident that he was very influential in the team’s success. One could even say that without his 23.6 pts per game, the turnout of their campaign would’ve been different. But hey, that’s just speculation.

Score Difference for Cinderella Team Matches

 

Sindarius Thornwell March Madness Stats

 

Predicting success An Indiegogo prediction study

Predicting Success using SPSS: An Indiegogo Prediction Study

This was submitted as a project for my data mining class in my MS Business Analytics program. My team included classmates: Anh Duong and Luoqi Deng. This project was done for educational purposes only. Click the photos to enlarge.

Abstract & Key Learnings

  • The hypothesis that crowdfunding campaigns from Colorado have higher success rates than campaigns that are located in other places was proven.
  • The category, the number of comments, the number of funders, and the fund goal of the campaign contribute significantly to its success.
  • If a campaign creator from Colorado promotes these factors, they will be able to increase their chances of success significantly.

Project Rationale

  • Indiegogo, the largest global crowdfunding and fundraising site online, has funded over one hundred seventy-five thousand (175K) campaigns, with an estimated valuation of eight hundred million (800M) dollars, in the past seven years. Over two million five hundred thousand (2.5M) funders from two hundred sixty-six (266) countries have contributed to the success of these campaigns.
  • Based on a recent tally by Krowdster.co, a crowdfunding marketing and PR solution provider, stated that nine of ten (90%) Indiegogo campaigns fail to reach their goal; which is significantly higher than the 66.6% failure rate of Kickstarter, one of Indiegogo’s biggest competitors.

Problem Statement

This study aimed to determine whether campaign success can be predicted by certain attributes of its profile, campaign activity, and Indiegogo funder engagement.

Data Description/Preprocessing

The data used came from a public data set from BigML.com.

  • 15K rows
  • Global data
  • No time factor
  • Contains finished and unfinished campaigns
Data set variables

Original data set variables

Cleaning

In order to make the data usable, the data set had to undergo cleaning. All the blank cells were turned into zeroes. Variables that had inconsistent units (hours, days, minutes) had to be converted into consistent values. To make analysis easier and more reflective of performance, two calculated fields were created:

  • Attainment Rate = Raised/Goal
  • Attained = if(Attainment Rate >= 100), then 1, else, 0

Variable Selection

State and country were disregarded because the scope of the study was focused on Colorado-only campaigns.

Model Building

For this study, the Decision Tree, Neural Network, Bayesian, and Clustering models were used to determine the most important predictors to a campaign’s success. Multiple models were chosen so that the outcomes could be compared and a more logical conclusion could be confidently drawn. The association model wasn’t used because, although it would be interesting to determine the confidence and correlation of the occurrence of the different variables, it would contribute little to the goal of this study.

Assumptions and Limitations

  • The study is inadequate due to the researchers’ inability to collected all the variables, such as the important dates of the campaigns. The result of our predictive model is solely based on the pre-given attributes
  • The observations are independent of each other
  • Campaigns happened within the same year
  • A successful campaign is when the amount of capital raised by the crowdfunders exceeds the goal set by campaign owners. Therefore, in our dataset, we create two new calculated fields called Attainment Rate and Attained. Attainment Rate is the Raised amount over the Goal amount. If the Attainment Rate is above 100, the campaign is attained, which is indicated as 1 in the Attained field, and 0 vice versa. The two fields will be used as target for our model.

Hypothesis

According to Krowdster.co, only 10% of the Indiegogo campaign meet their goals. In our sample data set of Colorado, the actual result we found is 30%.

  • The p-value (Sig.) of the one sample T-test is less than 0.001; thus, we reject the null hypothesis that the sample mean is equal to the hypothesized population.
  • This means that crowdfunding campaign in Colorado does have higher chance of success than the average campaign globally.

screen-shot-2017-01-19-at-6-27-20-pm-copy screen-shot-2017-01-19-at-6-27-29-pm-copy

Descriptive/Regression Analysis

Correlation Table

Correlation Table

  • Consideration: potential correlation between continuous variables.
    • A strong correlation between two variables will make explaining what’s going on harder. The Correlation table shows the correlation table between the variables. The two pairs Updates & Gallery (0.589) and Comments & Funders (0.681) show strong positive correlation.
    • However, since we already have very few variables, we decide not to conduct the Principle Component Analysis nor Factor Analysis to drop variables.
  • It is helpful to understand whether a high number of comments, updates, gallery or funders is correlated to the attainment rate. If the correlation is indeed strong, campaign owner could promote the variable to increase the attainment rate.
    • Attainment Rate has a correlation of 0.397 and 0.44 with Comments and Funders, respectively, which is a good indicator that a high number of comments and funders could lead to a higher attainment rate
Statistics Table

Statistics Table

  • The Statistics table (Figure 2) shows the descriptive statistics about the continuous variables and the outcome Attainment Rate.
    • All the continuous variables are heavily skewed to the left (mean >> median). The skewness of the dataset makes it difficult to apply hypothesis testing since most of tests are based on normal distribution assumption.
    • Regarding the Attainment Rate, the fact that it has the mean and median below 100 indicates the high failure rate of the sample.
    • The standard deviation of 260 and the maximum value of 14281 suggest that there are many outliers in the dataset.
Distribution of Campaigns based on Category

Distribution of Campaigns based on Category

  • The Category table (Figure 3) shows the distribution table of Category, the only relevant discreet variable of the sample data set since the geographical dimension has been eliminated.

Predictive Analysis

If the results from the classification models are similar to each other, a conclusion about the predictor of success of a campaign can be confidently drawn.

Decision Tree

Preparations:

  • Partition: 50-50 Training and Testing data
  • Irrelevant attributes, such as text and urls, were casted as typeless to prevent noise and misrepresentation in the model itself.
Model Summary

Model Summary

Accuracy test between training and testing data

Accuracy test between training and testing data

Neural Network

Preparations:

  • Partition: 70-30 Training and Testing data
Model Summary

Model Summary

Predictor Importance

Predictor Importance

Accuracy test between training and testing data

Accuracy test between training and testing data

Histograms for Goal and Comments

Histograms for Goal and Comments

Bayesian Network

Preparations:

  • Partition: 50-50 Training and Testing data
Model Summary

Model Summary

Coincidence Matrix

Coincidence Matrix

screen-shot-2017-01-19-at-6-29-09-pm-copy screen-shot-2017-01-19-at-6-29-16-pm-copy screen-shot-2017-01-19-at-6-29-22-pm-copy

Clustering

Model Summary

Model Summary

Distribution of Data

Distribution of Data

Summary

Overall, the performance of our analysis stays at the fair/ moderate acceptance level. The results from our classification models share some consistency with each other, meaning that we can trust their outcomes. The following table shows the summary of our analysis:

Summary of Predictive Analysis Results

Summary of Predictive Analysis Results

Domain Knowledge

Domain knowledge was gathered to be able to ground the study to what happens in the business context.

  • With the advent of social media, it has been more evident that crowdfunding is not bounded by geographical constraints.
  • It was also observed that funders exhibit herding mentality. This means that as a campaign accumulates capital, more and more individual funders are motivated to make the campaign a success.
  • Immediate markets, such as friends and family, play an important role to the early success of a crowdfunding campaign. This market has been known to spike the fund during the first few days of the campaigns.
  • Crowdfunding is a platform of incentives to both the creators and funders.
  • Funders lose motivation in supporting campaigns where creator incompetence and inexperience is evident.

Moving Forward

  • A bigger dataset in terms of dimensions and quantity will definitely impact the overall quality of the classification results.