Love in the Fastlane – Predicting Success in Speed Dating using Logistic Regression and R

This was submitted as a project for my Statistical Methods and Computation class in my MS Business Analytics program. The original title is “Love in the Fastlane: Success in Speed Dating”. My other teammate is Ruoxuan Gong. This project was done for educational purposes only. Click the photos to enlarge. Check out the GitHub page for the files and data set.

Problem Statement

The purpose of this study is to determine if speed dating outcomes can be predicted, and if yes, what are the most important factors that would help speed dating participants successfully match with each other

Methodology

Data Cleaning and Pre-processing

  • 8,369 rows containing speed dating round and participant information in one table
  • Table was separated into two tables: round-specific data (round condition and relative data) and participant data (demographics and interests)
  • Has a big amount of double-counted rows:
    • Wave ID 1 with Participant A and Partner D
    • Wave ID 1 with Participant D and Partner A (double-count and removed)
  • After removal of double-counts, nrow = 4,184.
  • Columns with a lot of missing data were removed.
  • Domain knowledge used to produce initial set of variables
  • Rows with missing data were removed; new nrow = 3,377
  • All Participants are women and all partners are men

Hypothesis

It is possible to predict, to a certain level of confidence, the outcome of a speed dating round(match or no match) by analyzing and taking into consideration the different factors during the round itself.

Variable Selection

  • Dependent Variable: Match (1/0)
  • Independent Variables: Round-specific data, difference in preference ratings (attractiveness, sincerity, intelligence, fun(funny?), ambitious, and shared hobbies), ratings for and by the participant

Tool and Model Selection

  • Tool: R was the main tool used for this project.
  • Model: Logistic Regression
    • Target is binary (Match = 1/0)
    • Assumptions:
      • Explanatory variables are measured without error
      • Model is correctly specified (No important variables are omitted, extraneous variables are excluded)
      • Outcomes not completely separable
      • No outliers
      • Variables should have little or no multicollinearity (VIF test)
      • Observations are independent (no time series, no in group data)
      • Sample size = n = at least 10 observations for each outcome (0/1) per predictor

Results and Discussion

Descriptive Analysis – Overall Data set

  • Here, we could see that the ages of the participants typically ranged from the 20s to 40s, with a few outliers
  • Ages 20-30: mixed ratings for almost all interests
  • Ages 30-40: Trend of rating goes higher for reading, movies, music, museums, and art

Distribution of Interests by Age by Gender

 

  • In the chart below, data shows that women are more interested in theater, art, and shopping, while men like gaming more.
  • Both genders share interests in dining, reading, movies, and music

  • In terms of preference, it would seem that women prefer intelligence over ambition and attractiveness
  • Men, on the other hand, prefer attractiveness over ambition and shared hobbies

Descriptive Analysis – Match = 1 Scenarios

  • We also wanted to take a look at what the average man and woman in match = 1 scenarios looked like
  • From the looks of it, men who joined were generally older than women
  • Also, men expected to be happier in the event and compared to women
  • Surprisingly, most joined the event to
  • “have a fun night out”, “meet new people”, or “try it out”.
  • Only a very few joined for “romantic reasons”

Age, Goal, and Expected Happiness Comparison for Match = 1 Scenarios

  • There were high ratings from both genders for exercise, dining, hiking, music, concerts, and movies
  • There were small variances for museums, art, clubbing, reading, and TV
  • Large variances were seen for theater, shopping, and yoga
  • Lastly, both had low ratings for gaming (I don’t understand why… lol)

Interest Rating Comparison by Gender

  • Very low variances were observed in the preferences of men and women
  • Might suggest that a match tends to happen when both participant and partner have the same level of preference, no matter what level it is

Preference Comparison by Gender

  •  This is how an average man and woman in a match = 1 scenario would look like in terms of their interest and preference ratings

Predictive Analysis – Logistic Regression

  • Going to predictive analytics, the first thing that we did was to check the distribution of the variables. Although multivariate normality isn’t an assumption for logistic regression, having normal variables help make the model stable.
  • We selected the variables that were not normal in shape and decided to apply transformations on them.
  • Of the three, square root transformation produced the most-normal-looking transformations. This prompted us to pick this over the original and log transformed variables.

  • The initial model produced the following results:
    • Accuracy = 83.41%
    • Recall = 0.244
    • precision = 0.659
    • AUC = 0.849
    • AIC = 1,945.32

  • We believed that this model could still be improved. That being said, we chose to apply a stepwise function on it.
  • After around 12 iterations, our round variables were stripped down to 13 (from 25)
  • AIC decreased to 1,929.27 (from 1,945.32)
  • Although the ROC curve didn’t seem to change, the AUC score increased from 0.84989 to 0.84996
  • The equation below represents the model in its current form

  • To satisfy one of the assumptions (the model should be correctly specified, meaning there are no important variables that are omitted and all extraneous variables should be excluded), we removed all the variables that had high p-values (>0.5, except for the ones with the “.”s)
  • By removing the extraneous variables, we were able to make the coefficients of the remaining variables more reliable
  • The new model resulted to the equation below:

  • The improved model has an precision of 0.6739. In addition, the final AUC score is 0.847.
  • It also shows an improvement in the model’s total True Negatives.

Assumption Testing

  • Explanatory variables are measured without error
    • Limitation of using third-party data. We assumed this to be true.
  • Model is correctly specified (No important variables are omitted, extraneous variables are excluded)
    • Demonstrated above.
  • Outcomes not completely separable
    • In R, the glm() function will not work if this was not true
  • No outliers
  • Observations are independent
    • no time series, no in group data
  • Sample size = n = at least 10 observations for each outcome (0/1) per predictor
    • Our nrow more than covers this requirement
  • Variables should have little or no multicollinearity
    • In order to test for multicollinearity, we ran the Variance Inflation Factor test.
    • Having a VIF of <5 means that the model has low or no multicollinearity. That being said, since the VIFs of our variables are <2, then we can say that the level of multicollinearity in the model is negligible, at best.

Conclusion

In conclusion, a logistic regression model can be used to predict the outcome of speed dating rounds. It can be represented using the formula below:

Implications

  • Attractiveness has a very big impact on producing a successful match, especially when females (participants) are perceived to be more attractive.
  • For females, meeting someone who brings fun to them increases the chance of getting a match.
  • In other words, males (partners) care about attractiveness more, and females prefer someone who has a sense of humor.

However, factors like having an ambitious personality has a negative impact on a successful match.

Scope and Limitations

  • Many valuable variables were excluded because of missing values
  • Given the reality of using third party data, information in terms of location and time of data collection is limited, leaving us without any knowledge if there are any biases in the results

References

[1] Donald, B. (2013, May 6). New Stanford research on speed dating examines what makes couples ‘click’ in four minutes. Retrieved from http://news.stanford.edu/news/2013/may/jurafsky-mcfarland-dating-050613.html

[2] Data set: http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/

Google Slides

Predicting success An Indiegogo prediction study

Predicting Success using SPSS: An Indiegogo Prediction Study

This was submitted as a project for my data mining class in my MS Business Analytics program. My team included classmates: Anh Duong and Luoqi Deng. This project was done for educational purposes only. Click the photos to enlarge.

Abstract & Key Learnings

  • The hypothesis that crowdfunding campaigns from Colorado have higher success rates than campaigns that are located in other places was proven.
  • The category, the number of comments, the number of funders, and the fund goal of the campaign contribute significantly to its success.
  • If a campaign creator from Colorado promotes these factors, they will be able to increase their chances of success significantly.

Project Rationale

  • Indiegogo, the largest global crowdfunding and fundraising site online, has funded over one hundred seventy-five thousand (175K) campaigns, with an estimated valuation of eight hundred million (800M) dollars, in the past seven years. Over two million five hundred thousand (2.5M) funders from two hundred sixty-six (266) countries have contributed to the success of these campaigns.
  • Based on a recent tally by Krowdster.co, a crowdfunding marketing and PR solution provider, stated that nine of ten (90%) Indiegogo campaigns fail to reach their goal; which is significantly higher than the 66.6% failure rate of Kickstarter, one of Indiegogo’s biggest competitors.

Problem Statement

This study aimed to determine whether campaign success can be predicted by certain attributes of its profile, campaign activity, and Indiegogo funder engagement.

Data Description/Preprocessing

The data used came from a public data set from BigML.com.

  • 15K rows
  • Global data
  • No time factor
  • Contains finished and unfinished campaigns
Data set variables

Original data set variables

Cleaning

In order to make the data usable, the data set had to undergo cleaning. All the blank cells were turned into zeroes. Variables that had inconsistent units (hours, days, minutes) had to be converted into consistent values. To make analysis easier and more reflective of performance, two calculated fields were created:

  • Attainment Rate = Raised/Goal
  • Attained = if(Attainment Rate >= 100), then 1, else, 0

Variable Selection

State and country were disregarded because the scope of the study was focused on Colorado-only campaigns.

Model Building

For this study, the Decision Tree, Neural Network, Bayesian, and Clustering models were used to determine the most important predictors to a campaign’s success. Multiple models were chosen so that the outcomes could be compared and a more logical conclusion could be confidently drawn. The association model wasn’t used because, although it would be interesting to determine the confidence and correlation of the occurrence of the different variables, it would contribute little to the goal of this study.

Assumptions and Limitations

  • The study is inadequate due to the researchers’ inability to collected all the variables, such as the important dates of the campaigns. The result of our predictive model is solely based on the pre-given attributes
  • The observations are independent of each other
  • Campaigns happened within the same year
  • A successful campaign is when the amount of capital raised by the crowdfunders exceeds the goal set by campaign owners. Therefore, in our dataset, we create two new calculated fields called Attainment Rate and Attained. Attainment Rate is the Raised amount over the Goal amount. If the Attainment Rate is above 100, the campaign is attained, which is indicated as 1 in the Attained field, and 0 vice versa. The two fields will be used as target for our model.

Hypothesis

According to Krowdster.co, only 10% of the Indiegogo campaign meet their goals. In our sample data set of Colorado, the actual result we found is 30%.

  • The p-value (Sig.) of the one sample T-test is less than 0.001; thus, we reject the null hypothesis that the sample mean is equal to the hypothesized population.
  • This means that crowdfunding campaign in Colorado does have higher chance of success than the average campaign globally.

screen-shot-2017-01-19-at-6-27-20-pm-copy screen-shot-2017-01-19-at-6-27-29-pm-copy

Descriptive/Regression Analysis

Correlation Table

Correlation Table

  • Consideration: potential correlation between continuous variables.
    • A strong correlation between two variables will make explaining what’s going on harder. The Correlation table shows the correlation table between the variables. The two pairs Updates & Gallery (0.589) and Comments & Funders (0.681) show strong positive correlation.
    • However, since we already have very few variables, we decide not to conduct the Principle Component Analysis nor Factor Analysis to drop variables.
  • It is helpful to understand whether a high number of comments, updates, gallery or funders is correlated to the attainment rate. If the correlation is indeed strong, campaign owner could promote the variable to increase the attainment rate.
    • Attainment Rate has a correlation of 0.397 and 0.44 with Comments and Funders, respectively, which is a good indicator that a high number of comments and funders could lead to a higher attainment rate
Statistics Table

Statistics Table

  • The Statistics table (Figure 2) shows the descriptive statistics about the continuous variables and the outcome Attainment Rate.
    • All the continuous variables are heavily skewed to the left (mean >> median). The skewness of the dataset makes it difficult to apply hypothesis testing since most of tests are based on normal distribution assumption.
    • Regarding the Attainment Rate, the fact that it has the mean and median below 100 indicates the high failure rate of the sample.
    • The standard deviation of 260 and the maximum value of 14281 suggest that there are many outliers in the dataset.
Distribution of Campaigns based on Category

Distribution of Campaigns based on Category

  • The Category table (Figure 3) shows the distribution table of Category, the only relevant discreet variable of the sample data set since the geographical dimension has been eliminated.

Predictive Analysis

If the results from the classification models are similar to each other, a conclusion about the predictor of success of a campaign can be confidently drawn.

Decision Tree

Preparations:

  • Partition: 50-50 Training and Testing data
  • Irrelevant attributes, such as text and urls, were casted as typeless to prevent noise and misrepresentation in the model itself.
Model Summary

Model Summary

Accuracy test between training and testing data

Accuracy test between training and testing data

Neural Network

Preparations:

  • Partition: 70-30 Training and Testing data
Model Summary

Model Summary

Predictor Importance

Predictor Importance

Accuracy test between training and testing data

Accuracy test between training and testing data

Histograms for Goal and Comments

Histograms for Goal and Comments

Bayesian Network

Preparations:

  • Partition: 50-50 Training and Testing data
Model Summary

Model Summary

Coincidence Matrix

Coincidence Matrix

screen-shot-2017-01-19-at-6-29-09-pm-copy screen-shot-2017-01-19-at-6-29-16-pm-copy screen-shot-2017-01-19-at-6-29-22-pm-copy

Clustering

Model Summary

Model Summary

Distribution of Data

Distribution of Data

Summary

Overall, the performance of our analysis stays at the fair/ moderate acceptance level. The results from our classification models share some consistency with each other, meaning that we can trust their outcomes. The following table shows the summary of our analysis:

Summary of Predictive Analysis Results

Summary of Predictive Analysis Results

Domain Knowledge

Domain knowledge was gathered to be able to ground the study to what happens in the business context.

  • With the advent of social media, it has been more evident that crowdfunding is not bounded by geographical constraints.
  • It was also observed that funders exhibit herding mentality. This means that as a campaign accumulates capital, more and more individual funders are motivated to make the campaign a success.
  • Immediate markets, such as friends and family, play an important role to the early success of a crowdfunding campaign. This market has been known to spike the fund during the first few days of the campaigns.
  • Crowdfunding is a platform of incentives to both the creators and funders.
  • Funders lose motivation in supporting campaigns where creator incompetence and inexperience is evident.

Moving Forward

  • A bigger dataset in terms of dimensions and quantity will definitely impact the overall quality of the classification results.