*This was submitted as a project for my Statistical Methods and Computation class in my MS Business Analytics program. The original title is “Love in the Fastlane: Success in Speed Dating”. My other teammate is Ruoxuan Gong*

*. This project was done for educational purposes only. Click the photos to enlarge. Check out the GitHub page for the files and data set.*

## Problem Statement

The purpose of this study is to determine if speed dating outcomes can be predicted, and if yes, what are the most important factors that would help speed dating participants successfully match with each other

## Methodology

### Data Cleaning and Pre-processing

- 8,369 rows containing speed dating round and participant information in one table
- Table was separated into two tables: round-specific data (round condition and relative data) and participant data (demographics and interests)
- Has a big amount of double-counted rows:
- Wave ID 1 with Participant A and Partner D
- Wave ID 1 with Participant D and Partner A (double-count and removed)

- After removal of double-counts, nrow = 4,184.
- Columns with a lot of missing data were removed.
- Domain knowledge used to produce initial set of variables
- Rows with missing data were removed; new nrow = 3,377
- All Participants are women and all partners are men

### Hypothesis

It is possible to predict, to a certain level of confidence, the outcome of a speed dating round(match or no match) by analyzing and taking into consideration the different factors during the round itself.

### Variable Selection

- Dependent Variable: Match (1/0)
- Independent Variables: Round-specific data, difference in preference ratings (attractiveness, sincerity, intelligence, fun(funny?), ambitious, and shared hobbies), ratings for and by the participant

### Tool and Model Selection

- Tool: R was the main tool used for this project.
- Model: Logistic Regression
- Target is binary (Match = 1/0)
- Assumptions:
- Explanatory variables are measured without error
- Model is correctly specified (No important variables are omitted, extraneous variables are excluded)
- Outcomes not completely separable
- No outliers
- Variables should have little or no multicollinearity (VIF test)
- Observations are independent (no time series, no in group data)
- Sample size = n = at least 10 observations for each outcome (0/1) per predictor

## Results and Discussion

### Descriptive Analysis – Overall Data set

- Here, we could see that the ages of the participants typically ranged from the 20s to 40s, with a few outliers
- Ages 20-30: mixed ratings for almost all interests
- Ages 30-40: Trend of rating goes higher for reading, movies, music, museums, and art

- In the chart below, data shows that women are more interested in theater, art, and shopping, while men like gaming more.
- Both genders share interests in dining, reading, movies, and music

- In terms of preference, it would seem that women prefer intelligence over ambition and attractiveness
- Men, on the other hand, prefer attractiveness over ambition and shared hobbies

### Descriptive Analysis – Match = 1 Scenarios

- We also wanted to take a look at what the average man and woman in
**match = 1 scenarios**looked like - From the looks of it, men who joined were generally older than women
- Also, men expected to be happier in the event and compared to women
- Surprisingly, most joined the event to
- “have a fun night out”, “meet new people”, or “try it out”.
- Only a very few joined for “romantic reasons”

- There were high ratings from both genders for exercise, dining, hiking, music, concerts, and movies
- There were small variances for museums, art, clubbing, reading, and TV
- Large variances were seen for theater, shopping, and yoga
- Lastly, both had low ratings for gaming
*(I don’t understand why… lol)*

- Very low variances were observed in the preferences of men and women
- Might suggest that a match tends to happen when both participant and partner have the same level of preference, no matter what level it is

- This is how an average man and woman in a match = 1 scenario would look like in terms of their interest and preference ratings

### Predictive Analysis – Logistic Regression

- Going to predictive analytics, the first thing that we did was to check the distribution of the variables. Although multivariate normality isn’t an assumption for logistic regression, having normal variables help make the model stable.
- We selected the variables that were not normal in shape and decided to apply transformations on them.
- Of the three, square root transformation produced the most-normal-looking transformations. This prompted us to pick this over the original and log transformed variables.

- The initial model produced the following results:
- Accuracy = 83.41%
- Recall = 0.244
- precision = 0.659
- AUC = 0.849
- AIC = 1,945.32

- We believed that this model could still be improved. That being said, we chose to apply a stepwise function on it.
- After around 12 iterations, our round variables were stripped down to 13 (from 25)
- AIC decreased to 1,929.27 (from 1,945.32)
- Although the ROC curve didn’t seem to change, the AUC score increased from 0.84989 to 0.84996
- The equation below represents the model in its current form

- To satisfy one of the assumptions (the model should be correctly specified, meaning there are no important variables that are omitted and all extraneous variables should be excluded), we removed all the variables that had high p-values (>0.5, except for the ones with the “.”s)
- By removing the extraneous variables, we were able to make the coefficients of the remaining variables more reliable
- The new model resulted to the equation below:

- The improved model has an precision of 0.6739. In addition, the final AUC score is 0.847.
- It also shows an improvement in the model’s total True Negatives.

### Assumption Testing

- Explanatory variables are measured without error
- Limitation of using third-party data. We assumed this to be true.

- Model is correctly specified (No important variables are omitted, extraneous variables are excluded)
- Demonstrated above.

- Outcomes not completely separable
- In R, the glm() function will not work if this was not true

- No outliers
- Observations are independent
- no time series, no in group data

- Sample size = n = at least 10 observations for each outcome (0/1) per predictor
- Our nrow more than covers this requirement

- Variables should have little or no multicollinearity
- In order to test for multicollinearity, we ran the Variance Inflation Factor test.
- Having a VIF of <5 means that the model has low or no multicollinearity. That being said, since the VIFs of our variables are <2, then we can say that the level of multicollinearity in the model is negligible, at best.

## Conclusion

In conclusion, a logistic regression model can be used to predict the outcome of speed dating rounds. It can be represented using the formula below:

## Implications

**Attractiveness**has a very big impact on producing a successful match, especially when females (participants) are perceived to be more attractive.- For females, meeting someone
**who brings fun**to them increases the chance of getting a match. - In other words, males (partners) care about
**attractiveness**more, and females prefer someone who has a**sense of humor**.

However, factors like having an **ambitious personality** has a negative impact on a successful match.

## Scope and Limitations

- Many valuable variables were excluded because of missing values
- Given the reality of using third party data, information in terms of location and time of data collection is limited, leaving us without any knowledge if there are any biases in the results

## References

[1] Donald, B. (2013, May 6). New Stanford research on speed dating examines what makes couples ‘click’ in four minutes. Retrieved from http://news.stanford.edu/news/2013/may/jurafsky-mcfarland-dating-050613.html

[2] Data set: http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/