Classifying a Company’s True Earnings Quality using Text Analytics and Machine Learning on S&P Proxy Statements’ Compensation Discussion and Analysis [R, Python]

This was submitted as a project for my Text Analytics class in my MS Business Analytics program. The original title is “Text Analytics on the Compensation Discussion and Analysis of S&P 1500 Proxy Statements. My other teammates are Minglu Sun, Jiawen Zhou, and Yi Luo. This project was done for educational purposes only. Click the photos to enlarge. Check out the GitHub page for the files and data set. 

Problem Statement

The purpose of this study is to explore whether the sentiment, structure, and contents of a company’s Proxy Statement Compensation Discussion and Analysis (CD&A) reflects the company’s real financial performance in terms of the relationship of Earnings per Share and Operating Cash Flow per Share


  • Public companies submit C-level management compensation reports to SEC every year. The Compensation Discussion and Analysis (CD&A) section discloses all material elements of the company’s executive compensation programs and provides the appropriate reasoning as to why the C-suite are being paid their respective salaries.
  • The compensation report is highly sensitive and is required to be explained with utmost transparency. In an attempt to standardize transparency in the document, in early 2017, SEC proposed rules and regulations that will require companies to disclose the relationship between executive pay and a company’s financial performance.
  • That being said, whether the Compensation Discussion and Analysis reflects the company’s real financial performance or not needs to be tested.


In this project, we assumed that the more positive a company’s proxy statement’s CD&A was written, the better the earnings quality of a company is in a given fiscal year.

According to Investopedia, Two financial indicators are being used to present whether companies’ earnings with high quality or low quality. A company has high quality earnings if it is generating more cash than is reported in the income statement. Earning quality is low if the company’s statements are not showing the negative operating results of the company. True cash operating results are also overstated.

  • High quality earnings: Earnings Per Share (EPS) > Operating Cash Flow Per Share (CFS)
  • Low quality earnings: Earnings Per Share (EPS) < Operating Cash Flow Per Share (CFS)

Data Description

For this project, three data sets were collected:

  1. Randomly selected 1,500 companies’ S&P Proxy Statements’ Compensation Discussion and Analysis (CD&A) from the U.S.  SEC EDGAR System.
  2. Company performance (Earnings Per Share and Cash Flow Per Share) using the Intrinio Financial Marketplace API.
  3. Ticker and registered company industries from Google Finance.

In addition, two popular sentiment lexicons were selected for the sentiment analysis portion: Bing Liu’s sentiment dictionary and LoughranMcDonald Master Dictionary, which was specifically developed for Tim Loughran and Bill McDonald’s paper in the Journal of Finance entitled “When is Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks” (2011).

Document Structural Dimension

  • Although the SEC did not rule out the structure of the Proxy Statement, most of the companies share similar structure of the statement, as well as the Compensation Discussion and Analysis part.
  • In general, the first paragraph of the CD&A is the introduction, which briefly introduces what the content is included in this part.
  • The second paragraph is the Executive Summary. A large number of the companies disclose the current year financial performance in the Executive Summary, making it the important section for sentiment information. Most of the positive or negative words and phrases are extracted from the Executive Summary.
  • The rest of the Compensation Discussion and Analysis are detailed descriptions of the compensation policy, subcategories of the compensation, and the approval from the compensation committee. A few companies explain the compensation decisions in these detailed compensation components, which reveal the sentiment information.

Document Content Dimension

  • The documents share the characteristics of public financial statements. Most of the sentences analyze and compare numeric values, which represent financial performance.


Data Preprocessing

  • Text data was extracted from the CD&As. It underwent cleaning, which involved removal of punctuations and special characters.
  • Domain-specific lexicon creation. In the process, positive and negative words, phrase, and templates were extracted from 200 of the 500 documents. In the process, positive and negative words, phrase, and templates were extracted.
    • Templates:
      • e.g. an <increase>/<decrease> of <amount> from <number>/<year>
      • <metric> <increased>/<decreased> <amount> over/compared to <year>
Domain-specific lexicon sample

Domain-specific lexicon sample

  • The team simulated “expertise” and classified the 200 documents into positive or negative performance/sentiment.

Sentiment Analysis

Feature-level Analysis

Feature-level Sentiment Analysis model

  • Polarity-based sentiment analysis was conducted using the two publicly available lexicons mentioned above.
  • Due to inadequate results, the team decided to create a new domain-specific lexicon that will hopefully produce a better result.
  • To complement the sentiment analysis, IBM Tone Analyzer was used to acquire 13 tonal dimension results for each company’s CD&A.

 Document-level Analysis

Document-level Sentiment Analysis Model

  • Using the “expert” classifications of the 200 labeled data and the domain-specific lexicon as the feature set, a term-document matrix data set, containing the quantity/existence of each feature in all the documents (500 in total, was created.

Term-Document Matrix

  • Using a Neural Network, the remaining 300 documents were classified into positive or negative sentiment classes.

Classification of Earnings Quality

  • Considering the sentiment classification from the polarity-based sentiment analysis model, using the domain-specific dictionary and the tonal information as predictors and the earning quality as the target variable, four scenarios were used and subjected to multiple classification models (random forest, neural network, and logistic regression).
  • The following scenarios were tested:
    • Scenario 1 : Financial Performances ~ CD&A Tones
    • Scenario 2: Financial Performances ~ CD&A Sentiment
    • Scenario 3: Financial Performances ~ CD&A Tones + Sentiment
    • Scenario 4: Financial Performances ~ Top 5 Predictor Importance (Tone + Sentiment)

Results and Discussion

Sentiment Analysis

Sentiment analytics, in this project, was approached in two ways: feature-level analysis by using polarity-based classification models and document-level analysis using document classification.

Feature-level Analysis

  • Feature-level sentiment analysis is initially conducted with two dictionaries: Bing Liu’s Lexicon and the Loughran McDonald Master Lexicon, which focuses on financial concepts and finance-driven directional phrases.
  • Running these dictionaries into a polarity-based sentiment analyzer (netting of counts of positive and negative words based on existence produced very bi-polar results.

Bing Liu and LonghranMcDonald Sentiment Results

  • Due to the unsatisfactory results of these dictionaries, it became clear that there was a need to use a more domain-specific lexicon.
  • Since such as dictionary is nonexistent, the we decided to create one by reading 200 documents and extracting positive and negative words, phrases, and templates.
    • For instance, positive dictionary include “strong performance”, “outperformed”, “exceeding our target”, “revenue increased”, etc.
    • The negative dictionary included “decrease”, “slow down in ”, “reduction”, “did not achieve”, etc.
    • In the process, each document is categorized as positive or negative. This serves as input in the document-level approach.
  • Surprisingly, the new dictionary classifies 487 documents as positive, 1 as negative, and the remaining 12 as neutral.

Total Polarity-based Sentiment Results

  • The accuracy of the model using the domain-specific dictionary is  67%.

Document-level Analysis

  • Using the classifications generating from the domain-specific dictionary creation phase, classification models were used to determine the sentiment class of the remaining 300 unlabeled documents.
  • Using the words and phrases in the created dictionary as predictors to sentiment, a neural network model with an accuracy of 57.89% was created. That being said, the it was decided that the classifications from the polarity-based model that used the domain-specific dictionary will be used as input for the succeeding steps.

Evaluation of Document-level Sentiment Analysis model

Tonal Analysis

Input data also included tonal results computed by the IBM Tones Analyzer. The 13 dimensions extracted are anger, disgust, fear, joy, sadness, analytical, confident, tentative, openness, conscientiousness, extraversion, agreeableness, and emotional range.

Sample Tonal Results

To give a better idea of how tonal results performed throughout the company list, we decided to aggregate results up to the industry level.

Slice of the Industry-level Tonal Results

  • Telecommunication services industry’s compensation discussion and analysis has the highest joy value, which is 0.41.
  • Basic Materials, Energy, and Industrials are the three industries share the same highest sadness value, which is 0.41.
  • Compare to sadness and joy, tentative tone value is less obvious. The radar chart above is a slice of the tonal analysis that contains only three tones.

Classification of Earnings Quality

Model Evaluation

  •  Among all the created models, the random forest model in scenario 3 produces the highest accuracy (83%), precision (0.8), recall (0.84), and F-score (0.8).


According to the classification results, CD&A documents with positive sentiment score will be more likely to have high earning ability, which is characterized by a higher Earnings per Share compared to the company’s Cash Flow per Share.

In addition, there are no significant difference in tone score and sentiment score among different industries.


  • The domain-specific lexicon of Compensation Discussion and Analytics will assists the users and stakeholders of the Proxy Statement to recognize positive and negative features, and enables them to make effective and efficient decisions.
  • Since the Compensation Discussion and Analytics shares the characters of financial statements, the dictionary can also be applied to analyze other financial statements.

Limitations and Future Direction

  • The syntactic template has not been matched to the text content and loss some of the features.
  • Secondly, the CD&A prefers to use positive words and phrases and avoid using negative expressions. Even though some of the companies in the negative situation in this year, the description in the discussions seems to be positive. Therefore, the positive frequency of featured words and phrases are higher than the actual number.
  • Thirdly, the data records from the original training set are imbalanced, there are far more positive documents than the negative class. A model that uses a balanced dataset can be created in the future.
  • Also, other financial performance parameters could be used as the target variables instead of cash flow per share or earning per share.


Exploring the Association of Movie Trailer Performance on YouTube and Box Office Success using Neural Net, Python, and R

This was submitted as a project for my Big Data Analytics class in my MS Business Analytics program. The original title is “Exploring the Association of Movie Trailer Performance on YouTube and Box Office Success”. My other teammates are Yi Cai, Michael Friscia, and Zheyu Tian. This project was done for educational purposes only. Click the photos to enlarge. Check out the GitHub page for the files and data set. Due to policies of regarding their data, that particular data set won’t be uploaded.

UPDATE: If you scroll below, you will see that the final accuracy was 82.55%. Using genetic algorithms and a Sklearn implementation, the accuracy was improved to 98.66% (with a final generation average accuracy of 92.28%). Check out the code in this GitHub repo.

Problem Statement

The purpose of this study is to determine if there is a correlation between the performance of trailers on YouTube and Hollywood movie sales.

Project Significance

  • By evaluating important predictors from YouTube viewers, studios and agencies can create and publish movie trailers on YouTube more efficiently, thus:
    • driving box office ticket sales domestically and globally
    • generating more revenue
  • Trailer performance can be focused on and improved if it shows that there is a correlation to boxoffice/post-show sales

Data Collection

  • Data was collected from YouTube, using its proprietary API, and from
    • Youtube – trailer performance and comments
    • – Movie Box Office data
  • 32.4GB (when comments are expanded into 1 line per comment)
  • 1,713 movies
  • 5,244 trailers

Youtube Data

Variable Selection

  • The ROI variable had to be created.

Variables selected

Hypothesis and Rationale

  • There is a positive correlation between Youtube movie trailer performance indicators  and Box office performance/Video Sales.
    • Rationale: “Likes” = Sales
  • There is a positive correlation between Movie trailer comment sentiments and Box office/Video Sales  performance.
    • Rationale: If trailers are viewed in a positive manner, then people will be more likely to watch the movie.

Conceptual Model

  • After data extraction using Python, data was transformed using Python. Output files were CSV and TXT files.
  • Three sentiment models were implemented in the project: polarity-based sentiment models by using Bing Liu’s and Harvard IV-4 dictionaries, and Naive Bayes Classifier: NLTK Sentiment model.
    • To process part of the sentiment analysis, Apache Spark was used.
  • The sentiment scores were also used to help identify the ROI of each movie using a neural network model.

Project Conceptual Model

Results and Discussion

Variable Correlation Test

The graph, which was generated by R, shows the correlations between the independent variables and dependent variables.

There are three main main conclusions based on the graph:  

 1.The graph demonstrated a positive correlation among count Views, Count Comments, and Likes/Dislikes.

2. The graph was also used to test the hypotheses regarding the movie trailer features and movie performance which assumed that the movie trailer comment counts/ Movie Trailer Likes and Movie Box Office are positively correlated.

 3. Unfortunately, three sentiment models have little correlation with the Box Office Data (eg. ROI), which means that the initial hypothesis wasn’t proved. Two feature-based sentiment models have negative correlations with: Count Views, Count Comments, Likes/Dislikes.

Time Series Analysis

  • It was interesting to see that for 2008, even though with the financial crisis, overall ROI turned out to be good.
  • Another interesting finding is that ROI continuously decreased after 2008.

Sentiment Analysis

Two models were implemented for sentiment analysis.

  • a polarity-based model using Bing Liu’s and a Harvard dictionary, which nets the counts of positive and negative words that can be found in each comment, and
  • the NLTK Sentiment Analyzer using the Vader dictionary, which is a rule-based approach
  • Scores were scaled and centered to zero to maintain positive scores > 0 and negative scores < 0. The scale is [-1,1].

  • Comparing the performance of the three models, the Polarity-based models gravitated towards negative sentiment, which could be explained by the internal structure of the dictionaries used; meaning, if there were more negative than positive words, most likely there will be a higher chance of a higher negative-word count.
  • For the NLTK Sentiment Analyzer, results showed more positive sentiment towards the comments.

Sentiment Analysis – Movie Studios

  • Based on the Harvard sentiment dictionary, Paramount Vantage has the lowest average sentiment score whereas Weinstein Company has the highest.
  • The Vader sentiment sentiment dictionary determined that Apparition has the highest average sentiment score while Focus/Gramercy has the lowest sentiment average score.
  • Bing Liu sentiment dictionary predicted that Freestyle Releasing and Apparition have the lowest and highest average sentiment score, respectively.

Sentiment Analysis – Genre

  • When evaluating the Bing Liu and Harvard dictionaries, Romantic Comedies and Documentaries have the highest and lowest average sentiment score respectively.
  • Interestingly, for the NLTK Analyzer, the Concerts and Performances genre has the lowest average sentiment score, while Romantic Comedy has the highest score.

Clustering (to follow)

Predicting Box Office ROI Performance using Neural Net

  • ROI performance was classified using four bins:
    • Poor (less than the 25% quantile)
    • Passing (between 25% and 50% quantile)
    • Ok (between the 50% and 75% quantile)
    • Great (above the  75% quantile)
  • Neural Net implemented using R
  • ROI Performance ~ countsComments + countsViews + Ratio_of_Likes_and_Dislikes + ProdBudget + genre + MPAArating + MovieStudio + BingLiuSentiment + HarvardSentiment + VadeSentiment
  • Model Accuracy = 82.55%

    Neural Net Model Results


  • Due to the success of the neural network model, companies now have the ability to accurately predict the ROI of their movies, specifically with the use of the number of YouTube comments, ratio of likes and dislikes, and their sentiment scores from the three models.
  • With the hypotheses predicted for the research, there is a higher probability of Box Office success which would then in return generate a higher ROI for movie studios and production companies. 
  • Although the sentiment results are different among the three dictionaries, this implicates that some dictionaries used in the models view more neutral words as negative or positive.
    • The best alternative methods  to predict the sentiment of YouTube comments in movies are to use domain-specific dictionaries and the application of  machine learning classifiers paired with a sample comment-sentiment data set. 

Scope and Limitations

  • There are many popular websites and applications that can be used to comment on trailers or movies, such as Rotten Tomatoes, Facebook, Twitter and so on. However, in this case, Youtube is the only trailer source used.

  • Trailers are not the only factors that impact box office and video sales. Other factors such as advertisements,the actors, and  the competition of other movies being released at the same time can have an effect on the movie’s box office sales. However, these factors are not included in this study. Further studies could be conducted with those variables included.


Love in the Fastlane – Predicting Success in Speed Dating using Logistic Regression and R

This was submitted as a project for my Statistical Methods and Computation class in my MS Business Analytics program. The original title is “Love in the Fastlane: Success in Speed Dating”. My other teammate is Ruoxuan Gong. This project was done for educational purposes only. Click the photos to enlarge. Check out the GitHub page for the files and data set.

Problem Statement

The purpose of this study is to determine if speed dating outcomes can be predicted, and if yes, what are the most important factors that would help speed dating participants successfully match with each other


Data Cleaning and Pre-processing

  • 8,369 rows containing speed dating round and participant information in one table
  • Table was separated into two tables: round-specific data (round condition and relative data) and participant data (demographics and interests)
  • Has a big amount of double-counted rows:
    • Wave ID 1 with Participant A and Partner D
    • Wave ID 1 with Participant D and Partner A (double-count and removed)
  • After removal of double-counts, nrow = 4,184.
  • Columns with a lot of missing data were removed.
  • Domain knowledge used to produce initial set of variables
  • Rows with missing data were removed; new nrow = 3,377
  • All Participants are women and all partners are men


It is possible to predict, to a certain level of confidence, the outcome of a speed dating round(match or no match) by analyzing and taking into consideration the different factors during the round itself.

Variable Selection

  • Dependent Variable: Match (1/0)
  • Independent Variables: Round-specific data, difference in preference ratings (attractiveness, sincerity, intelligence, fun(funny?), ambitious, and shared hobbies), ratings for and by the participant

Tool and Model Selection

  • Tool: R was the main tool used for this project.
  • Model: Logistic Regression
    • Target is binary (Match = 1/0)
    • Assumptions:
      • Explanatory variables are measured without error
      • Model is correctly specified (No important variables are omitted, extraneous variables are excluded)
      • Outcomes not completely separable
      • No outliers
      • Variables should have little or no multicollinearity (VIF test)
      • Observations are independent (no time series, no in group data)
      • Sample size = n = at least 10 observations for each outcome (0/1) per predictor

Results and Discussion

Descriptive Analysis – Overall Data set

  • Here, we could see that the ages of the participants typically ranged from the 20s to 40s, with a few outliers
  • Ages 20-30: mixed ratings for almost all interests
  • Ages 30-40: Trend of rating goes higher for reading, movies, music, museums, and art

Distribution of Interests by Age by Gender


  • In the chart below, data shows that women are more interested in theater, art, and shopping, while men like gaming more.
  • Both genders share interests in dining, reading, movies, and music

  • In terms of preference, it would seem that women prefer intelligence over ambition and attractiveness
  • Men, on the other hand, prefer attractiveness over ambition and shared hobbies

Descriptive Analysis – Match = 1 Scenarios

  • We also wanted to take a look at what the average man and woman in match = 1 scenarios looked like
  • From the looks of it, men who joined were generally older than women
  • Also, men expected to be happier in the event and compared to women
  • Surprisingly, most joined the event to
  • “have a fun night out”, “meet new people”, or “try it out”.
  • Only a very few joined for “romantic reasons”

Age, Goal, and Expected Happiness Comparison for Match = 1 Scenarios

  • There were high ratings from both genders for exercise, dining, hiking, music, concerts, and movies
  • There were small variances for museums, art, clubbing, reading, and TV
  • Large variances were seen for theater, shopping, and yoga
  • Lastly, both had low ratings for gaming (I don’t understand why… lol)

Interest Rating Comparison by Gender

  • Very low variances were observed in the preferences of men and women
  • Might suggest that a match tends to happen when both participant and partner have the same level of preference, no matter what level it is

Preference Comparison by Gender

  •  This is how an average man and woman in a match = 1 scenario would look like in terms of their interest and preference ratings

Predictive Analysis – Logistic Regression

  • Going to predictive analytics, the first thing that we did was to check the distribution of the variables. Although multivariate normality isn’t an assumption for logistic regression, having normal variables help make the model stable.
  • We selected the variables that were not normal in shape and decided to apply transformations on them.
  • Of the three, square root transformation produced the most-normal-looking transformations. This prompted us to pick this over the original and log transformed variables.

  • The initial model produced the following results:
    • Accuracy = 83.41%
    • Recall = 0.244
    • precision = 0.659
    • AUC = 0.849
    • AIC = 1,945.32

  • We believed that this model could still be improved. That being said, we chose to apply a stepwise function on it.
  • After around 12 iterations, our round variables were stripped down to 13 (from 25)
  • AIC decreased to 1,929.27 (from 1,945.32)
  • Although the ROC curve didn’t seem to change, the AUC score increased from 0.84989 to 0.84996
  • The equation below represents the model in its current form

  • To satisfy one of the assumptions (the model should be correctly specified, meaning there are no important variables that are omitted and all extraneous variables should be excluded), we removed all the variables that had high p-values (>0.5, except for the ones with the “.”s)
  • By removing the extraneous variables, we were able to make the coefficients of the remaining variables more reliable
  • The new model resulted to the equation below:

  • The improved model has an precision of 0.6739. In addition, the final AUC score is 0.847.
  • It also shows an improvement in the model’s total True Negatives.

Assumption Testing

  • Explanatory variables are measured without error
    • Limitation of using third-party data. We assumed this to be true.
  • Model is correctly specified (No important variables are omitted, extraneous variables are excluded)
    • Demonstrated above.
  • Outcomes not completely separable
    • In R, the glm() function will not work if this was not true
  • No outliers
  • Observations are independent
    • no time series, no in group data
  • Sample size = n = at least 10 observations for each outcome (0/1) per predictor
    • Our nrow more than covers this requirement
  • Variables should have little or no multicollinearity
    • In order to test for multicollinearity, we ran the Variance Inflation Factor test.
    • Having a VIF of <5 means that the model has low or no multicollinearity. That being said, since the VIFs of our variables are <2, then we can say that the level of multicollinearity in the model is negligible, at best.


In conclusion, a logistic regression model can be used to predict the outcome of speed dating rounds. It can be represented using the formula below:


  • Attractiveness has a very big impact on producing a successful match, especially when females (participants) are perceived to be more attractive.
  • For females, meeting someone who brings fun to them increases the chance of getting a match.
  • In other words, males (partners) care about attractiveness more, and females prefer someone who has a sense of humor.

However, factors like having an ambitious personality has a negative impact on a successful match.

Scope and Limitations

  • Many valuable variables were excluded because of missing values
  • Given the reality of using third party data, information in terms of location and time of data collection is limited, leaving us without any knowledge if there are any biases in the results


[1] Donald, B. (2013, May 6). New Stanford research on speed dating examines what makes couples ‘click’ in four minutes. Retrieved from

[2] Data set:

Google Slides

Predicting the Winner of March Madness 2017 using R, Python, and Machine Learning

This project was done using R and Python, and the results were used as a submission to Deloitte’s March Madness Data Crunch Competition. Team members: Luo Yi, Yufei Long, and Yuyang Yue. Check the GitHub for the code.

Of the 64 teams that competed, we predicted Gonzaga University to win. Unfortunately, they lost to University of North Carolina.


  1. Data transformation
  2. Data exploration
    • Feature Correlation testing
    • Principal Component Analysis
  3. Feature Selection
  4. Model Testing
    • Decision Tree
    • Logistic Regression
    • Random Forest
  5. Results and other analysis

Data Transformation

The data that was used to train the initial model was from a data set that contained 2002-2016 team performance data, which included statistics, efficiency ratings, etc.,  from different sources. Each row was a game that consisted of two teams and their respective performance data. For the initial training of the models, we were instructed to use 2002-2013 data as the training set and 2014-2016 data as the testing set. After examining the data, we debated on what would be the way to use it. We finally decided on creating new relative variables that would reflect the difference/ratio of team 1 and team 2’s performance. Feature correlation testing was also done during this phase. The results supported the need for relative variables.

Features Correlation Heatmap (Original Features)

Data Exploration

After transformation, feature correlation testing was repeated. This time, results were much more favorable. The heat map below shows that the correlation between the new variables is acceptable.

Features Correlation Heatmap (New Features)

Principal Component Analysis was also performed on the new features. We hoped to show which features were the most influential, even before running any machine learning models. Imputation was done to deal with missing values. The thicker lines in the chart below signify a more influential link to the 8 new discriminant features. This, however, was used to understand the features more and wasn’t used as an input for all the models.

Feature Selection by PCA

Feature Selection

For this project, we opted to remove anything (aside from seed and distance from game location) that wasn’t a performance metric. Some of the variables that were discarded were ratings data since we believed that they were too subjective to be reliable indicators.

Model Testing

We used three models for this project: Decision Tree, Logistic Regression, and Random Forest.

Decision Tree – Results were less than favorable for this model. Overfitting occurred and we had to drop it.

Random Forest (R) – We decided to use the Random Forest model for 2 different reasons: the need to bypass overfitting restrictions and its democratic nature.

Predictor Importance

  • OOB Estimate of error rate: 26.9%
  • Error reduction plateaus at approx. 2,600 trees
  • Model Log-loss: 0.5556
  • Chart Legend:
    • Black:  Out-of-bag estimate of error rate
    • Green and Red: Class errors

Forest Error Performance

Logistic Regression (Python) – From PCA analysis and Random Forest Model, 5 features were selected for this model. 

Features Selected for Logistic Regression

Results and Other Analysis

Summary of Results

Running them against the testing set, we were able to get a higher accuracy for the Random Forest model. Log loss, which was also one of the key performance indicators for the competition, was relatively the same for the 2 models. That being said, Random Forest was chosen to run the new 2017 march madness data.

As previously mentioned, we had predicted Gonzaga University to win the tournament. We came really close though. It made a lot of sense because, compared to the other teams, Gonzaga was a frequent contender in March Madness.

One of the more interesting teams this season was the cinderella team, South Carolina. They had gone against expectations, and this is why we decided to analyze their journey even further.

In the 1st round, we were able to correctly predict that South Carolina was going to win. However, because we were using historical data, it was obvious that we were going to predict them to lose in the next stages, especially since they were going against stronger teams. Despite “water under the bridge” data, they were able to reach the Final 4.

Cinderella Team Win Rate by Stage

One of the questions that we wanted to attempt to answer was why they kept on winning. What was so different this year that they were able to surprise everyone?

One reason that we speculated about was the high performance of one of South Carolina’s players, Sindarius Thornwell. In the past years, he was averaging 11-13 pts per game. This year, he was dropping 21.4 pts per game. Moreover, in his last 5 appearances, his was able to increase this stat to 23.6 pts per game. Looking at the score difference of South Carolina’s games in March Madness, it is evident that he was very influential in the team’s success. One could even say that without his 23.6 pts per game, the turnout of their campaign would’ve been different. But hey, that’s just speculation.

Score Difference for Cinderella Team Matches


Sindarius Thornwell March Madness Stats


Solving the Greenhouse Gas Problem through Sustainable Meat Consumption (Watson Analytics)


This is my team’s official entry to the 2017 Watson Analytics Global Competition. Beyond our hope to win the competition is the hope that our recommendations will be put to use by policy makers in the different countries. We believe that this is something that can make a difference. Team members are Ruoxuan Gong and Liyi Li.


People are rarely aware of meat consumption’s contribution to greenhouse gas emissions. The purpose of this study is to utilize IBM Watson Analytics to identify relationships among meat consumption, greenhouse gas emission, and potential thermal depolymerization by-products from meat production funnels. Thorough data collection, data preprocessing, and data analysis, using both descriptive and predictive analytics, were conducted. As a result, three solutions: policies to optimize meat consumption, transformation of solid waste to sustainable by-products, and social media methods to increase people’s awareness have been proposed in this project.

The dashboard and research-based data-driven golden information can be used by environmental policy makers, business owners, and the public to exponentially make meat consumption more sustainable in the long run. Network effects can be expected from the improvement of public awareness.


  1. Data collection from OECD, FAO, and other sources.
  2. Data processing to relate meat production & consumption data with greenhouse gas emission data
  3. Variable Selection
  4. Data Analysis
    • Chart creation
    • Dashboarding
    • Simulation of thermal depolymerization by-product conversion
    • Retrospective Analysis
    • Social Media Awareness Analysis
  5. Conclusion and Recommendations
Predicting success An Indiegogo prediction study

Predicting Success using SPSS: An Indiegogo Prediction Study

This was submitted as a project for my data mining class in my MS Business Analytics program. My team included classmates: Anh Duong and Luoqi Deng. This project was done for educational purposes only. Click the photos to enlarge.

Abstract & Key Learnings

  • The hypothesis that crowdfunding campaigns from Colorado have higher success rates than campaigns that are located in other places was proven.
  • The category, the number of comments, the number of funders, and the fund goal of the campaign contribute significantly to its success.
  • If a campaign creator from Colorado promotes these factors, they will be able to increase their chances of success significantly.

Project Rationale

  • Indiegogo, the largest global crowdfunding and fundraising site online, has funded over one hundred seventy-five thousand (175K) campaigns, with an estimated valuation of eight hundred million (800M) dollars, in the past seven years. Over two million five hundred thousand (2.5M) funders from two hundred sixty-six (266) countries have contributed to the success of these campaigns.
  • Based on a recent tally by, a crowdfunding marketing and PR solution provider, stated that nine of ten (90%) Indiegogo campaigns fail to reach their goal; which is significantly higher than the 66.6% failure rate of Kickstarter, one of Indiegogo’s biggest competitors.

Problem Statement

This study aimed to determine whether campaign success can be predicted by certain attributes of its profile, campaign activity, and Indiegogo funder engagement.

Data Description/Preprocessing

The data used came from a public data set from

  • 15K rows
  • Global data
  • No time factor
  • Contains finished and unfinished campaigns
Data set variables

Original data set variables


In order to make the data usable, the data set had to undergo cleaning. All the blank cells were turned into zeroes. Variables that had inconsistent units (hours, days, minutes) had to be converted into consistent values. To make analysis easier and more reflective of performance, two calculated fields were created:

  • Attainment Rate = Raised/Goal
  • Attained = if(Attainment Rate >= 100), then 1, else, 0

Variable Selection

State and country were disregarded because the scope of the study was focused on Colorado-only campaigns.

Model Building

For this study, the Decision Tree, Neural Network, Bayesian, and Clustering models were used to determine the most important predictors to a campaign’s success. Multiple models were chosen so that the outcomes could be compared and a more logical conclusion could be confidently drawn. The association model wasn’t used because, although it would be interesting to determine the confidence and correlation of the occurrence of the different variables, it would contribute little to the goal of this study.

Assumptions and Limitations

  • The study is inadequate due to the researchers’ inability to collected all the variables, such as the important dates of the campaigns. The result of our predictive model is solely based on the pre-given attributes
  • The observations are independent of each other
  • Campaigns happened within the same year
  • A successful campaign is when the amount of capital raised by the crowdfunders exceeds the goal set by campaign owners. Therefore, in our dataset, we create two new calculated fields called Attainment Rate and Attained. Attainment Rate is the Raised amount over the Goal amount. If the Attainment Rate is above 100, the campaign is attained, which is indicated as 1 in the Attained field, and 0 vice versa. The two fields will be used as target for our model.


According to, only 10% of the Indiegogo campaign meet their goals. In our sample data set of Colorado, the actual result we found is 30%.

  • The p-value (Sig.) of the one sample T-test is less than 0.001; thus, we reject the null hypothesis that the sample mean is equal to the hypothesized population.
  • This means that crowdfunding campaign in Colorado does have higher chance of success than the average campaign globally.

screen-shot-2017-01-19-at-6-27-20-pm-copy screen-shot-2017-01-19-at-6-27-29-pm-copy

Descriptive/Regression Analysis

Correlation Table

Correlation Table

  • Consideration: potential correlation between continuous variables.
    • A strong correlation between two variables will make explaining what’s going on harder. The Correlation table shows the correlation table between the variables. The two pairs Updates & Gallery (0.589) and Comments & Funders (0.681) show strong positive correlation.
    • However, since we already have very few variables, we decide not to conduct the Principle Component Analysis nor Factor Analysis to drop variables.
  • It is helpful to understand whether a high number of comments, updates, gallery or funders is correlated to the attainment rate. If the correlation is indeed strong, campaign owner could promote the variable to increase the attainment rate.
    • Attainment Rate has a correlation of 0.397 and 0.44 with Comments and Funders, respectively, which is a good indicator that a high number of comments and funders could lead to a higher attainment rate
Statistics Table

Statistics Table

  • The Statistics table (Figure 2) shows the descriptive statistics about the continuous variables and the outcome Attainment Rate.
    • All the continuous variables are heavily skewed to the left (mean >> median). The skewness of the dataset makes it difficult to apply hypothesis testing since most of tests are based on normal distribution assumption.
    • Regarding the Attainment Rate, the fact that it has the mean and median below 100 indicates the high failure rate of the sample.
    • The standard deviation of 260 and the maximum value of 14281 suggest that there are many outliers in the dataset.
Distribution of Campaigns based on Category

Distribution of Campaigns based on Category

  • The Category table (Figure 3) shows the distribution table of Category, the only relevant discreet variable of the sample data set since the geographical dimension has been eliminated.

Predictive Analysis

If the results from the classification models are similar to each other, a conclusion about the predictor of success of a campaign can be confidently drawn.

Decision Tree


  • Partition: 50-50 Training and Testing data
  • Irrelevant attributes, such as text and urls, were casted as typeless to prevent noise and misrepresentation in the model itself.
Model Summary

Model Summary

Accuracy test between training and testing data

Accuracy test between training and testing data

Neural Network


  • Partition: 70-30 Training and Testing data
Model Summary

Model Summary

Predictor Importance

Predictor Importance

Accuracy test between training and testing data

Accuracy test between training and testing data

Histograms for Goal and Comments

Histograms for Goal and Comments

Bayesian Network


  • Partition: 50-50 Training and Testing data
Model Summary

Model Summary

Coincidence Matrix

Coincidence Matrix

screen-shot-2017-01-19-at-6-29-09-pm-copy screen-shot-2017-01-19-at-6-29-16-pm-copy screen-shot-2017-01-19-at-6-29-22-pm-copy


Model Summary

Model Summary

Distribution of Data

Distribution of Data


Overall, the performance of our analysis stays at the fair/ moderate acceptance level. The results from our classification models share some consistency with each other, meaning that we can trust their outcomes. The following table shows the summary of our analysis:

Summary of Predictive Analysis Results

Summary of Predictive Analysis Results

Domain Knowledge

Domain knowledge was gathered to be able to ground the study to what happens in the business context.

  • With the advent of social media, it has been more evident that crowdfunding is not bounded by geographical constraints.
  • It was also observed that funders exhibit herding mentality. This means that as a campaign accumulates capital, more and more individual funders are motivated to make the campaign a success.
  • Immediate markets, such as friends and family, play an important role to the early success of a crowdfunding campaign. This market has been known to spike the fund during the first few days of the campaigns.
  • Crowdfunding is a platform of incentives to both the creators and funders.
  • Funders lose motivation in supporting campaigns where creator incompetence and inexperience is evident.

Moving Forward

  • A bigger dataset in terms of dimensions and quantity will definitely impact the overall quality of the classification results.

Job Listing Mining to get the Industry Standard Skills Requirements of a Job Position using NLP and Python

This was submitted as a project for my web analytics class in my MS Business Analytics program. The original title is “Bridging the Gap: Improving the link between job applicant competitiveness and the MOOC business model”. My team included classmates: Liyi Li, Long Wan, Yiting Cai. This project was done for educational purposes only. Click the photos to enlarge.

Abstract & Key Learnings

  • Text Analytics was used to analyze thousands of job descriptions from various employment websites to determine the top requirements of a particular job position.
  • Data Scientist position has the most technical inclination, while the Business Analyst position, although it still dabbles in technical aspects, plays a bigger role in terms of business fulfilment.
  • In terms of specific skills and software, the top three skills needed for Business Analyst positions are the following: Communication Skills, Project Management, and Verbal and Written Skills, while the top three software required are Microsoft Office, Data Warehousing software, and Big Data software.
  • For the Data Analyst position, the top skills required are Verbal and Written Skills, Communication Skills, and Data Analysis, while SQL, Big Data software, and Microsoft Office are the top software required.
  • Lastly, for the Data Scientist position, the top three skills required are Data Analytics, Communication Skills, and Data Visualization. In terms for software, the top three required are Machine Learning software, Big Data software, Visualization Tools.

Project Rationale

  • Year after year, New York is swarmed by thousands of unemployed and newly graduated hopefuls with the main goal of securing a job.
  • The 44% of the city’s current working-age population who are unemployed. Competition becomes too overwhelming, making it harder and harder to differentiate oneself from other applicants.
  • MOOCs have become a legitimate source for learning skills and knowledge that can potentially increase the marketability and competitiveness of a job applicant. The only problem is that, with the sheer number of available online courses on different MOOC sites, it becomes harder to distinguish which course is appropriate and applicable to fulfilling a specific skill-set required in many of the currently open job positions.

Problem Statement

The purpose of this study is to be able to determine, at any given time, the top requirements of a particular job position. This can be done by using text content analysis on job descriptions from top employment websites, under a specific search term.

Information from this study will help two types of entities:

  1. Job applicants, by giving them accurate ideas of what companies are looking for when hiring for a position, and
  2. Massive Open Online Course (MOOC) providers, by offering them the ability to discover which skills to prioritize for course creation.

This could revolutionize the rate of how job applicants make themselves marketable to prospective companies through other means than the traditional school and work experience by providing the same information to MOOC providers and users. Also, since paid courses and verified certificates are the main source of revenue for the MOOC business model, this study can provide a research-based methodology to increasing the value of verified certificates and improving learning environments, in the hopes that they will meet the ever-changing requirements of different types of learners, and will, more importantly, be recognized by employers. 


Data Specifications

Creating the data sets entail having to scrape the aforementioned sites’ content, clean the extracted data, and compile into separate data sets based on search query.

Table 1 Raw Data Set Variables

Table 1 Raw Data Set Variables

Here are the raw compiled datasets for the three query terms: Download

Sample Rows

Sample Rows

Analytical Techniques

  1. Word Frequency: Word clouds were used to illustrate the frequency of words from the job descriptions. Limitations: Although this is perfect for keywords that have an immediate relevant meaning, such as software (e.g. Python, SQL), it proved to be inadequate in pinpointing the relevance of more ambiguous keywords, such as “business” and “experience”. These words can, however, give a general idea of what are the important themes that hover over a certain job position. Also, beyond the top ten most frequently mentioned words, the effect of word clouds become irrelevant.
  2. N-Gram Analysis: This allowed relevant keywords and their respective contexts to surface. To support this analysis, tree maps were chosen to visualize phrase frequency and make the study more robust. Four maps were generated for each job position to show the results of bi-gram and tri-gram counts for both software and skills.
  3. Network Analysis: Chosen to present the different connections and interactions among the software skills. It can supplement N-gram analysis and support the later categorical-level analysis as well. With the results of this analysis, MOOC providers and job applicants will not only learn which software is the most useful, but will also have a progressive and systematic understanding of software and how they relate with each other.
  4. Categorical Analysis: This resulted in a bird’s eye view of the different types of skills and software, and how they funnelled into particular areas of expertise. This is important to have because information coming from this analysis will allow candidates to position themselves, depending on what job and specialty they want to apply for. With regards to MOOC providers, having access to this kind of information will allow them to create more robust curriculums that would focus on specific areas of expertise.

Results and Discussion

I. Word Frequency

From left-right-down: Business Analyst, Data Analyst, Data Scientist

Word Frequency - Business Analyst Word Frequency - Data Analyst

Word Frequency - Data Scientist

II. N-Gram Analysis (Bi-Grams and Tri-Grams)

Business Analyst – Skills

Business Analyst – Software

Data Analyst – Skills


Data Analyst – Software


Data Scientist – Skills


Data Scientist – Software


III. Software Association Analysis

From left-right-down: Business Analyst, Data Analyst, Data Scientist

screen-shot-2017-01-18-at-3-01-34-am-copy screen-shot-2017-01-18-at-3-01-48-am-copy screen-shot-2017-01-18-at-3-01-56-am-copy

IV. Categorical Analysis

From left-right-down: Business Analyst, Data Analyst, Data Scientist

categorical analysis - business analyst                          categorical analysis - data analyst

categorical analysis - data scientist

Most important software, skills, and education for all job positions

Most important software, skills, and education for all job positions


  • Other popular employment websites such as ZipRecruiter, CareerBuilder, Monster have been tried but were not included in this study according to availability and information completeness. Some websites are protected from automatic parsing.
  • According to semi-structured web, one of the main limitations of the data set cleaning process is the fact that each job listing has a different format. Because of this, scraping the listings entailed extracting the whole job description page, as compared to the ideal scenario of extracting only the requirements. Admittedly, there were some minor cleaning issues that were missed by the python scripts that the researchers created. An example of the issue is joined words (e.g. “applicationsResponsible”, “dependenciesOptimizing”). Moreover, word stemming were not included in this study, because job requirements usually refer to particular term. However, it leaves other issues such as plural words.
  • Unsupervised machine learning model-Kmeans was tried, but the majority results of this study are based on human analysis. Since the study was unsupervised, there were no defined skill and software lists. This made the counting process harder to accomplish through python scripts since there were no training data to learn from. Because of this, the top twenty words for each category and job position were counted and aggregated manually. Admittedly, this opens up the process to potential human errors.
  • Another limitation was experienced during the word and n-gram frequency count process. As a precursor to the process, general ground rules were deliberated upon on to preserve uniformity and consistency of the results. What wasn’t accounted for was that, despite the general ground rules, each researcher’s results from counting and combining terms were still affected by their own personal judgements. Because of this, further adjustments had to be done to make the results as consistent as possible.


Given the various limitations of this study, it is recommended that further research be done.

  • Being able to incorporate machine learning into the research will improve result accuracy drastically.
  • Unsupervised machine learning studies about this topic should be pursued because its expected results, such as dictionaries of skills, software, and education, will be able to support future supervised research, thus paving the way for automation.

Sample Code for Web Scraping

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping
from bs4 import BeautifulSoup
import requests
import urllib
import urllib2
def initializeURL():
	pagenumber = 1
	URL = ''
	data = urllib.urlopen(URL)
	soup = BeautifulSoup(data, "html.parser")
	return soup

pagecount = 0

# cheatsheets
# print 'soup.title: ', soup.title
# print ' ',
# print 'soup.title.string: ', soup.title.string
# print 'soup.p: ', soup.p
# print 'soup.p.string: ', soup.p.string
# print 'soup.a: ', soup.a
# print 'soup.find_all("a") - just the links: '
# for link in soup.find_all(attrs={'class':'serp-result-div'}):
# 	for urlinfo in link.find_all('a'):
# 		if urlinfo.get('href').startswith(''):
# 			listingcount+=1
# 			print "--- --- ---"
# 			print str(listingcount)
# 			print "Posted on: \t X"
# 			print "URL: \t", urlinfo.get('href')

#gets total pages
def countpage():
	posicounter = 0
	positiontotal = 0
	for positions in soup.find_all('div', {'class':'col-md-12'}):
		for posicount in positions.find_all('span'):
			posicounter += 1
			if posicounter == 6:
				positiontotal = int(posicount.string)

	# compute number of pages
	pagestotal = int(round(positiontotal / 30))
	return pagestotal

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping

def getAll(pagestotal, soup):
#cycle through all pages
	listingcount = 0
	for pagecount in range(pagestotal):
		URL = ''+str(pagecount)+'-limit-30-jobs?searchid=3908011389118'
		data = urllib.urlopen(URL)
		soup = BeautifulSoup(data, "html.parser")
		for searchresults in soup.find_all(id='serp'):
			for listing in searchresults.find_all('div', {'class', 'complete-serp-result-div'}):
				print "-----------"
				listingcount +=1
				print str(listingcount)+"."
				for urlinfo in listing.find_all('a'):
					if urlinfo.get('href').startswith(''):
						print "URL: \t", urlinfo.get('href')
				for shortdescription in listing.find_all('div', {'class':'shortdesc'}):
					for string in shortdescription.stripped_strings:
						print "Short Description: \t", repr(string)
				for smalldetails in listing.find_all('ul', {'class':'list-inline'}):
					for companyli in smalldetails.find_all('li', {'class':'employer'}):
						for companyprint in companyli.find_all('span', {'class':'hidden-xs'}):
							print "Company: \t", companyprint.text
					for locationli in smalldetails.find_all('li', {'class':'location'}):
						print "Location: \t", locationli.text
					for postedli in smalldetails.find_all('li', {'class':'posted'}):
						print "Posted: \t", postedli.text

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping

#get description

def getlistingdescription():
	listingcount = 0
	for pagecount in range(pagestotal):
		URL = ''+str(pagecount)+'-limit-30-jobs?searchid=3908011389118'
		data = urllib.urlopen(URL)
		soup = BeautifulSoup(data, "html.parser")
		for searchresults in soup.find_all(id='serp'):
			for listing in searchresults.find_all('div', {'class', 'complete-serp-result-div'}):
				print "-----------"
				listingcount +=1
				print str(listingcount)+"."
				for urlinfo in listing.find_all('a'):
					if urlinfo.get('href').startswith(''):
						print "URL: \t", urlinfo.get('href')
						listingurl = urlinfo.get('href')
						access = requests.get(listingurl)
						content = access.content
						listingsoup = BeautifulSoup(content, "html.parser")
						for title in listingsoup.find_all('h1', {'class':'jobTitle'}):
							print "Job Title:\t", title.text
						for jd in listingsoup.find_all('div', {'id':'jobdescSec'}):
							print "Job Description:"
							print jd.text
						for metadatas in listingsoup.find_all('ul',{'class':'list-inline'}):
							for companyli in metadatas.find_all('li',{'class':'employer'}):
								print "Company"
								print companyli.text
							for locationli in metadatas.find_all('li',{'class':'location'}):
								print "Location"
								print locationli.text
							for postedli in metadatas.find_all('li',{'class':'posted hidden-xs'}):
								print postedli.text

						titleTofile = title.text
						titleTofilepre1 = titleTofile.encode('utf-8','ignore')
						titleTofilepre2 = titleTofilepre1.strip()
						titleTofilepre3 = titleTofilepre2.replace('\n','')
						jdTofile = jd.text
						jdTofilepre1 = jdTofile.encode('utf-8','ignore')
						jdTofilepre2 = jdTofilepre1.strip()
						jdTofilepre3 = jdTofilepre2.replace('\n','')
						companyliTofile = companyli.text
						companyliTofile1 = companyliTofile.encode('utf-8','ignore')
						companyliTofile2 = companyliTofile1.strip()
						companyliTofile3 = companyliTofile2.replace(',','')
						companyliTofile4 = companyliTofile2.replace('\n','')
						locationliTofile = locationli.text
						locationliTofile1 = locationliTofile.encode('utf-8','ignore')
						locationliTofile2 = locationliTofile1.strip()
						locationliTofile3 = locationliTofile2.replace('\n','')
						postliTofile = postedli.text
						postliTofile1 = postliTofile.encode('utf-8','ignore')
						postliTofile2 = postliTofile1.strip()
						postliTofile3 = postliTofile2.replace('\n','')
						postliTofile4 = postliTofile3.replace('Posted by','')

						with open ('datascientistNYClistings.txt','a') as csvfile:
							csvfile.write(str(urlinfo.get('href'))+ '\t' + str(titleTofilepre3) + '\t' + str(companyliTofile4) + '\t' + str(locationliTofile3) + '\t' + str(postliTofile4) + '\t' + str(jdTofilepre3) + '\n')
						# with open ('datascientistNYClistingsNoURLNoTitle.txt','a') as csvfile:
						# 	csvfile.write(str(jdTofilepre3) + '\n')	

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping

soup = initializeURL()
pagestotal = countpage()
# getAll(pagestotal,soup)

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping



[1] The Hype is Dead, but MOOCs Are Marching On. (n.d.). Retrieved December 04, 2016, from
[2] R., & D. (2013, May). A Financially Viable MOOC Business Model. Retrieved December 04, 2016, from

[3] A. (2015, November 23). EdX Stays Committed to Universities, Offering Credits for MOOCs (EdSurge News). Retrieved December 04, 2016, from
[4] A. (2015, November 23). EdX Stays Committed to Universities, Offering Credits for MOOCs. Retrieved December 04, 2016, from to-offer-credit-for-moocs

[5] J., & E. (2016, October). Department of Labor. Retrieved December 04, 2016, from
[6] SankeyMATIC. (n.d.). Retrieved December 04, 2016, from
[7] T. (2014, September 4). Framework to build a niche dictionary for text mining. Retrieved December 04, 2016, from mining/

[8] Tagul – Word Cloud Art. (n.d.). Retrieved December 06, 2016, from