Classifying a Company’s True Earnings Quality using Text Analytics and Machine Learning on S&P Proxy Statements’ Compensation Discussion and Analysis [R, Python]

This was submitted as a project for my Text Analytics class in my MS Business Analytics program. The original title is “Text Analytics on the Compensation Discussion and Analysis of S&P 1500 Proxy Statements. My other teammates are Minglu Sun, Jiawen Zhou, and Yi Luo. This project was done for educational purposes only. Click the photos to enlarge. Check out the GitHub page for the files and data set. 

Problem Statement

The purpose of this study is to explore whether the sentiment, structure, and contents of a company’s Proxy Statement Compensation Discussion and Analysis (CD&A) reflects the company’s real financial performance in terms of the relationship of Earnings per Share and Operating Cash Flow per Share


  • Public companies submit C-level management compensation reports to SEC every year. The Compensation Discussion and Analysis (CD&A) section discloses all material elements of the company’s executive compensation programs and provides the appropriate reasoning as to why the C-suite are being paid their respective salaries.
  • The compensation report is highly sensitive and is required to be explained with utmost transparency. In an attempt to standardize transparency in the document, in early 2017, SEC proposed rules and regulations that will require companies to disclose the relationship between executive pay and a company’s financial performance.
  • That being said, whether the Compensation Discussion and Analysis reflects the company’s real financial performance or not needs to be tested.


In this project, we assumed that the more positive a company’s proxy statement’s CD&A was written, the better the earnings quality of a company is in a given fiscal year.

According to Investopedia, Two financial indicators are being used to present whether companies’ earnings with high quality or low quality. A company has high quality earnings if it is generating more cash than is reported in the income statement. Earning quality is low if the company’s statements are not showing the negative operating results of the company. True cash operating results are also overstated.

  • High quality earnings: Earnings Per Share (EPS) > Operating Cash Flow Per Share (CFS)
  • Low quality earnings: Earnings Per Share (EPS) < Operating Cash Flow Per Share (CFS)

Data Description

For this project, three data sets were collected:

  1. Randomly selected 1,500 companies’ S&P Proxy Statements’ Compensation Discussion and Analysis (CD&A) from the U.S.  SEC EDGAR System.
  2. Company performance (Earnings Per Share and Cash Flow Per Share) using the Intrinio Financial Marketplace API.
  3. Ticker and registered company industries from Google Finance.

In addition, two popular sentiment lexicons were selected for the sentiment analysis portion: Bing Liu’s sentiment dictionary and LoughranMcDonald Master Dictionary, which was specifically developed for Tim Loughran and Bill McDonald’s paper in the Journal of Finance entitled “When is Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks” (2011).

Document Structural Dimension

  • Although the SEC did not rule out the structure of the Proxy Statement, most of the companies share similar structure of the statement, as well as the Compensation Discussion and Analysis part.
  • In general, the first paragraph of the CD&A is the introduction, which briefly introduces what the content is included in this part.
  • The second paragraph is the Executive Summary. A large number of the companies disclose the current year financial performance in the Executive Summary, making it the important section for sentiment information. Most of the positive or negative words and phrases are extracted from the Executive Summary.
  • The rest of the Compensation Discussion and Analysis are detailed descriptions of the compensation policy, subcategories of the compensation, and the approval from the compensation committee. A few companies explain the compensation decisions in these detailed compensation components, which reveal the sentiment information.

Document Content Dimension

  • The documents share the characteristics of public financial statements. Most of the sentences analyze and compare numeric values, which represent financial performance.


Data Preprocessing

  • Text data was extracted from the CD&As. It underwent cleaning, which involved removal of punctuations and special characters.
  • Domain-specific lexicon creation. In the process, positive and negative words, phrase, and templates were extracted from 200 of the 500 documents. In the process, positive and negative words, phrase, and templates were extracted.
    • Templates:
      • e.g. an <increase>/<decrease> of <amount> from <number>/<year>
      • <metric> <increased>/<decreased> <amount> over/compared to <year>
Domain-specific lexicon sample

Domain-specific lexicon sample

  • The team simulated “expertise” and classified the 200 documents into positive or negative performance/sentiment.

Sentiment Analysis

Feature-level Analysis

Feature-level Sentiment Analysis model

  • Polarity-based sentiment analysis was conducted using the two publicly available lexicons mentioned above.
  • Due to inadequate results, the team decided to create a new domain-specific lexicon that will hopefully produce a better result.
  • To complement the sentiment analysis, IBM Tone Analyzer was used to acquire 13 tonal dimension results for each company’s CD&A.

 Document-level Analysis

Document-level Sentiment Analysis Model

  • Using the “expert” classifications of the 200 labeled data and the domain-specific lexicon as the feature set, a term-document matrix data set, containing the quantity/existence of each feature in all the documents (500 in total, was created.

Term-Document Matrix

  • Using a Neural Network, the remaining 300 documents were classified into positive or negative sentiment classes.

Classification of Earnings Quality

  • Considering the sentiment classification from the polarity-based sentiment analysis model, using the domain-specific dictionary and the tonal information as predictors and the earning quality as the target variable, four scenarios were used and subjected to multiple classification models (random forest, neural network, and logistic regression).
  • The following scenarios were tested:
    • Scenario 1 : Financial Performances ~ CD&A Tones
    • Scenario 2: Financial Performances ~ CD&A Sentiment
    • Scenario 3: Financial Performances ~ CD&A Tones + Sentiment
    • Scenario 4: Financial Performances ~ Top 5 Predictor Importance (Tone + Sentiment)

Results and Discussion

Sentiment Analysis

Sentiment analytics, in this project, was approached in two ways: feature-level analysis by using polarity-based classification models and document-level analysis using document classification.

Feature-level Analysis

  • Feature-level sentiment analysis is initially conducted with two dictionaries: Bing Liu’s Lexicon and the Loughran McDonald Master Lexicon, which focuses on financial concepts and finance-driven directional phrases.
  • Running these dictionaries into a polarity-based sentiment analyzer (netting of counts of positive and negative words based on existence produced very bi-polar results.

Bing Liu and LonghranMcDonald Sentiment Results

  • Due to the unsatisfactory results of these dictionaries, it became clear that there was a need to use a more domain-specific lexicon.
  • Since such as dictionary is nonexistent, the we decided to create one by reading 200 documents and extracting positive and negative words, phrases, and templates.
    • For instance, positive dictionary include “strong performance”, “outperformed”, “exceeding our target”, “revenue increased”, etc.
    • The negative dictionary included “decrease”, “slow down in ”, “reduction”, “did not achieve”, etc.
    • In the process, each document is categorized as positive or negative. This serves as input in the document-level approach.
  • Surprisingly, the new dictionary classifies 487 documents as positive, 1 as negative, and the remaining 12 as neutral.

Total Polarity-based Sentiment Results

  • The accuracy of the model using the domain-specific dictionary is  67%.

Document-level Analysis

  • Using the classifications generating from the domain-specific dictionary creation phase, classification models were used to determine the sentiment class of the remaining 300 unlabeled documents.
  • Using the words and phrases in the created dictionary as predictors to sentiment, a neural network model with an accuracy of 57.89% was created. That being said, the it was decided that the classifications from the polarity-based model that used the domain-specific dictionary will be used as input for the succeeding steps.

Evaluation of Document-level Sentiment Analysis model

Tonal Analysis

Input data also included tonal results computed by the IBM Tones Analyzer. The 13 dimensions extracted are anger, disgust, fear, joy, sadness, analytical, confident, tentative, openness, conscientiousness, extraversion, agreeableness, and emotional range.

Sample Tonal Results

To give a better idea of how tonal results performed throughout the company list, we decided to aggregate results up to the industry level.

Slice of the Industry-level Tonal Results

  • Telecommunication services industry’s compensation discussion and analysis has the highest joy value, which is 0.41.
  • Basic Materials, Energy, and Industrials are the three industries share the same highest sadness value, which is 0.41.
  • Compare to sadness and joy, tentative tone value is less obvious. The radar chart above is a slice of the tonal analysis that contains only three tones.

Classification of Earnings Quality

Model Evaluation

  •  Among all the created models, the random forest model in scenario 3 produces the highest accuracy (83%), precision (0.8), recall (0.84), and F-score (0.8).


According to the classification results, CD&A documents with positive sentiment score will be more likely to have high earning ability, which is characterized by a higher Earnings per Share compared to the company’s Cash Flow per Share.

In addition, there are no significant difference in tone score and sentiment score among different industries.


  • The domain-specific lexicon of Compensation Discussion and Analytics will assists the users and stakeholders of the Proxy Statement to recognize positive and negative features, and enables them to make effective and efficient decisions.
  • Since the Compensation Discussion and Analytics shares the characters of financial statements, the dictionary can also be applied to analyze other financial statements.

Limitations and Future Direction

  • The syntactic template has not been matched to the text content and loss some of the features.
  • Secondly, the CD&A prefers to use positive words and phrases and avoid using negative expressions. Even though some of the companies in the negative situation in this year, the description in the discussions seems to be positive. Therefore, the positive frequency of featured words and phrases are higher than the actual number.
  • Thirdly, the data records from the original training set are imbalanced, there are far more positive documents than the negative class. A model that uses a balanced dataset can be created in the future.
  • Also, other financial performance parameters could be used as the target variables instead of cash flow per share or earning per share.


Predicting the Winner of March Madness 2017 using R, Python, and Machine Learning

This project was done using R and Python, and the results were used as a submission to Deloitte’s March Madness Data Crunch Competition. Team members: Luo Yi, Yufei Long, and Yuyang Yue. Check the GitHub for the code.

Of the 64 teams that competed, we predicted Gonzaga University to win. Unfortunately, they lost to University of North Carolina.


  1. Data transformation
  2. Data exploration
    • Feature Correlation testing
    • Principal Component Analysis
  3. Feature Selection
  4. Model Testing
    • Decision Tree
    • Logistic Regression
    • Random Forest
  5. Results and other analysis

Data Transformation

The data that was used to train the initial model was from a data set that contained 2002-2016 team performance data, which included statistics, efficiency ratings, etc.,  from different sources. Each row was a game that consisted of two teams and their respective performance data. For the initial training of the models, we were instructed to use 2002-2013 data as the training set and 2014-2016 data as the testing set. After examining the data, we debated on what would be the way to use it. We finally decided on creating new relative variables that would reflect the difference/ratio of team 1 and team 2’s performance. Feature correlation testing was also done during this phase. The results supported the need for relative variables.

Features Correlation Heatmap (Original Features)

Data Exploration

After transformation, feature correlation testing was repeated. This time, results were much more favorable. The heat map below shows that the correlation between the new variables is acceptable.

Features Correlation Heatmap (New Features)

Principal Component Analysis was also performed on the new features. We hoped to show which features were the most influential, even before running any machine learning models. Imputation was done to deal with missing values. The thicker lines in the chart below signify a more influential link to the 8 new discriminant features. This, however, was used to understand the features more and wasn’t used as an input for all the models.

Feature Selection by PCA

Feature Selection

For this project, we opted to remove anything (aside from seed and distance from game location) that wasn’t a performance metric. Some of the variables that were discarded were ratings data since we believed that they were too subjective to be reliable indicators.

Model Testing

We used three models for this project: Decision Tree, Logistic Regression, and Random Forest.

Decision Tree – Results were less than favorable for this model. Overfitting occurred and we had to drop it.

Random Forest (R) – We decided to use the Random Forest model for 2 different reasons: the need to bypass overfitting restrictions and its democratic nature.

Predictor Importance

  • OOB Estimate of error rate: 26.9%
  • Error reduction plateaus at approx. 2,600 trees
  • Model Log-loss: 0.5556
  • Chart Legend:
    • Black:  Out-of-bag estimate of error rate
    • Green and Red: Class errors

Forest Error Performance

Logistic Regression (Python) – From PCA analysis and Random Forest Model, 5 features were selected for this model. 

Features Selected for Logistic Regression

Results and Other Analysis

Summary of Results

Running them against the testing set, we were able to get a higher accuracy for the Random Forest model. Log loss, which was also one of the key performance indicators for the competition, was relatively the same for the 2 models. That being said, Random Forest was chosen to run the new 2017 march madness data.

As previously mentioned, we had predicted Gonzaga University to win the tournament. We came really close though. It made a lot of sense because, compared to the other teams, Gonzaga was a frequent contender in March Madness.

One of the more interesting teams this season was the cinderella team, South Carolina. They had gone against expectations, and this is why we decided to analyze their journey even further.

In the 1st round, we were able to correctly predict that South Carolina was going to win. However, because we were using historical data, it was obvious that we were going to predict them to lose in the next stages, especially since they were going against stronger teams. Despite “water under the bridge” data, they were able to reach the Final 4.

Cinderella Team Win Rate by Stage

One of the questions that we wanted to attempt to answer was why they kept on winning. What was so different this year that they were able to surprise everyone?

One reason that we speculated about was the high performance of one of South Carolina’s players, Sindarius Thornwell. In the past years, he was averaging 11-13 pts per game. This year, he was dropping 21.4 pts per game. Moreover, in his last 5 appearances, his was able to increase this stat to 23.6 pts per game. Looking at the score difference of South Carolina’s games in March Madness, it is evident that he was very influential in the team’s success. One could even say that without his 23.6 pts per game, the turnout of their campaign would’ve been different. But hey, that’s just speculation.

Score Difference for Cinderella Team Matches


Sindarius Thornwell March Madness Stats