Predicting success An Indiegogo prediction study

Predicting Success using SPSS: An Indiegogo Prediction Study

This was submitted as a project for my data mining class in my MS Business Analytics program. My team included classmates: Anh Duong and Luoqi Deng. This project was done for educational purposes only. Click the photos to enlarge.

Abstract & Key Learnings

  • The hypothesis that crowdfunding campaigns from Colorado have higher success rates than campaigns that are located in other places was proven.
  • The category, the number of comments, the number of funders, and the fund goal of the campaign contribute significantly to its success.
  • If a campaign creator from Colorado promotes these factors, they will be able to increase their chances of success significantly.

Project Rationale

  • Indiegogo, the largest global crowdfunding and fundraising site online, has funded over one hundred seventy-five thousand (175K) campaigns, with an estimated valuation of eight hundred million (800M) dollars, in the past seven years. Over two million five hundred thousand (2.5M) funders from two hundred sixty-six (266) countries have contributed to the success of these campaigns.
  • Based on a recent tally by Krowdster.co, a crowdfunding marketing and PR solution provider, stated that nine of ten (90%) Indiegogo campaigns fail to reach their goal; which is significantly higher than the 66.6% failure rate of Kickstarter, one of Indiegogo’s biggest competitors.

Problem Statement

This study aimed to determine whether campaign success can be predicted by certain attributes of its profile, campaign activity, and Indiegogo funder engagement.

Data Description/Preprocessing

The data used came from a public data set from BigML.com.

  • 15K rows
  • Global data
  • No time factor
  • Contains finished and unfinished campaigns
Data set variables

Original data set variables

Cleaning

In order to make the data usable, the data set had to undergo cleaning. All the blank cells were turned into zeroes. Variables that had inconsistent units (hours, days, minutes) had to be converted into consistent values. To make analysis easier and more reflective of performance, two calculated fields were created:

  • Attainment Rate = Raised/Goal
  • Attained = if(Attainment Rate >= 100), then 1, else, 0

Variable Selection

State and country were disregarded because the scope of the study was focused on Colorado-only campaigns.

Model Building

For this study, the Decision Tree, Neural Network, Bayesian, and Clustering models were used to determine the most important predictors to a campaign’s success. Multiple models were chosen so that the outcomes could be compared and a more logical conclusion could be confidently drawn. The association model wasn’t used because, although it would be interesting to determine the confidence and correlation of the occurrence of the different variables, it would contribute little to the goal of this study.

Assumptions and Limitations

  • The study is inadequate due to the researchers’ inability to collected all the variables, such as the important dates of the campaigns. The result of our predictive model is solely based on the pre-given attributes
  • The observations are independent of each other
  • Campaigns happened within the same year
  • A successful campaign is when the amount of capital raised by the crowdfunders exceeds the goal set by campaign owners. Therefore, in our dataset, we create two new calculated fields called Attainment Rate and Attained. Attainment Rate is the Raised amount over the Goal amount. If the Attainment Rate is above 100, the campaign is attained, which is indicated as 1 in the Attained field, and 0 vice versa. The two fields will be used as target for our model.

Hypothesis

According to Krowdster.co, only 10% of the Indiegogo campaign meet their goals. In our sample data set of Colorado, the actual result we found is 30%.

  • The p-value (Sig.) of the one sample T-test is less than 0.001; thus, we reject the null hypothesis that the sample mean is equal to the hypothesized population.
  • This means that crowdfunding campaign in Colorado does have higher chance of success than the average campaign globally.

screen-shot-2017-01-19-at-6-27-20-pm-copy screen-shot-2017-01-19-at-6-27-29-pm-copy

Descriptive/Regression Analysis

Correlation Table

Correlation Table

  • Consideration: potential correlation between continuous variables.
    • A strong correlation between two variables will make explaining what’s going on harder. The Correlation table shows the correlation table between the variables. The two pairs Updates & Gallery (0.589) and Comments & Funders (0.681) show strong positive correlation.
    • However, since we already have very few variables, we decide not to conduct the Principle Component Analysis nor Factor Analysis to drop variables.
  • It is helpful to understand whether a high number of comments, updates, gallery or funders is correlated to the attainment rate. If the correlation is indeed strong, campaign owner could promote the variable to increase the attainment rate.
    • Attainment Rate has a correlation of 0.397 and 0.44 with Comments and Funders, respectively, which is a good indicator that a high number of comments and funders could lead to a higher attainment rate
Statistics Table

Statistics Table

  • The Statistics table (Figure 2) shows the descriptive statistics about the continuous variables and the outcome Attainment Rate.
    • All the continuous variables are heavily skewed to the left (mean >> median). The skewness of the dataset makes it difficult to apply hypothesis testing since most of tests are based on normal distribution assumption.
    • Regarding the Attainment Rate, the fact that it has the mean and median below 100 indicates the high failure rate of the sample.
    • The standard deviation of 260 and the maximum value of 14281 suggest that there are many outliers in the dataset.
Distribution of Campaigns based on Category

Distribution of Campaigns based on Category

  • The Category table (Figure 3) shows the distribution table of Category, the only relevant discreet variable of the sample data set since the geographical dimension has been eliminated.

Predictive Analysis

If the results from the classification models are similar to each other, a conclusion about the predictor of success of a campaign can be confidently drawn.

Decision Tree

Preparations:

  • Partition: 50-50 Training and Testing data
  • Irrelevant attributes, such as text and urls, were casted as typeless to prevent noise and misrepresentation in the model itself.
Model Summary

Model Summary

Accuracy test between training and testing data

Accuracy test between training and testing data

Neural Network

Preparations:

  • Partition: 70-30 Training and Testing data
Model Summary

Model Summary

Predictor Importance

Predictor Importance

Accuracy test between training and testing data

Accuracy test between training and testing data

Histograms for Goal and Comments

Histograms for Goal and Comments

Bayesian Network

Preparations:

  • Partition: 50-50 Training and Testing data
Model Summary

Model Summary

Coincidence Matrix

Coincidence Matrix

screen-shot-2017-01-19-at-6-29-09-pm-copy screen-shot-2017-01-19-at-6-29-16-pm-copy screen-shot-2017-01-19-at-6-29-22-pm-copy

Clustering

Model Summary

Model Summary

Distribution of Data

Distribution of Data

Summary

Overall, the performance of our analysis stays at the fair/ moderate acceptance level. The results from our classification models share some consistency with each other, meaning that we can trust their outcomes. The following table shows the summary of our analysis:

Summary of Predictive Analysis Results

Summary of Predictive Analysis Results

Domain Knowledge

Domain knowledge was gathered to be able to ground the study to what happens in the business context.

  • With the advent of social media, it has been more evident that crowdfunding is not bounded by geographical constraints.
  • It was also observed that funders exhibit herding mentality. This means that as a campaign accumulates capital, more and more individual funders are motivated to make the campaign a success.
  • Immediate markets, such as friends and family, play an important role to the early success of a crowdfunding campaign. This market has been known to spike the fund during the first few days of the campaigns.
  • Crowdfunding is a platform of incentives to both the creators and funders.
  • Funders lose motivation in supporting campaigns where creator incompetence and inexperience is evident.

Moving Forward

  • A bigger dataset in terms of dimensions and quantity will definitely impact the overall quality of the classification results.

Job Listing Mining to get the Industry Standard Skills Requirements of a Job Position using NLP and Python

This was submitted as a project for my web analytics class in my MS Business Analytics program. The original title is “Bridging the Gap: Improving the link between job applicant competitiveness and the MOOC business model”. My team included classmates: Liyi Li, Long Wan, Yiting Cai. This project was done for educational purposes only. Click the photos to enlarge.

Abstract & Key Learnings

  • Text Analytics was used to analyze thousands of job descriptions from various employment websites to determine the top requirements of a particular job position.
  • Data Scientist position has the most technical inclination, while the Business Analyst position, although it still dabbles in technical aspects, plays a bigger role in terms of business fulfilment.
  • In terms of specific skills and software, the top three skills needed for Business Analyst positions are the following: Communication Skills, Project Management, and Verbal and Written Skills, while the top three software required are Microsoft Office, Data Warehousing software, and Big Data software.
  • For the Data Analyst position, the top skills required are Verbal and Written Skills, Communication Skills, and Data Analysis, while SQL, Big Data software, and Microsoft Office are the top software required.
  • Lastly, for the Data Scientist position, the top three skills required are Data Analytics, Communication Skills, and Data Visualization. In terms for software, the top three required are Machine Learning software, Big Data software, Visualization Tools.

Project Rationale

  • Year after year, New York is swarmed by thousands of unemployed and newly graduated hopefuls with the main goal of securing a job.
  • The 44% of the city’s current working-age population who are unemployed. Competition becomes too overwhelming, making it harder and harder to differentiate oneself from other applicants.
  • MOOCs have become a legitimate source for learning skills and knowledge that can potentially increase the marketability and competitiveness of a job applicant. The only problem is that, with the sheer number of available online courses on different MOOC sites, it becomes harder to distinguish which course is appropriate and applicable to fulfilling a specific skill-set required in many of the currently open job positions.

Problem Statement

The purpose of this study is to be able to determine, at any given time, the top requirements of a particular job position. This can be done by using text content analysis on job descriptions from top employment websites, under a specific search term.

Information from this study will help two types of entities:

  1. Job applicants, by giving them accurate ideas of what companies are looking for when hiring for a position, and
  2. Massive Open Online Course (MOOC) providers, by offering them the ability to discover which skills to prioritize for course creation.

This could revolutionize the rate of how job applicants make themselves marketable to prospective companies through other means than the traditional school and work experience by providing the same information to MOOC providers and users. Also, since paid courses and verified certificates are the main source of revenue for the MOOC business model, this study can provide a research-based methodology to increasing the value of verified certificates and improving learning environments, in the hopes that they will meet the ever-changing requirements of different types of learners, and will, more importantly, be recognized by employers. 

Scope

Data Specifications

Creating the data sets entail having to scrape the aforementioned sites’ content, clean the extracted data, and compile into separate data sets based on search query.

Table 1 Raw Data Set Variables

Table 1 Raw Data Set Variables

Here are the raw compiled datasets for the three query terms: Download

Sample Rows

Sample Rows

Analytical Techniques

  1. Word Frequency: Word clouds were used to illustrate the frequency of words from the job descriptions. Limitations: Although this is perfect for keywords that have an immediate relevant meaning, such as software (e.g. Python, SQL), it proved to be inadequate in pinpointing the relevance of more ambiguous keywords, such as “business” and “experience”. These words can, however, give a general idea of what are the important themes that hover over a certain job position. Also, beyond the top ten most frequently mentioned words, the effect of word clouds become irrelevant.
  2. N-Gram Analysis: This allowed relevant keywords and their respective contexts to surface. To support this analysis, tree maps were chosen to visualize phrase frequency and make the study more robust. Four maps were generated for each job position to show the results of bi-gram and tri-gram counts for both software and skills.
  3. Network Analysis: Chosen to present the different connections and interactions among the software skills. It can supplement N-gram analysis and support the later categorical-level analysis as well. With the results of this analysis, MOOC providers and job applicants will not only learn which software is the most useful, but will also have a progressive and systematic understanding of software and how they relate with each other.
  4. Categorical Analysis: This resulted in a bird’s eye view of the different types of skills and software, and how they funnelled into particular areas of expertise. This is important to have because information coming from this analysis will allow candidates to position themselves, depending on what job and specialty they want to apply for. With regards to MOOC providers, having access to this kind of information will allow them to create more robust curriculums that would focus on specific areas of expertise.

Results and Discussion

I. Word Frequency

From left-right-down: Business Analyst, Data Analyst, Data Scientist

Word Frequency - Business Analyst Word Frequency - Data Analyst

Word Frequency - Data Scientist

II. N-Gram Analysis (Bi-Grams and Tri-Grams)

Business Analyst – Skills

Business Analyst – Software

Data Analyst – Skills

screen-shot-2017-01-18-at-2-51-02-am-copyscreen-shot-2017-01-18-at-2-50-34-am-copy

Data Analyst – Software

screen-shot-2017-01-18-at-2-51-20-am-copyscreen-shot-2017-01-18-at-2-51-10-am-copy

Data Scientist – Skills

screen-shot-2017-01-18-at-2-51-20-am-copyscreen-shot-2017-01-18-at-2-51-28-am-copy

Data Scientist – Software

screen-shot-2017-01-18-at-2-51-37-am-copyscreen-shot-2017-01-18-at-2-51-47-am-copy

III. Software Association Analysis

From left-right-down: Business Analyst, Data Analyst, Data Scientist

screen-shot-2017-01-18-at-3-01-34-am-copy screen-shot-2017-01-18-at-3-01-48-am-copy screen-shot-2017-01-18-at-3-01-56-am-copy

IV. Categorical Analysis

From left-right-down: Business Analyst, Data Analyst, Data Scientist

categorical analysis - business analyst                          categorical analysis - data analyst

categorical analysis - data scientist

Most important software, skills, and education for all job positions

Most important software, skills, and education for all job positions

Limitations

  • Other popular employment websites such as ZipRecruiter, CareerBuilder, Monster have been tried but were not included in this study according to availability and information completeness. Some websites are protected from automatic parsing.
  • According to semi-structured web, one of the main limitations of the data set cleaning process is the fact that each job listing has a different format. Because of this, scraping the listings entailed extracting the whole job description page, as compared to the ideal scenario of extracting only the requirements. Admittedly, there were some minor cleaning issues that were missed by the python scripts that the researchers created. An example of the issue is joined words (e.g. “applicationsResponsible”, “dependenciesOptimizing”). Moreover, word stemming were not included in this study, because job requirements usually refer to particular term. However, it leaves other issues such as plural words.
  • Unsupervised machine learning model-Kmeans was tried, but the majority results of this study are based on human analysis. Since the study was unsupervised, there were no defined skill and software lists. This made the counting process harder to accomplish through python scripts since there were no training data to learn from. Because of this, the top twenty words for each category and job position were counted and aggregated manually. Admittedly, this opens up the process to potential human errors.
  • Another limitation was experienced during the word and n-gram frequency count process. As a precursor to the process, general ground rules were deliberated upon on to preserve uniformity and consistency of the results. What wasn’t accounted for was that, despite the general ground rules, each researcher’s results from counting and combining terms were still affected by their own personal judgements. Because of this, further adjustments had to be done to make the results as consistent as possible.

Recommendations

Given the various limitations of this study, it is recommended that further research be done.

  • Being able to incorporate machine learning into the research will improve result accuracy drastically.
  • Unsupervised machine learning studies about this topic should be pursued because its expected results, such as dictionaries of skills, software, and education, will be able to support future supervised research, thus paving the way for automation.

Sample Code for Web Scraping

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping dice.com
from bs4 import BeautifulSoup
import requests
import urllib
import urllib2
def initializeURL():
	pagenumber = 1
	URL = 'https://www.dice.com/jobs/q-data_scientist-limit-30-l-New_York_City%2C_NY-radius-30-startPage-1-limit-30-jobs?searchid=3908011389118'
	data = urllib.urlopen(URL)
	soup = BeautifulSoup(data, "html.parser")
	return soup

pagecount = 0

# cheatsheets
# print 'soup.title: ', soup.title
# print 'soup.title.name: ', soup.title.name
# print 'soup.title.string: ', soup.title.string
# print 'soup.p: ', soup.p
# print 'soup.p.string: ', soup.p.string
# print 'soup.a: ', soup.a
# print 'soup.find_all("a") - just the links: '
# for link in soup.find_all(attrs={'class':'serp-result-div'}):
# 	for urlinfo in link.find_all('a'):
# 		if urlinfo.get('href').startswith('https://www.dice.com/jobs/detail'):
			
# 			listingcount+=1
# 			print "--- --- ---"
# 			print str(listingcount)
# 			print "Posted on: \t X"
# 			print "URL: \t", urlinfo.get('href')
			

#gets total pages
def countpage():
	posicounter = 0
	positiontotal = 0
	for positions in soup.find_all('div', {'class':'col-md-12'}):
		for posicount in positions.find_all('span'):
			posicounter += 1
			if posicounter == 6:
				positiontotal = int(posicount.string)

	# compute number of pages
	pagestotal = int(round(positiontotal / 30))
	return pagestotal

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping dice.com

def getAll(pagestotal, soup):
#cycle through all pages
	listingcount = 0
	for pagecount in range(pagestotal):
		URL = 'https://www.dice.com/jobs/q-data_scientist-limit-30-l-New_York_City%2C_NY-radius-30-startPage-'+str(pagecount)+'-limit-30-jobs?searchid=3908011389118'
		data = urllib.urlopen(URL)
		soup = BeautifulSoup(data, "html.parser")
		for searchresults in soup.find_all(id='serp'):
			for listing in searchresults.find_all('div', {'class', 'complete-serp-result-div'}):
				print "-----------"
				listingcount +=1
				print str(listingcount)+"."
				for urlinfo in listing.find_all('a'):
					if urlinfo.get('href').startswith('https://www.dice.com/jobs/detail'):
						print "URL: \t", urlinfo.get('href')
				for shortdescription in listing.find_all('div', {'class':'shortdesc'}):
					for string in shortdescription.stripped_strings:
						print "Short Description: \t", repr(string)
				for smalldetails in listing.find_all('ul', {'class':'list-inline'}):
					for companyli in smalldetails.find_all('li', {'class':'employer'}):
						for companyprint in companyli.find_all('span', {'class':'hidden-xs'}):
							print "Company: \t", companyprint.text
					for locationli in smalldetails.find_all('li', {'class':'location'}):
						print "Location: \t", locationli.text
					for postedli in smalldetails.find_all('li', {'class':'posted'}):
						print "Posted: \t", postedli.text
	return

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping dice.com

#get description

def getlistingdescription():
	listingcount = 0
	for pagecount in range(pagestotal):
		URL = 'https://www.dice.com/jobs/q-data_scientist-limit-30-l-New_York_City%2C_NY-radius-30-startPage-'+str(pagecount)+'-limit-30-jobs?searchid=3908011389118'
		data = urllib.urlopen(URL)
		soup = BeautifulSoup(data, "html.parser")
		for searchresults in soup.find_all(id='serp'):
			for listing in searchresults.find_all('div', {'class', 'complete-serp-result-div'}):
				print "-----------"
				listingcount +=1
				print str(listingcount)+"."
				for urlinfo in listing.find_all('a'):
					if urlinfo.get('href').startswith('https://www.dice.com/jobs/detail'):
						print "URL: \t", urlinfo.get('href')
						listingurl = urlinfo.get('href')
						access = requests.get(listingurl)
						access.raise_for_status()
						content = access.content
						listingsoup = BeautifulSoup(content, "html.parser")
						
						for title in listingsoup.find_all('h1', {'class':'jobTitle'}):
							print "Job Title:\t", title.text
						for jd in listingsoup.find_all('div', {'id':'jobdescSec'}):
							print "Job Description:"
							print jd.text
						for metadatas in listingsoup.find_all('ul',{'class':'list-inline'}):
							for companyli in metadatas.find_all('li',{'class':'employer'}):
								print "Company"
								print companyli.text
							for locationli in metadatas.find_all('li',{'class':'location'}):
								print "Location"
								print locationli.text
							for postedli in metadatas.find_all('li',{'class':'posted hidden-xs'}):
								print postedli.text

						#strings
						titleTofile = title.text
						titleTofilepre1 = titleTofile.encode('utf-8','ignore')
						titleTofilepre2 = titleTofilepre1.strip()
						titleTofilepre3 = titleTofilepre2.replace('\n','')
						jdTofile = jd.text
						jdTofilepre1 = jdTofile.encode('utf-8','ignore')
						jdTofilepre2 = jdTofilepre1.strip()
						jdTofilepre3 = jdTofilepre2.replace('\n','')
						companyliTofile = companyli.text
						companyliTofile1 = companyliTofile.encode('utf-8','ignore')
						companyliTofile2 = companyliTofile1.strip()
						companyliTofile3 = companyliTofile2.replace(',','')
						companyliTofile4 = companyliTofile2.replace('\n','')
						locationliTofile = locationli.text
						locationliTofile1 = locationliTofile.encode('utf-8','ignore')
						locationliTofile2 = locationliTofile1.strip()
						locationliTofile3 = locationliTofile2.replace('\n','')
						postliTofile = postedli.text
						postliTofile1 = postliTofile.encode('utf-8','ignore')
						postliTofile2 = postliTofile1.strip()
						postliTofile3 = postliTofile2.replace('\n','')
						postliTofile4 = postliTofile3.replace('Posted by','')


						with open ('datascientistNYClistings.txt','a') as csvfile:
							csvfile.write(str(urlinfo.get('href'))+ '\t' + str(titleTofilepre3) + '\t' + str(companyliTofile4) + '\t' + str(locationliTofile3) + '\t' + str(postliTofile4) + '\t' + str(jdTofilepre3) + '\n')
							
						# with open ('datascientistNYClistingsNoURLNoTitle.txt','a') as csvfile:
						# 	csvfile.write(str(jdTofilepre3) + '\n')	
	return

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping dice.com

soup = initializeURL()
pagestotal = countpage()
# getAll(pagestotal,soup)
getlistingdescription()

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping dice.com

 

References

[1] The Hype is Dead, but MOOCs Are Marching On. (n.d.). Retrieved December 04, 2016, from http://knowledge.wharton.upenn.edu/article/moocs-making-progress-hype-died/
[2] R., & D. (2013, May). A Financially Viable MOOC Business Model. Retrieved December 04, 2016, from https://www.universitybusiness.com/MOOCBusinessModel

[3] A. (2015, November 23). EdX Stays Committed to Universities, Offering Credits for MOOCs (EdSurge News). Retrieved December 04, 2016, from
[4] A. (2015, November 23). EdX Stays Committed to Universities, Offering Credits for MOOCs. Retrieved December 04, 2016, from https://www.edsurge.com/news/2015-11-23-edx-buckles-down- to-offer-credit-for-moocs

[5] J., & E. (2016, October). Department of Labor. Retrieved December 04, 2016, from https://www.labor.ny.gov/stats/nyc/
[6] SankeyMATIC. (n.d.). Retrieved December 04, 2016, from http://sankeymatic.com/
[7] T. (2014, September 4). Framework to build a niche dictionary for text mining. Retrieved December 04, 2016, from https://www.analyticsvidhya.com/blog/2014/09/creating-dictionary-text- mining/

[8] Tagul – Word Cloud Art. (n.d.). Retrieved December 06, 2016, from https://tagul.com/