Multi-agent system simulation: Quick Start with ZeroMQ [Python]

Created using Python 2.7 and ZMQ 4.2.1.

Recent work has brought me towards the direction of multi-agent A.I. systems. This was fairly challenging because I had no prior training in terms of multi-agent systems. In fact, whenever I had to code, I had always resigned to function-driven code; always shying away from object-oriented  environments.

Realizing that there was no escaping object-oriented programming this time, I quickly dove into it. Subway rides quickly became refresher sessions on classes and boot camps into multi-agent systems and communication.

The biggest question that I had was, how do multiple agents, whether hosted locally or in different systems, communicate with each other? Quick research brought me to four libraries. The codes are located in this GitHub repo.

  1. Threading Module
  2. SimPy (Discrete Event Simulation for Python)
  3. PyRo (Python Remote Objects)
  4. ZeroMQ (Distributed Messaging)

My method of going through with this project was to start writing code and just reassess the process when I start breaking stuff. Take note that I didn’t dive into all the capabilities of the first three libraries. It just so happened that ZeroMQ gave me what I wanted faster than the other three.

Threading Module

Python’s threading module is beautiful, especially when you get multiple threads working. Unfortunately, I ran into a wall when I needed to change the agent type. As you can see below, although the Basic agents changed into Better agents, the Better agent threads that were started did not run past the initial three processes. I also had the problem that this simulation might not show what I wanted: which was multiple independent agents communicating with each other. Since they were housed in threads of the same python instance, I don’t see how my goal could be realized.

SimPy (Discrete Event Simulation)

SimPy was pretty cool as well. I wasn’t able to dive really deep into it, but from what I initially saw in the tutorials, SimPy objects had to run in an environment. This is generally how A.I. agents run, which is, in an environment, but I had a problem with the fact that I need to run an environment for these agents to work with each other. I need the agents to run in multiple computers in the future. Also, I need a brokerless system where the agents communicate with each other without having a central “server” or environment to make them run. Here is the implementation of a code that came from the tutorial. The code introduces three cars and a refilling station with two slots. It’s basically a queueing system.

PyRo (Python Remote Objects)

This library was closer to what I needed than Threading and SimPy. It allowed me to run a server and three (3) agents. Truth be told, I was set in using this, but a few things still bothered me:

  • Will I be able to convert remote objects into servers as well? As I said, my goal is a brokerless system, where agents communicate from each other. I tried this but to no avail, it did not work. All communication had to go through the server/remote server.
  • Pyro4’s expose function, which exposes a remote object’s class’ variables or functions to the other instances, seems to not be working correctly. I might have been using it wrong. I doubt since some of the variables were accessible.
  • Communication was extremely easy… to tangle up.

There are a few things that I liked though.

  • Pyro has a nameserver, which allows agents to connect to a server or remote object using its name. This is extremely convenient, in my opinion, for situations where the agents have to dynamically connect to new servers or remote objects, since they can just call the name.
  • Daemon loop capabilities
  • Supports multiple python instances.

*GIF demonstration and code to follow

ZeroMQ

Now, this. Among the four libraries, ZeroMQ gave me what I really needed (Brokerless system); and it did so in a truly fashionable way. Communication among agents was easy because ZeroMQ’s communication style is in terms of patterns. The basic patterns that you could execute are the following:

  • PUSH-PULL: one-way communication. Commands can be executed this way.
  • REQ-REP: Request and reply communication. This allows agents to respond to another agent’s request. This basically starts a conversation between the agents.
  • PUB-SUB: Publish-Subscription communication. This allows an agent to continuously broadcast a message. Agents who are subscribed to a certain message will react to or get woken up whenever it gets pinged. Update streams, heartbeats, and wake-up signals come into mind when this type of comms is mentioned.

There are more patterns that can be used that would allow sending and receiving multiple messages simultaneously without locking the system and network, but even with these three patterns, it’s pretty doable and easy to set up multiple connection types between an agent and parallel instances of itself.

Quick Tips:

  • Messages can only be delivered in string or serialized format. Currently, I’ve only tried serializing a dictionary to json format (using simplejson). To my knowledge, pickles, c-pickles, and message packs are also allowed.
  • Agents can act as PUSHERS, PULLERS, PUBLISHERS, SUBSCRIBERS, REQUESTERS, AND REPLIERS (and other roles not mentioned above), all at the same time. This allows agents to communicate with each other without having to rely on a server to get the message across. If you are designing an agent to be more than of these types at the same time, you will have to assign different ports to each type of socket.
  • There are different types of connections. I’m currently using TCP port connections in my local machine. If I will have to deploy more agents, there has to be a way to assign, reassign, and kill ports dynamically.
  • It allows connections between agents that have been coded using different languages.
  • Running multiple processes of functions of a certain agent instance will not work, at least in my current implementation, due to overlapping port connections.
  • The best thing is, in my case, I only need to create one PY file for a specific agent type. I could run this code in different python instances and have them interact dynamically with each other, since they can take up different roles, as mentioned above.  This is pretty handy because it helps with code organization. This also allows an easier management and deployment of the code in multiple systems.

Below is a simple implementation of three Basic agents trying to establish who is the best among themselves. The faster the timer runs out, the faster an agent can become a ‘Better’ agent. Since the Better agent is the first one to communicate and receive replies from the Basic agents, he is deemed the Best agent. Afterwards, he sends a heartbeat to the two other agents to prevent them from thinking that there is no Best agent alive and start their countdowns again.

This article will be update as I learn more about ZeroMQ. Check the section below for the most recent updates. Drop me a comment if you have any questions as well.

Credits:

  1. ZeroMQ’s guide – Due to this library being updated over and over again, most of the code that I saw outside of the site and number 2 below were outdated. Understanding this guide and thinking of how people think and communicate helped me become accustomed to ZeroMQ’s patterns.
  2. PyZMQ’s docs – Python bindings of ZMQ.

Exploring the Association of Movie Trailer Performance on YouTube and Box Office Success using Neural Net, Python, and R

This was submitted as a project for my Big Data Analytics class in my MS Business Analytics program. The original title is “Exploring the Association of Movie Trailer Performance on YouTube and Box Office Success”. My other teammates are Yi Cai, Michael Friscia, and Zheyu Tian. This project was done for educational purposes only. Click the photos to enlarge. Check out the GitHub page for the files and data set. Due to policies of thenumbers.com regarding their data, that particular data set won’t be uploaded.

UPDATE: If you scroll below, you will see that the final accuracy was 82.55%. Using genetic algorithms and a Sklearn implementation, the accuracy was improved to 98.66% (with a final generation average accuracy of 92.28%). Check out the code in this GitHub repo.

Problem Statement

The purpose of this study is to determine if there is a correlation between the performance of trailers on YouTube and Hollywood movie sales.

Project Significance

  • By evaluating important predictors from YouTube viewers, studios and agencies can create and publish movie trailers on YouTube more efficiently, thus:
    • driving box office ticket sales domestically and globally
    • generating more revenue
  • Trailer performance can be focused on and improved if it shows that there is a correlation to boxoffice/post-show sales

Data Collection

  • Data was collected from YouTube, using its proprietary API, and from thenumbers.com
    • Youtube – trailer performance and comments
    • thenumbers.com – Movie Box Office data
  • 32.4GB (when comments are expanded into 1 line per comment)
  • 1,713 movies
  • 5,244 trailers
  • 2,979,511 comments

Youtube Data

Variable Selection

  • The ROI variable had to be created.

Variables selected

Hypothesis and Rationale

  • There is a positive correlation between Youtube movie trailer performance indicators  and Box office performance/Video Sales.
    • Rationale: “Likes” = Sales
  • There is a positive correlation between Movie trailer comment sentiments and Box office/Video Sales  performance.
    • Rationale: If trailers are viewed in a positive manner, then people will be more likely to watch the movie.

Conceptual Model

  • After data extraction using Python, data was transformed using Python. Output files were CSV and TXT files.
  • Three sentiment models were implemented in the project: polarity-based sentiment models by using Bing Liu’s and Harvard IV-4 dictionaries, and Naive Bayes Classifier: NLTK Sentiment model.
    • To process part of the sentiment analysis, Apache Spark was used.
  • The sentiment scores were also used to help identify the ROI of each movie using a neural network model.

Project Conceptual Model

Results and Discussion

Variable Correlation Test

The graph, which was generated by R, shows the correlations between the independent variables and dependent variables.

There are three main main conclusions based on the graph:  

 1.The graph demonstrated a positive correlation among count Views, Count Comments, and Likes/Dislikes.

2. The graph was also used to test the hypotheses regarding the movie trailer features and movie performance which assumed that the movie trailer comment counts/ Movie Trailer Likes and Movie Box Office are positively correlated.

 3. Unfortunately, three sentiment models have little correlation with the Box Office Data (eg. ROI), which means that the initial hypothesis wasn’t proved. Two feature-based sentiment models have negative correlations with: Count Views, Count Comments, Likes/Dislikes.

Time Series Analysis

  • It was interesting to see that for 2008, even though with the financial crisis, overall ROI turned out to be good.
  • Another interesting finding is that ROI continuously decreased after 2008.

Sentiment Analysis

Two models were implemented for sentiment analysis.

  • a polarity-based model using Bing Liu’s and a Harvard dictionary, which nets the counts of positive and negative words that can be found in each comment, and
  • the NLTK Sentiment Analyzer using the Vader dictionary, which is a rule-based approach
  • Scores were scaled and centered to zero to maintain positive scores > 0 and negative scores < 0. The scale is [-1,1].

  • Comparing the performance of the three models, the Polarity-based models gravitated towards negative sentiment, which could be explained by the internal structure of the dictionaries used; meaning, if there were more negative than positive words, most likely there will be a higher chance of a higher negative-word count.
  • For the NLTK Sentiment Analyzer, results showed more positive sentiment towards the comments.

Sentiment Analysis – Movie Studios

  • Based on the Harvard sentiment dictionary, Paramount Vantage has the lowest average sentiment score whereas Weinstein Company has the highest.
  • The Vader sentiment sentiment dictionary determined that Apparition has the highest average sentiment score while Focus/Gramercy has the lowest sentiment average score.
  • Bing Liu sentiment dictionary predicted that Freestyle Releasing and Apparition have the lowest and highest average sentiment score, respectively.

Sentiment Analysis – Genre

  • When evaluating the Bing Liu and Harvard dictionaries, Romantic Comedies and Documentaries have the highest and lowest average sentiment score respectively.
  • Interestingly, for the NLTK Analyzer, the Concerts and Performances genre has the lowest average sentiment score, while Romantic Comedy has the highest score.

Clustering (to follow)

Predicting Box Office ROI Performance using Neural Net

  • ROI performance was classified using four bins:
    • Poor (less than the 25% quantile)
    • Passing (between 25% and 50% quantile)
    • Ok (between the 50% and 75% quantile)
    • Great (above the  75% quantile)
  • Neural Net implemented using R
  • ROI Performance ~ countsComments + countsViews + Ratio_of_Likes_and_Dislikes + ProdBudget + genre + MPAArating + MovieStudio + BingLiuSentiment + HarvardSentiment + VadeSentiment
  • Model Accuracy = 82.55%

    Neural Net Model Results

Conclusion

  • Due to the success of the neural network model, companies now have the ability to accurately predict the ROI of their movies, specifically with the use of the number of YouTube comments, ratio of likes and dislikes, and their sentiment scores from the three models.
  • With the hypotheses predicted for the research, there is a higher probability of Box Office success which would then in return generate a higher ROI for movie studios and production companies. 
  • Although the sentiment results are different among the three dictionaries, this implicates that some dictionaries used in the models view more neutral words as negative or positive.
    • The best alternative methods  to predict the sentiment of YouTube comments in movies are to use domain-specific dictionaries and the application of  machine learning classifiers paired with a sample comment-sentiment data set. 

Scope and Limitations

  • There are many popular websites and applications that can be used to comment on trailers or movies, such as Rotten Tomatoes, Facebook, Twitter and so on. However, in this case, Youtube is the only trailer source used.

  • Trailers are not the only factors that impact box office and video sales. Other factors such as advertisements,the actors, and  the competition of other movies being released at the same time can have an effect on the movie’s box office sales. However, these factors are not included in this study. Further studies could be conducted with those variables included.

Reference

Predicting the Winner of March Madness 2017 using R, Python, and Machine Learning

This project was done using R and Python, and the results were used as a submission to Deloitte’s March Madness Data Crunch Competition. Team members: Luo Yi, Yufei Long, and Yuyang Yue. Check the GitHub for the code.

Of the 64 teams that competed, we predicted Gonzaga University to win. Unfortunately, they lost to University of North Carolina.

Methodology

  1. Data transformation
  2. Data exploration
    • Feature Correlation testing
    • Principal Component Analysis
  3. Feature Selection
  4. Model Testing
    • Decision Tree
    • Logistic Regression
    • Random Forest
  5. Results and other analysis

Data Transformation

The data that was used to train the initial model was from a data set that contained 2002-2016 team performance data, which included statistics, efficiency ratings, etc.,  from different sources. Each row was a game that consisted of two teams and their respective performance data. For the initial training of the models, we were instructed to use 2002-2013 data as the training set and 2014-2016 data as the testing set. After examining the data, we debated on what would be the way to use it. We finally decided on creating new relative variables that would reflect the difference/ratio of team 1 and team 2’s performance. Feature correlation testing was also done during this phase. The results supported the need for relative variables.

Features Correlation Heatmap (Original Features)

Data Exploration

After transformation, feature correlation testing was repeated. This time, results were much more favorable. The heat map below shows that the correlation between the new variables is acceptable.

Features Correlation Heatmap (New Features)

Principal Component Analysis was also performed on the new features. We hoped to show which features were the most influential, even before running any machine learning models. Imputation was done to deal with missing values. The thicker lines in the chart below signify a more influential link to the 8 new discriminant features. This, however, was used to understand the features more and wasn’t used as an input for all the models.

Feature Selection by PCA

Feature Selection

For this project, we opted to remove anything (aside from seed and distance from game location) that wasn’t a performance metric. Some of the variables that were discarded were ratings data since we believed that they were too subjective to be reliable indicators.

Model Testing

We used three models for this project: Decision Tree, Logistic Regression, and Random Forest.

Decision Tree – Results were less than favorable for this model. Overfitting occurred and we had to drop it.

Random Forest (R) – We decided to use the Random Forest model for 2 different reasons: the need to bypass overfitting restrictions and its democratic nature.

Predictor Importance

  • OOB Estimate of error rate: 26.9%
  • Error reduction plateaus at approx. 2,600 trees
  • Model Log-loss: 0.5556
  • Chart Legend:
    • Black:  Out-of-bag estimate of error rate
    • Green and Red: Class errors

Forest Error Performance

Logistic Regression (Python) – From PCA analysis and Random Forest Model, 5 features were selected for this model. 

Features Selected for Logistic Regression

Results and Other Analysis

Summary of Results

Running them against the testing set, we were able to get a higher accuracy for the Random Forest model. Log loss, which was also one of the key performance indicators for the competition, was relatively the same for the 2 models. That being said, Random Forest was chosen to run the new 2017 march madness data.

As previously mentioned, we had predicted Gonzaga University to win the tournament. We came really close though. It made a lot of sense because, compared to the other teams, Gonzaga was a frequent contender in March Madness.

One of the more interesting teams this season was the cinderella team, South Carolina. They had gone against expectations, and this is why we decided to analyze their journey even further.

In the 1st round, we were able to correctly predict that South Carolina was going to win. However, because we were using historical data, it was obvious that we were going to predict them to lose in the next stages, especially since they were going against stronger teams. Despite “water under the bridge” data, they were able to reach the Final 4.

Cinderella Team Win Rate by Stage

One of the questions that we wanted to attempt to answer was why they kept on winning. What was so different this year that they were able to surprise everyone?

One reason that we speculated about was the high performance of one of South Carolina’s players, Sindarius Thornwell. In the past years, he was averaging 11-13 pts per game. This year, he was dropping 21.4 pts per game. Moreover, in his last 5 appearances, his was able to increase this stat to 23.6 pts per game. Looking at the score difference of South Carolina’s games in March Madness, it is evident that he was very influential in the team’s success. One could even say that without his 23.6 pts per game, the turnout of their campaign would’ve been different. But hey, that’s just speculation.

Score Difference for Cinderella Team Matches

 

Sindarius Thornwell March Madness Stats

 

Solving the Greenhouse Gas Problem through Sustainable Meat Consumption (Watson Analytics)

 

This is my team’s official entry to the 2017 Watson Analytics Global Competition. Beyond our hope to win the competition is the hope that our recommendations will be put to use by policy makers in the different countries. We believe that this is something that can make a difference. Team members are Ruoxuan Gong and Liyi Li.

Abstract

People are rarely aware of meat consumption’s contribution to greenhouse gas emissions. The purpose of this study is to utilize IBM Watson Analytics to identify relationships among meat consumption, greenhouse gas emission, and potential thermal depolymerization by-products from meat production funnels. Thorough data collection, data preprocessing, and data analysis, using both descriptive and predictive analytics, were conducted. As a result, three solutions: policies to optimize meat consumption, transformation of solid waste to sustainable by-products, and social media methods to increase people’s awareness have been proposed in this project.

The dashboard and research-based data-driven golden information can be used by environmental policy makers, business owners, and the public to exponentially make meat consumption more sustainable in the long run. Network effects can be expected from the improvement of public awareness.

Methodology

  1. Data collection from OECD, FAO, and other sources.
  2. Data processing to relate meat production & consumption data with greenhouse gas emission data
  3. Variable Selection
  4. Data Analysis
    • Chart creation
    • Dashboarding
    • Simulation of thermal depolymerization by-product conversion
    • Retrospective Analysis
    • Social Media Awareness Analysis
  5. Conclusion and Recommendations

Data set transformation of FDNY’s Fire Incident Dispatch data using Pandas (Python)

This quick data set transformation was done for one of my current analytics projects. We decided to study and run models on FDNY’s public fire incident dispatch data from NYC Open Data. It’s an on-going project and the results will be posted at the end of the semester.

The goal of this transformation is to produce a data set that has the following filters and conditions:

  • Remove all row with empty values
  • Filter out the year 2017
  • Split out date strings to convert to Python readable date formats
  • Filter out “Medical Emergencies” and “Medical MFAs” from Incident_Classification_Group
  • Remove insignificant columns
  • Create drill down variables for some of the columns (Incident_Datetime, Incident_Borough, etc.)
  • Replace Police Precinct category with a count of the number of police precincts in each zip code

The end result decreased rows from 1.8M to 800K+.

The transformation was done on Python. The code can be seen below.

Original data set: https://data.cityofnewyork.us/Public-Safety/Fire-Incident-Dispatch-Data/8m42-w767/data

Initial Transformation:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
import csv
import pandas as pd
import numpy as np
from datetime import date, time, datetime

rowlist = []
totallist = []
countrows = 0
#PART 1: remove all rows with empty values
# data = pd.read_csv('Fire_Incident_Dispatch_Data.csv', sep = ',', dtype = {'INCIDENT_RESPONSE_SECONDS_QY':np.int64, 'INCIDENT_TRAVEL_TM_SECONDS_QY': np.int64, 'ENGINES_ASSIGNED_QUANTITY': np.int64, 'LADDERS_ASSIGNED_QUANTITY':np.int64,'OTHER_UNITS_ASSIGNED_QUANTITY': np.int64})
# # data.dropna().to_csv('fireincidentdatasetnoblank.csv')
# data = data.dropna()


#PART 2: data set transformation
#get month, year, day, daydate, and time
		#1/01/2013 12:00:37 AM - format
def convert_datetime(csv_datetime):

	DATETIME_SPLIT = csv_datetime.split(' ')
	DATE_SPLIT = DATETIME_SPLIT[0].split('/')
	MONTH = int(DATE_SPLIT[0])
	DAYDATE = int(DATE_SPLIT[1])
	YEAR = int(DATE_SPLIT[2])
	dateformat = date(YEAR,MONTH,DAYDATE)
	WEEKDAY_INDEX = dateformat.weekday()

	#format INCIDENT date
	if WEEKDAY_INDEX == 0:
		WEEKDAY = 'MONDAY'
	elif WEEKDAY_INDEX == 1:
		WEEKDAY = 'TUESDAY'
	elif WEEKDAY_INDEX == 2:
		WEEKDAY = 'WEDNESDAY'
	elif WEEKDAY_INDEX == 3:
		WEEKDAY = 'THURSDAY'
	elif WEEKDAY_INDEX == 4:
		WEEKDAY = 'FRIDAY'
	elif WEEKDAY_INDEX == 5:
		WEEKDAY = 'SATURDAY'
	else:
		WEEKDAY = 'SUNDAY'

	#format time
	TIME_SPLIT = DATETIME_SPLIT[1].split(':')
	HOUR = int(TIME_SPLIT[0])
	MINUTE = int(TIME_SPLIT[1])
	SECOND = int(TIME_SPLIT[2])
	

	if DATETIME_SPLIT[2] == 'AM':
		if HOUR in [12,1,2,3,4,5]:
			QTR_DAY = 'Early Morning'
			if HOUR == 12:
				HOUR_PYTHON = 0
			else:
				HOUR_PYTHON = HOUR
		else:
			QTR_DAY = 'Morning'
			HOUR_PYTHON = HOUR
	else:
		if HOUR in [12,1,2,3,4,5]:
			QTR_DAY = 'Afternoon'
			if HOUR == 12:
				HOUR_PYTHON = HOUR
			else:
				HOUR_PYTHON = HOUR + 12
		else:
			QTR_DAY = 'Evening'
			if HOUR == 12:
				HOUR_PYTHON = HOUR
			else:
				HOUR_PYTHON = HOUR + 12
	timeformat = time(HOUR_PYTHON, MINUTE, SECOND)
	datetimeformat = datetime(YEAR,MONTH,DAYDATE,HOUR_PYTHON,MINUTE,SECOND)
	datetimelist = [YEAR,MONTH,DAYDATE,WEEKDAY,QTR_DAY,str(timeformat),str(datetimeformat)]
	return datetimelist

#process and transform data
with open('fireincidentdatasetnoblank.csv', 'rb') as csv_in:
	myreader = csv.reader(csv_in, delimiter = ',')
	next(myreader) #skips column headers

	for row in myreader:

		INCIDENT_DATETIME = row[2]
		INCIDENT_BOROUGH = row[6]
		ZIPCODE = int(float(row[7]))
		POLICEPRECINCT = int(float(row[8]))
		ALARM_SOURCE_DESCRIPTION_TX = row[13]
		HIGHEST_ALARM_LEVEL = row[15]
		INCIDENT_CLASSIFICATION = row[16]
		INCIDENT_CLASSIFICATION_GROUP = row[17]
		DISPATCH_RESPONSE_SECONDS_QY = row[18]
		FIRST_ASSIGNMENT_DATETIME = row[19]
		FIRST_ACTIVATION_DATETIME = row[20]
		FIRST_ON_SCENE_DATETIME = row[21]
		INCIDENT_CLOSE_DATETIME = row[22]
		VALID_INCIDENT_RSPNS_TIME_INDC = row[24]
		INCIDENT_RESPONSE_SECONDS_QY = row[25]
		INCIDENT_TRAVEL_TM_SECONDS_QY = row[26]
		ENGINES_ASSIGNED_QUANTITY = row[27]
		LADDERS_ASSIGNED_QUANTITY = row[28]
		OTHER_UNITS_ASSIGNED_QUANTITY = row[29]

		#compute for total_resource_qty
		TOTAL_RESOURCE_QTY = int(str(ENGINES_ASSIGNED_QUANTITY)) + int(str(LADDERS_ASSIGNED_QUANTITY)) + int(str(OTHER_UNITS_ASSIGNED_QUANTITY))

		INCIDENT_DATETIME_LIST = convert_datetime(INCIDENT_DATETIME)
		FIRST_ASSIGNMENT_DATETIME_LIST = convert_datetime(FIRST_ASSIGNMENT_DATETIME)
		FIRST_ACTIVATION_DATETIME_LIST = convert_datetime(FIRST_ACTIVATION_DATETIME)
		FIRST_ON_SCENE_DATETIME_LIST = convert_datetime(FIRST_ON_SCENE_DATETIME)
		INCIDENT_CLOSE_DATETIME_LIST = convert_datetime(INCIDENT_CLOSE_DATETIME)
		#datetimelist = [YEAR,MONTH,DAYDATE,WEEKDAY,QTR_DAY,timeformat,datetimeformat]
		#Compute for Incident_Resolution_Sec
		INCIDENT_RESOLUTION_SEC = datetime.strptime(INCIDENT_CLOSE_DATETIME_LIST[6],'%Y-%m-%d %H:%M:%S') - datetime.strptime(INCIDENT_DATETIME_LIST[6],'%Y-%m-%d %H:%M:%S')

		if VALID_INCIDENT_RSPNS_TIME_INDC == "Y":
			if INCIDENT_DATETIME_LIST[0] == "2017":
				pass
			else:
				#INDEPENDENT VARIABLES
				#datetime drilldown
				rowlist.append(INCIDENT_DATETIME_LIST[6])
				# rowlist.append(INCIDENT_DATETIME_LIST[0])
				# rowlist.append(INCIDENT_DATETIME_LIST[1])
				# rowlist.append(INCIDENT_DATETIME_LIST[2])
				# rowlist.append(INCIDENT_DATETIME_LIST[3])
				# rowlist.append(INCIDENT_DATETIME_LIST[4])
				# rowlist.append(INCIDENT_DATETIME_LIST[5])
				# #location drilldown
				# rowlist.append(INCIDENT_BOROUGH)
				rowlist.append(ZIPCODE)
				#policeprecinct
				rowlist.append(POLICEPRECINCT)
				#alarmsource
				rowlist.append(ALARM_SOURCE_DESCRIPTION_TX)
				#alarmlevel
				rowlist.append(HIGHEST_ALARM_LEVEL)
				#incidentclassification drilldown
				# rowlist.append(INCIDENT_CLASSIFICATION_GROUP)
				rowlist.append(INCIDENT_CLASSIFICATION)
				#rowlist.append(FIRST_ASSIGNMENT_DATETIME)
				# rowlist.append(FIRST_ASSIGNMENT_DATETIME_LIST[0])
				# rowlist.append(FIRST_ASSIGNMENT_DATETIME_LIST[1])
				# rowlist.append(FIRST_ASSIGNMENT_DATETIME_LIST[2])
				# rowlist.append(FIRST_ASSIGNMENT_DATETIME_LIST[3])
				# rowlist.append(FIRST_ASSIGNMENT_DATETIME_LIST[4])
				# rowlist.append(FIRST_ASSIGNMENT_DATETIME_LIST[5])
				#rowlist.append(FIRST_ACTIVATION_DATETIME)
				# rowlist.append(FIRST_ACTIVATION_DATETIME_LIST[0])
				# rowlist.append(FIRST_ACTIVATION_DATETIME_LIST[1])
				# rowlist.append(FIRST_ACTIVATION_DATETIME_LIST[2])
				# rowlist.append(FIRST_ACTIVATION_DATETIME_LIST[3])
				# rowlist.append(FIRST_ACTIVATION_DATETIME_LIST[4])
				# rowlist.append(FIRST_ACTIVATION_DATETIME_LIST[5])
				# #rowlist.append(FIRST_ON_SCENE_DATETIME)
				# rowlist.append(FIRST_ON_SCENE_DATETIME_LIST[0])
				# rowlist.append(FIRST_ON_SCENE_DATETIME_LIST[1])
				# rowlist.append(FIRST_ON_SCENE_DATETIME_LIST[2])
				# rowlist.append(FIRST_ON_SCENE_DATETIME_LIST[3])
				# rowlist.append(FIRST_ON_SCENE_DATETIME_LIST[4])
				# rowlist.append(FIRST_ON_SCENE_DATETIME_LIST[5])
				#rowlist.append(INCIDENT_CLOSE_DATETIME)
				#rowlist.append(VALID_INCIDENT_RSPNS_TIME_INDC) #qualifier (should be 'Y')
				#totalresource drilldown
				rowlist.append(TOTAL_RESOURCE_QTY)
				# rowlist.append(ENGINES_ASSIGNED_QUANTITY)
				# rowlist.append(LADDERS_ASSIGNED_QUANTITY)
				# rowlist.append(OTHER_UNITS_ASSIGNED_QUANTITY)

				#DEPENDENT VARIABLES
				rowlist.append(DISPATCH_RESPONSE_SECONDS_QY)		
				# rowlist.append(INCIDENT_RESPONSE_SECONDS_QY)
				rowlist.append(INCIDENT_TRAVEL_TM_SECONDS_QY)
				rowlist.append(int(INCIDENT_RESOLUTION_SEC.total_seconds()))	
				countrows += 1
				print countrows
				# if countrows == 500000:
				# 	print countrows
				# elif countrows == 1000000:
				# 	print countrows
				# elif countrows == 1400000:
				# 	print countrows
				# else:
				# 	pass

				# with open('cleanfireincidentdataset.csv','a') as csvfile:
				# 	for i in range(0,38):
				# 		csvfile.write(str(rowlist[i]))
				# 		if i == 38:
				# 			csvfile.write('\n')
				# 		else:
				# 			csvfile.write(',')
				totallist.append(rowlist)
				rowlist = []
	# print totallist
#labels = ['Incident Date and Time','INCIDENT_YEAR','INCIDENT_MONTH','INCIDENT_DAY','INCIDENT_WEEKDAY','INCIDENT_QTR_OF_DAY','INCIDENT_TIME','INCIDENT_BOROUGH','Zip Code','Count of Police Precincts','Alarm Source Description','Highest Alarm Level','INCIDENT_CLASSIFICATION_GROUP','Incident Classification','FIRST_ASSIGNMENT_YEAR','FIRST_ASSIGNMENT_MONTH','FIRST_ASSIGNMENT_DAY','FIRST_ASSIGNMENT_WEEKDAY','FIRST_ASSIGNMENT_QTR_OF_DAY','FIRST_ASSIGNMENT_TIME','FIRST_ACTIVATION_YEAR','FIRST_ACTIVATION_MONTH','FIRST_ACTIVATION_DAY','FIRST_ACTIVATION_WEEKDAY','FIRST_ACTIVATION_QTR_OF_DAY','FIRST_ACTIVATION_TIME','FIRST_ON_SCENE_YEAR','FIRST_ON_SCENE_MONTH','FIRST_ON_SCENE_DAY','FIRST_ON_SCENE_WEEKDAY','FIRST_ON_SCENE_QTR_OF_DAY','FIRST_ON_SCENE_TIME','Total Quantity of Resources Dispatched', 'ENGINES_ASSIGNED_QUANTITY','LADDERS_ASSIGNED_QUANTITY', 'OTHER_UNITS_ASSIGNED_QUANTITY','Dispatch Response Time (in seconds)','INCIDENT_RESPONSE_SECONDS_QY','Incident Travel Time (in seconds)','Total Resolution Time (in seconds)']
labels = ['Incident Date and Time','Zip Code','Count of Police Precincts','Alarm Source Description','Highest Alarm Level','Incident Classification','Total Quantity of Resources Dispatched', 'Dispatch Response Time (in seconds)','Incident Travel Time (in seconds)','Total Resolution Time (in seconds)']
df = pd.DataFrame.from_records(totallist,columns = labels)	
df.to_csv('cleanfireincident_final.csv')

*Notes: It’s definitely easier and more elegant to create a working CSV file using Pandas’ DataFrame and .to_csv() functions. The structure is automatically created as a whole.

Creation of unique zipcode-police precinct and count list:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import pandas as pd
import numpy as np

#create data frame from the clean data set and set integer variables
data = pd.read_csv('cleanfireincident_final.csv', sep = ',', dtype = {'Total Quantity of Resources Dispatched':np.int64, 'Dispatch Response Time (in seconds)': np.int64, 'Incident Travel Time (in seconds)': np.int64, 'Total Resolution Time (in seconds)':np.int64})

#create new dataframe only consisting of zip code and police precinct categories (this is not yet the count)
zcu = data[['Zip Code','Count of Police Precincts']].copy()

#create a group key that will be counted later
countzip =  zcu.groupby(['Zip Code','Count of Police Precincts'])
#count unique instances
counting = countzip.aggregate(len)
#insert zcu and count into another dataframe and rename columns
abc = pd.DataFrame(counting.reset_index())
abc.columns = ['Zip Code','Count of Police Precincts','count']
#export to csv
abc.to_csv('uniquezipcodesandcountsofPP.csv')

 

Matching of Zip Code and insertion of real count of police precincts per zip code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import csv
import pandas as pd
import numpy as np
from datetime import date, time, datetime

#initialize lists
listrow = []
totallist = []
rowlist = []
newcsvlist = []
#same process of opening and reading each column by row
with open('uniquezipcodesandcountsofPP.csv', 'rb') as csv_in:
    myreader2 = csv.reader(csv_in, delimiter = ',')
    next(myreader2) #skips column headers    

    for row in myreader2:
        zipcode2 = row[0]
        pp = row[1]
        count = row[2]
        listrow.append(zipcode2)
        listrow.append(pp)
        listrow.append(count)
       #create a list of lists of zip code-precinct-count
        totallist.append(listrow)
        listrow = []


#same process of opening and reading each column by row
with open('cleanfireincident_final.csv', 'rb') as csv_in:
    myreader = csv.reader(csv_in, delimiter = ',')
    next(myreader) #skips column headers

    for row in myreader:
        dateandtime = row[1]
        zipcode= row[2]
        pp = row[3]
        alarmsource = row[4]
        alarmlevel = row[5]
        classif = row[6]
        qty = row[7]
        dispatchtime = row[8]
        traveltime = row[9]
        resoltime = row[10]
        #print classif

        #matching of zipcode and police precincts
        for ziprow in totallist:
            if zipcode == ziprow[0] and pp == ziprow[1]:
                rowlist.append(dateandtime)
                rowlist.append(zipcode)
               #insert count if match
                rowlist.append(ziprow[2])
                rowlist.append(alarmsource)
                rowlist.append(alarmlevel)
                rowlist.append(classif)
                rowlist.append(qty)
                rowlist.append(dispatchtime)
                rowlist.append(traveltime)
                rowlist.append(resoltime)
                newcsvlist.append(rowlist)
                rowlist = []
            else:
                pass
            
labels = ['Incident Date and Time','Zip Code','Count of Police Precincts','Alarm Source Description','Highest Alarm Level','Incident Classification','Total Quantity of Resources Dispatched', 'Dispatch Response Time (in seconds)','Incident Travel Time (in seconds)','Total Resolution Time (in seconds)']
df = pd.DataFrame.from_records(newcsvlist,columns = labels)	
df.to_csv('new.csv')

Job Listing Mining to get the Industry Standard Skills Requirements of a Job Position using NLP and Python

This was submitted as a project for my web analytics class in my MS Business Analytics program. The original title is “Bridging the Gap: Improving the link between job applicant competitiveness and the MOOC business model”. My team included classmates: Liyi Li, Long Wan, Yiting Cai. This project was done for educational purposes only. Click the photos to enlarge.

Abstract & Key Learnings

  • Text Analytics was used to analyze thousands of job descriptions from various employment websites to determine the top requirements of a particular job position.
  • Data Scientist position has the most technical inclination, while the Business Analyst position, although it still dabbles in technical aspects, plays a bigger role in terms of business fulfilment.
  • In terms of specific skills and software, the top three skills needed for Business Analyst positions are the following: Communication Skills, Project Management, and Verbal and Written Skills, while the top three software required are Microsoft Office, Data Warehousing software, and Big Data software.
  • For the Data Analyst position, the top skills required are Verbal and Written Skills, Communication Skills, and Data Analysis, while SQL, Big Data software, and Microsoft Office are the top software required.
  • Lastly, for the Data Scientist position, the top three skills required are Data Analytics, Communication Skills, and Data Visualization. In terms for software, the top three required are Machine Learning software, Big Data software, Visualization Tools.

Project Rationale

  • Year after year, New York is swarmed by thousands of unemployed and newly graduated hopefuls with the main goal of securing a job.
  • The 44% of the city’s current working-age population who are unemployed. Competition becomes too overwhelming, making it harder and harder to differentiate oneself from other applicants.
  • MOOCs have become a legitimate source for learning skills and knowledge that can potentially increase the marketability and competitiveness of a job applicant. The only problem is that, with the sheer number of available online courses on different MOOC sites, it becomes harder to distinguish which course is appropriate and applicable to fulfilling a specific skill-set required in many of the currently open job positions.

Problem Statement

The purpose of this study is to be able to determine, at any given time, the top requirements of a particular job position. This can be done by using text content analysis on job descriptions from top employment websites, under a specific search term.

Information from this study will help two types of entities:

  1. Job applicants, by giving them accurate ideas of what companies are looking for when hiring for a position, and
  2. Massive Open Online Course (MOOC) providers, by offering them the ability to discover which skills to prioritize for course creation.

This could revolutionize the rate of how job applicants make themselves marketable to prospective companies through other means than the traditional school and work experience by providing the same information to MOOC providers and users. Also, since paid courses and verified certificates are the main source of revenue for the MOOC business model, this study can provide a research-based methodology to increasing the value of verified certificates and improving learning environments, in the hopes that they will meet the ever-changing requirements of different types of learners, and will, more importantly, be recognized by employers. 

Scope

Data Specifications

Creating the data sets entail having to scrape the aforementioned sites’ content, clean the extracted data, and compile into separate data sets based on search query.

Table 1 Raw Data Set Variables

Table 1 Raw Data Set Variables

Here are the raw compiled datasets for the three query terms: Download

Sample Rows

Sample Rows

Analytical Techniques

  1. Word Frequency: Word clouds were used to illustrate the frequency of words from the job descriptions. Limitations: Although this is perfect for keywords that have an immediate relevant meaning, such as software (e.g. Python, SQL), it proved to be inadequate in pinpointing the relevance of more ambiguous keywords, such as “business” and “experience”. These words can, however, give a general idea of what are the important themes that hover over a certain job position. Also, beyond the top ten most frequently mentioned words, the effect of word clouds become irrelevant.
  2. N-Gram Analysis: This allowed relevant keywords and their respective contexts to surface. To support this analysis, tree maps were chosen to visualize phrase frequency and make the study more robust. Four maps were generated for each job position to show the results of bi-gram and tri-gram counts for both software and skills.
  3. Network Analysis: Chosen to present the different connections and interactions among the software skills. It can supplement N-gram analysis and support the later categorical-level analysis as well. With the results of this analysis, MOOC providers and job applicants will not only learn which software is the most useful, but will also have a progressive and systematic understanding of software and how they relate with each other.
  4. Categorical Analysis: This resulted in a bird’s eye view of the different types of skills and software, and how they funnelled into particular areas of expertise. This is important to have because information coming from this analysis will allow candidates to position themselves, depending on what job and specialty they want to apply for. With regards to MOOC providers, having access to this kind of information will allow them to create more robust curriculums that would focus on specific areas of expertise.

Results and Discussion

I. Word Frequency

From left-right-down: Business Analyst, Data Analyst, Data Scientist

Word Frequency - Business Analyst Word Frequency - Data Analyst

Word Frequency - Data Scientist

II. N-Gram Analysis (Bi-Grams and Tri-Grams)

Business Analyst – Skills

Business Analyst – Software

Data Analyst – Skills

screen-shot-2017-01-18-at-2-51-02-am-copyscreen-shot-2017-01-18-at-2-50-34-am-copy

Data Analyst – Software

screen-shot-2017-01-18-at-2-51-20-am-copyscreen-shot-2017-01-18-at-2-51-10-am-copy

Data Scientist – Skills

screen-shot-2017-01-18-at-2-51-20-am-copyscreen-shot-2017-01-18-at-2-51-28-am-copy

Data Scientist – Software

screen-shot-2017-01-18-at-2-51-37-am-copyscreen-shot-2017-01-18-at-2-51-47-am-copy

III. Software Association Analysis

From left-right-down: Business Analyst, Data Analyst, Data Scientist

screen-shot-2017-01-18-at-3-01-34-am-copy screen-shot-2017-01-18-at-3-01-48-am-copy screen-shot-2017-01-18-at-3-01-56-am-copy

IV. Categorical Analysis

From left-right-down: Business Analyst, Data Analyst, Data Scientist

categorical analysis - business analyst                          categorical analysis - data analyst

categorical analysis - data scientist

Most important software, skills, and education for all job positions

Most important software, skills, and education for all job positions

Limitations

  • Other popular employment websites such as ZipRecruiter, CareerBuilder, Monster have been tried but were not included in this study according to availability and information completeness. Some websites are protected from automatic parsing.
  • According to semi-structured web, one of the main limitations of the data set cleaning process is the fact that each job listing has a different format. Because of this, scraping the listings entailed extracting the whole job description page, as compared to the ideal scenario of extracting only the requirements. Admittedly, there were some minor cleaning issues that were missed by the python scripts that the researchers created. An example of the issue is joined words (e.g. “applicationsResponsible”, “dependenciesOptimizing”). Moreover, word stemming were not included in this study, because job requirements usually refer to particular term. However, it leaves other issues such as plural words.
  • Unsupervised machine learning model-Kmeans was tried, but the majority results of this study are based on human analysis. Since the study was unsupervised, there were no defined skill and software lists. This made the counting process harder to accomplish through python scripts since there were no training data to learn from. Because of this, the top twenty words for each category and job position were counted and aggregated manually. Admittedly, this opens up the process to potential human errors.
  • Another limitation was experienced during the word and n-gram frequency count process. As a precursor to the process, general ground rules were deliberated upon on to preserve uniformity and consistency of the results. What wasn’t accounted for was that, despite the general ground rules, each researcher’s results from counting and combining terms were still affected by their own personal judgements. Because of this, further adjustments had to be done to make the results as consistent as possible.

Recommendations

Given the various limitations of this study, it is recommended that further research be done.

  • Being able to incorporate machine learning into the research will improve result accuracy drastically.
  • Unsupervised machine learning studies about this topic should be pursued because its expected results, such as dictionaries of skills, software, and education, will be able to support future supervised research, thus paving the way for automation.

Sample Code for Web Scraping

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping dice.com
from bs4 import BeautifulSoup
import requests
import urllib
import urllib2
def initializeURL():
	pagenumber = 1
	URL = 'https://www.dice.com/jobs/q-data_scientist-limit-30-l-New_York_City%2C_NY-radius-30-startPage-1-limit-30-jobs?searchid=3908011389118'
	data = urllib.urlopen(URL)
	soup = BeautifulSoup(data, "html.parser")
	return soup

pagecount = 0

# cheatsheets
# print 'soup.title: ', soup.title
# print 'soup.title.name: ', soup.title.name
# print 'soup.title.string: ', soup.title.string
# print 'soup.p: ', soup.p
# print 'soup.p.string: ', soup.p.string
# print 'soup.a: ', soup.a
# print 'soup.find_all("a") - just the links: '
# for link in soup.find_all(attrs={'class':'serp-result-div'}):
# 	for urlinfo in link.find_all('a'):
# 		if urlinfo.get('href').startswith('https://www.dice.com/jobs/detail'):
			
# 			listingcount+=1
# 			print "--- --- ---"
# 			print str(listingcount)
# 			print "Posted on: \t X"
# 			print "URL: \t", urlinfo.get('href')
			

#gets total pages
def countpage():
	posicounter = 0
	positiontotal = 0
	for positions in soup.find_all('div', {'class':'col-md-12'}):
		for posicount in positions.find_all('span'):
			posicounter += 1
			if posicounter == 6:
				positiontotal = int(posicount.string)

	# compute number of pages
	pagestotal = int(round(positiontotal / 30))
	return pagestotal

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping dice.com

def getAll(pagestotal, soup):
#cycle through all pages
	listingcount = 0
	for pagecount in range(pagestotal):
		URL = 'https://www.dice.com/jobs/q-data_scientist-limit-30-l-New_York_City%2C_NY-radius-30-startPage-'+str(pagecount)+'-limit-30-jobs?searchid=3908011389118'
		data = urllib.urlopen(URL)
		soup = BeautifulSoup(data, "html.parser")
		for searchresults in soup.find_all(id='serp'):
			for listing in searchresults.find_all('div', {'class', 'complete-serp-result-div'}):
				print "-----------"
				listingcount +=1
				print str(listingcount)+"."
				for urlinfo in listing.find_all('a'):
					if urlinfo.get('href').startswith('https://www.dice.com/jobs/detail'):
						print "URL: \t", urlinfo.get('href')
				for shortdescription in listing.find_all('div', {'class':'shortdesc'}):
					for string in shortdescription.stripped_strings:
						print "Short Description: \t", repr(string)
				for smalldetails in listing.find_all('ul', {'class':'list-inline'}):
					for companyli in smalldetails.find_all('li', {'class':'employer'}):
						for companyprint in companyli.find_all('span', {'class':'hidden-xs'}):
							print "Company: \t", companyprint.text
					for locationli in smalldetails.find_all('li', {'class':'location'}):
						print "Location: \t", locationli.text
					for postedli in smalldetails.find_all('li', {'class':'posted'}):
						print "Posted: \t", postedli.text
	return

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping dice.com

#get description

def getlistingdescription():
	listingcount = 0
	for pagecount in range(pagestotal):
		URL = 'https://www.dice.com/jobs/q-data_scientist-limit-30-l-New_York_City%2C_NY-radius-30-startPage-'+str(pagecount)+'-limit-30-jobs?searchid=3908011389118'
		data = urllib.urlopen(URL)
		soup = BeautifulSoup(data, "html.parser")
		for searchresults in soup.find_all(id='serp'):
			for listing in searchresults.find_all('div', {'class', 'complete-serp-result-div'}):
				print "-----------"
				listingcount +=1
				print str(listingcount)+"."
				for urlinfo in listing.find_all('a'):
					if urlinfo.get('href').startswith('https://www.dice.com/jobs/detail'):
						print "URL: \t", urlinfo.get('href')
						listingurl = urlinfo.get('href')
						access = requests.get(listingurl)
						access.raise_for_status()
						content = access.content
						listingsoup = BeautifulSoup(content, "html.parser")
						
						for title in listingsoup.find_all('h1', {'class':'jobTitle'}):
							print "Job Title:\t", title.text
						for jd in listingsoup.find_all('div', {'id':'jobdescSec'}):
							print "Job Description:"
							print jd.text
						for metadatas in listingsoup.find_all('ul',{'class':'list-inline'}):
							for companyli in metadatas.find_all('li',{'class':'employer'}):
								print "Company"
								print companyli.text
							for locationli in metadatas.find_all('li',{'class':'location'}):
								print "Location"
								print locationli.text
							for postedli in metadatas.find_all('li',{'class':'posted hidden-xs'}):
								print postedli.text

						#strings
						titleTofile = title.text
						titleTofilepre1 = titleTofile.encode('utf-8','ignore')
						titleTofilepre2 = titleTofilepre1.strip()
						titleTofilepre3 = titleTofilepre2.replace('\n','')
						jdTofile = jd.text
						jdTofilepre1 = jdTofile.encode('utf-8','ignore')
						jdTofilepre2 = jdTofilepre1.strip()
						jdTofilepre3 = jdTofilepre2.replace('\n','')
						companyliTofile = companyli.text
						companyliTofile1 = companyliTofile.encode('utf-8','ignore')
						companyliTofile2 = companyliTofile1.strip()
						companyliTofile3 = companyliTofile2.replace(',','')
						companyliTofile4 = companyliTofile2.replace('\n','')
						locationliTofile = locationli.text
						locationliTofile1 = locationliTofile.encode('utf-8','ignore')
						locationliTofile2 = locationliTofile1.strip()
						locationliTofile3 = locationliTofile2.replace('\n','')
						postliTofile = postedli.text
						postliTofile1 = postliTofile.encode('utf-8','ignore')
						postliTofile2 = postliTofile1.strip()
						postliTofile3 = postliTofile2.replace('\n','')
						postliTofile4 = postliTofile3.replace('Posted by','')


						with open ('datascientistNYClistings.txt','a') as csvfile:
							csvfile.write(str(urlinfo.get('href'))+ '\t' + str(titleTofilepre3) + '\t' + str(companyliTofile4) + '\t' + str(locationliTofile3) + '\t' + str(postliTofile4) + '\t' + str(jdTofilepre3) + '\n')
							
						# with open ('datascientistNYClistingsNoURLNoTitle.txt','a') as csvfile:
						# 	csvfile.write(str(jdTofilepre3) + '\n')	
	return

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping dice.com

soup = initializeURL()
pagestotal = countpage()
# getAll(pagestotal,soup)
getlistingdescription()

#this code was created for the purposes of my web analytics class project
#francisco mendoza
#web scraping dice.com

 

References

[1] The Hype is Dead, but MOOCs Are Marching On. (n.d.). Retrieved December 04, 2016, from http://knowledge.wharton.upenn.edu/article/moocs-making-progress-hype-died/
[2] R., & D. (2013, May). A Financially Viable MOOC Business Model. Retrieved December 04, 2016, from https://www.universitybusiness.com/MOOCBusinessModel

[3] A. (2015, November 23). EdX Stays Committed to Universities, Offering Credits for MOOCs (EdSurge News). Retrieved December 04, 2016, from
[4] A. (2015, November 23). EdX Stays Committed to Universities, Offering Credits for MOOCs. Retrieved December 04, 2016, from https://www.edsurge.com/news/2015-11-23-edx-buckles-down- to-offer-credit-for-moocs

[5] J., & E. (2016, October). Department of Labor. Retrieved December 04, 2016, from https://www.labor.ny.gov/stats/nyc/
[6] SankeyMATIC. (n.d.). Retrieved December 04, 2016, from http://sankeymatic.com/
[7] T. (2014, September 4). Framework to build a niche dictionary for text mining. Retrieved December 04, 2016, from https://www.analyticsvidhya.com/blog/2014/09/creating-dictionary-text- mining/

[8] Tagul – Word Cloud Art. (n.d.). Retrieved December 06, 2016, from https://tagul.com/