Revision as of 14:37, 17 February 2017

Possible Tools

McNair Project
Industry Classifier
Project Information
Project Title
Start Date
Deadline
Primary Billing
Notes
Has project status
	Copyright © 2016 edegan.com. All Rights Reserved.

Python Tools

SciKit Learn SVM

http://scikit-learn.org/stable/modules/svm.html#svm

It's complexity is between O(n^2) and O(n^3). Seems easy to use. This is not a neural net; it is a support vector machine.

SciKit Learn Neural Net

http://scikit-learn.org/stable/modules/neural_networks_supervised.html

This IS a neural net using back propagation.

It's complexity is listed as: Suppose there are n training samples, m features, k hidden layers, each containing h neurons - for simplicity, and o output neurons. The time complexity of backpropagation is O(n * m * h^k * o * i), where i is the number of iterations. Since backpropagation has a high time complexity, it is advisable to start with smaller number of hidden neurons and few hidden layers for training.

WE ENDED UP USING THIS ONE

SK Neural Network Package

This is a separate package than listed above. It requires a separate installation. Documentation is provided at:

https://scikit-neuralnetwork.readthedocs.io/en/latest/index.html

We ran into deprecation warnings, and the program would not execute due to a missing g++ drive.

R Tools

R seems to have a built in package called "neuralnet".

An example is given at:

https://www.packtpub.com/books/content/training-and-visualizing-neural-network-r

Scripts

Scripts and data for this project are located in:

E:\McNair\Projects\Accelerators\Code+Final_Data\ChristyCode

Industry Classifier

This is a neural net built in python that trains on industry designation data from the SDC Platinum database. It serves as a predictive model to predict the industry allocation of given companies. The file is located in the directory listed above.

FindTrainData.py

Builds a tab-delimited text file containing 200 companies with each Industry classification (i.e. 200 biotech, 200 media etc). Hopefully if we use this as our training data, we will get more accurate classifications.

FixDescriptions.py

Deals with the problem that by output files from SDC are poorly formatted when the description goes beyond 1 line. Outputs a tab-delimited text file where the whole description is on the same line and can be read.

Addresses.txt

This text file contains investment info, name, address, city, state of Portfolio companies.

Descriptions.txt

This text file contains company, short description, major industry, minor industry of Portfolio companies.

Statistics

Stastical methods for analyzing results from a neural network.

Precision and Recall

Quick Check using excel; Finding number of correct matches between two columns:

=SUMPRODUCT(--(range1=range2))

See an example here.

Comments and Thoughts

2/17/17

Christy: No matter what parameters I change in the NN, I can't get the accuracy to go up above around 30%. Looking at the descriptions that the classifier fails on, I realized that it pretty much guesses randomly a lot of the time when the descriptions are terrible like "We provide services to our customers." I think we need to be training and classifying based on the longer description, which is why I started working on the FixDescriptions.txt script.

@@ Line 53: / Line 53: @@
 ===FindTrainData.py===
 Builds a tab-delimited text file containing 200 companies with each Industry classification (i.e. 200 biotech, 200 media etc). Hopefully if we use this as our training data, we will get more accurate classifications.
+==FixDescriptions.py==
+Deals with the problem that by output files from SDC are poorly formatted when the description goes beyond 1 line. Outputs a tab-delimited text file where the whole description is on the same line and can be read.
 ===Addresses.txt===
 This text file contains investment info, name, address, city, state of Portfolio companies.
 ===Descriptions.txt===
@@ Line 75: / Line 79: @@
 See an example [https://exceljet.net/formula/count-matches-between-two-columns here].
+=Comments and Thoughts=
+'''2/17/17'''
+Christy: No matter what parameters I change in the NN, I can't get the accuracy to go up above around 30%. Looking at the descriptions that the classifier fails on, I realized that it pretty much guesses randomly a lot of the time when the descriptions are terrible like "We provide services to our customers." I think we need to be training and classifying based on the longer description, which is why I started working on the FixDescriptions.txt script.

Difference between revisions of "Industry Classifier"

Revision as of 14:37, 17 February 2017

Contents

Possible Tools

Python Tools

SciKit Learn SVM

SciKit Learn Neural Net

SK Neural Network Package

R Tools

Scripts

Industry Classifier

FindTrainData.py

FixDescriptions.py

Addresses.txt

Descriptions.txt

Statistics

Comments and Thoughts

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools