Ecosystem Organization Classifier
Ecosystem Organization Classifier | |
---|---|
Project Information | |
Has title | Ecosystem Organization Classifier |
Has owner | Libby Bassini, Anne Freeman |
Has start date | |
Has deadline date | |
Has project status | Active |
Is dependent on | Crunchbase Database, VentureXpert Database |
Does subsume | Defining Incubators, Incubator Seed Data, Incubators in Five Ecosystems |
Copyright © 2019 edegan.com. All Rights Reserved. |
Introduction
The purpose of this project is to build a classifier, which takes the description of an ecosystem organization (i.e., a startup, a venture capitalist, an incubator, etc.) and either correctly classifies the organization's type or correctly classifies incubators vs. non-incubators.
Text Processing
There are two obvious classification methods for the processing the textual descriptions. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency (TF-IDF) to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses a shallow 2 layer neural network to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.) We are going to be trying both approaches.
Code built already
We have previously used bag-of-words in the Demo Day Page Google Classifier and in early versions of the Industry Classifier. Later versions of the Industry Classifier were based on our Deep Text Classifier project.
First data
For the first data, we are going to use organization descriptions from Crunchbase. Run this code on crunchbase3 (see Crunchbase Database):
\COPY (SELECT uuid, company_name, short_description FROM Organizations) TO 'CrunchbaseShortOrgDescs.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --744332 \COPY (SELECT A.uuid, A.company_name, B.description FROM Organizations AS A JOIN organization_descriptions AS B on A.uuid=B.uuid) TO 'CrunchbaseLongOrgDescs.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV --520698
The resulting files are in Z:\crunchbase3 and copied to E:\projects\crunchbase3.
We can use The Matcher (Tool) to match organization names to portfolio companies and VC funds and firms taken from vcdb3 (see VentureXpert Database). We will also search this data by hand for incubators to get an initial set. Later on, we'll match our early list of incubators to crunchbase organization names to expand our list.
Related Projects
Subsumed Projects: Defining Incubators, Incubator Seed Data, Incubators in Five Ecosystems
This project is dependent on: Crunchbase Database, VentureXpert Database