Difference between revisions of "Seed Accelerator Data Assembly"
(Created page with "{{McNair Projects |Has title=Seed Accelerator Data Assembly |Has owner=Ed Egan, Hira Farooqi, |Has start date=Summer 2018 |Has project status=Active }}") |
|||
Line 1: | Line 1: | ||
− | + | ==Useful pages== | |
− | + | ||
− | + | *[[U.S. Seed Accelerators]] | |
− | + | **[[Accelerator Seed List (Data)]] | |
− | + | *[[Accelerator Demo Day]] | |
− | + | **[[Mechanical Turk (Tool)]] | |
+ | *[[Crunchbase Accelerator Founders]] | ||
+ | **[[Crunchbase Data#Collecting Company Information]] | ||
+ | **[[Merging Existing Data with Crunchbase]] | ||
+ | *[[Whois Parser]] | ||
+ | *[[URL Finder (Tool)]] | ||
+ | *[[Industry Classifier]] | ||
+ | *[[Industry Classifier yang]] | ||
+ | *[[VentureXpert Data]] | ||
+ | |||
+ | Please add (or subtract) other relevant (or irrelevant) pages! | ||
+ | |||
+ | ==Database specification== | ||
+ | |||
+ | ===Preamble=== | ||
+ | |||
+ | We need to get the data into approximately 3NF to prevent errors and make it more useable. | ||
+ | |||
+ | Inputs: | ||
+ | E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx | ||
+ | https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0 | ||
+ | |||
+ | Recent work on the above from Connor: | ||
+ | *made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column) | ||
+ | *recoded the employee count | ||
+ | *normalized the investment amount | ||
+ | |||
+ | Connor's next to do: | ||
+ | *Founders Experience: code job title | ||
+ | *Founders Education: remove unknowns, code degree and code major | ||
+ | *Fix multiple campus and cohorts (see below). | ||
+ | |||
+ | This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]]. | ||
+ | |||
+ | The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in: | ||
+ | E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql | ||
+ | |||
+ | Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere). | ||
+ | |||
+ | However, we should build a new database for this project: | ||
+ | createdb accelerator | ||
+ | |||
+ | As well as a new folder (Z:\accelerator) for the data. | ||
+ | |||
+ | The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there! | ||
+ | |||
+ | ===Suggested Spec=== | ||
+ | |||
+ | We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor). | ||
+ | |||
+ | Accelerator Table (fieldname colname/DISCARD, from sheet "Accelerators Final") | ||
+ | *acceleratorname (Primary Key) Accelerators | ||
+ | *url homepage_url | ||
+ | *cohorturl cohort page URL | ||
+ | *cohortlisting Break out cohorts on the website? (Y/N) | ||
+ | *type Type | ||
+ | *alive Alive | ||
+ | *typenote Type notes | ||
+ | *weeks Weeks | ||
+ | *durationnotes Duration Notes | ||
+ | *city city (note: address info is for the accelerator HQ here) | ||
+ | *state state/region (note: address info is for the accelerator HQ here) | ||
+ | *foundingdate Creation Date | ||
+ | *terms Terms of Joining Accelerator (non-equity pay/equity/free) | ||
+ | *equity Equity? | ||
+ | *equityamountdesc Equity Amount | ||
+ | *equityamount Equity Amount Normalized (only 1s, range is averaged) | ||
+ | *investmentamountdesc Investment Amount | ||
+ | *investmentamount Investment Amount Normalized (Midpoint) | ||
+ | *investmentnotes Investment Notes | ||
+ | *discard industry DISCARD | ||
+ | *industry Industry | ||
+ | *specgenindu Specified/General | ||
+ | *address Address (note: address info is for the accelerator HQ here) | ||
+ | *subtype DISCARD | ||
+ | *nonprofit nonprofit? | ||
+ | *studentfocus designed for students (Y/N) | ||
+ | *multicampus Multiple campuses? (Y/N) | ||
+ | *software tech (Y/N) DISCARD | ||
+ | *stagepref What stage do they look for in cohort companies (SEL) | ||
+ | *cbconame Name in Crunchbase | ||
+ | *uuid UUID | ||
+ | |||
+ | Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption. | ||
+ | |||
+ | CohortCompany Table (fieldname colname/DISCARD, from sheet "Cohorts Final") | ||
+ | *conamestd (primary key) | ||
+ | *coname | ||
+ | *conameorg | ||
+ | *colocation | ||
+ | *city | ||
+ | *state_code | ||
+ | *country_code | ||
+ | *address | ||
+ | *codescription | ||
+ | *short_desc | ||
+ | *long_desc | ||
+ | *cosectors | ||
+ | *costatus | ||
+ | *cofundingstage | ||
+ | *courl | ||
+ | *uuid | ||
+ | *category_list | ||
+ | *category_group_list | ||
+ | *founded_on | ||
+ | *employee_count | ||
+ | *emp_count_scale | ||
+ | *linkedin_url | ||
+ | *gotvc | ||
+ | |||
+ | We then need to deduplicate this table and make sure that conamestd is a valid primary key. | ||
+ | |||
+ | Note that we won't be taking the following from "Cohorts Final" for this table: | ||
+ | *year | ||
+ | *accelerator | ||
+ | *cohort | ||
+ | *quarter | ||
+ | *acclocation | ||
+ | *accperks DISCARD | ||
+ | *cofounder DISCARD | ||
+ | |||
+ | CohortParticipation | ||
+ | *Cohort | ||
+ | *year | ||
+ | *quarter | ||
+ | *accelerator (foreign key) | ||
+ | *conamestd (foreign key) | ||
+ | |||
+ | ===Going further=== | ||
+ | |||
+ | We should likely rebuild the cohort variable to make it a sole "primary key" for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables: | ||
+ | |||
+ | CohortParticipation | ||
+ | *Cohort | ||
+ | *accelerator (foreign key) | ||
+ | *conamestd (foreign key) | ||
+ | |||
+ | Cohort | ||
+ | *Cohort | ||
+ | *year | ||
+ | *quarter | ||
+ | |||
+ | We could then add campus and a seperate table for campuses: | ||
+ | |||
+ | Campus Table | ||
+ | *CampusName (e.g., Techstars Boulder) | ||
+ | *Accelerator (foreign key) | ||
+ | *Address | ||
+ | *City | ||
+ | *State | ||
+ | *Zip | ||
+ | *Description | ||
+ | |||
+ | ===Founders Information=== | ||
+ | |||
+ | The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer. | ||
+ | |||
+ | Founders: | ||
+ | *Accelerator | ||
+ | *First Name | ||
+ | *Last Name | ||
+ | *Full Name | ||
+ | *Current Job | ||
+ | *Current Location | ||
+ | |||
+ | FoundersExperience: | ||
+ | *Accelerator | ||
+ | *First Name | ||
+ | *Last Name | ||
+ | *Full Name | ||
+ | *Employer | ||
+ | *Job Title | ||
+ | *Dates Employed | ||
+ | *Time Employed | ||
+ | *Location | ||
+ | *Extra Description | ||
+ | |||
+ | FoundersEducation: | ||
+ | *Accelerator | ||
+ | *First Name | ||
+ | *Last Name | ||
+ | *Full Name | ||
+ | *School Name | ||
+ | *Degree Name | ||
+ | *Major | ||
+ | *Attended | ||
+ | *Graduated | ||
+ | *Societies | ||
+ | |||
+ | |||
+ | ==To do/For consideration== | ||
+ | |||
+ | Minh: | ||
+ | *Create a format for collecting timing data | ||
+ | *Put the timing job on Turk | ||
+ | |||
+ | Connor: | ||
+ | *Try for missing timings that we really (new process) need | ||
+ | *Col W should be headquarter address | ||
+ | *What stage- Clean up? | ||
+ | |||
+ | Maxine: | ||
+ | *Build the google URL finder | ||
+ | *Industry classification from description | ||
+ | |||
+ | Someone: | ||
+ | *Process and Join in new timing data | ||
+ | *Make a category group to minorcode lookup | ||
+ | *Run WHOIS crawler on all valid URLs (not facebook pages, etc.) | ||
+ | *Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi) |
Revision as of 18:50, 26 July 2018
Contents
Useful pages
- U.S. Seed Accelerators
- Accelerator Demo Day
- Crunchbase Accelerator Founders
- Whois Parser
- URL Finder (Tool)
- Industry Classifier
- Industry Classifier yang
- VentureXpert Data
Please add (or subtract) other relevant (or irrelevant) pages!
Database specification
Preamble
We need to get the data into approximately 3NF to prevent errors and make it more useable.
Inputs:
E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0
Recent work on the above from Connor:
- made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)
- recoded the employee count
- normalized the investment amount
Connor's next to do:
- Founders Experience: code job title
- Founders Education: remove unknowns, code degree and code major
- Fix multiple campus and cohorts (see below).
This is documented on U.S._Seed_Accelerators#Update_for_Hira.
The current database work is in vcdb2. The code to build the relevant tables in vcdb2 is in:
E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql
Note that AddCBData.sql can be run on crunchbase2 to get the relevant crunchbase data for import into vcdb2 (or elsewhere).
However, we should build a new database for this project:
createdb accelerator
As well as a new folder (Z:\accelerator) for the data.
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!
Suggested Spec
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).
Accelerator Table (fieldname colname/DISCARD, from sheet "Accelerators Final")
- acceleratorname (Primary Key) Accelerators
- url homepage_url
- cohorturl cohort page URL
- cohortlisting Break out cohorts on the website? (Y/N)
- type Type
- alive Alive
- typenote Type notes
- weeks Weeks
- durationnotes Duration Notes
- city city (note: address info is for the accelerator HQ here)
- state state/region (note: address info is for the accelerator HQ here)
- foundingdate Creation Date
- terms Terms of Joining Accelerator (non-equity pay/equity/free)
- equity Equity?
- equityamountdesc Equity Amount
- equityamount Equity Amount Normalized (only 1s, range is averaged)
- investmentamountdesc Investment Amount
- investmentamount Investment Amount Normalized (Midpoint)
- investmentnotes Investment Notes
- discard industry DISCARD
- industry Industry
- specgenindu Specified/General
- address Address (note: address info is for the accelerator HQ here)
- subtype DISCARD
- nonprofit nonprofit?
- studentfocus designed for students (Y/N)
- multicampus Multiple campuses? (Y/N)
- software tech (Y/N) DISCARD
- stagepref What stage do they look for in cohort companies (SEL)
- cbconame Name in Crunchbase
- uuid UUID
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.
CohortCompany Table (fieldname colname/DISCARD, from sheet "Cohorts Final")
- conamestd (primary key)
- coname
- conameorg
- colocation
- city
- state_code
- country_code
- address
- codescription
- short_desc
- long_desc
- cosectors
- costatus
- cofundingstage
- courl
- uuid
- category_list
- category_group_list
- founded_on
- employee_count
- emp_count_scale
- linkedin_url
- gotvc
We then need to deduplicate this table and make sure that conamestd is a valid primary key.
Note that we won't be taking the following from "Cohorts Final" for this table:
- year
- accelerator
- cohort
- quarter
- acclocation
- accperks DISCARD
- cofounder DISCARD
CohortParticipation
- Cohort
- year
- quarter
- accelerator (foreign key)
- conamestd (foreign key)
Going further
We should likely rebuild the cohort variable to make it a sole "primary key" for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:
CohortParticipation
- Cohort
- accelerator (foreign key)
- conamestd (foreign key)
Cohort
- Cohort
- year
- quarter
We could then add campus and a seperate table for campuses:
Campus Table
- CampusName (e.g., Techstars Boulder)
- Accelerator (foreign key)
- Address
- City
- State
- Zip
- Description
Founders Information
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.
Founders:
- Accelerator
- First Name
- Last Name
- Full Name
- Current Job
- Current Location
FoundersExperience:
- Accelerator
- First Name
- Last Name
- Full Name
- Employer
- Job Title
- Dates Employed
- Time Employed
- Location
- Extra Description
FoundersEducation:
- Accelerator
- First Name
- Last Name
- Full Name
- School Name
- Degree Name
- Major
- Attended
- Graduated
- Societies
To do/For consideration
Minh:
- Create a format for collecting timing data
- Put the timing job on Turk
Connor:
- Try for missing timings that we really (new process) need
- Col W should be headquarter address
- What stage- Clean up?
Maxine:
- Build the google URL finder
- Industry classification from description
Someone:
- Process and Join in new timing data
- Make a category group to minorcode lookup
- Run WHOIS crawler on all valid URLs (not facebook pages, etc.)
- Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi)