Difference between revisions of "Talk:Accelerator Seed List (Data)"
BenBaldazo (talk | contribs) |
BenBaldazo (talk | contribs) |
||
Line 18: | Line 18: | ||
#Filter out actual accelerators from the Crunchbase organizations data | #Filter out actual accelerators from the Crunchbase organizations data | ||
− | *Possibly by running accelerator_keywords.py | + | **Possibly by running accelerator_keywords.py |
− | *Possibly by using string searching in organizations.csv | + | **Possibly by using string searching in organizations.csv |
− | *Watch out for Venture capital companies (the organizations file has many of these and we'll probably pick up a lot in our "accelerator" filtered list | + | **Watch out for Venture capital companies (the organizations file has many of these and we'll probably pick up a lot in our "accelerator" filtered list |
#Match this list against the current list of accelerators | #Match this list against the current list of accelerators | ||
− | *We have our own copy of the matcher in the accelerators E drive (try mode 1 and mode 2 for different results, mode 2 might be more helpful) | + | **We have our own copy of the matcher in the accelerators E drive (try mode 1 and mode 2 for different results, mode 2 might be more helpful) |
− | *This will tell you whether it was part of the old list or not (and therefore whether we need to get data for it or not) | + | **This will tell you whether it was part of the old list or not (and therefore whether we need to get data for it or not) |
#Find cohort data for all of the new accelerators (ones not previously on the list & if they're not accelerators take them off the list) | #Find cohort data for all of the new accelerators (ones not previously on the list & if they're not accelerators take them off the list) | ||
− | *We used regex for this | + | **We used regex for this |
− | *once you find the cohort data put it into the updated cohort data list excel file | + | **once you find the cohort data put it into the updated cohort data list excel file |
#Match the cohort data against the round data from SDC | #Match the cohort data against the round data from SDC | ||
− | *Make sure to get both the accelerator name and the cohort company name in the first document | + | **Make sure to get both the accelerator name and the cohort company name in the first document |
− | *In the second document (to match against the first) put the list of all companies funded in rounds (from SDC) | + | **In the second document (to match against the first) put the list of all companies funded in rounds (from SDC) |
− | *in summary: File1 = Accelerator Cohorts and File2 = SDC data | + | **in summary: File1 = Accelerator Cohorts and File2 = SDC data |
#Upload the match file into the psql database, then follow the code in accelerators.sql | #Upload the match file into the psql database, then follow the code in accelerators.sql | ||
− | *making new code with your new uploaded tables and documents, you should just be able to follow what we've already done to get a similar percentVC table | + | **making new code with your new uploaded tables and documents, you should just be able to follow what we've already done to get a similar percentVC table |
− | *The previous percent VC table you'll want it to look like is PercentVc4 | + | **The previous percent VC table you'll want it to look like is PercentVc4 |
=Don't worry about this stuff= | =Don't worry about this stuff= |
Revision as of 10:48, 24 April 2017
Hi Veeral,
Intro
Welcome to the project. The documents are here: E:\Mcnair\Projects\Accelerators
SQL documents are here: E:\Mcnair\Projects\Accelerators\SQL_Data
Database Drive is here: Z:\Bulk\Accelerators
Important docs
The SDC pull that includes all of the round data since 1999: E:\Mcnair\Projects\Accelerators\VC_Data_Repeated_Down.txt or E:\Mcnair\Projects\Accelerators\"VC Data.xlsx"
The Cohorts of accelerators (under the Updated tab on the bottom): E:\Mcnair\Projects\Accelerators\"Clean Cohort Data.xlsx"
The Crunchbase Snapshots of organizations: E:\Mcnair\Projects\Accelerators\"Crunchbase Snapshot"\organizations.csv
To-do list
- Filter out actual accelerators from the Crunchbase organizations data
- Possibly by running accelerator_keywords.py
- Possibly by using string searching in organizations.csv
- Watch out for Venture capital companies (the organizations file has many of these and we'll probably pick up a lot in our "accelerator" filtered list
- Match this list against the current list of accelerators
- We have our own copy of the matcher in the accelerators E drive (try mode 1 and mode 2 for different results, mode 2 might be more helpful)
- This will tell you whether it was part of the old list or not (and therefore whether we need to get data for it or not)
- Find cohort data for all of the new accelerators (ones not previously on the list & if they're not accelerators take them off the list)
- We used regex for this
- once you find the cohort data put it into the updated cohort data list excel file
- Match the cohort data against the round data from SDC
- Make sure to get both the accelerator name and the cohort company name in the first document
- In the second document (to match against the first) put the list of all companies funded in rounds (from SDC)
- in summary: File1 = Accelerator Cohorts and File2 = SDC data
- Upload the match file into the psql database, then follow the code in accelerators.sql
- making new code with your new uploaded tables and documents, you should just be able to follow what we've already done to get a similar percentVC table
- The previous percent VC table you'll want it to look like is PercentVc4
Don't worry about this stuff
Rank on VC
- Getting a VC percentage for each Accelerator
Also categorize
- Age
- Nonprofit or not
- Location
RegEx Code for repeating data down for the round data from SDC:
\n([^\t]+\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t)(.*)\n\t\t\t\t\t\t\t\t\t\t
\n\1\2\n\1
=if(isnumber(search("blah",B2))=TRUE,1,0) where blah is the substring (what you're searching for), B2 is the string (what your searching in) and 1 represents that it's present and 0 means it isn't.
=sum(A1:C1) This just sums the cells from A1 to C1