Difference between revisions of "Talk:Accelerator Seed List (Data)"
VeeralShah (talk | contribs) |
VeeralShah (talk | contribs) |
||
Line 86: | Line 86: | ||
===Necessities to Parse the Global Accelerator Network HTML=== | ===Necessities to Parse the Global Accelerator Network HTML=== | ||
An entry: | An entry: | ||
− | div class="member_entry clear".../div | + | <><div class="member_entry clear"...</div><> |
Within an entry: | Within an entry: | ||
Line 92: | Line 92: | ||
Logo: | Logo: | ||
− | <div class="logo"> | + | <><div class="logo"> |
<a href="http://gan.co/members/view/desai-accelerator"><img alt="123_large" src="./GAN_files/123_large.png"></a> | <a href="http://gan.co/members/view/desai-accelerator"><img alt="123_large" src="./GAN_files/123_large.png"></a> | ||
− | </div> | + | </div><> |
Revision as of 14:05, 15 June 2017
Hi Veeral,
Contents
- 1 Intro
- 2 Important docs
- 3 To-do list
- 4 Don't worry about this stuff
- 5 Veeral's Plan
- 5.1 Complete Master List of Accelerators
- 5.1.1 Transfer all of the organizations data into Access
- 5.1.2 Use Access keyword queries with the short descriptions of each organization to accumulate a list of Potential Accelerators from Organizations data
- 5.1.3 Match Potential Accelerators with Cleaned Cohort Data using The Matcher (Tool).
- 5.1.4 Necessities to Parse the Global Accelerator Network HTML
- 5.1 Complete Master List of Accelerators
Intro
Welcome to the project. The documents are here: E:\Mcnair\Projects\Accelerators
SQL documents are here: E:\Mcnair\Projects\Accelerators\SQL_Data
Database Drive is here: Z:\Bulk\Accelerators
The database is called accelerator
Important docs
The SDC pull that includes all of the round data since 1999: E:\Mcnair\Projects\Accelerators\VC_Data_Repeated_Down.txt or E:\Mcnair\Projects\Accelerators\"VC Data.xlsx"
The Cohorts of accelerators (under the Updated tab on the bottom): E:\Mcnair\Projects\Accelerators\"Clean Cohort Data.xlsx"
The Crunchbase Snapshots of organizations: E:\Mcnair\Projects\Accelerators\"Crunchbase Snapshot"\organizations.csv
To-do list
1. Filter out actual accelerators from the Crunchbase organizations data
- Possibly by running accelerator_keywords.py HOW DO YOU RUN THIS?
- Possibly by using string searching in organizations.csv SHOULD I ADD MORE FILTERS?
- Watch out for Venture capital companies (the organizations file has many of these and we'll probably pick up a lot in our "accelerator" filtered list
2. Match this list against the current list of accelerators
- We have our own copy of the matcher in the accelerators E drive (try mode 1 and mode 2 for different results, mode 2 might be more helpful) CAN'T SEEM TO FIND DIFFERENT MODES - ALSO HOW DO YOU USE?
- This will tell you whether it was part of the old list or not (and therefore whether we need to get data for it or not)
3. Find cohort data for all of the new accelerators (ones not previously on the list & if they're not accelerators take them off the list)
- We used regex for this
- once you find the cohort data put it into the updated cohort data list excel file
- You just need the cohort company name and the name of the accerator
4. Match the cohort data against the round data from SDC
- Make sure to get both the accelerator name and the cohort company name in the first document
- In the second document (to match against the first) put the list of all companies funded in rounds (from SDC)
- in summary: File1 = Accelerator Cohorts and File2 = SDC data
5. Upload the match file into the psql database, then follow the code in accelerators.sql
- making new code with your new uploaded tables and documents, you should just be able to follow what we've already done to get a similar percentVC table
- The previous percent VC table you'll want it to look like is PercentVc4
^this above is all for the VC percentage rankings
For more info you can use the whoisparser which will get data on website registration (location, time, who, potentailly age if you consider the website registration date as an age) You can also do an automated google lookup (this will harvest addresses that are within google)
^These two will get you the information of where & how old
Don't worry about this stuff
Rank on VC
- Getting a VC percentage for each Accelerator
Also categorize
- Age
- Nonprofit or not
- Location
RegEx Code for repeating data down for the round data from SDC:
\n([^\t]+\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t)(.*)\n\t\t\t\t\t\t\t\t\t\t
\n\1\2\n\1
=if(isnumber(search("blah",B2))=TRUE,1,0) where blah is the substring (what you're searching for), B2 is the string (what your searching in) and 1 represents that it's present and 0 means it isn't.
=sum(A1:C1) This just sums the cells from A1 to C1
Veeral's Plan
Complete Master List of Accelerators
(Note: all files are found and stored under E:\McNair\Projects\Accelerators)
Transfer all of the organizations data into Access
- Done - Organizations.accdb
Use Access keyword queries with the short descriptions of each organization to accumulate a list of Potential Accelerators from Organizations data
- Companies with atleast 2 keywords from [accel, startup, mentor, seed, program, week, pitch, found, stage, incubat]
- Companies with location_country_code = USA
- 381 Potential Accelerators (These are not exclusively Accelerators -- some VC firms and startup firms snuck into the list from initial glance. Plan is to match it with list of accelerators and then eliminate the ones that do not match that are not accelerators in that step.
Match Potential Accelerators with Cleaned Cohort Data using The Matcher (Tool).
- List of current accelerators obtained from Cleaned Cohort Data is in Organizations.accdb under the query, "List of Accelerators". The 381 Potential Accelerators are under the "Potential Accelerators" Query.
- Discovered the Global Accelerator Network - downloaded all of the HTML and examined it to find out how we can parse the website.
Necessities to Parse the Global Accelerator Network HTML
An entry:
<><div class="member_entry clear"...<>
Within an entry:
Logo:
<>
<a href="http://gan.co/members/view/desai-accelerator"><img alt="123_large" src="./GAN_files/123_large.png"></a>
<>