Difference between revisions of "AngelList Database"
AnneFreeman (talk | contribs) |
AnneFreeman (talk | contribs) |
||
Line 66: | Line 66: | ||
== Master File of Results == | == Master File of Results == | ||
− | We performed a diff of the two files to create a master file with only unique results. The master file containing the unique results from the two crawlers contains 1512 results. We decided to | + | We performed a diff of the two files to create a master file with only unique results. The master file containing the unique results from the two crawlers contains 1512 results. We decided to drop the state when determining if the results were unique because occasionally the same company would be listed in different states, leading to repetitive results. |
Revision as of 14:20, 1 May 2019
AngelList Database | |
---|---|
Project Information | |
Has title | AngelList Database |
Has start date | |
Has deadline date | |
Has project status | Active |
Copyright © 2019 edegan.com. All Rights Reserved. |
The purpose of this project is to build a database of incubators, perhaps as well as other ecosystem organizations, from AngelList.
Contents
Crawler Specification
There are incubators here
Process from before:
- Opened source link (http://www.angel.co)
- Typed "incubator" in the search box
- Clicked on "Search for 'incubator'
500 Results
Revised process:
- Visit https://angel.co/search?q=incubator
- Click More (a lot)
- Save the HTML page as E:\projects\AngelList\AngelList.html
- That gets you 500 (out of 1,447 claimed results)
- Process the HTML using Regular Expressions to produce AngelListPages.txt, which is in the format:
- URL\tConame
- Note that restricting to "Companies" reduces it to 1,339 results.
Failed workarounds
Tried work around with pages:
- https://angel.co/search?page=13&q=incubator&type=companies
- https://angel.co/search?page=14&q=incubator&type=companies
But 40 results per page, page 13 ends with No Results Yet after More, and page 14 opens with it. So still capped at 500 results.
It appears from the format of results that Angellist has a type "incubator", though some likely incubators have other types (e.g., BMW iVentures Incubator is a "VC Firm" and Austin Technology Incubator is a "Company". And I can't see a way to restrict search by type.
Signed up for an account as Ed Egan, ed@edegan.com, littleAmount. Then the link More -> Incubators takes you to https://angel.co/accelerators/apply. But there doesn't seem to be an advanced search. Count of incubator results increased while on the site!
400 Results
The page https://angel.co/incubators shows 6,054 companies. It stopped adding to the list after 20 More clicks, which turned out to be 400 results. Saved page as E:\projects\AngelList\Incubator - CompanyTypes - AngelList.html
Given the page title, this is likely the just the "Incubator" company type organizations. However, there is some useful information that could be extracted from just that page. The incubator type also clearly includes accelerators and other things.
Possible Processes
In either of the cases below, we'd need a Selenium web driver to click More (a lot). For the later case, we'd also need to get the URL encodings (probably by hand) for the State names we'd like to search.
Restricted Search
Tried searching incubator TX but it looks like only the name and text descriptions is searched. Tried searching "incubator a", "incubator b", "incubator c" and each had less than 500 results, so that might work.
Company Search
https://angel.co/companies has a search function. You can select type as incubator and location as US: https://angel.co/companies?company_types[]=Incubator&locations[]=1688-United+States This gives 993 companies...
It might be possible to go state by state. California has 385, Massachusetts has 36, New York has 141, etc. But again, this is limited to the incubator type.
Crawler
We decided to build a webcrawler using selenium to search for incubators using the domain for angelList companies https://angel.co/companies? with the locations[]= option appended to the end as a specified state (50 states and the district of columbia).The crawler loaded the page as specified and then clicked the load more button while there were still more results to load. No state exceeded 500 results. Then the crawler collected information for all of the companies listed including state, name of company, a brief description, and the url for the company within angelList. This information was stored in a tab delimitated text file.
Crawler By Company Type
This crawler appended company_types[]=Incubator to the url so that the companies appearing in the search results were only those with the listed company type of incubator. It yielded 1068 results. The script (angelList_companyTypeIncubator.py) and the data it generated (AngelList_companyTypeIncubator.txt) are on the RDP in the folder AngelList.
Crawler By Keyword
This crawler clicked on the search bar and entered the keyword "incubator" so that companies appeared in the results contained the keyword incubator somewhere on their company page. It yield 840 results. The script (angelList_keywordIncubator.py) and the data it generated (AngelList_keywordIncubator.txt) are on the RDP in the folder AngelList.
Master File of Results
We performed a diff of the two files to create a master file with only unique results. The master file containing the unique results from the two crawlers contains 1512 results. We decided to drop the state when determining if the results were unique because occasionally the same company would be listed in different states, leading to repetitive results.