Difference between revisions of "AngelList Database"

From edegan.com
Jump to navigation Jump to search
 
(17 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{Project
 
{{Project
 +
|Has project output=Data,Tool
 +
|Has sponsor=Kauffman Incubator Project
 
|Has title=AngelList Database
 
|Has title=AngelList Database
 +
|Has owner=Anne Freeman,
 
|Has project status=Active
 
|Has project status=Active
 
}}
 
}}
 +
The purpose of this project is to build a database of incubators, perhaps as well as other ecosystem organizations, from AngelList.
  
The purpose of this project is to build a database of incubators, perhaps as well as other ecosystem organizations, from AngelList.
+
==Crawler Specification==
  
==So Far==
+
===There are incubators here===
  
 
Process from before:
 
Process from before:
Line 40: Line 44:
 
The page https://angel.co/incubators shows 6,054 companies. It stopped adding to the list after 20 More clicks, which turned out to be 400 results. Saved page as E:\projects\AngelList\Incubator - CompanyTypes - AngelList.html
 
The page https://angel.co/incubators shows 6,054 companies. It stopped adding to the list after 20 More clicks, which turned out to be 400 results. Saved page as E:\projects\AngelList\Incubator - CompanyTypes - AngelList.html
  
Given the page title, this is likely the just the "Incubator" company type organizations. However, there is some useful information that could be extracted from just that page.
+
Given the page title, this is likely the just the '''"Incubator" company type''' organizations. However, there is some useful information that could be extracted from just that page. The incubator type also '''clearly includes accelerators''' and other things.
  
===Possible Process===
+
===Possible Processes===
  
 +
In either of the cases below, we'd need a Selenium web driver to click More (a lot). For the later case, we'd also need to get the URL encodings (probably by hand) for the State names we'd like to search.
 +
 +
====Restricted Search====
 
Tried searching incubator TX but it looks like only the name and text descriptions is searched. Tried searching "incubator a", "incubator b", "incubator c" and each had less than 500 results, so that ''might'' work.
 
Tried searching incubator TX but it looks like only the name and text descriptions is searched. Tried searching "incubator a", "incubator b", "incubator c" and each had less than 500 results, so that ''might'' work.
 +
 +
====Company Search====
 +
 +
https://angel.co/companies has a search function. You can select type as incubator and location as US: https://angel.co/companies?company_types[]=Incubator&locations[]=1688-United+States  This gives 993 companies...
 +
 +
It might be possible to go state by state. California has 385, Massachusetts has 36, New York has 141, etc. But again, this is '''limited to the incubator type'''.
 +
 +
==Crawler==
 +
We decided to build a webcrawler using selenium to search for incubators using the domain for angelList companies ''https://angel.co/companies?'' with the ''locations[]='' option appended to the end as a specified state (50 states and the district of columbia).The crawler loaded the page as specified and then clicked the load more button while there were still more results to load. No state exceeded 500 results. Then the crawler collected information for all of the companies listed including state, name of company, a brief description, and the url for the company within angelList. This information was stored in a tab delimitated text file.
 +
===Crawler By Company Type===
 +
This crawler appended ''company_types[]=Incubator'' to the url so that the companies appearing in the search results were only those with the listed company type of incubator. It yielded 1068 results. The script (angelList_companyTypeIncubator.py) and the data it generated (AngelList_companyTypeIncubator.txt) are on the RDP in the folder AngelList.
 +
 +
===Crawler By Keyword===
 +
This crawler clicked on the search bar and entered the keyword "incubator" so that companies appeared in the results contained the keyword incubator somewhere on their company page. It yield 840 results. The script (angelList_keywordIncubator.py) and the data it generated (AngelList_keywordIncubator.txt) are on the RDP in the folder AngelList.
 +
 +
== Master File of Results ==
 +
We performed a diff of the two files to create a master file with only unique results.  The master file containing the unique results from the two crawlers contains 1512 results. We decided to drop the state when determining if the results were unique because occasionally the same company would be listed in different states, leading to repetitive results.
 +
 +
== Saving AngelList Pages ==
 +
===Failed Attempts===
 +
The AngelList website was excellent at detecting bot activity and blocking our IP address. We attempted several different ways of downloading the pages from the masterlist that were blocked by AngelList.
 +
*  urllib from python
 +
* using a google crawler (scrapy)
 +
*  accessing them directly with a curl/wget() command
 +
These three methods were blocked by the angelList site. So we decided to use Selenium
 +
=== Selenium Script ===
 +
The selenium script to download the pages opens the URL and then saves it in a data folder. It also checks for a recaptcha and pauses the script so that the recaptcha can be manually solved. Even using selenium and manually solving recaptchas, angelList would occasionally block our IP address, making it necessary to perform the script in small batches, only collecting ~600 webpages before changing wifi networks. The selenium code save_angelList_pages.py is in the RDP folder angelList.
 +
 +
== Parsing Saved AngelList Pages ==
 +
We used beautiful soup to iterated through the static html files that were saved from the angelList website. We created three tab separated text files. The first was populated via parse_company_info.py and contains basic information about the company including the company name, a short description, the location, the company size, a URL to the company website, and the business tags on angelList. The second was populated via parse_portfolio.py and contains information including the company name, and the name of a portfolio company. The third was populated via parse_employees.py and contains information including the company name, and the name of the employee/founder at the company. The three python files and the data files they generated are in the RDP folder angelList.
 +
E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\angelList
 +
 +
 +
== Things to note/What needs work ==
 +
The selenium script to download the HTML files from angelList cannot be run completely with the masterFile. The masterFile needs to be split into smaller files and then run on devices connected to different wifi networks to avoid being blocked. 
 +
 +
The script parse_employees.py does not collect all the necessary information on the employees from the downloaded HTML files, there is a bug in the beautiful soup code.
 +
 +
== How to Run ==
 +
The following scripts were coded in a virtualenv on a Mac, using Python 3.6.5
 +
* angelList_companyTypeIncubator.py
 +
* angelList_keywordIncubator.py
 +
* masterFile.py
 +
* save_angelList_pages.py
 +
* parse_company_info.py
 +
* parse_portfolio.py
 +
* parse_employees.py
 +
 +
The following packages where loaded in that virtualenv
 +
* beautifulsoup4  4.7.1 
 +
* bs4            0.0.1 
 +
* certifi        2019.3.9
 +
* chardet        3.0.4 
 +
* idna            2.8   
 +
* numpy          1.16.2 
 +
* pandas          0.24.2 
 +
* pip            19.1.1 
 +
* python-dateutil 2.8.0 
 +
* pytz            2019.1 
 +
* requests        2.21.0 
 +
* selenium        3.141.0
 +
* setuptools      41.0.0 
 +
* six            1.12.0 
 +
* soupsieve      1.9.1 
 +
* urllib3        1.24.1 
 +
* wheel          0.33.1
 +
 +
== Summary of Python Files ==
 +
===angelList_companyTypeIncubator.py ===
 +
* input: text file with URL endings for states
 +
* output: tab separated text file (AngelList_companyTypeIncubator.txt)
 +
* description: Uses selenium to search AngelList for companies with the type incubator using a list with the proper endings for the states (and Washington DC) to create the angelList URL. It clicks the more button at the bottom of the screen when necessary. It stores the results, state, company name, short description, and url to site within angelList to a tab separated text file.
 +
===angelList_keywordIncubator.py ===
 +
* input: text file with URL endings for states
 +
* output: tab separated text file (AngelList_keywordIncubator.txt)
 +
* description: Uses selenium to search AngelList for companies that appear using the key word "incubator" and using a list with the proper endings for the states (and Washington DC) to create the angelList URL.  It clicks the more button at the bottom of the screen when necessary. It stores the results, state, company name, short description, and url to site within angelList to a tab separated text file.
 +
=== masterFile.py ===
 +
* inputs: two tab separated files  (AngelList_companyTypeIncubator.txt, AngelList_keywordIncubator.txt)
 +
* outputs: one tab separated file (angelList_masterFile.txt)
 +
* description: masterFile.py performs a diff on the two tab separated files with angelListData and creates a master file containing unique entries for use in save_angelList_pages.py
 +
=== save_angelList_pages.py ===
 +
* input: one tab separated file (angelList_masterFile.txt)
 +
* output: data folder containing html files
 +
* description: Uses selenium to open the url to the site for the incubator within angelList then saves the webpage as a html file in a specified folder.
 +
=== parse_company_info.py ===
 +
* input: path to data folder containing html files
 +
* output: tab separated file containing company info (angelList_companyInfo.txt)
 +
* description: Iterates through the saved angelList files and collects information such as the company name,  a short description, the location, company size, URL company website, and business tags. It saves the information in a tab separated text file.
 +
=== parse_portfolio.py ===
 +
* input: path to data folder containing html files
 +
* output: tab separated file containing portfolio info (angelList_portfolio.txt)
 +
* description: Iterates through the saved angelList files and collects information on the portfolio of the company,  saving the company name and the company portfolio name as a tab separated text file.
 +
=== parse_employees.py ===
 +
* input: path to data folder containing html files
 +
* output: tab separated file containing employee/founder info (angelList_employees.txt)
 +
* description: Iterates through the saved angelList files and collects information on people that work at the company,  saving the company name and the founder/employee name as a tab separated text file.

Latest revision as of 12:44, 21 September 2020


Project
AngelList Database
Project logo 02.png
Project Information
Has title AngelList Database
Has owner Anne Freeman
Has start date
Has deadline date
Has project status Active
Has sponsor Kauffman Incubator Project
Has project output Data, Tool
Copyright © 2019 edegan.com. All Rights Reserved.

The purpose of this project is to build a database of incubators, perhaps as well as other ecosystem organizations, from AngelList.

Crawler Specification

There are incubators here

Process from before:

  • Opened source link (http://www.angel.co)
  • Typed "incubator" in the search box
  • Clicked on "Search for 'incubator'

500 Results

Revised process:

  • Visit https://angel.co/search?q=incubator
  • Click More (a lot)
  • Save the HTML page as E:\projects\AngelList\AngelList.html
  • That gets you 500 (out of 1,447 claimed results)
  • Process the HTML using Regular Expressions to produce AngelListPages.txt, which is in the format:
    • URL\tConame
  • Note that restricting to "Companies" reduces it to 1,339 results.

Failed workarounds

Tried work around with pages:

But 40 results per page, page 13 ends with No Results Yet after More, and page 14 opens with it. So still capped at 500 results.

It appears from the format of results that Angellist has a type "incubator", though some likely incubators have other types (e.g., BMW iVentures Incubator is a "VC Firm" and Austin Technology Incubator is a "Company". And I can't see a way to restrict search by type.

Signed up for an account as Ed Egan, ed@edegan.com, littleAmount. Then the link More -> Incubators takes you to https://angel.co/accelerators/apply. But there doesn't seem to be an advanced search. Count of incubator results increased while on the site!

400 Results

The page https://angel.co/incubators shows 6,054 companies. It stopped adding to the list after 20 More clicks, which turned out to be 400 results. Saved page as E:\projects\AngelList\Incubator - CompanyTypes - AngelList.html

Given the page title, this is likely the just the "Incubator" company type organizations. However, there is some useful information that could be extracted from just that page. The incubator type also clearly includes accelerators and other things.

Possible Processes

In either of the cases below, we'd need a Selenium web driver to click More (a lot). For the later case, we'd also need to get the URL encodings (probably by hand) for the State names we'd like to search.

Restricted Search

Tried searching incubator TX but it looks like only the name and text descriptions is searched. Tried searching "incubator a", "incubator b", "incubator c" and each had less than 500 results, so that might work.

Company Search

https://angel.co/companies has a search function. You can select type as incubator and location as US: https://angel.co/companies?company_types[]=Incubator&locations[]=1688-United+States This gives 993 companies...

It might be possible to go state by state. California has 385, Massachusetts has 36, New York has 141, etc. But again, this is limited to the incubator type.

Crawler

We decided to build a webcrawler using selenium to search for incubators using the domain for angelList companies https://angel.co/companies? with the locations[]= option appended to the end as a specified state (50 states and the district of columbia).The crawler loaded the page as specified and then clicked the load more button while there were still more results to load. No state exceeded 500 results. Then the crawler collected information for all of the companies listed including state, name of company, a brief description, and the url for the company within angelList. This information was stored in a tab delimitated text file.

Crawler By Company Type

This crawler appended company_types[]=Incubator to the url so that the companies appearing in the search results were only those with the listed company type of incubator. It yielded 1068 results. The script (angelList_companyTypeIncubator.py) and the data it generated (AngelList_companyTypeIncubator.txt) are on the RDP in the folder AngelList.

Crawler By Keyword

This crawler clicked on the search bar and entered the keyword "incubator" so that companies appeared in the results contained the keyword incubator somewhere on their company page. It yield 840 results. The script (angelList_keywordIncubator.py) and the data it generated (AngelList_keywordIncubator.txt) are on the RDP in the folder AngelList.

Master File of Results

We performed a diff of the two files to create a master file with only unique results. The master file containing the unique results from the two crawlers contains 1512 results. We decided to drop the state when determining if the results were unique because occasionally the same company would be listed in different states, leading to repetitive results.

Saving AngelList Pages

Failed Attempts

The AngelList website was excellent at detecting bot activity and blocking our IP address. We attempted several different ways of downloading the pages from the masterlist that were blocked by AngelList.

  • urllib from python
  • using a google crawler (scrapy)
  • accessing them directly with a curl/wget() command

These three methods were blocked by the angelList site. So we decided to use Selenium

Selenium Script

The selenium script to download the pages opens the URL and then saves it in a data folder. It also checks for a recaptcha and pauses the script so that the recaptcha can be manually solved. Even using selenium and manually solving recaptchas, angelList would occasionally block our IP address, making it necessary to perform the script in small batches, only collecting ~600 webpages before changing wifi networks. The selenium code save_angelList_pages.py is in the RDP folder angelList.

Parsing Saved AngelList Pages

We used beautiful soup to iterated through the static html files that were saved from the angelList website. We created three tab separated text files. The first was populated via parse_company_info.py and contains basic information about the company including the company name, a short description, the location, the company size, a URL to the company website, and the business tags on angelList. The second was populated via parse_portfolio.py and contains information including the company name, and the name of a portfolio company. The third was populated via parse_employees.py and contains information including the company name, and the name of the employee/founder at the company. The three python files and the data files they generated are in the RDP folder angelList.

E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\angelList


Things to note/What needs work

The selenium script to download the HTML files from angelList cannot be run completely with the masterFile. The masterFile needs to be split into smaller files and then run on devices connected to different wifi networks to avoid being blocked.

The script parse_employees.py does not collect all the necessary information on the employees from the downloaded HTML files, there is a bug in the beautiful soup code.

How to Run

The following scripts were coded in a virtualenv on a Mac, using Python 3.6.5

  • angelList_companyTypeIncubator.py
  • angelList_keywordIncubator.py
  • masterFile.py
  • save_angelList_pages.py
  • parse_company_info.py
  • parse_portfolio.py
  • parse_employees.py

The following packages where loaded in that virtualenv

  • beautifulsoup4 4.7.1
  • bs4 0.0.1
  • certifi 2019.3.9
  • chardet 3.0.4
  • idna 2.8
  • numpy 1.16.2
  • pandas 0.24.2
  • pip 19.1.1
  • python-dateutil 2.8.0
  • pytz 2019.1
  • requests 2.21.0
  • selenium 3.141.0
  • setuptools 41.0.0
  • six 1.12.0
  • soupsieve 1.9.1
  • urllib3 1.24.1
  • wheel 0.33.1

Summary of Python Files

angelList_companyTypeIncubator.py

  • input: text file with URL endings for states
  • output: tab separated text file (AngelList_companyTypeIncubator.txt)
  • description: Uses selenium to search AngelList for companies with the type incubator using a list with the proper endings for the states (and Washington DC) to create the angelList URL. It clicks the more button at the bottom of the screen when necessary. It stores the results, state, company name, short description, and url to site within angelList to a tab separated text file.

angelList_keywordIncubator.py

  • input: text file with URL endings for states
  • output: tab separated text file (AngelList_keywordIncubator.txt)
  • description: Uses selenium to search AngelList for companies that appear using the key word "incubator" and using a list with the proper endings for the states (and Washington DC) to create the angelList URL. It clicks the more button at the bottom of the screen when necessary. It stores the results, state, company name, short description, and url to site within angelList to a tab separated text file.

masterFile.py

  • inputs: two tab separated files (AngelList_companyTypeIncubator.txt, AngelList_keywordIncubator.txt)
  • outputs: one tab separated file (angelList_masterFile.txt)
  • description: masterFile.py performs a diff on the two tab separated files with angelListData and creates a master file containing unique entries for use in save_angelList_pages.py

save_angelList_pages.py

  • input: one tab separated file (angelList_masterFile.txt)
  • output: data folder containing html files
  • description: Uses selenium to open the url to the site for the incubator within angelList then saves the webpage as a html file in a specified folder.

parse_company_info.py

  • input: path to data folder containing html files
  • output: tab separated file containing company info (angelList_companyInfo.txt)
  • description: Iterates through the saved angelList files and collects information such as the company name, a short description, the location, company size, URL company website, and business tags. It saves the information in a tab separated text file.

parse_portfolio.py

  • input: path to data folder containing html files
  • output: tab separated file containing portfolio info (angelList_portfolio.txt)
  • description: Iterates through the saved angelList files and collects information on the portfolio of the company, saving the company name and the company portfolio name as a tab separated text file.

parse_employees.py

  • input: path to data folder containing html files
  • output: tab separated file containing employee/founder info (angelList_employees.txt)
  • description: Iterates through the saved angelList files and collects information on people that work at the company, saving the company name and the founder/employee name as a tab separated text file.