Changes

Jump to navigation Jump to search
4,293 bytes added ,  15:22, 29 May 2019
no edit summary
== Parsing Saved AngelList Pages ==
We used beautiful soup to iterated through the static html files that were saved from the angelList website. We created three tab separated text files. The first was populated via parse_company_info.py and contains basic information about the company including the company name, a short description, the location, the company size, a URL to the company website, and the business tags on angelList. The second was populated via parse_portfolio.py and contains information including the company name, and the name of a portfolio company. The third was populated via parse_employees.py and contains information including the company name, and the name of the employee/founder at the company. The three python files and the data files they generated are in the RDP folder angelList.
E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\angelList
 
 
== Things to note/What needs work ==
The selenium script to download the HTML files from angelList cannot be run completely with the masterFile. The masterFile needs to be split into smaller files and then run on devices connected to different wifi networks to avoid being blocked.
 
The script parse_employees.py does not collect all the necessary information on the employees from the downloaded HTML files, there is a bug in the beautiful soup code.
 
== How to Run ==
The following scripts were coded in a virtualenv on a Mac, using Python 3.6.5
* angelList_companyTypeIncubator.py
* angelList_keywordIncubator.py
* masterFile.py
* save_angelList_pages.py
* parse_company_info.py
* parse_portfolio.py
* parse_employees.py
 
The following packages where loaded in that virtualenv
* beautifulsoup4 4.7.1
* bs4 0.0.1
* certifi 2019.3.9
* chardet 3.0.4
* idna 2.8
* numpy 1.16.2
* pandas 0.24.2
* pip 19.1.1
* python-dateutil 2.8.0
* pytz 2019.1
* requests 2.21.0
* selenium 3.141.0
* setuptools 41.0.0
* six 1.12.0
* soupsieve 1.9.1
* urllib3 1.24.1
* wheel 0.33.1
 
== Summary of Python Files ==
===angelList_companyTypeIncubator.py ===
* input: text file with URL endings for states
* output: tab separated text file (AngelList_companyTypeIncubator.txt)
* description: Uses selenium to search AngelList for companies with the type incubator using a list with the proper endings for the states (and Washington DC) to create the angelList URL. It clicks the more button at the bottom of the screen when necessary. It stores the results, state, company name, short description, and url to site within angelList to a tab separated text file.
 
===angelList_keywordIncubator.py ===
* input: text file with URL endings for states
* output: tab separated text file (AngelList_keywordIncubator.txt)
* description: Uses selenium to search AngelList for companies that appear using the key word "incubator" and using a list with the proper endings for the states (and Washington DC) to create the angelList URL. It clicks the more button at the bottom of the screen when necessary. It stores the results, state, company name, short description, and url to site within angelList to a tab separated text file.
 
=== masterFile.py ===
* inputs: two tab separated files (AngelList_companyTypeIncubator.txt, AngelList_keywordIncubator.txt)
* outputs: one tab separated file (angelList_masterFile.txt)
* description: masterFile.py performs a diff on the two tab separated files with angelListData and creates a master file containing unique entries for use in save_angelList_pages.py
 
 
=== save_angelList_pages.py ===
* input: one tab separated file (angelList_masterFile.txt)
* output: data folder containing html files
* description: Uses selenium to open the url to the site for the incubator within angelList then saves the webpage as a html file in a specified folder.
 
=== parse_company_info.py ===
* input: path to data folder containing html files
* output: tab separated file containing company info (angelList_companyInfo.txt)
* description: Iterates through the saved angelList files and collects information such as the company name, a short description, the location, company size, URL company website, and business tags. It saves the information in a tab separated text file.
 
 
 
=== parse_portfolio.py ===
* input: path to data folder containing html files
* output: tab separated file containing portfolio info (angelList_portfolio.txt)
* description: Iterates through the saved angelList files and collects information on the portfolio of the company, saving the company name and the company portfolio name as a tab separated text file.
 
 
=== parse_employees.py ===
* input: path to data folder containing html files
* output: tab separated file containing employee/founder info (angelList_employees.txt)
* description: Iterates through the saved angelList files and collects information on people that work at the company, saving the company name and the founder/employee name as a tab separated text file.
83

edits

Navigation menu