Difference between revisions of "Scraping GAN Data"
(Created page with "=Overview= The goal of this project was to acquire accelerator data from [http://gan.co/members] <br> ==Desired Fields== 1. Name 2. Country/ Location 3. Seed Money 4....") |
(No difference)
|
Revision as of 11:14, 19 June 2017
Overview
The goal of this project was to acquire accelerator data from [1]
Desired Fields
1. Name 2. Country/ Location 3. Seed Money 4. Equity 5. Funding Raised 6. Companies 7. Companies Funded 8. Companies Funding Raised 9. Exits 10. Exit Funding 11. Employees 12. Mentors 13. Years
Scraper Use
The scraper is implemented in using BeautifulSoup, a Python based web scraper.
The scraper requires the following libraries:
1. Pandas 2. BeautifulSoup
It takes in as input the full HTML file of the website, converts it to "soup" object and scrapes the resulting html file.
The items from the scrape are inputted into a Pandas DataFrame object, which is then converted to a tab-separated text file.
When converting to text file, make sure to set the explicit encoding = "utf-8", sep = "\t", and index = False
This ensures that the resulting strings are converted properly, the file is tab separated, and the data looks clean, respectively.
Code Location and Necessary Files
The code and the resulting text file are located here:
E:\McNair\Projects\Accelerators\Web Scraping for Accelerators
The html file to scrape is located here:
E:\McNair\Projects\Accelerators\GAN_data.txt
To better understand what it is that is being scraped and look under parser specs:
[[2]]