Scraping GAN Data

From edegan.com
Revision as of 11:16, 19 June 2017 by AbhiB (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

The goal of this project was to acquire accelerator data from [1]

Desired Fields

 1. Name
 2. Country/ Location
 3. Seed Money
 4. Equity
 5. Funding Raised
 6. Companies
 7. Companies Funded
 8. Companies Funding Raised
 9. Exits
 10. Exit Funding
 11. Employees
 12. Mentors
 13. Years

Scraper Use

The scraper is implemented in using BeautifulSoup, a Python based web scraper.
The scraper requires the following libraries:

 1. Pandas
 2. BeautifulSoup

It takes in as input the full HTML file of the website, converts it to "soup" object and scrapes the resulting html file.
The items from the scrape are inputted into a Pandas DataFrame object, which is then converted to a tab-separated text file.
When converting to text file, make sure to set the the following explicityly: encoding = "utf-8", sep = "\t", and index = False
This ensures that the resulting strings are converted properly, the file is tab separated, and the data looks clean, respectively.

Code Location and Necessary Files

The code and the resulting text file are located here:

   E:\McNair\Projects\Accelerators\Web Scraping for Accelerators

The html file to scrape is located here:

   E:\McNair\Projects\Accelerators\GAN_data.txt

To better understand what it is that is being scraped and look under parser specs:

   [[2]]