Doing training data - 2,600 pages and are a little bit more than 1/2 way (~1500-1600).
==Finding Company URLs==
Excel master datasets are in:
E:\McNair\Projects\Accelerators\Summer 2018
Code and files specific to this URL finder are in:
E:\McNair\Projects\Accelerators\Summer 2018\url finder
====Results====
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.
====Testing====
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):
E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.
Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.
To test, I ran about 40 companies from "smallcompanylist.txt", using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.
====Actual Run Info====
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9.