Difference between revisions of "URL Finder (Tool)"
Maxine.tao (talk | contribs) |
|||
Line 6: | Line 6: | ||
|Has project status=Complete | |Has project status=Complete | ||
}} | }} | ||
− | There are 4 types of URL Finders, all that obtain URLs but in different methods for distinct purposes. | + | There are 4 types of URL Finders, all that obtain URLs but in different methods for distinct purposes. Some of this work has been recompiled and edited during Summer 2018. See below for more information. |
All of the URL Finders are found in Bulk->McNair->Software->Scripts->URL Finders: | All of the URL Finders are found in Bulk->McNair->Software->Scripts->URL Finders: | ||
Line 13: | Line 13: | ||
*<code>E:\McNair\Software\Scripts\URLFinders\URL Compiler.py</code> | *<code>E:\McNair\Software\Scripts\URLFinders\URL Compiler.py</code> | ||
*<code>E:\McNair\Software\Scripts\URLFinders\Specific Search URL Finder.py</code> | *<code>E:\McNair\Software\Scripts\URLFinders\Specific Search URL Finder.py</code> | ||
+ | |||
+ | =Summer 2018 URL Finder work= | ||
+ | Excel master datasets are in: | ||
+ | E:\McNair\Projects\Accelerators\Summer 2018 | ||
+ | |||
+ | Code and files specific to this URL finder are in: | ||
+ | E:\McNair\Projects\Accelerators\Summer 2018\url finder | ||
+ | |||
+ | ====Results==== | ||
+ | I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'. | ||
+ | |||
+ | ====Testing==== | ||
+ | |||
+ | In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'): | ||
+ | E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx | ||
+ | |||
+ | We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs. | ||
+ | Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible. | ||
+ | |||
+ | To test, I ran about 40 companies from "smallcompanylist.txt", using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs. | ||
+ | |||
+ | It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above. | ||
+ | |||
+ | ====Actual Run Info==== | ||
+ | |||
+ | The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'. | ||
+ | |||
+ | The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'. | ||
+ | |||
+ | The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'. | ||
+ | |||
+ | Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company. | ||
+ | |||
+ | The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser. | ||
+ | |||
+ | ====Using Python files==== | ||
+ | '''To use STEP1_crawl.py''': | ||
+ | INPUT: a list of company names (or anything) you would like to find websites for by searching on google | ||
+ | OUTPUT: a list of company names and the top X number of results from google | ||
+ | |||
+ | 1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search. | ||
+ | |||
+ | 2. Change NUMRESULT to be however many results you would like from Google. | ||
+ | |||
+ | 3. Adjust DONT_COLLECT to include any websites that you don't want. | ||
+ | |||
+ | 4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + "whatever you want here") | ||
+ | |||
+ | 5. Change line 127 to be the name of your output file. | ||
+ | |||
+ | '''To use STEP2_findcorrecturl.py''': | ||
+ | INPUT: output file from STEP1 | ||
+ | OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with "no match" | ||
+ | |||
+ | 1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part. | ||
+ | |||
+ | 2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for. | ||
+ | |||
+ | '''To use STEP3_clean.py''': | ||
+ | |||
+ | Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2. | ||
+ | |||
+ | 1. Change file f to be the output file from STEP2 (you should delete anything that says "no match", and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part. | ||
+ | |||
+ | Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one. | ||
=URL FINDER #1 - URL Matcher= | =URL FINDER #1 - URL Matcher= |
Revision as of 10:15, 2 August 2018
URL Finder (Tool) | |
---|---|
Project Information | |
Project Title | URL Finder (Tool) |
Owner | Veeral Shah |
Start Date | Summer 2016 |
Deadline | |
Keywords | URL, Webcrawler, Tool |
Primary Billing | |
Notes | |
Has project status | Complete |
Copyright © 2016 edegan.com. All Rights Reserved. |
There are 4 types of URL Finders, all that obtain URLs but in different methods for distinct purposes. Some of this work has been recompiled and edited during Summer 2018. See below for more information.
All of the URL Finders are found in Bulk->McNair->Software->Scripts->URL Finders:
E:\McNair\Software\Scripts\URLFinders\URL Matcher.py
E:\McNair\Software\Scripts\URLFinders\AboutPageFinder.py
E:\McNair\Software\Scripts\URLFinders\URL Compiler.py
E:\McNair\Software\Scripts\URLFinders\Specific Search URL Finder.py
Contents
Summer 2018 URL Finder work
Excel master datasets are in:
E:\McNair\Projects\Accelerators\Summer 2018
Code and files specific to this URL finder are in:
E:\McNair\Projects\Accelerators\Summer 2018\url finder
Results
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.
Testing
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):
E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs. Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.
To test, I ran about 40 companies from "smallcompanylist.txt", using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.
Actual Run Info
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company.
The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser.
Using Python files
To use STEP1_crawl.py:
INPUT: a list of company names (or anything) you would like to find websites for by searching on google OUTPUT: a list of company names and the top X number of results from google
1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search.
2. Change NUMRESULT to be however many results you would like from Google.
3. Adjust DONT_COLLECT to include any websites that you don't want.
4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + "whatever you want here")
5. Change line 127 to be the name of your output file.
To use STEP2_findcorrecturl.py:
INPUT: output file from STEP1 OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with "no match"
1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part.
2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for.
To use STEP3_clean.py:
Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2.
1. Change file f to be the output file from STEP2 (you should delete anything that says "no match", and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part.
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.
URL FINDER #1 - URL Matcher
Description
Notes: The URL Finder Tool automated algorithmic program to locate, retrieve and match URLs to corresponding Startup companies using the Google API. Developed through Python 2.7.
Input: CSV file containing a list of startup company names
Output: Matched URL for each company in the CSV file.
How to Use
1) Assign "input_path" = the input CSV file address
2) Assign "out_path" = the file address in which to dump all the downloaded JSON files.
3) Assign "output_path" = the new output file address
4) Run the program
Development Notes
7/7: Project start
- I am utilizing the
pandas
library to read and write CSV files in order to access the inputted CSV files. From there, I am simplifying the names of the companies using several functions from the aiding program, glink, to get rid of company identifiers such as "Co., INC., LLC., etc. and form the company names in a manner that is accessible by the Google Search API.
- I am then searching each company name into the Google Search API and collecting a number of URLs that come up from the custom search. All of these URLs are put into a JSON file.
- Attempted to use program on 1500 Startup company names but ran into a KeyError with the JSON files. I am not able to access specific keys in each data
7/8:
- Created conditionals for keys in JSON dictionaries. Successfully ran the tool on my 50 companies and then again on 1500 companies. Changed ratio to .75 and higher to elicit URLs that were close but not exact and got more results.
7/14 - About Us URL Finder
- Created a function, "about_us_url", that takes the url of a company obtained using the above function and identifies if the company has an "about" page.
- The function tests if the company url exists with either "about" or "about-us" as the sub-url. If it does, the new url is matched next to a old url in a new column, "about_us_url".
7/18 - Company Description Finder
- Created a function, "company_description" that takes a URL and gave back all of the substantial text blocks on the page (used to find company descriptions)
- Uses BeautifulSoup to access and explore HTML files.
- The function explores the HTML source code of the URL and finds all parts of the source code with the
tag to indicate a text paragraph.
- Then, the function goes though each paragraph, and if it is above a certain number of characters (eliminate for short, unnecessary information), the function adds the description in a new column of the csv file under "description".
URL Finder #2 - AboutPageFinder
Description
Notes: The AboutPageFinder is an automated algorithmic program to match company URLs to their corresponding company About pages and extract company descriptions from these About pages. Developed through Python 2.7.
Input: CSV file with a column of URLs under the column name "url"
Output: About Page URLs and company descriptions for each company URL in input if it exists.
Process:
- From that csv file, the program takes the URL strings under the column name "url".
- The program then adds "about" to the end of the URL as a sub-URL and checks if the site exists.
- If the site exists, the program returns the about page URL next to the original URL.
- If the site does not exist, the program adds "about-us" instead and checks again, returning the new URL if it exists and returning an empty string if not.
- The program then sifts the HTML of the About Page and returns all text blocks of 300 characters or more to obtain company descriptions.
Includes:
E:\McNair\Software\Scripts\Gscript2.py
URL Finder #3 - URL Compiler
Description
Notes: The URL Compiler is an automated algorithmic program to search company names through the Google Custom Search API and compile the first 10 URLs received, matching each URL to the company search and its result order number (1st result; 2nd result; etc). Developed through Python 2.7.
Input: CSV file with a list of company names in a column.
Output: New CSV file with three columns:
- Company name
- URL Result
- Result Order Number
Process: fsfjasjfk;asdjfa;
URL Finder #4 - Specific Search URL Finder
Description
Notes: The Specific Search URL Finder is an automated algorithmic program to process specific searches through the Google Custom Search API and compile up to the first 10 URLs received, matching each URL to the company search and its result order number (1st result; 2nd result; etc). Developed through Python 2.7.
Input: CSV file with search text strings in the first column
Output: New CSV file with three columns:
- Original Search Text
- URL Result
- Result Order Number
How To Use
The program takes the input of a csv file
- From that csv file, the program takes the search strings in the first column of that csv file and puts the searches through the google custom search API and compiles the results.
- From there, the program forms a new csv file with 4 columns
- In the first column is the search string from the input file.
- In the second is all of the URL results that came up from the google custom search.
- In the third is a short description of the content of the URL in column 2 .
- In the fourth is a number, indicating the order of the search result. for example, a “1” would indicate that the url in that row was the first result that came up from searching the corresponding search string in the row.
You only have to adjust three parts of the code to make the program work.
- Replace the address in input_file read to the address of your input csv file
- Replace the address in output_file to the address of a new csv file you want to output the data into
- Lastly, the program needs a place to store all of its results from the google searches so wherever you want to store it, replace the address in out_path to the address of that place.