Difference between revisions of "Crunchbase Accelerator Founders"

From edegan.com
Jump to navigation Jump to search
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{McNair Projects
+
{{Project
 +
|Has project output=Data
 +
|Has sponsor=McNair Center
 
|Has title=Crunchbase Accelerator Founders
 
|Has title=Crunchbase Accelerator Founders
 
|Has owner=Grace Tan,
 
|Has owner=Grace Tan,
Line 15: Line 17:
 
==Project Introduction==
 
==Project Introduction==
 
This project uses the Crunchbase Data and API to find founders of the accelerators we are interested in. We then take the founders and run their names through the LinkedIn Crawler to find information about them.
 
This project uses the Crunchbase Data and API to find founders of the accelerators we are interested in. We then take the founders and run their names through the LinkedIn Crawler to find information about them.
 
  
 
==Part 1: Getting Data==
 
==Part 1: Getting Data==
 
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:
 
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:
 
   https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&user_key=662e263576fe3e4ea5991edfbcfb9883
 
   https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&user_key=662e263576fe3e4ea5991edfbcfb9883
 +
 +
We are doing this because there were more results this way than looking at the people table in the crunchbase db for the keyword "founders."
  
 
===scrapefounders.py===
 
===scrapefounders.py===
Line 43: Line 46:
 
Output: 3 txt files - founders_education.txt, founders_experience.txt, founders_main.txt
 
Output: 3 txt files - founders_education.txt, founders_experience.txt, founders_main.txt
  
 +
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have "unavailable" in the name).
  
 +
===linked_in_crawler.py===
 +
This contains the LinkedInCrawler class that includes functions(login, logout, search, etc) that the driver calls to crawl the LinkedIn website.
  
 
+
Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.
  
 
===New Test Account===
 
===New Test Account===
Line 51: Line 57:
 
   Password: McNair2018
 
   Password: McNair2018
  
 +
===Obstacles and Notes===
 
Use the selenium computer on Rice Visitor wifi.
 
Use the selenium computer on Rice Visitor wifi.
 +
 
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.
 
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.
  
 
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.
 
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.
 
==Code==
 
===linkedin_crawler_main.py===
 
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have "unavailable" in the name).
 
 
===linked_in_crawler.py===
 
Contains the LinkedInCrawler class that contains functions for login, logout, search, etc. Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.
 

Latest revision as of 12:41, 21 September 2020


Project
Crunchbase Accelerator Founders
Project logo 02.png
Project Information
Has title Crunchbase Accelerator Founders
Has owner Grace Tan
Has start date 6/18/18
Has deadline date
Has project status Active
Dependent(s): Crunchbase Data, U.S. Seed Accelerators
Has sponsor McNair Center
Has project output Data
Copyright © 2019 edegan.com. All Rights Reserved.


Related Pages

Crunchbase Data

Crunchbase Accelerator Equity

LinkedIn Crawler (Python)

Project Introduction

This project uses the Crunchbase Data and API to find founders of the accelerators we are interested in. We then take the founders and run their names through the LinkedIn Crawler to find information about them.

Part 1: Getting Data

To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:

 https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&user_key=662e263576fe3e4ea5991edfbcfb9883

We are doing this because there were more results this way than looking at the people table in the crunchbase db for the keyword "founders."

scrapefounders.py

This code is located in:

 Z:\crunchbase2\scrapefounders.py

This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.

Part 2: Updated LinkedIn Crawler

We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in

 E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin

My code is found in the selenium computer at the root and at

 E:\McNair\Projects\LinkedIn Crawler 2018

There are 6 python files in the LinkedIn Crawler 2018 directory.

linkedin_crawler_main.py

This contains the main function that will run the LinkedIn Crawler. It includes two test accounts at the top which I went back and forth on to prevent LinkedIn from finding me.

Inputs (set outside of function): username(of test account), password(of test account), query_filepath(txt file that includes name of accelerator, first_name, last_name, linkedin_url)

Output: 3 txt files - founders_education.txt, founders_experience.txt, founders_main.txt

Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have "unavailable" in the name).

linked_in_crawler.py

This contains the LinkedInCrawler class that includes functions(login, logout, search, etc) that the driver calls to crawl the LinkedIn website.

Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.

New Test Account

 Username: mcboatfaceboaty670@gmail.com
 Password: McNair2018

Obstacles and Notes

Use the selenium computer on Rice Visitor wifi.

After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.

I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.