Changes

Crunchbase Accelerator Founders (view source)

Revision as of 16:23, 30 July 2018

13 bytes added , 16:23, 30 July 2018

no edit summary

Output: 3 txt files - founders_education.txt, founders_experience.txt, founders_main.txt

Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have "unavailable" in the name).

===linked_in_crawler.py===

This contains the LinkedInCrawler class that includes functions(login, logout, search, etc) that the driver calls to crawl the LinkedIn website.

Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.

I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.

~~==Code==~~

~~===linkedin_crawler_main.py===~~

Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have "unavailable" in the name).

~~===linked_in_crawler.py===~~

Contains the LinkedInCrawler class that contains functions for login, logout, search, etc. Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.

GraceTan

108

edits

Changes

Crunchbase Accelerator Founders (view source)

Revision as of 16:23, 30 July 2018

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools