Difference between revisions of "LinkedIn Crawler (Python)"

From edegan.com
Jump to navigation Jump to search
Line 15: Line 15:
 
Relevant scripts can be found in the following directory:
 
Relevant scripts can be found in the following directory:
 
  E:\McNair\Projects\LinkedIn Crawler
 
  E:\McNair\Projects\LinkedIn Crawler
 +
 +
The resulting data for accelerator founders can be found:
 +
E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin\accelerator_founders_data
  
 
The code from the original Summer 2016 Project can be found in:
 
The code from the original Summer 2016 Project can be found in:

Revision as of 14:02, 21 September 2017


McNair Project
LinkedIn Crawler (Python)
Web-crawler.jpg
Project Information
Project Title LinkedIn Crawler (Python)
Start Date April 3, 2017
Deadline
Keywords Selenium, LinkedIn, Crawler, Tool
Primary Billing
Notes
Has project status
Copyright © 2016 edegan.com. All Rights Reserved.


Overview

Files for this project can be found on our Git Server under the directory LinkedIn_Crawler.

This page is dedicated to a new LinkedIn Crawler built using Selenium and Python. The goal of this project is to be able to crawl LinkedIn without being caught by LinkedIn's aggressive anti-scraping rules. To do this, we will use Selenium to behave like a human, and use time delays to hide bot-like tendencies.

The documentation for Selenium Web Driver can be found [here http://selenium-python.readthedocs.io/index.html].

Relevant scripts can be found in the following directory:

E:\McNair\Projects\LinkedIn Crawler

The resulting data for accelerator founders can be found:

E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin\accelerator_founders_data

The code from the original Summer 2016 Project can be found in:

web_crawler\linkedin

The next section will provide details on the construction and functionality of the scripts located in the linkedin directory.

The old documentation said that the programs/scripts (see details below) are located on our Bonobo Git Server.

repository: Web_Crawler
branch: researcher/linkedin
directory: /linkedin


Accounts

Test Account:

email: testapplicat6@gmail.com

pass: McNair2017

Real Account:

email: ed.edgan@rice.edu

pass: This area has intentionally been left blank.

LinkedIn Scripts

Overview

This section provides a file by file breakdown of the contents of the folder located at:

E:\McNair\Projects\LinkedIn Crawler\web_crawler\linkedin

The main script to run is:

run_linkedin_recruiter.py

run_linkedin_recruiter.py

This script executes the linkedin recruiter crawler. At the top of the file, just below the imports, are three fields: username, password, and query_filepath. The username and password fields are for the desired recruiter pro account you would like to log into, and query_filepath is a pathname to a text file that contains a list of properly formatted queries that can be read by the LinkedIn Crawler's simple_search method. The following are the functions listed in the script.

main()

This function runs the LinkedIn Crawler and will automatically begin when called from the command line. If you only want to go through some of the queries, you can change the range of the splice in line 32, and if you wish to only look at a certain number of search results, you can change the range of the splice in line 40.

open_new_window(driver, element)

This function does a shift click on a web element to open the link in a new window. It then changes the window handler to the new window. This method makes it simple to view search results and close them in a quick manner.

close_window_and_return(driver)

This function closes the current window, and returns to the main window. It is used in conjunction with open_new_window() to view search results and close them in an iterative manner.

close_tab(driver)

When necessary, this function is used to close the current tab and return to the main tab. It is similar to close_window_and_return(). This function is used to log out of the account.

crawlererror.py

This script is a simple class construction for error messages. It is used in other scripts to raise errors to the user when errors with the crawler occur. Please continue.

linked_in_crawler.py

This script constructs a class that provides navigation functionality around the traditional LinkedIn site. The beginning section lists some global xpaths that will be used by Selenium throughout the process. These xpaths are used to locate elements within the HTML. The following are some important functions to keep in mind when designing original programs using this code.

login(self, username, password)

This function takes a username and password, and logs in to LinkedIn. During the process, the function uses the MouseMove move_random() function to move the mouse randomly across the screen like a crazy person.

logout(self)

This function logs out of LinkedIn. It works by clicking on the profile picture, and then selecting logout.

go_back(self)

This function goes back a page if you ever need to do such a thing.This function also doesn't seem to work.

simple_search(self, query)

This function takes a string as a query, and searches it using the search box. At the end of the functions run, a page with search results relevant to your search query will be on the screen.

advance_search(self, query)

This function uses the advanced search feature of LinkedIn. Instead of a string, this function takes in a dictionary mapping predetermined keywords to their necessary values. This function has not been debugged yet.

get_search_results_on_page(self)

This function is supposed to return all the search results on the current page. This function has not been debugged yet.

get_next_search_page(self)

This function is supposed to click and load the next search page if one exists. This function has not been debugged yet.

linked_in_crawler_recruiter.py

This script constructs a class called LinkedInCrawlerRecruiter that implements functionality specifically for the Recruiter Pro feature of LinkedIn. Similar to the regular linked_in_crawler, the program begins with a list of relevant xpaths. It is followed by multiple functions. Their functionalities are listed below.

login(self, username, password)

This function logs into a normal LinkedIn account, and then launches the Recruiter Pro session from the LinkedIn home page. At the end of the function run, there will be a window with the Recruiter Pro feature open, and the Selenium web frame will be on that window.

simple_search(self, query)

Similar to the original LinkedIn Crawler, this function implements a basic string query search for the Recruiter Pro feature. At the end of the function run, a page will be up with the relevant search results of the search query.

help_search_handler_stuff(self)

This function does some things on the current page in an attempt to appear more human. As of now, the function has a notes feature that will randomly jot down notes on the current page.


utils.py

This file contains a few useful functions for waiting and moving the mouse. This is the human file for this project.

sleep_secs(secs)

This is a simple function that has the browser wait for a specified number of seconds.

sleep_rand(limit=__SLEEP_LIMIT__)

This function has the browser wait for a random amount of time less than the user provided limit. If the user does not provide a limit, the browser waits for a random time less than 5 seconds.

move_strategy1(self)

This is a function within the MouseMove class. This function moves the mouse randomly across the window. It uses autopy to move the mouse across the window visibly to the user.

move_to(self, x=None, y=None)

This function is a function within the MouseMove class. Given an x and y coordinates on the screen, this function will move the mouse to that given point.

move_random(self)

This function chooses a random MouseMove method and executes it.

web_driver.py

This file contains the relevant functions from the Selenium library that is used for web driving.

Constructing Your Query

Using Recruiter to search generic terms such as "CompanyName Founder" does not turn up valuable search results. For optimal performance, it is recommended that you determine through another source the exact person you are looking for. Methods to get such information will be listed below.

format_founders.py

Script location:

TBD

This python script takes a textfile of company names, and uses the Crunchbase Snapshot to determine the founder names of each company. If Crunchbase does not have the records of the founder, it is unlikely that a generic search on LinkedIn will provide any useful results. The script returns a new textfile with each company name replaced with "CompanyName Founder FounderName" for each founder of the company listed in the Crunchbase Snapshot. This new textfile can then be used directly with the LinkedIn Crawler to generate accurate search results, and retrieve accurate html pages.

The following lists the functionality of functions in the format_founders.py script.

create_pickle()

This function creates a pickled python dictionary of the Crunchbase Snapshot, people.csv. If a different dataset should be used in the future, one should pickle a dictionary in a similar fashion to this function, and then use that pickled result in the next function to reformat your queries.

reformat(pathname, output_filename)

This function takes a textfile pathname and an output filename, and converts the textfile to a searchable term by using the data from the pickled Crunchbase Snapshot. The new textfile with the corrected queries are saved to the output filename.

Results with Accelerator Data

Of the 265 recorded accelerators we have data on, 94 of them have founders listed through the Crunchbase Snapshot. Some of these companies will have multiple founders with profiles, and some of these founders will not have LinkedIn profiles.

The final data is a text file with accelerator name, founder name, profile summary, experience, and education. It can be found at:

E:\McNair\Projects\Accelerators\LinkedIn Founders Data

Fall 2017

Accelerator Founders Search

These results are for the paper: The Jockey, The Horse, or the RaceTrack


Our LinkedIn Recruiter Pro account has expired. Unfortunately, it turns out that profiles cannot be viewed through LinkedIn if the target profile is 3rd degree away or further. However, a Google search on such a LinkedIn profile will still let you view the profile, provided that an account has been logged into prior to the search.

Piggybacking Google

In order to get our data, we will piggyback on Google's web crawler to work around the LinkedIn protective wall. The crawler begins by logging into our test LinkedIn Account (credentials displayed at the top), and then launching a Google search for each query. By adding "LinkedIn" before the query, and "Founder" after the query, we can turn up relevant search results. The top 5 results on Google search are explored, scraped, and saved.

We ended up not opting to use the Google method for various reasons.

Crunchbase API

Instead, we opted to use data from Crunchbase we have access to through a license. A wiki page on the crunchbase data and how to use the API can be found here. The data can be accessed either through the web API (discussed on the Crunchbase Data wiki page), or through the bulk download we have in our SQL server.

The web API has the nice added feature of having a Founders section. The API returns a JSON when a GET request is submitted using the correct company identifier. The Founders section of this JSON contains information on the Founders of the accelerator if Crunchbase has said data. Details about the data can be found on the Crunchbase Data Page.

The script that queried the API is called crunchbase_founders.py and can be found:

E:\McNair\Projects\Accelerators\crunchbase_founders.py

The resulting text file, called founders_linkedin.txt, containing names and linkedin URLs of founders after messing around with the database can be found:

E:\McNair\Projects\Accelerators\founders_linkedin.txt

Crawling LinkedIn

The next step of the process uses this data to get information about these founders from their LinkedIn profiles. For the founders we have linkedin URLs for, we will use those. For those we do not have linkedin URLs for, we will do a simple LinkedIn search with their name and accelerator name. The code for this crawler, linkedin_founders.py can be found:

E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin\linkedin_founders.py

NOTE: Right now, this code needs to run in a virtual environment that contains Python3. This is due to the origins of the project, and this needs to be addressed when we have a lull in the development process. The only virtual environment we have managed to get working is on the Ubuntu machine sitting in the corner of the room.

Using the Ubuntu Virtual Environment

Step 1: Login using the researcher credentials. If you don't know what these are, ask someone.

Step 2: Open the command prompt. Type:

source dev/python3_venv_linkedin/bin/activate

Your screen should now have (python3_venv_linkedin) next to any command you write. The virtual enivornment has been activated.

Step 3: Change directories to:

 ~/dev/web_crawler/linkedin

Step 4: All the files for any sort of LinkedIn Crawler are here. The file for this project is:

linkedin_founders.py

This file executes the crawler on all of the information stored in the file founders_linkedin.txt. Any file with the format company-tab-first name-tab-last name-tab-linkedin url-newline- will work. The output of the data will be stored in founders_linkedin_main.txt, founders_linkedin_experience.txt, and founders_linkedin_education.txt.

Step 5: To run the file, enter:

python linkedin_founders.py

The crawler will begin running automatically.

Step 6: If you want to leave the virtual environment and return to the normal environment, simply enter the following in the command prompt:

deactivate

Crawling Google for unknown LinkedIn accounts

For accelerator founders without a recorded LinkedIn profile, a quick google search will most likely get the correct page if the person has a LinkedIn profile. The script to run this process is in the same folder, and is called:

goog_linkedin_founders.py

This file uses the same formatted text file for its queries.


Previous Posts about the LinkedIn Crawler

To what extent are we able to reproduce the network structure in LinkedIn (From Previous)

Example 1: 1st degree contact- You are connected to his profile Albert Nabiullin (485 connections)

Example 2: 2nd degree contact- You are connected to someone who is connected to him Amir Kazempour Esmati (63 connections)

Example 3: 3rd degree contact- You are connected to someone who is connected to someone else who is connected to her. Linda Szabados(500+ connections)

Any profile with a distance greater than three is defined as out your network.

Summary: Individual specific network information are not accessible even for the first degree connections. Therefore, any such plans to construct a network structure based on the connection of every individuals is not feasible.

It seems that the only possible direction would be using the advanced search feature.