LinkedIn Crawler (Python)

From edegan.com
Revision as of 11:05, 17 April 2017 by Peterjalbert (talk | contribs)
Jump to navigation Jump to search


McNair Project
LinkedIn Crawler (Python)
Project logo 02.png
Project Information
Project Title LinkedIn Crawler (Python)
Start Date March 2, 2017
Deadline
Keywords Selenium, LinkedIn, Crawler, Tool
Primary Billing
Notes
Has project status
Copyright © 2016 edegan.com. All Rights Reserved.


Overview

This page is dedicated to a new LinkedIn Crawler built using Selenium and Python. The goal of this project is to be able to crawl LinkedIn without being caught by LinkedIn's aggressive anti-scraping rules. To do this, we will use Selenium to behave like a human, and use time delays to hide bot-like tendencies.

The documentation for Selenium Web Driver can be found [here http://selenium-python.readthedocs.io/index.html].

Relevant scripts can be found in the following directory:

E:\McNair\Projects\LinkedIn Crawler

The code from the original Summer 2016 Project can be found in:

web_crawler\linkedin

The next section will provide details on the construction and functionality of the scripts located in the linkedin directory.

The old documentation said that the programs/scripts (see details below) are located on our Bonobo Git Server.

repository: Web_Crawler
branch: researcher/linkedin
directory: /linkedin


Accounts

Test Account:

email: testapplicat6@gmail.com

pass: McNair2017

Real Account:

email: ed.edgan@rice.edu

pass: This area has intentionally been left blank.

LinkedIn Scripts

Overview

This section provides a file by file breakdown of the contents of the folder located at:

E:\McNair\Projects\LinkedIn Crawler\web_crawler\linkedin

The main script to run is:

run_linkedin_crawler.py

crawlererror.py

This script is a simple class construction for error messages. It is used in other scripts to raise errors to the user when errors with the crawler occur. Please continue.

linked_in_crawler.py

This script constructs a class that provides navigation functionality around the traditional LinkedIn site. The beginning section lists some global xpaths that will be used by Selenium throughout the process. These xpaths are used to locate elements within the HTML. The following are some important functions to keep in mind when designing original programs using this code.

login(self, username, password)

This function takes a username and password, and logs in to LinkedIn. During the process, the function uses the MouseMove move_random() function to move the mouse randomly across the screen like a crazy person.

logout(self)

This function logs out of LinkedIn. It works by clicking on the profile picture, and then selecting logout.

go_back(self)

This function goes back a page if you ever need to do such a thing.This function also doesn't seem to work.

simple_search(self, query)

This function takes a string as a query, and searches it using the search box. At the end of the functions run, a page with search results relevant to your search query will be on the screen.

advance_search(self, query)

This function uses the advanced search feature of LinkedIn. Instead of a string, this function takes in a dictionary mapping predetermined keywords to their necessary values. This function has not been debugged yet.

get_search_results_on_page(self)

This function is supposed to return all the search results on the current page. This function has not been debugged yet.

get_next_search_page(self)

This function is supposed to click and load the next search page if one exists. This function has not been debugged yet.

linked_in_crawler_recruiter.py

This script constructs a class called LinkedInCrawlerRecruiter that implements functionality specifically for the Recruiter Pro feature of LinkedIn. Similar to the regular linked_in_crawler, the program begins with a list of relevant xpaths. It is followed by multiple functions. Their functionalities are listed below.

login(self, username, password)

This function logs into a normal LinkedIn account, and then launches the Recruiter Pro session from the LinkedIn home page. At the end of the function run, there will be a window with the Recruiter Pro feature open, and the Selenium web frame will be on that window.

simple_search(self, query)

Similar to the original LinkedIn Crawler, this function implements a basic string query search for the Recruiter Pro feature. At the end of the function run, a page will be up with the relevant search results of the search query.

help_search_handler_stuff(self)

This function does some things on the current page in an attempt to appear more human. As of now, the function has a notes feature that will randomly jot down notes on the current page.


utils.py

This file contains a few useful functions for waiting and moving the mouse. This is the human file for this project.

sleep_secs(secs)

This is a simple function that has the browser wait for a specified number of seconds.

sleep_rand(limit=__SLEEP_LIMIT__)

This function has the browser wait for a random amount of time less than the user provided limit. If the user does not provide a limit, the browser waits for a random time less than 5 seconds.

move_strategy1(self)

This is a function within the MouseMove class. This function moves the mouse randomly across the window. It uses autopy to move the mouse across the window visibly to the user.

move_to(self, x=None, y=None)

This function is a function within the MouseMove class. Given an x and y coordinates on the screen, this function will move the mouse to that given point.

move_random(self)

This function chooses a random MouseMove method and executes it.

Constructing Your Query

Using Recruiter to search generic terms such as "CompanyName Founder" does not turn up valuable search results. For optimal performance, it is recommended that you determine through another source the exact person you are looking for. Methods to get such information will be listed below.

Using Crunchbase

Currently, we have SnapShot data for the year 2013. This method works for the companies that existed in that period, but is not useful for any companies not listed in that given file. Ideally, we will be able to get data directly from Crunchbase. If not, one option is to crawl Crunchbase directly.

To what extent are we able to reproduce the network structure in LinkedIn (From Previous)

Example 1: 1st degree contact- You are connected to his profile Albert Nabiullin (485 connections)

Example 2: 2nd degree contact- You are connected to someone who is connected to him Amir Kazempour Esmati (63 connections)

Example 3: 3rd degree contact- You are connected to someone who is connected to someone else who is connected to her. Linda Szabados(500+ connections)

Any profile with a distance greater than three is defined as out your network.

Summary: Individual specific network information are not accessible even for the first degree connections. Therefore, any such plans to construct a network structure based on the connection of every individuals is not feasible.

It seems that the only possible direction would be using the advanced search feature.