LinkedIn Crawler (Python)
LinkedIn Crawler (Python) | |
---|---|
Project Information | |
Project Title | LinkedIn Crawler (Python) |
Start Date | March 2, 2017 |
Deadline | |
Keywords | Selenium, LinkedIn, Crawler, Tool |
Primary Billing | |
Notes | |
Has project status | |
Copyright © 2016 edegan.com. All Rights Reserved. |
Contents
Overview
This page is dedicated to a new LinkedIn Crawler built using Selenium and Python. The goal of this project is to be able to crawl LinkedIn without being caught by LinkedIn's aggressive anti-scraping rules. To do this, we will use Selenium to behave like a human, and use time delays to hide bot-like tendencies.
The documentation for Selenium Web Driver can be found [here http://selenium-python.readthedocs.io/index.html].
Relevant scripts can be found in the following directory:
E:\McNair\Projects\LinkedIn Crawler
The code from the original Summer 2016 Project can be found in:
web_crawler\linkedin
The next section will provide details on the construction and functionality of the scripts located in the linkedin directory.
The old documentation said that the programs/scripts (see details below) are located on our Bonobo Git Server.
repository: Web_Crawler branch: researcher/linkedin directory: /linkedin
Accounts
Test Account:
email: testapplicat6@gmail.com
pass: McNair2017
Real Account:
email: ed.edgan@rice.edu
pass: This area has intentionally been left blank.
LinkedIn Scripts
Overview
This section provides a file by file breakdown of the contents of the folder located at:
E:\McNair\Projects\LinkedIn Crawler\web_crawler\linkedin
The main script to run is:
run_linkedin_crawler.py
crawlererror.py
This script is a simple class construction for error messages. It is used in other scripts to raise errors to the user when errors with the crawler occur. Please continue.
linked_in_crawler.py
This script constructs a class that provides navigation functionality around the traditional LinkedIn site. The beginning section lists some global xpaths that will be used by Selenium throughout the process. These xpaths are used to locate elements within the HTML. The following are some important functions to keep in mind when designing original programs using this code.
login(self, username, password)
This function takes a username and password, and logs in to LinkedIn. During the process, the function uses the MouseMove move_random() function to move the mouse randomly across the screen like a crazy person.
logout(self)
This function logs out of LinkedIn. It works by clicking on the profile picture, and then selecting logout.
go_back(self)
This function goes back a page if you ever need to do such a thing.
simple_search(self, query)
This function takes a string as a query, and searches it using the search box.
=
Functionality
This section lists functions in the crawl_linkedin.py script that can be combined for higher functionality.
login(username, password)
This function opens the LinkedIn home page and logs in using the credentials given to the function. You will be taken to the home news feed for your account.
search(query)
This function assumes you are already logged into LinkedIn. It will type in the search bar the query that it is given, and begin the search for the given query.
To what extent are we able to reproduce the network structure in LinkedIn (From Previous)
Example 1: 1st degree contact- You are connected to his profile Albert Nabiullin (485 connections)
Example 2: 2nd degree contact- You are connected to someone who is connected to him Amir Kazempour Esmati (63 connections)
Example 3: 3rd degree contact- You are connected to someone who is connected to someone else who is connected to her. Linda Szabados(500+ connections)
Any profile with a distance greater than three is defined as out your network.
Summary: Individual specific network information are not accessible even for the first degree connections. Therefore, any such plans to construct a network structure based on the connection of every individuals is not feasible.
It seems that the only possible direction would be using the advanced search feature.