Difference between revisions of "Twitter Webcrawler (Tool)"

McNair Project
Twitter Webcrawler (Tool)
Project Information
Project Title
Start Date
Deadline
Primary Billing
Notes
Has project status
	Copyright © 2016 edegan.com. All Rights Reserved.

Revision as of 16:30, 19 July 2016

Description

Notes: The Twitter Webcrawler, in its alpha version, is an expedition project involving the Twittwer API in search of a sustainable and scale-able way to excavate retweet-retweeter, favorited-favoriter following-follower relationships in the entrepreneurship Tweet-o-sphere. On the same beat, we also seek to document tweeting activities/timelines of important twitters in the same Tweet-o-sphere.

Input: Twitter database

Output: Local database documenting important timelines and relationships in the entrepreneurship Tweet-o-sphere.

Development Notes

7/11: Project start

Dan wanted:

First-take on Twitter API Overview
- Cumbersome API that is not directly accessible/requires great deal of configuration if one chooses to leverage e.g. import requests library.
  - Turns out Twitter has a long controversial history wrt third-party development. There is no clean canonical interface to access its database.
  - DO NOT attempt to access Twitter API through canonical documented methods - huge waste of time
  - Obsolete authentication process documented - do not be use canonical documentation for Oauth procedure
Instead, DO USE third-party developed python interfaces such as python-twitter by bear - highly recommended in hindsight
- Follow python-twitter's documented methods for authentication
- The twitter account that I am using is shortname: BIPPMcNair and password: amount
  - One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping TOOLS > Manage Your Apps in the footer bar of the portal.
- There is no direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment.

7/12: Grasping API

The python-twitter library is extremely intricate and well-synchronized
- All queries are to be launched through a twitter.api.Api object, which is produced by the authentication process implemented yesterday

>>> import twitter
>>> api = twitter.Api(consumer_key='consumer_key',
                      consumer_secret='consumer_secret',
                      access_token_key='access_token',
                      access_token_secret='access_token_secret')

- Some potentially very useful query methods are:
  - Api.GetUserTimeline(user_id=None, screen_name=None) which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple as screen_name, which is @shortname that is v public and familiar.
  - Api.GetRetweeters(status_id=None) and Api.GetRetweets(status_id=None) which identifies a tweet as a status by its status_id and spits out all the retweets that this particular tweet has undergone.
  - Api.GetFavorites(user_id=None) which seems to satisfy our need for tracking favorited tweets
  - Api.GetFollowers(user_id=None, screen_name=None) and Api.GetFollowerIDs(user_id=None, screen_name=None) which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about.

- After retrieving data objects using these query methods, we can understand and process them using instructions from Twitter-python Models Source Code
  - To note that tweets are expressed as Status objects
    - It holds useful parameters such as 'text', 'created_at', 'user', etc
    - They can be retrieved by classical object expressions such as Status.created_at
  - To note that users are expressed as User objects
  - Best part? All these objects inherit .Api methods such as AsJsonString(self) and AsDict(self) so that we can read and write them as JSON or DICT objects in the py environment

7/13: Full Dev

Documented in-file, as below:

Twitter Webcrawler

Summary: Rudimentary (and slightly generalized) webcrawler that queries twitter database with using twitter API. At current stage of development/discussion, user shortname (in twitter, @shortname) is used as the query key, and this script publishes 200 recent tweets of said user in a tab delimited, UTF-8 document, along with the details and social interactions each tweet possesses
Input: Twitter database, Shortname string of queried user (@shortname)
Output: Local database of queried user's 200 recent tweets, described by the keys "Content", "User", "Created at", "Hashtags", "User Mentions", "Retweet Count", "Retweeted By", "Favorite Count", "Favorited By".
Version: 1.0 Alpha
Development environment specs: Twitter API, JSON library, twitter-python library, pandas library, Py 2.7, ActiveState Komodo IDE 9.3

Pseudo-code

function I: main driver
- generate empty table for subsequent building with apt columns
- iterate through each status object in the obtained data, and fill up the table rows as apt, one row per event
- and the main processing task being: write table to output file

function II: empty table generator
- modular caus of my unfamiliarity with pandas.DataFrame; modularity enables testing

function IV: authenticator + twitter API access interface setup
- authenticate using our granted consumer keys and access tokens
- obtains working twitter API object, post-authentication

function V: subquery #1
- iterate through main query object in order to further query for retweeters, i.e. GetRetweeter() and ???

function VI: raw data acquisitior
- grabs raw data of recent tweets using master_working_api object
- make it json so we can access and manipulate it easily

Notes:

Modular development and unit testing are integral to writing fast, working code. no joke
Problems with GetFavorites() method as it only returns the favorited list wrt authenticated use (i.e. BIPPMcNair), not input target user.
Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.
A tweet looks like this in json:

7/14 & 7/15: Alpha dev wrap-up

Black box can be pretty much sealed after a round of debugging
All output rqts fulfilled except for output "retweeter" list per tweet
Code is live
Sample output is live
Awaiting more discussion, modifications
Ed mentioned populating a database according to Past Tweet-o-sphere experimentation/excavation results

7/18: Application on Todd's Hub Project

Notes and PC for the Todd's hub data

Input: csv of twitter @shortnames
Output: A main datasheet tagging each @shortname to the following keys: # of followers, # of following, # of tweets made in the past month; a side datasheet for each @shortname detailing the time signature, text, retweet count and other details of each tweet made by given @shortname in the past month.
Summary: need to fix up auto .csv writing methods, parameters to query timeline by time signature (UPDATE: NOT POSSIBLE, LET'S JUST DO 200 RESULTS), instead of # of searched tweets.

Pseudo-code
- We need a driver function to write the main datasheet, as well as iterate through the input list of @shortname and run alpha scrapper on each iteration.
- doesn't need to have a read.csv side function - no room for failure, no need to test
- Make ***one query*** per iteration, please.

7/19: Application on Todd's Hub Project Pt.II

As documented on twitter-python documentation, there is no direct way to filter timeline query results by start date/end date. So I've decided to write a support module time_signature_processor to help with counting the number of tweets that have elapsed since a month ago
- first-take with from datetime import datetime
- usage of datetime.datetime.stptime() method to parse formatted (luckily) date strings provided by twitter.Status objects into smart datetime.datetime objects to support mathematical comparisons (i.e. if tweet_time_obj < one_month_ago_obj:
- Does not support timezone-aware counting. current python version (2.7) does not support timezone-awareness in my datetime.datetime objects.
  - functionality to be subsequently improved
To retrieve data regarding # of following for each shortname, it seems like I have to call twitter.api.GetUser() in addition to twitter.api.GetTimeline. To ration token usage, I will omit this second call for now.
- functionality to be subsequently improved
Improvements to debugging interface and practice
- Do note Komodo IDE's Unexpected Indent error message that procs when it cannot distinguish between whitespaces created by /tab or /space. Use editor debugger instead of interactive shell in this case. Latter is tedious and impossible to fix.
data structure pandas.DataFrame can be built in a smart fashion by putting together various dictionaries that uses list-indices and list-values as key-value pairs in the df proper. More efficient than past method of creating empty table then populating it cell-by-cell.

raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
        'age': [42, 52, 36, 24, 73],
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df

@@ Line 114: / Line 114: @@
 **doesn't need to have a read.csv side function - no room for failure, no need to test
 **Make ***one query*** per iteration, please.
+===7/19: Application on Todd's Hub Project Pt.II===
+*As documented on <code>twitter-python</code> documentation, there is no direct way to filter timeline query results by start date/end date. So I've decided to write a support module <code>time_signature_processor</code> to help with counting the number of tweets that have elapsed since a month ago
+**first-take with <code>from datetime import datetime</code>
+**usage of datetime.datetime.stptime() method to parse formatted (luckily) date strings provided by <code>twitter.Status</code> objects into smart datetime.datetime objects to support mathematical comparisons (i.e. <code>if tweet_time_obj < one_month_ago_obj: </code>
+**Does not support timezone-aware counting. current python version (2.7) does not support timezone-awareness in my datetime.datetime objects.
+***'''functionality to be subsequently improved'''
+*To retrieve data regarding # of following for each shortname, it seems like I have to call <code>twitter.api.GetUser()</code> in addition to <code>twitter.api.GetTimeline</code>. To ration token usage, I will omit this second call for now.
+**'''functionality to be subsequently improved'''
+*Improvements to debugging interface and practice
+**Do note Komodo IDE's <code>Unexpected Indent</code> error message that procs when it cannot distinguish between whitespaces created by /tab or /space. Use editor debugger instead of interactive shell in this case. Latter is tedious and impossible to fix.
+*data structure <code>pandas.DataFrame</code> can be built in a smart fashion by putting together various dictionaries that uses list-indices and list-values as key-value pairs in the df proper. More efficient than past method of creating empty table then populating it cell-by-cell.
+ raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
+         'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
+         'age': [42, 52, 36, 24, 73],
+         'preTestScore': [4, 24, 31, 2, 3],
+         'postTestScore': [25, 94, 57, 62, 70]}
+ df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
+ df

Difference between revisions of "Twitter Webcrawler (Tool)"

Revision as of 16:30, 19 July 2016

Contents

Description

Development Notes

7/11: Project start

7/12: Grasping API

7/13: Full Dev

Twitter Webcrawler

Pseudo-code

Notes:

7/14 & 7/15: Alpha dev wrap-up

7/18: Application on Todd's Hub Project

Notes and PC for the Todd's hub data

7/19: Application on Todd's Hub Project Pt.II

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools