Twitter Webcrawler (Tool)

From edegan.com
Revision as of 16:30, 19 July 2016 by GunnyLiu (talk | contribs)
Jump to navigation Jump to search

Description

Notes: The Twitter Webcrawler, in its alpha version, is an expedition project involving the Twittwer API in search of a sustainable and scale-able way to excavate retweet-retweeter, favorited-favoriter following-follower relationships in the entrepreneurship Tweet-o-sphere. On the same beat, we also seek to document tweeting activities/timelines of important twitters in the same Tweet-o-sphere.

Input: Twitter database

Output: Local database documenting important timelines and relationships in the entrepreneurship Tweet-o-sphere.

Development Notes

7/11: Project start

  • Dan wanted:
  • First-take on Twitter API Overview
    • Cumbersome API that is not directly accessible/requires great deal of configuration if one chooses to leverage e.g. import requests library.
      • Turns out Twitter has a long controversial history wrt third-party development. There is no clean canonical interface to access its database.
      • DO NOT attempt to access Twitter API through canonical documented methods - huge waste of time
      • Obsolete authentication process documented - do not be use canonical documentation for Oauth procedure
  • Instead, DO USE third-party developed python interfaces such as python-twitter by bear - highly recommended in hindsight
    • Follow python-twitter's documented methods for authentication
    • The twitter account that I am using is shortname: BIPPMcNair and password: amount
      • One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping TOOLS > Manage Your Apps in the footer bar of the portal.
    • There is no direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment.

7/12: Grasping API

  • The python-twitter library is extremely intricate and well-synchronized
    • All queries are to be launched through a twitter.api.Api object, which is produced by the authentication process implemented yesterday
>>> import twitter
>>> api = twitter.Api(consumer_key='consumer_key',
                      consumer_secret='consumer_secret',
                      access_token_key='access_token',
                      access_token_secret='access_token_secret')
    • Some potentially very useful query methods are:
      • Api.GetUserTimeline(user_id=None, screen_name=None) which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple as screen_name, which is @shortname that is v public and familiar.
      • Api.GetRetweeters(status_id=None) and Api.GetRetweets(status_id=None) which identifies a tweet as a status by its status_id and spits out all the retweets that this particular tweet has undergone.
      • Api.GetFavorites(user_id=None) which seems to satisfy our need for tracking favorited tweets
      • Api.GetFollowers(user_id=None, screen_name=None) and Api.GetFollowerIDs(user_id=None, screen_name=None) which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about.
    • After retrieving data objects using these query methods, we can understand and process them using instructions from Twitter-python Models Source Code
      • To note that tweets are expressed as Status objects
        • It holds useful parameters such as 'text', 'created_at', 'user', etc
        • They can be retrieved by classical object expressions such as Status.created_at
      • To note that users are expressed as User objects
      • Best part? All these objects inherit .Api methods such as AsJsonString(self) and AsDict(self) so that we can read and write them as JSON or DICT objects in the py environment

7/13: Full Dev

Documented in-file, as below:

Twitter Webcrawler

  • Summary: Rudimentary (and slightly generalized) webcrawler that queries twitter database with using twitter API. At current stage of development/discussion, user shortname (in twitter, @shortname) is used as the query key, and this script publishes 200 recent tweets of said user in a tab delimited, UTF-8 document, along with the details and social interactions each tweet possesses
  • Input: Twitter database, Shortname string of queried user (@shortname)
  • Output: Local database of queried user's 200 recent tweets, described by the keys "Content", "User", "Created at", "Hashtags", "User Mentions", "Retweet Count", "Retweeted By", "Favorite Count", "Favorited By".
  • Version: 1.0 Alpha
  • Development environment specs: Twitter API, JSON library, twitter-python library, pandas library, Py 2.7, ActiveState Komodo IDE 9.3

Pseudo-code

  • function I: main driver
    • generate empty table for subsequent building with apt columns
    • iterate through each status object in the obtained data, and fill up the table rows as apt, one row per event
    • and the main processing task being: write table to output file
  • function II: empty table generator
    • modular caus of my unfamiliarity with pandas.DataFrame; modularity enables testing
  • function IV: authenticator + twitter API access interface setup
    • authenticate using our granted consumer keys and access tokens
    • obtains working twitter API object, post-authentication
  • function V: subquery #1
    • iterate through main query object in order to further query for retweeters, i.e. GetRetweeter() and ???
  • function VI: raw data acquisitior
    • grabs raw data of recent tweets using master_working_api object
    • make it json so we can access and manipulate it easily

Notes:

  • Modular development and unit testing are integral to writing fast, working code. no joke
  • Problems with GetFavorites() method as it only returns the favorited list wrt authenticated use (i.e. BIPPMcNair), not input target user.
  • Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.
  • A tweet looks like this in json:

7/14 & 7/15: Alpha dev wrap-up

7/18: Application on Todd's Hub Project

Notes and PC for the Todd's hub data

  • Input: csv of twitter @shortnames
  • Output: A main datasheet tagging each @shortname to the following keys: # of followers, # of following, # of tweets made in the past month; a side datasheet for each @shortname detailing the time signature, text, retweet count and other details of each tweet made by given @shortname in the past month.
  • Summary: need to fix up auto .csv writing methods, parameters to query timeline by time signature (UPDATE: NOT POSSIBLE, LET'S JUST DO 200 RESULTS), instead of # of searched tweets.
  • Pseudo-code
    • We need a driver function to write the main datasheet, as well as iterate through the input list of @shortname and run alpha scrapper on each iteration.
    • doesn't need to have a read.csv side function - no room for failure, no need to test
    • Make ***one query*** per iteration, please.

7/19: Application on Todd's Hub Project Pt.II

  • As documented on twitter-python documentation, there is no direct way to filter timeline query results by start date/end date. So I've decided to write a support module time_signature_processor to help with counting the number of tweets that have elapsed since a month ago
    • first-take with from datetime import datetime
    • usage of datetime.datetime.stptime() method to parse formatted (luckily) date strings provided by twitter.Status objects into smart datetime.datetime objects to support mathematical comparisons (i.e. if tweet_time_obj < one_month_ago_obj:
    • Does not support timezone-aware counting. current python version (2.7) does not support timezone-awareness in my datetime.datetime objects.
      • functionality to be subsequently improved
  • To retrieve data regarding # of following for each shortname, it seems like I have to call twitter.api.GetUser() in addition to twitter.api.GetTimeline. To ration token usage, I will omit this second call for now.
    • functionality to be subsequently improved
  • Improvements to debugging interface and practice
    • Do note Komodo IDE's Unexpected Indent error message that procs when it cannot distinguish between whitespaces created by /tab or /space. Use editor debugger instead of interactive shell in this case. Latter is tedious and impossible to fix.
  • data structure pandas.DataFrame can be built in a smart fashion by putting together various dictionaries that uses list-indices and list-values as key-value pairs in the df proper. More efficient than past method of creating empty table then populating it cell-by-cell.
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
        'age': [42, 52, 36, 24, 73],
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df