Difference between revisions of "Twitter Webcrawler (Tool)"

From edegan.com
Jump to navigation Jump to search
(Created page with "{{McNair Projects |Project Title=Twitter Webcrawler (Tool) |Topic Area=Resources and Tools |Owner=Gunny Liu |Start Term=Summer 2016 |Status=Active |Deliverable=Tool |Audience=...")
 
 
(29 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{McNair Projects
+
{{Project
|Project Title=Twitter Webcrawler (Tool)
+
|Has project output=Tool
|Topic Area=Resources and Tools
+
|Has title=Twitter Webcrawler (Tool)
|Owner=Gunny Liu
+
|Has owner=Gunny Liu
|Start Term=Summer 2016
+
|Has start date=Summer 2016
|Status=Active
+
|Has keywords=Webcrawler, Database, Twitter, API, Python,Tool
|Deliverable=Tool
+
|Has sponsor=McNair Center
|Audience=McNair Staff
+
|Has notes=
|Keywords=Webcrawler, Database, Twitter, API, Python
+
|Is dependent on=
|Primary Billing=AccNBER01
+
|Depends upon it=
 +
|Has project status=Complete
 
}}
 
}}
  
Line 24: Line 25:
  
 
===7/11: Project start===
 
===7/11: Project start===
----
 
 
*Dan wanted:
 
*Dan wanted:
 
[[File:Capture 15.PNG|400px|none]]
 
[[File:Capture 15.PNG|400px|none]]
Line 37: Line 37:
 
***One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping <code>TOOLS > Manage Your Apps</code> in the footer bar of the portal.
 
***One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping <code>TOOLS > Manage Your Apps</code> in the footer bar of the portal.
 
**There is '''no''' direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment.  
 
**There is '''no''' direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment.  
 
  
 
===7/12: Grasping API===
 
===7/12: Grasping API===
Line 47: Line 46:
 
                       access_token_key='access_token',
 
                       access_token_key='access_token',
 
                       access_token_secret='access_token_secret')
 
                       access_token_secret='access_token_secret')
 +
 
**Some potentially very useful query methods are:  
 
**Some potentially very useful query methods are:  
 
***<code>Api.GetUserTimeline(user_id=None, screen_name=None)</code> which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple as <code>screen_name</code>, which is @shortname that is v public and familiar.
 
***<code>Api.GetUserTimeline(user_id=None, screen_name=None)</code> which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple as <code>screen_name</code>, which is @shortname that is v public and familiar.
Line 53: Line 53:
 
***<code>Api.GetFollowers(user_id=None, screen_name=None)</code> and <code>Api.GetFollowerIDs(user_id=None, screen_name=None)</code> which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about.
 
***<code>Api.GetFollowers(user_id=None, screen_name=None)</code> and <code>Api.GetFollowerIDs(user_id=None, screen_name=None)</code> which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about.
  
===7/5: Eventbrite API First-Take===
+
**After retrieving data objects using these query methods, we can understand and process them using instructions from [http://python-twitter.readthedocs.io/en/latest/_modules/twitter/models.html Twitter-python Models Source Code]
----
+
***To note that tweets are expressed as <code>Status</code> objects
*Eventbrite developer account for McNair Center:
+
****It holds useful parameters such as <code>'text'</code>, <code>'created_at'</code>, <code>'user'</code>, etc
**first name: '''Anne''', last name: '''Dayton'''
+
****They can be retrieved by classical object expressions such as <code>Status.created_at</code>
**Login Email: '''admin@mcnaircenter.org'''
+
***To note that users are expressed as <code>User</code> objects
**Login Password: '''amount'''
+
***Best part? All these objects inherit .Api methods such as AsJsonString(self) and AsDict(self) so that we can read and write them as JSON or DICT objects in the py environment
*Eventbrite API is well-documented and its database readily accessible. In the python dev environment, I am using the http <code>requests</code> library to make queries to the database, to obtain json data containing event objects that in turn contain organizer objects, venue objects, start/end time values, longitude/latitude values specific to each event. The <code>requests</code> library has inbuilt <code>.json()</code> access methods, simplifying the json reading/writing process. Bang.
+
 
**In querying for events organized by techstar, one of the biggest startup programs organization in the U.S., I use the following. Note that the organizer ID of techstar is 2300226659.
+
===7/13: Full Dev===
import requests
+
'''Documented in-file, as below:'''
response = requests.get(
+
 
    "https://www.eventbriteapi.com/v3/organizers/2300226659/events/",
+
====Twitter Webcrawler====
    headers = {
+
*Summary: Rudimentary (and slightly generalized) webcrawler that queries twitter database with using twitter API. At current stage of development/discussion, user shortname (in twitter, @shortname) is used as the query key, and this script publishes 200 recent tweets of said user in a tab delimited, UTF-8 document, along with the details and social interactions each tweet possesses
        "Authorization": "Bearer CRAQ5MAXEGHKEXSUSWXN",
+
*Input: Twitter database, Shortname string of queried user (@shortname)
    },
+
*Output: Local database of queried user's 200 recent tweets, described by the keys "Content", "User", "Created at", "Hashtags", "User Mentions", "Retweet Count", "Retweeted By", "Favorite Count", "Favorited By".
    verify = True,
+
*Version: 1.0 Alpha
)
+
*Development environment specs: Twitter API, JSON library, twitter-python library, pandas library, Py 2.7, ActiveState Komodo IDE 9.3
**In querying for, instead, keywords such as "startup weekend," I use the following.
+
 
import requests
+
====Pseudo-code====
response = requests.get(
+
*function I: main driver
    "https://www.eventbriteapi.com/v3/events/search/q="startup weekend"",
+
**generate empty table for subsequent building with apt columns
    headers = {
+
**iterate through each status object in the obtained data, and fill up the table rows as apt, one row per event
        "Authorization": "Bearer CRAQ5MAXEGHKEXSUSWXN",
+
**and the main processing task being: write table to output file
    },
+
 
    verify = True, 
+
*function II: empty table generator
)
+
**'''modular caus of my unfamiliarity with pandas.DataFrame; modularity enables testing'''
**In querying for events parked under the category "science and technology", I use the following. However, this query also returns scientific seminars unrelated to entrepreneurship and is yet to be refined.
+
 
**Note that the category ID of science and technology is 102.
+
*function IV: authenticator + twitter API access interface setup
import requests
+
**authenticate using our granted consumer keys and access tokens
response = requests.get(
+
**obtains working twitter API object, post-authentication
    "https://www.eventbriteapi.com/v3/categories/102",
+
 
    headers = {
+
*function V: subquery #1
        "Authorization": "Bearer CRAQ5MAXEGHKEXSUSWXN",
+
**iterate through main query object in order to further query for retweeters, i.e. GetRetweeter() and ???
    },
+
 
    verify = True, 
+
*function VI: raw data acquisitior
)
+
**grabs raw data of recent tweets using master_working_api object
**In each case, var <code>response</code> is a json object, that can be read/written in python using the requests method <code>response.json()</code>. Each endpoint used above are instances of e.g. <code>GET events/search/</code> or <code>GET categories/:id</code> EventBrite API methods. There are different parameters each GET function can harness to get more specific results. To populate a comprehensive local database, the '''dream''' is to systematic queries from different endpoints and collecting all results, without repetition, in a centralized database. In order to do this, I'll have to familarize further with these GET functions and develop a systematic approach to automate queries to the eventbrite server. One way to do this is to import entrepreneurship buzzword libraries that are available on the web, and make queries by iterating through these search strings systematically.
+
**make it json so we can access and manipulate it easily
*Eventbrite event objects in json are well-organized and consistent. There are many interesting fields such as the longitude/latitude decimals, apart from name/location/organizer/start-time/end-time data which are data we want to amass initially.  
+
 
**For instance, the upcoming startup weekend event in Seville looks like the following.
+
====Notes:====
[[File:Capture 12.PNG|400px|none]]
+
*Modular development and unit testing are integral to writing fast, working code. no joke
**In the events object, organizer and venue are represented as ID's and have to be queried separately since they contain a multitude of string-value pairs such as "description", "logo", and "url" in the case of organizer data. Huge opportunity here for more data extraction. Kudos to eventbrite for documenting their stuff meticulously. Can you tell I'm impressed?
+
*Problems with GetFavorites() method as it only returns the favorited list wrt authenticated use (i.e. BIPPMcNair), not input target user.  
**To produce a local database, I'm using the <code>import pandas as pd</code> library, the <code>pandas.DataFrame</code> object and the <code>pandas.DataFrame.to_csv()</code> method. Currently, I initialize a dataframe with columns of variables that I seek to extract, and iterate through event objects and venue/organizer objects within to populate the dataframe with rows of event data.  
+
*'''Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.'''
**'''Still debugging/writing at the moment'''.
+
*A tweet looks like this in json:
**RDP went down, major sadness.
+
[[File:Capture 16.PNG|800px|none]]
 +
 
 +
===7/14 & 7/15: Alpha dev wrap-up===
 +
*Black box can be pretty much sealed after a round of debugging
 +
*All output rqts fulfilled except for output "retweeter" list per tweet
 +
*[https://github.com/scroungemyvibe/mcnair_center_builds Code is live]
 +
*[https://github.com/scroungemyvibe/mcnair_center_builds Sample output is live]
 +
*Awaiting more discussion, modifications
 +
*Ed mentioned populating a database according to [[Social_Media_Entrepreneurship_Resources|Past Tweet-o-sphere experimentation/excavation results]]<!-- flush flush -->
 +
 
 +
===7/18: Application on Todd's Hub Project===
 +
====Notes and PC for the Todd's hub data====
 +
*Input: csv of twitter @shortnames
 +
*Output: A main datasheet tagging each @shortname to the following keys: # of followers, # of following, # of tweets made in the past month; a side datasheet for each @shortname detailing the time signature, text, retweet count and other details of each tweet made by given @shortname in the past month.
 +
*Summary: need to fix up auto .csv writing methods, parameters to query timeline by time signature (UPDATE: NOT POSSIBLE, LET'S JUST DO 200 RESULTS), instead of # of searched tweets.  
  
 +
*Pseudo-code
 +
**We need a driver function to write the main datasheet, as well as iterate through the input list of @shortname and run alpha scrapper on each iteration.
 +
**doesn't need to have a read.csv side function - no room for failure, no need to test
 +
**Make ***one query*** per iteration, please.
  
===7/6: Alpha Development===
+
===7/19: Application on Todd's Hub Project Pt.II===
----
+
*As documented on <code>twitter-python</code> documentation, there is no direct way to filter timeline query results by start date/end date. So I've decided to write a support module <code>time_signature_processor</code> to help with counting the number of tweets that have elapsed since a month ago
*Eventbrite stipulates a system of ID-numbering for all organizers and venues objects, for instance.
+
**first-take with <code>from datetime import datetime</code>
**For the endpoint <code>GET /venues/:id/</code>, replace <code>:id</code> with the venue_id associated with desired organizer
+
**usage of datetime.datetime.stptime() method to parse formatted (luckily) date strings provided by <code>twitter.Status</code> objects into smart datetime.datetime objects to support mathematical comparisons (i.e. <code>if tweet_time_obj < one_month_ago_obj: </code>)
**For the endpoint <code>GET /organizers/:id</code>, replace <code>:id</code> with the organizer_id associated with desired organizer
+
**Does not support timezone-aware counting. current python version (2.7) does not support timezone-awareness in my datetime.datetime objects.
**Where are these ID numbers located, you ask? Any query for an event will return them as values the the strings "venue_id" and "organizer_id"
+
***'''functionality to be subsequently improved'''
*Script development slowed considerably by lack of modularity and debugging functionality
+
*To retrieve data regarding # of following for each shortname, it seems like I have to call <code>twitter.api.GetUser()</code> in addition to <code>twitter.api.GetTimeline</code>. To ration token usage, I will omit this second call for now.
**Modules to generate query url strings from input GET
+
**'''functionality to be subsequently improved'''
**Module to create empty <code>pandas.DataFrame</code> table based on input rows and columns
+
*Improvements to debugging interface and practice
**Modules to retrieve information from venues and organizer data from their respective ID numbers
+
**Do note Komodo IDE's <code>Unexpected Indent</code> error message that procs when it cannot distinguish between whitespaces created by /tab or /space. Use editor debugger instead of interactive shell in this case. Latter is tedious and impossible to fix.
**To learn and operate komodo debugger and write appropriate tests for each modules detached from main driver function
+
*data structure <code>pandas.DataFrame</code> can be built in a smart fashion by putting together various dictionaries that uses list-indices and list-values as key-value pairs in the df proper. More efficient than past method of creating empty table then populating it cell-by-cell. This is clearly the way to go, I was young and stupid.
**To learn pandas.DataFrame and appropriate methods to update it
+
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
*'''Notes and Ideas'''
+
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
**Develop smart iteration to query for all events sought
+
        'age': [42, 52, 36, 24, 73],
:::To create intelligent searches:
+
        'preTestScore': [4, 24, 31, 2, 3],
:::Note that eventbrite is esp good for free events
+
        'postTestScore': [25, 94, 57, 62, 70]}
:::Note that past events may extend only to a certain point
+
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
:::Note that eventbrite was launched in 2006, but is the first major player in online event ticketing
+
df<!-- flush -->
:::Category is always science and tech
 
:::Organiser is impt; some entrepreneurship events are organised by known collectives
 
:::Organiser description also has many impt keywords
 
:::keywords from SEO material on [[marketing artfully]] is very good
 
:::Event series, dates and venues endpoints are secondarily important
 
  
 +
===7/20: Application on Todd's Hub Project Pt. III===
 +
*Major debugging session
 +
**Note: <code>str()</code> method in python attempts to convert input into ASCII chars. When input are already UTF-8 chars, or have ambiguous components such as a backslash, <code>str()</code> will malfunction!!
 +
**Note: wrote additional function <code>empty_timeline_filter()</code> to address problems with certain @shortnames having no tweets, ever, and thus no timeline to speak of. Ran function and manually removed these @shortnames from the input .csv
 +
**'''Re: Twitter API TOKENS''' i.e. this is important. Refer to [https://dev.twitter.com/rest/public/rate-limits API Rate Limit Chart] for comprehensive information on what traffic Twitter allows, and does not allow us to query.
 +
***In calling <code>GET statuses/user_timeline</code> for all 109 @shortnames in the input list, I am barely hitting the ''''180 calls per 15 minutes'''' rate limit. But do take note that while testing modules, one is likely to repeatedly call the same <code>GET</code> in a short burst span of time.
 +
***In terms of future developments, <code>GET</code> methods such as <code>GET statuses/retweeters/ids</code> are capped at a mere ''''15 calls per 15 minutes''''. This explains why it was previously impossible to populate a list of retweeter ID's for each tweet prosseesed in the alpha scrapper. (See above)
 +
***There is a sleeper parameter we can use with the <code>twitter.Api</code> object in <code>python-twitter</code>
 +
import twitter
 +
api = twitter.Api(consumer_key=[consumer key],
 +
                  consumer_secret=[consumer secret],
 +
                  access_token_key=[access token],
 +
                  access_token_secret=[access token secret],
 +
                  '''sleep_on_rate_limit=True''')
 +
***It is, however, unclear if this is useful. Considering that the sleeper is triggered at a certain point, it is hard to keep track of the chokepoint and, more importantly, how long is the wait and how long already has elapsed.
 +
**Note: it was important to add progress print() statements at each juncture of the scrapper driver for each iteration of data scrapping, as follows. They helped me track the progress of the data query and writing, and alerted me to possible bugs that arise for individual @shortname and timelines.
 +
[[File:Capture 18.PNG|800px|none]]
 +
Note to self: full automation/perfectionism is not necessary or helpful in a dev environment. It is of paramount importance to seek the shortest path, the max effect and the most important problem at each given step.
 +
*'''Development complete'''
 +
**Output files can be found in E:\McNair\Users\GunnyLiu, with E:\ being McNair's shared bulk drive.
 +
***Main datasheet that maps each row of @shortname to its count of followers and past month tweets is named <code>Hub_Tweet_Main_DataSheet.csv</code>
 +
***Individual datasheets for each @shortname that maps each tweet to tweet details can be found at <code>Twitter_Data_Where_@shortname_Tweets.csv</code>
 +
**Code will be LIVE on <code>mcnair git</code> soon
 +
*Output/Process Shortcoming:
 +
**Unable to retrieve retweeter list for each tweet, because this current pull has a total of 200x109=21800 tweets. Making 1 call a minute due to rate limit will amount to a runtime of >21800 minutes. 363 Hours approx. If an intern is paid $10 an hour, this data could cost $3630. Let's talk about opportunity cost.
 +
**Unable to process past month tweet count if count exceeds 199. Will need to write additional recursive modules to do additional pulls to achieve actual number. To be discussed
 +
**Unable to correct for timezone in calculating tweets over the past month. Needs to install <code>python 3.5.3</code>
 +
**Unable to process data for a single @shortname i.e. @FORGEPortland becuz they don't tweet and that's annoying
  
===7/7 Alpha Development #2===
+
===7/21: Application to Todd's Hub Project Pt. IV===
----
+
*Fix for time signatures in output
*Full swing: pseudo-code, modularity, docstrings, tests, naming style
+
**Instead of discrete strings, we want the "Creation Time" value of tweets in the output to be in the format of MM/DD/YYYY, which supports performance on MS Excel and other GUI-based analysis environments
*Komodo debugger works
+
**Wrote new function time_signature_simplifier() and time_signature_mass_simplification()
*Alpha development complete. All tests passed. Complete code as below.
+
**Functions iterate through all existing .csv tweetlogs of listed hubs @shortnames and process them in a python environment as pd.DataFrame objects
https://github.com/scroungemyvibe/mcnair_center_builds/blob/master/EventBrite_Webcrawler_Build.py
+
**For each date string that exists under the "Creation Time" column, function converts them to datetime.datetime objects, and overwrite using <code>.date().month</code>, <code>.date().day</code>, <code>.date().year</code> attributes of each object.
*'''Notes'''
+
***Met problems with date strings such as "29 Feb"; datetime has compatibility issues with leap years esp. when year is defaulted to 1900. Do take note.
**Current query (without input parameters) by organizer ID returns only active events listed under organizer. For instance, techstars has 45 upcoming events and I am pulling 45 json event objects from the database.
+
**test passed; new data is available, for every input @shortname <code>Twitter_Data_Where_@shortname_Tweets_v2.csv</code>
**Current build should be applied systematically to lists of organizer_id's
 
**Further build ideas/notes documented in code proper on the git
 

Latest revision as of 12:47, 21 September 2020


Project
Twitter Webcrawler (Tool)
Project logo 02.png
Project Information
Has title Twitter Webcrawler (Tool)
Has owner Gunny Liu
Has start date Summer 2016
Has deadline date
Has keywords Webcrawler, Database, Twitter, API, Python, Tool
Has project status Complete
Has sponsor McNair Center
Has project output Tool
Copyright © 2019 edegan.com. All Rights Reserved.


Description

Notes: The Twitter Webcrawler, in its alpha version, is an expedition project involving the Twittwer API in search of a sustainable and scale-able way to excavate retweet-retweeter, favorited-favoriter following-follower relationships in the entrepreneurship Tweet-o-sphere. On the same beat, we also seek to document tweeting activities/timelines of important twitters in the same Tweet-o-sphere.

Input: Twitter database

Output: Local database documenting important timelines and relationships in the entrepreneurship Tweet-o-sphere.

Development Notes

7/11: Project start

  • Dan wanted:
Capture 15.PNG
  • First-take on Twitter API Overview
    • Cumbersome API that is not directly accessible/requires great deal of configuration if one chooses to leverage e.g. import requests library.
      • Turns out Twitter has a long controversial history wrt third-party development. There is no clean canonical interface to access its database.
      • DO NOT attempt to access Twitter API through canonical documented methods - huge waste of time
      • Obsolete authentication process documented - do not be use canonical documentation for Oauth procedure
  • Instead, DO USE third-party developed python interfaces such as python-twitter by bear - highly recommended in hindsight
    • Follow python-twitter's documented methods for authentication
    • The twitter account that I am using is shortname: BIPPMcNair and password: amount
      • One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping TOOLS > Manage Your Apps in the footer bar of the portal.
    • There is no direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment.

7/12: Grasping API

  • The python-twitter library is extremely intricate and well-synchronized
    • All queries are to be launched through a twitter.api.Api object, which is produced by the authentication process implemented yesterday
>>> import twitter
>>> api = twitter.Api(consumer_key='consumer_key',
                      consumer_secret='consumer_secret',
                      access_token_key='access_token',
                      access_token_secret='access_token_secret')
    • Some potentially very useful query methods are:
      • Api.GetUserTimeline(user_id=None, screen_name=None) which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple as screen_name, which is @shortname that is v public and familiar.
      • Api.GetRetweeters(status_id=None) and Api.GetRetweets(status_id=None) which identifies a tweet as a status by its status_id and spits out all the retweets that this particular tweet has undergone.
      • Api.GetFavorites(user_id=None) which seems to satisfy our need for tracking favorited tweets
      • Api.GetFollowers(user_id=None, screen_name=None) and Api.GetFollowerIDs(user_id=None, screen_name=None) which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about.
    • After retrieving data objects using these query methods, we can understand and process them using instructions from Twitter-python Models Source Code
      • To note that tweets are expressed as Status objects
        • It holds useful parameters such as 'text', 'created_at', 'user', etc
        • They can be retrieved by classical object expressions such as Status.created_at
      • To note that users are expressed as User objects
      • Best part? All these objects inherit .Api methods such as AsJsonString(self) and AsDict(self) so that we can read and write them as JSON or DICT objects in the py environment

7/13: Full Dev

Documented in-file, as below:

Twitter Webcrawler

  • Summary: Rudimentary (and slightly generalized) webcrawler that queries twitter database with using twitter API. At current stage of development/discussion, user shortname (in twitter, @shortname) is used as the query key, and this script publishes 200 recent tweets of said user in a tab delimited, UTF-8 document, along with the details and social interactions each tweet possesses
  • Input: Twitter database, Shortname string of queried user (@shortname)
  • Output: Local database of queried user's 200 recent tweets, described by the keys "Content", "User", "Created at", "Hashtags", "User Mentions", "Retweet Count", "Retweeted By", "Favorite Count", "Favorited By".
  • Version: 1.0 Alpha
  • Development environment specs: Twitter API, JSON library, twitter-python library, pandas library, Py 2.7, ActiveState Komodo IDE 9.3

Pseudo-code

  • function I: main driver
    • generate empty table for subsequent building with apt columns
    • iterate through each status object in the obtained data, and fill up the table rows as apt, one row per event
    • and the main processing task being: write table to output file
  • function II: empty table generator
    • modular caus of my unfamiliarity with pandas.DataFrame; modularity enables testing
  • function IV: authenticator + twitter API access interface setup
    • authenticate using our granted consumer keys and access tokens
    • obtains working twitter API object, post-authentication
  • function V: subquery #1
    • iterate through main query object in order to further query for retweeters, i.e. GetRetweeter() and ???
  • function VI: raw data acquisitior
    • grabs raw data of recent tweets using master_working_api object
    • make it json so we can access and manipulate it easily

Notes:

  • Modular development and unit testing are integral to writing fast, working code. no joke
  • Problems with GetFavorites() method as it only returns the favorited list wrt authenticated use (i.e. BIPPMcNair), not input target user.
  • Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.
  • A tweet looks like this in json:
Capture 16.PNG

7/14 & 7/15: Alpha dev wrap-up

7/18: Application on Todd's Hub Project

Notes and PC for the Todd's hub data

  • Input: csv of twitter @shortnames
  • Output: A main datasheet tagging each @shortname to the following keys: # of followers, # of following, # of tweets made in the past month; a side datasheet for each @shortname detailing the time signature, text, retweet count and other details of each tweet made by given @shortname in the past month.
  • Summary: need to fix up auto .csv writing methods, parameters to query timeline by time signature (UPDATE: NOT POSSIBLE, LET'S JUST DO 200 RESULTS), instead of # of searched tweets.
  • Pseudo-code
    • We need a driver function to write the main datasheet, as well as iterate through the input list of @shortname and run alpha scrapper on each iteration.
    • doesn't need to have a read.csv side function - no room for failure, no need to test
    • Make ***one query*** per iteration, please.

7/19: Application on Todd's Hub Project Pt.II

  • As documented on twitter-python documentation, there is no direct way to filter timeline query results by start date/end date. So I've decided to write a support module time_signature_processor to help with counting the number of tweets that have elapsed since a month ago
    • first-take with from datetime import datetime
    • usage of datetime.datetime.stptime() method to parse formatted (luckily) date strings provided by twitter.Status objects into smart datetime.datetime objects to support mathematical comparisons (i.e. if tweet_time_obj < one_month_ago_obj: )
    • Does not support timezone-aware counting. current python version (2.7) does not support timezone-awareness in my datetime.datetime objects.
      • functionality to be subsequently improved
  • To retrieve data regarding # of following for each shortname, it seems like I have to call twitter.api.GetUser() in addition to twitter.api.GetTimeline. To ration token usage, I will omit this second call for now.
    • functionality to be subsequently improved
  • Improvements to debugging interface and practice
    • Do note Komodo IDE's Unexpected Indent error message that procs when it cannot distinguish between whitespaces created by /tab or /space. Use editor debugger instead of interactive shell in this case. Latter is tedious and impossible to fix.
  • data structure pandas.DataFrame can be built in a smart fashion by putting together various dictionaries that uses list-indices and list-values as key-value pairs in the df proper. More efficient than past method of creating empty table then populating it cell-by-cell. This is clearly the way to go, I was young and stupid.
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
        'age': [42, 52, 36, 24, 73],
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df

7/20: Application on Todd's Hub Project Pt. III

  • Major debugging session
    • Note: str() method in python attempts to convert input into ASCII chars. When input are already UTF-8 chars, or have ambiguous components such as a backslash, str() will malfunction!!
    • Note: wrote additional function empty_timeline_filter() to address problems with certain @shortnames having no tweets, ever, and thus no timeline to speak of. Ran function and manually removed these @shortnames from the input .csv
    • Re: Twitter API TOKENS i.e. this is important. Refer to API Rate Limit Chart for comprehensive information on what traffic Twitter allows, and does not allow us to query.
      • In calling GET statuses/user_timeline for all 109 @shortnames in the input list, I am barely hitting the '180 calls per 15 minutes' rate limit. But do take note that while testing modules, one is likely to repeatedly call the same GET in a short burst span of time.
      • In terms of future developments, GET methods such as GET statuses/retweeters/ids are capped at a mere '15 calls per 15 minutes'. This explains why it was previously impossible to populate a list of retweeter ID's for each tweet prosseesed in the alpha scrapper. (See above)
      • There is a sleeper parameter we can use with the twitter.Api object in python-twitter
import twitter
api = twitter.Api(consumer_key=[consumer key],
                  consumer_secret=[consumer secret],
                  access_token_key=[access token],
                  access_token_secret=[access token secret],
                  sleep_on_rate_limit=True)
      • It is, however, unclear if this is useful. Considering that the sleeper is triggered at a certain point, it is hard to keep track of the chokepoint and, more importantly, how long is the wait and how long already has elapsed.
    • Note: it was important to add progress print() statements at each juncture of the scrapper driver for each iteration of data scrapping, as follows. They helped me track the progress of the data query and writing, and alerted me to possible bugs that arise for individual @shortname and timelines.
Capture 18.PNG

Note to self: full automation/perfectionism is not necessary or helpful in a dev environment. It is of paramount importance to seek the shortest path, the max effect and the most important problem at each given step.

  • Development complete
    • Output files can be found in E:\McNair\Users\GunnyLiu, with E:\ being McNair's shared bulk drive.
      • Main datasheet that maps each row of @shortname to its count of followers and past month tweets is named Hub_Tweet_Main_DataSheet.csv
      • Individual datasheets for each @shortname that maps each tweet to tweet details can be found at Twitter_Data_Where_@shortname_Tweets.csv
    • Code will be LIVE on mcnair git soon
  • Output/Process Shortcoming:
    • Unable to retrieve retweeter list for each tweet, because this current pull has a total of 200x109=21800 tweets. Making 1 call a minute due to rate limit will amount to a runtime of >21800 minutes. 363 Hours approx. If an intern is paid $10 an hour, this data could cost $3630. Let's talk about opportunity cost.
    • Unable to process past month tweet count if count exceeds 199. Will need to write additional recursive modules to do additional pulls to achieve actual number. To be discussed
    • Unable to correct for timezone in calculating tweets over the past month. Needs to install python 3.5.3
    • Unable to process data for a single @shortname i.e. @FORGEPortland becuz they don't tweet and that's annoying

7/21: Application to Todd's Hub Project Pt. IV

  • Fix for time signatures in output
    • Instead of discrete strings, we want the "Creation Time" value of tweets in the output to be in the format of MM/DD/YYYY, which supports performance on MS Excel and other GUI-based analysis environments
    • Wrote new function time_signature_simplifier() and time_signature_mass_simplification()
    • Functions iterate through all existing .csv tweetlogs of listed hubs @shortnames and process them in a python environment as pd.DataFrame objects
    • For each date string that exists under the "Creation Time" column, function converts them to datetime.datetime objects, and overwrite using .date().month, .date().day, .date().year attributes of each object.
      • Met problems with date strings such as "29 Feb"; datetime has compatibility issues with leap years esp. when year is defaulted to 1900. Do take note.
    • test passed; new data is available, for every input @shortname Twitter_Data_Where_@shortname_Tweets_v2.csv