*As documented on <code>twitter-python</code> documentation, there is no direct way to filter timeline query results by start date/end date. So I've decided to write a support module <code>time_signature_processor</code> to help with counting the number of tweets that have elapsed since a month ago
**first-take with <code>from datetime import datetime</code>
**usage of datetime.datetime.stptime() method to parse formatted (luckily) date strings provided by <code>twitter.Status</code> objects into smart datetime.datetime objects to support mathematical comparisons (i.e. <code>if tweet_time_obj < one_month_ago_obj: </code> )
**Does not support timezone-aware counting. current python version (2.7) does not support timezone-awareness in my datetime.datetime objects.
***'''functionality to be subsequently improved'''
*Improvements to debugging interface and practice
**Do note Komodo IDE's <code>Unexpected Indent</code> error message that procs when it cannot distinguish between whitespaces created by /tab or /space. Use editor debugger instead of interactive shell in this case. Latter is tedious and impossible to fix.
*data structure <code>pandas.DataFrame</code> can be built in a smart fashion by putting together various dictionaries that uses list-indices and list-values as key-value pairs in the df proper. More efficient than past method of creating empty table then populating it cell-by-cell. This is clearly the way to go, I was young and stupid.
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df<!-- flush -->
===7/20: Application on Todd's Hub Project Pt. III===
*Major debugging session
**Note: <code>str()</code> method in python attempts to convert input into ASCII chars. When input are already UTF-8 chars, or have ambiguous components such as a backslash, <code>str()</code> will malfunction!!
**Note: wrote additional function <code>empty_timeline_filter()</code> to address problems with certain @shortnames having no tweets, ever, and thus no timeline to speak of. Ran function and manually removed these @shortnames from the input .csv
**'''Re: Twitter API TOKENS''' i.e. this is important. Refer to [https://dev.twitter.com/rest/public/rate-limits API Rate Limit Chart] for comprehensive information on what traffic Twitter allows, and does not allow us to query.
***In calling <code>GET statuses/user_timeline</code> for all 109 @shortnames in the input list, I am barely hitting the ''''180 calls per 15 minutes'''' rate limit. But do take note that while testing modules, one is likely to repeatedly call the same <code>GET</code> in a short burst span of time.
***In terms of future developments, <code>GET</code> methods such as <code>GET statuses/retweeters/ids</code> are capped at a mere ''''15 calls per 15 minutes''''. This explains why it was previously impossible to populate a list of retweeter ID's for each tweet prosseesed in the alpha scrapper. (See above)
***There is a sleeper parameter we can use with the <code>twitter.Api</code> object in <code>python-twitter</code>
import twitter
api = twitter.Api(consumer_key=[consumer key],
consumer_secret=[consumer secret],
access_token_key=[access token],
access_token_secret=[access token secret],
'''sleep_on_rate_limit=True''')
***It is, however, unclear if this is useful. Considering that the sleeper is triggered at a certain point, it is hard to keep track of the chokepoint and, more importantly, how long is the wait and how long already has elapsed.
**Note: it was important to add progress print() statements at each juncture of the scrapper driver for each iteration of data scrapping, as follows. They helped me track the progress of the data query and writing, and alerted me to possible bugs that arise for individual @shortname and timelines.
[[File:Capture 18.PNG|400px|none]]
Note to self: full automation/perfectionism is not necessary or helpful in a dev environment. It is of paramount importance to seek the shortest path, the max effect and the most important problem at each given step.
*'''Development complete'''
**Output files can be found in E:\McNair\Users\GunnyLiu, with E:\ being McNair's shared bulk drive.
***Main datasheet that maps each row of @shortname to its count of followers and past month tweets is named <code>Hub_Tweet_Main_DataSheet.csv</code>
***Individual datasheets for each @shortname that maps each tweet to tweet details can be found at <code>Hub_Tweet_Main_DataSheet.csv</code>
**Code will be LIVE on <code>mcnair git</code> soon
*Output/Process Shortcoming:
**Unable to retrieve retweeter list for each tweet, because this current pull has a total of 200x109=21800 tweets. Making 1 call a minute due to rate limit will amount to a runtime of >21800 minutes. 363 Hours approx. If an intern is paid $10 an hour, this data could cost $3630. Let's talk about opportunity cost.
**Unable to process past month tweet count if count exceeds 199. Will need to write additional recursive modules to do additional pulls to achieve actual number. To be discussed
**Unable to correct for timezone in calculating tweets over the past month. Needs to install <code>python 3.5.3</code>
**Unable to process data for a single @shortname i.e. @FORGEPortland becuz they don't tweet and that's annoying