Difference between revisions of "Twitter Webcrawler (Tool)"
Jump to navigation
Jump to search
(Created page with "{{McNair Projects |Project Title=Twitter Webcrawler (Tool) |Topic Area=Resources and Tools |Owner=Gunny Liu |Start Term=Summer 2016 |Status=Active |Deliverable=Tool |Audience=...") |
|||
Line 24: | Line 24: | ||
===7/11: Project start=== | ===7/11: Project start=== | ||
− | |||
*Dan wanted: | *Dan wanted: | ||
[[File:Capture 15.PNG|400px|none]] | [[File:Capture 15.PNG|400px|none]] | ||
Line 37: | Line 36: | ||
***One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping <code>TOOLS > Manage Your Apps</code> in the footer bar of the portal. | ***One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping <code>TOOLS > Manage Your Apps</code> in the footer bar of the portal. | ||
**There is '''no''' direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment. | **There is '''no''' direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment. | ||
− | |||
===7/12: Grasping API=== | ===7/12: Grasping API=== | ||
Line 47: | Line 45: | ||
access_token_key='access_token', | access_token_key='access_token', | ||
access_token_secret='access_token_secret') | access_token_secret='access_token_secret') | ||
+ | |||
**Some potentially very useful query methods are: | **Some potentially very useful query methods are: | ||
***<code>Api.GetUserTimeline(user_id=None, screen_name=None)</code> which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple as <code>screen_name</code>, which is @shortname that is v public and familiar. | ***<code>Api.GetUserTimeline(user_id=None, screen_name=None)</code> which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple as <code>screen_name</code>, which is @shortname that is v public and familiar. | ||
Line 53: | Line 52: | ||
***<code>Api.GetFollowers(user_id=None, screen_name=None)</code> and <code>Api.GetFollowerIDs(user_id=None, screen_name=None)</code> which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about. | ***<code>Api.GetFollowers(user_id=None, screen_name=None)</code> and <code>Api.GetFollowerIDs(user_id=None, screen_name=None)</code> which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about. | ||
− | + | **After retrieving data objects using these query methods, we can understand and process them using instructions from [http://python-twitter.readthedocs.io/en/latest/_modules/twitter/models.html Twitter-python Models Source Code] | |
− | + | ***To note that tweets are expressed as <code>Status</code> objects | |
− | * | + | ****It holds useful parameters such as <code>'text'</code>, <code>'created_at'</code>, <code>'user'</code>, etc |
− | ** | + | ****They can be retrieved by classical object expressions such as <code>Status.created_at</code> |
− | ** | + | ***To note that users are expressed as <code>User</code> objects |
− | ** | + | ***Best part? All these objects inherit .Api methods such as AsJsonString(self) and AsDict(self) so that we can read and write them as JSON or DICT objects in the py environment |
− | + | ||
− | ** | + | ===7/13: Full Dev=== |
− | + | '''Documented in-file, as below:''' | |
− | + | ||
− | + | ====Twitter Webcrawler==== | |
− | + | *Summary: Rudimentary (and slightly generalized) webcrawler that queries twitter database with using twitter API. At current stage of development/discussion, user shortname (in twitter, @shortname) is used as the query key, and this script publishes 200 recent tweets of said user in a tab delimited, UTF-8 document, along with the details and social interactions each tweet possesses | |
− | + | *Input: Twitter database, Shortname string of queried user (@shortname) | |
− | + | *Output: Local database of queried user's 200 recent tweets, described by the keys "Content", "User", "Created at", "Hashtags", "User Mentions", "Retweet Count", "Retweeted By", "Favorite Count", "Favorited By". | |
− | + | *Version: 1.0 Alpha | |
− | + | *Development environment specs: Twitter API, JSON library, twitter-python library, pandas library, Py 2.7, ActiveState Komodo IDE 9.3 | |
− | + | ||
− | + | ====Pseudo-code==== | |
− | + | *function I: main driver | |
− | + | **generate empty table for subsequent building with apt columns | |
− | + | **iterate through each status object in the obtained data, and fill up the table rows as apt, one row per event | |
− | + | **and the main processing task being: write table to output file | |
− | + | ||
− | + | *function II: empty table generator | |
− | + | **'''modular caus of my unfamiliarity with pandas.DataFrame; modularity enables testing''' | |
− | + | ||
− | + | *function IV: authenticator + twitter API access interface setup | |
− | + | **authenticate using our granted consumer keys and access tokens | |
− | + | **obtains working twitter API object, post-authentication | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | * | ||
− | |||
− | * | ||
− | |||
− | ** | ||
− | ** | ||
− | **''' | ||
− | ** | ||
+ | *function V: subquery #1 | ||
+ | **iterate through main query object in order to further query for retweeters, i.e. GetRetweeter() and ??? | ||
− | + | *function VI: raw data acquisitior | |
− | + | **grabs raw data of recent tweets using master_working_api object | |
− | * | + | **make it json so we can access and manipulate it easily |
− | |||
− | ** | ||
− | |||
− | |||
− | ** | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | ====Notes:==== | ||
+ | *Modular development and unit testing are integral to writing fast, working code. no joke | ||
+ | *Problems with GetFavorites() method as it only returns the favorited list wrt authenticated use (i.e. BIPPMcNair), not input target user. | ||
+ | *'''Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.''' | ||
+ | *A tweet looks like this in json: | ||
+ | [[File:Capture 16.PNG|400px|none]] | ||
− | ===7/ | + | ===7/14: Alpha dev wrap-up=== |
− | + | *Black box can be pretty much sealed after a round of debugging | |
− | * | + | *All output rqts fulfilled except for output "retweeter" list per tweet |
− | * | + | *[https://github.com/scroungemyvibe/mcnair_center_builds Code is live] |
− | * | + | *[https://github.com/scroungemyvibe/mcnair_center_builds Sample output is live] |
− | + | *Awaiting more discussion, modifications | |
− | * | + | *Ed mentioned populating a database according to [Social_Media_Entrepreneurship_Resources|Past Tweet-o-sphere experimentation/excavation results] |
− | * | ||
− | |||
− |
Revision as of 16:52, 15 July 2016
Twitter Webcrawler (Tool) | |
---|---|
Project Information | |
Project Title | |
Start Date | |
Deadline | |
Primary Billing | |
Notes | |
Has project status | |
Copyright © 2016 edegan.com. All Rights Reserved. |
Contents
Description
Notes: The Twitter Webcrawler, in its alpha version, is an expedition project involving the Twittwer API in search of a sustainable and scale-able way to excavate retweet-retweeter, favorited-favoriter following-follower relationships in the entrepreneurship Tweet-o-sphere. On the same beat, we also seek to document tweeting activities/timelines of important twitters in the same Tweet-o-sphere.
Input: Twitter database
Output: Local database documenting important timelines and relationships in the entrepreneurship Tweet-o-sphere.
Development Notes
7/11: Project start
- Dan wanted:
- First-take on Twitter API Overview
- Cumbersome API that is not directly accessible/requires great deal of configuration if one chooses to leverage e.g.
import requests
library.- Turns out Twitter has a long controversial history wrt third-party development. There is no clean canonical interface to access its database.
- DO NOT attempt to access Twitter API through canonical documented methods - huge waste of time
- Obsolete authentication process documented - do not be use canonical documentation for Oauth procedure
- Cumbersome API that is not directly accessible/requires great deal of configuration if one chooses to leverage e.g.
- Instead, DO USE third-party developed python interfaces such as python-twitter by bear - highly recommended in hindsight
- Follow python-twitter's documented methods for authentication
- The twitter account that I am using is
shortname: BIPPMcNair
andpassword: amount
- One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping
TOOLS > Manage Your Apps
in the footer bar of the portal.
- One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping
- There is no direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment.
7/12: Grasping API
- The python-twitter library is extremely intricate and well-synchronized
- All queries are to be launched through a
twitter.api.Api
object, which is produced by the authentication process implemented yesterday
- All queries are to be launched through a
>>> import twitter >>> api = twitter.Api(consumer_key='consumer_key', consumer_secret='consumer_secret', access_token_key='access_token', access_token_secret='access_token_secret')
- Some potentially very useful query methods are:
Api.GetUserTimeline(user_id=None, screen_name=None)
which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple asscreen_name
, which is @shortname that is v public and familiar.Api.GetRetweeters(status_id=None)
andApi.GetRetweets(status_id=None)
which identifies a tweet as a status by its status_id and spits out all the retweets that this particular tweet has undergone.Api.GetFavorites(user_id=None)
which seems to satisfy our need for tracking favorited tweetsApi.GetFollowers(user_id=None, screen_name=None)
andApi.GetFollowerIDs(user_id=None, screen_name=None)
which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about.
- Some potentially very useful query methods are:
- After retrieving data objects using these query methods, we can understand and process them using instructions from Twitter-python Models Source Code
- To note that tweets are expressed as
Status
objects- It holds useful parameters such as
'text'
,'created_at'
,'user'
, etc - They can be retrieved by classical object expressions such as
Status.created_at
- It holds useful parameters such as
- To note that users are expressed as
User
objects - Best part? All these objects inherit .Api methods such as AsJsonString(self) and AsDict(self) so that we can read and write them as JSON or DICT objects in the py environment
- To note that tweets are expressed as
- After retrieving data objects using these query methods, we can understand and process them using instructions from Twitter-python Models Source Code
7/13: Full Dev
Documented in-file, as below:
Twitter Webcrawler
- Summary: Rudimentary (and slightly generalized) webcrawler that queries twitter database with using twitter API. At current stage of development/discussion, user shortname (in twitter, @shortname) is used as the query key, and this script publishes 200 recent tweets of said user in a tab delimited, UTF-8 document, along with the details and social interactions each tweet possesses
- Input: Twitter database, Shortname string of queried user (@shortname)
- Output: Local database of queried user's 200 recent tweets, described by the keys "Content", "User", "Created at", "Hashtags", "User Mentions", "Retweet Count", "Retweeted By", "Favorite Count", "Favorited By".
- Version: 1.0 Alpha
- Development environment specs: Twitter API, JSON library, twitter-python library, pandas library, Py 2.7, ActiveState Komodo IDE 9.3
Pseudo-code
- function I: main driver
- generate empty table for subsequent building with apt columns
- iterate through each status object in the obtained data, and fill up the table rows as apt, one row per event
- and the main processing task being: write table to output file
- function II: empty table generator
- modular caus of my unfamiliarity with pandas.DataFrame; modularity enables testing
- function IV: authenticator + twitter API access interface setup
- authenticate using our granted consumer keys and access tokens
- obtains working twitter API object, post-authentication
- function V: subquery #1
- iterate through main query object in order to further query for retweeters, i.e. GetRetweeter() and ???
- function VI: raw data acquisitior
- grabs raw data of recent tweets using master_working_api object
- make it json so we can access and manipulate it easily
Notes:
- Modular development and unit testing are integral to writing fast, working code. no joke
- Problems with GetFavorites() method as it only returns the favorited list wrt authenticated use (i.e. BIPPMcNair), not input target user.
- Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.
- A tweet looks like this in json:
7/14: Alpha dev wrap-up
- Black box can be pretty much sealed after a round of debugging
- All output rqts fulfilled except for output "retweeter" list per tweet
- Code is live
- Sample output is live
- Awaiting more discussion, modifications
- Ed mentioned populating a database according to [Social_Media_Entrepreneurship_Resources|Past Tweet-o-sphere experimentation/excavation results]