Difference between revisions of "Twitter Webcrawler (Tool)"

From edegan.com
Jump to navigation Jump to search
(Created page with "{{McNair Projects |Project Title=Twitter Webcrawler (Tool) |Topic Area=Resources and Tools |Owner=Gunny Liu |Start Term=Summer 2016 |Status=Active |Deliverable=Tool |Audience=...")
 
Line 24: Line 24:
  
 
===7/11: Project start===
 
===7/11: Project start===
----
 
 
*Dan wanted:
 
*Dan wanted:
 
[[File:Capture 15.PNG|400px|none]]
 
[[File:Capture 15.PNG|400px|none]]
Line 37: Line 36:
 
***One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping <code>TOOLS > Manage Your Apps</code> in the footer bar of the portal.
 
***One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping <code>TOOLS > Manage Your Apps</code> in the footer bar of the portal.
 
**There is '''no''' direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment.  
 
**There is '''no''' direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment.  
 
  
 
===7/12: Grasping API===
 
===7/12: Grasping API===
Line 47: Line 45:
 
                       access_token_key='access_token',
 
                       access_token_key='access_token',
 
                       access_token_secret='access_token_secret')
 
                       access_token_secret='access_token_secret')
 +
 
**Some potentially very useful query methods are:  
 
**Some potentially very useful query methods are:  
 
***<code>Api.GetUserTimeline(user_id=None, screen_name=None)</code> which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple as <code>screen_name</code>, which is @shortname that is v public and familiar.
 
***<code>Api.GetUserTimeline(user_id=None, screen_name=None)</code> which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple as <code>screen_name</code>, which is @shortname that is v public and familiar.
Line 53: Line 52:
 
***<code>Api.GetFollowers(user_id=None, screen_name=None)</code> and <code>Api.GetFollowerIDs(user_id=None, screen_name=None)</code> which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about.
 
***<code>Api.GetFollowers(user_id=None, screen_name=None)</code> and <code>Api.GetFollowerIDs(user_id=None, screen_name=None)</code> which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about.
  
===7/5: Eventbrite API First-Take===
+
**After retrieving data objects using these query methods, we can understand and process them using instructions from [http://python-twitter.readthedocs.io/en/latest/_modules/twitter/models.html Twitter-python Models Source Code]
----
+
***To note that tweets are expressed as <code>Status</code> objects
*Eventbrite developer account for McNair Center:
+
****It holds useful parameters such as <code>'text'</code>, <code>'created_at'</code>, <code>'user'</code>, etc
**first name: '''Anne''', last name: '''Dayton'''
+
****They can be retrieved by classical object expressions such as <code>Status.created_at</code>
**Login Email: '''admin@mcnaircenter.org'''
+
***To note that users are expressed as <code>User</code> objects
**Login Password: '''amount'''
+
***Best part? All these objects inherit .Api methods such as AsJsonString(self) and AsDict(self) so that we can read and write them as JSON or DICT objects in the py environment
*Eventbrite API is well-documented and its database readily accessible. In the python dev environment, I am using the http <code>requests</code> library to make queries to the database, to obtain json data containing event objects that in turn contain organizer objects, venue objects, start/end time values, longitude/latitude values specific to each event. The <code>requests</code> library has inbuilt <code>.json()</code> access methods, simplifying the json reading/writing process. Bang.
+
 
**In querying for events organized by techstar, one of the biggest startup programs organization in the U.S., I use the following. Note that the organizer ID of techstar is 2300226659.
+
===7/13: Full Dev===
import requests
+
'''Documented in-file, as below:'''
response = requests.get(
+
 
    "https://www.eventbriteapi.com/v3/organizers/2300226659/events/",
+
====Twitter Webcrawler====
    headers = {
+
*Summary: Rudimentary (and slightly generalized) webcrawler that queries twitter database with using twitter API. At current stage of development/discussion, user shortname (in twitter, @shortname) is used as the query key, and this script publishes 200 recent tweets of said user in a tab delimited, UTF-8 document, along with the details and social interactions each tweet possesses
        "Authorization": "Bearer CRAQ5MAXEGHKEXSUSWXN",
+
*Input: Twitter database, Shortname string of queried user (@shortname)
    },
+
*Output: Local database of queried user's 200 recent tweets, described by the keys "Content", "User", "Created at", "Hashtags", "User Mentions", "Retweet Count", "Retweeted By", "Favorite Count", "Favorited By".
    verify = True,
+
*Version: 1.0 Alpha
)
+
*Development environment specs: Twitter API, JSON library, twitter-python library, pandas library, Py 2.7, ActiveState Komodo IDE 9.3
**In querying for, instead, keywords such as "startup weekend," I use the following.
+
 
import requests
+
====Pseudo-code====
response = requests.get(
+
*function I: main driver
    "https://www.eventbriteapi.com/v3/events/search/q="startup weekend"",
+
**generate empty table for subsequent building with apt columns
    headers = {
+
**iterate through each status object in the obtained data, and fill up the table rows as apt, one row per event
        "Authorization": "Bearer CRAQ5MAXEGHKEXSUSWXN",
+
**and the main processing task being: write table to output file
    },
+
 
    verify = True,
+
*function II: empty table generator
)
+
**'''modular caus of my unfamiliarity with pandas.DataFrame; modularity enables testing'''
**In querying for events parked under the category "science and technology", I use the following. However, this query also returns scientific seminars unrelated to entrepreneurship and is yet to be refined.
+
 
**Note that the category ID of science and technology is 102.
+
*function IV: authenticator + twitter API access interface setup
import requests
+
**authenticate using our granted consumer keys and access tokens
response = requests.get(
+
**obtains working twitter API object, post-authentication
    "https://www.eventbriteapi.com/v3/categories/102",
 
    headers = {
 
        "Authorization": "Bearer CRAQ5MAXEGHKEXSUSWXN",
 
    },
 
    verify = True, 
 
)
 
**In each case, var <code>response</code> is a json object, that can be read/written in python using the requests method <code>response.json()</code>. Each endpoint used above are instances of e.g. <code>GET events/search/</code> or <code>GET categories/:id</code> EventBrite API methods. There are different parameters each GET function can harness to get more specific results. To populate a comprehensive local database, the '''dream''' is to systematic queries from different endpoints and collecting all results, without repetition, in a centralized database. In order to do this, I'll have to familarize further with these GET functions and develop a systematic approach to automate queries to the eventbrite server. One way to do this is to import entrepreneurship buzzword libraries that are available on the web, and make queries by iterating through these search strings systematically.
 
*Eventbrite event objects in json are well-organized and consistent. There are many interesting fields such as the longitude/latitude decimals, apart from name/location/organizer/start-time/end-time data which are data we want to amass initially.
 
**For instance, the upcoming startup weekend event in Seville looks like the following.
 
[[File:Capture 12.PNG|400px|none]]
 
**In the events object, organizer and venue are represented as ID's and have to be queried separately since they contain a multitude of string-value pairs such as "description", "logo", and "url" in the case of organizer data. Huge opportunity here for more data extraction. Kudos to eventbrite for documenting their stuff meticulously. Can you tell I'm impressed?
 
**To produce a local database, I'm using the <code>import pandas as pd</code> library, the <code>pandas.DataFrame</code> object and the <code>pandas.DataFrame.to_csv()</code> method. Currently, I initialize a dataframe with columns of variables that I seek to extract, and iterate through event objects and venue/organizer objects within to populate the dataframe with rows of event data.
 
**'''Still debugging/writing at the moment'''.
 
**RDP went down, major sadness.
 
  
 +
*function V: subquery #1
 +
**iterate through main query object in order to further query for retweeters, i.e. GetRetweeter() and ???
  
===7/6: Alpha Development===
+
*function VI: raw data acquisitior
----
+
**grabs raw data of recent tweets using master_working_api object
*Eventbrite stipulates a system of ID-numbering for all organizers and venues objects, for instance.
+
**make it json so we can access and manipulate it easily
**For the endpoint <code>GET /venues/:id/</code>, replace <code>:id</code> with the venue_id associated with desired organizer
 
**For the endpoint <code>GET /organizers/:id</code>, replace <code>:id</code> with the organizer_id associated with desired organizer
 
**Where are these ID numbers located, you ask? Any query for an event will return them as values the the strings "venue_id" and "organizer_id"
 
*Script development slowed considerably by lack of modularity and debugging functionality
 
**Modules to generate query url strings from input GET
 
**Module to create empty <code>pandas.DataFrame</code> table based on input rows and columns
 
**Modules to retrieve information from venues and organizer data from their respective ID numbers
 
**To learn and operate komodo debugger and write appropriate tests for each modules detached from main driver function
 
**To learn pandas.DataFrame and appropriate methods to update it  
 
*'''Notes and Ideas'''
 
**Develop smart iteration to query for all events sought
 
:::To create intelligent searches:
 
:::Note that eventbrite is esp good for free events
 
:::Note that past events may extend only to a certain point
 
:::Note that eventbrite was launched in 2006, but is the first major player in online event ticketing
 
:::Category is always science and tech
 
:::Organiser is impt; some entrepreneurship events are organised by known collectives
 
:::Organiser description also has many impt keywords
 
:::keywords from SEO material on [[marketing artfully]] is very good
 
:::Event series, dates and venues endpoints are secondarily important
 
  
 +
====Notes:====
 +
*Modular development and unit testing are integral to writing fast, working code. no joke
 +
*Problems with GetFavorites() method as it only returns the favorited list wrt authenticated use (i.e. BIPPMcNair), not input target user.
 +
*'''Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.'''
 +
*A tweet looks like this in json:
 +
[[File:Capture 16.PNG|400px|none]]
  
===7/7 Alpha Development #2===
+
===7/14: Alpha dev wrap-up===
----
+
*Black box can be pretty much sealed after a round of debugging
*Full swing: pseudo-code, modularity, docstrings, tests, naming style
+
*All output rqts fulfilled except for output "retweeter" list per tweet
*Komodo debugger works
+
*[https://github.com/scroungemyvibe/mcnair_center_builds Code is live]
*Alpha development complete. All tests passed. Complete code as below.
+
*[https://github.com/scroungemyvibe/mcnair_center_builds Sample output is live]
https://github.com/scroungemyvibe/mcnair_center_builds/blob/master/EventBrite_Webcrawler_Build.py
+
*Awaiting more discussion, modifications
*'''Notes'''
+
*Ed mentioned populating a database according to [Social_Media_Entrepreneurship_Resources|Past Tweet-o-sphere experimentation/excavation results]
**Current query (without input parameters) by organizer ID returns only active events listed under organizer. For instance, techstars has 45 upcoming events and I am pulling 45 json event objects from the database.
 
**Current build should be applied systematically to lists of organizer_id's
 
**Further build ideas/notes documented in code proper on the git
 

Revision as of 16:52, 15 July 2016


McNair Project
Twitter Webcrawler (Tool)
Project logo 02.png
Project Information
Project Title
Start Date
Deadline
Primary Billing
Notes
Has project status
Copyright © 2016 edegan.com. All Rights Reserved.


Description

Notes: The Twitter Webcrawler, in its alpha version, is an expedition project involving the Twittwer API in search of a sustainable and scale-able way to excavate retweet-retweeter, favorited-favoriter following-follower relationships in the entrepreneurship Tweet-o-sphere. On the same beat, we also seek to document tweeting activities/timelines of important twitters in the same Tweet-o-sphere.

Input: Twitter database

Output: Local database documenting important timelines and relationships in the entrepreneurship Tweet-o-sphere.

Development Notes

7/11: Project start

  • Dan wanted:
Capture 15.PNG
  • First-take on Twitter API Overview
    • Cumbersome API that is not directly accessible/requires great deal of configuration if one chooses to leverage e.g. import requests library.
      • Turns out Twitter has a long controversial history wrt third-party development. There is no clean canonical interface to access its database.
      • DO NOT attempt to access Twitter API through canonical documented methods - huge waste of time
      • Obsolete authentication process documented - do not be use canonical documentation for Oauth procedure
  • Instead, DO USE third-party developed python interfaces such as python-twitter by bear - highly recommended in hindsight
    • Follow python-twitter's documented methods for authentication
    • The twitter account that I am using is shortname: BIPPMcNair and password: amount
      • One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping TOOLS > Manage Your Apps in the footer bar of the portal.
    • There is no direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment.

7/12: Grasping API

  • The python-twitter library is extremely intricate and well-synchronized
    • All queries are to be launched through a twitter.api.Api object, which is produced by the authentication process implemented yesterday
>>> import twitter
>>> api = twitter.Api(consumer_key='consumer_key',
                      consumer_secret='consumer_secret',
                      access_token_key='access_token',
                      access_token_secret='access_token_secret')
    • Some potentially very useful query methods are:
      • Api.GetUserTimeline(user_id=None, screen_name=None) which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple as screen_name, which is @shortname that is v public and familiar.
      • Api.GetRetweeters(status_id=None) and Api.GetRetweets(status_id=None) which identifies a tweet as a status by its status_id and spits out all the retweets that this particular tweet has undergone.
      • Api.GetFavorites(user_id=None) which seems to satisfy our need for tracking favorited tweets
      • Api.GetFollowers(user_id=None, screen_name=None) and Api.GetFollowerIDs(user_id=None, screen_name=None) which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about.
    • After retrieving data objects using these query methods, we can understand and process them using instructions from Twitter-python Models Source Code
      • To note that tweets are expressed as Status objects
        • It holds useful parameters such as 'text', 'created_at', 'user', etc
        • They can be retrieved by classical object expressions such as Status.created_at
      • To note that users are expressed as User objects
      • Best part? All these objects inherit .Api methods such as AsJsonString(self) and AsDict(self) so that we can read and write them as JSON or DICT objects in the py environment

7/13: Full Dev

Documented in-file, as below:

Twitter Webcrawler

  • Summary: Rudimentary (and slightly generalized) webcrawler that queries twitter database with using twitter API. At current stage of development/discussion, user shortname (in twitter, @shortname) is used as the query key, and this script publishes 200 recent tweets of said user in a tab delimited, UTF-8 document, along with the details and social interactions each tweet possesses
  • Input: Twitter database, Shortname string of queried user (@shortname)
  • Output: Local database of queried user's 200 recent tweets, described by the keys "Content", "User", "Created at", "Hashtags", "User Mentions", "Retweet Count", "Retweeted By", "Favorite Count", "Favorited By".
  • Version: 1.0 Alpha
  • Development environment specs: Twitter API, JSON library, twitter-python library, pandas library, Py 2.7, ActiveState Komodo IDE 9.3

Pseudo-code

  • function I: main driver
    • generate empty table for subsequent building with apt columns
    • iterate through each status object in the obtained data, and fill up the table rows as apt, one row per event
    • and the main processing task being: write table to output file
  • function II: empty table generator
    • modular caus of my unfamiliarity with pandas.DataFrame; modularity enables testing
  • function IV: authenticator + twitter API access interface setup
    • authenticate using our granted consumer keys and access tokens
    • obtains working twitter API object, post-authentication
  • function V: subquery #1
    • iterate through main query object in order to further query for retweeters, i.e. GetRetweeter() and ???
  • function VI: raw data acquisitior
    • grabs raw data of recent tweets using master_working_api object
    • make it json so we can access and manipulate it easily

Notes:

  • Modular development and unit testing are integral to writing fast, working code. no joke
  • Problems with GetFavorites() method as it only returns the favorited list wrt authenticated use (i.e. BIPPMcNair), not input target user.
  • Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.
  • A tweet looks like this in json:
Capture 16.PNG

7/14: Alpha dev wrap-up

  • Black box can be pretty much sealed after a round of debugging
  • All output rqts fulfilled except for output "retweeter" list per tweet
  • Code is live
  • Sample output is live
  • Awaiting more discussion, modifications
  • Ed mentioned populating a database according to [Social_Media_Entrepreneurship_Resources|Past Tweet-o-sphere experimentation/excavation results]