Twitterverse Exploration

From edegan.com
Revision as of 19:18, 28 February 2017 by Ed (talk | contribs)
Jump to navigation Jump to search

The Report

After 3 days+ of reconnaissance, herein lies a comprehensive update on the Twitterverse, Twitter mining and the dream case that we seek from Twitter here at McNair.

Beliefs Update

Most importantly, Twitter Mining for McNair should be an aggregate of three approaches.

  • Network Visualization
    • Fundamentally, one ought to think of Twitter as an interest group, not a bona fide social network. Consider this: Twitter represents the degree of interest in, for instance, #ycombinator, not the stable business and personal connections made to and fro @ycombinator. It also houses everyone from the very important @barackobama to the fictional and frivolous @homersimpson. At the outset, Twitter represent trends than material fact. The following-follower relationship is mono-directional and voyeuristic, representing what people care and think about, instead of who they really are and what they really do. Twitter activity happens at the speed of thought (140 chars) and represents our rapidly-changing minds and perceptions.

At this level, let's consider the classical aspect of Twitting mining: Network Visualization. This is sociological and concerned with the self-organization of interest-based communities. It primarily provides us with a sense of social roles in an interest group; broadcastor vs receiver, influencer vs influenced. We can also learn about the quantity of interest in a social group, and, when measured over time, the delta/changes in this quantity within the group. We gain knowledge about trends that rise and fall, people that move in and out of the interest group, and community structures of a given interest group.

  • Tweet Analytics
    • Digging a little deeper beyond this superficial exchange, we come to a point where we need to think qualitatively about tweets. What people care about reflects some material facts about their material selves. Tweets containing hashtags such as #kpceoworkshop, for instance, tells us which people are attending the event physically and which people are passing commentary on it. When a startup has an IPO/Acquisition, it will attract a tremendous volume of mentions. When the presidential candidates talk about their technology policies, the entrepreneurship twitterverse responds.

This is the next level of Twitter mining, often associated with Natural Language Processing techniques: Tweet Analytics. Combined with the Network Visualization, we can learn about events that are unfolding in different parts of the entrepreneurship world, as well as new organizations and topics that appear in the conversation. These new organizations and topics will, in turn, generate the beginnings of new interest networks. When measured over time, we can get a handle on the up-and-coming stars in the field, and emerging trends that are of note.

  • Geo Visualization
    • On a physical level, tweets contain geo-information such as @user's home location and the tweet-from location. Through this, we stand to learn about the people's interests stratified by location. When combined with the former two forms of twitter mining, it can enhance what we know about physically-bound social dynamics and physically-bound shifts in interest and opinions.

Geo Visualization is the process of mapping tweets to a real map of the Earth. Applying tweet analytics and network visualization to it, we stand to have an understanding of the geographical dimension of entrepreneurship activities in terms of peoeple, organizations and events in particular places, for instance Palo Alto, CA or Austin, TX. When measured over time, we can observe the crests and troughs of activity in these places. This would be extremely promising especially for the HUBS research project.

For simplicity, I will refer to the above aggregate as Viz&Ana

Key Ideas

  • Viz&Ana: DaaS
    • While exploring the web, I realized that DaaS firms focus on providing Twitter Viz&Ana services to businesses and individuals to enable data-driven decision-making. In other words, the twitter data they mine offer an user interface for the client to interpret Twitter as an observable phenomenon. Clients exercise their own judgment as to whether a marketing campaign or event organization is successful, and make decisions based on these Viz&Ana.
    • To contribute to the research work at McNair, I would propose that we assemble tools and software in the spirit of a DaaS. In other words, Twitter Mining per se is not meaningful. Constructing a working system where researchers can observe the twitterverse, as if interpreting a primary source of data, is meaningful. For data scientists, running statistical analyses on outputs from this working system is meaningful.
  • Portability & Flexibility
    • This is the bit where we distinguish ourselves dream bigger than a run-of-the-mill SAAS, whose work ends when the Viz&Ana is delivered to the hands of the clients.
    • Since the Viz&Ana is for research consumption, further research and analysis must be carried out on the graphs, maps and tables produced by the Viz&Ana. We therefore should do well to avoid blackbox scenarios where beautiful but inflexible graphs are produced but cannot proceed further in the hands of the researchers. Open-source tools, a stronger backend and a good data management system is therefore important considerations when building our Viz&Ana system.
    • In other words, I want data structures that can move between softwares, not just a poster to hang on the walls.
  • "When measured over time..."
    • Since twitter represents the movement of trends, it is best interpreted as an organic body of knowledge that is contingent on the passage of time. Any Viz&Ana that we conduct on the twitterverse must be able to be viewed and extracted (and further processed) as a function of time.

Mining Tools

Blackboxes

Before the www revolution, legacy Viz&Ana software started in the past such as Pajek tend to be blackboxes whose functionality are developed by a dedicated team of commissioned engineers who knew that their target audience are not likely to know code. Many Viz&Ana software, as you will see below, fall into this category.

Modules and Scripts

There is a large community of developers and researchers who are actively involved in developing open-source, free-to-use modules and scripts. Most of the work done by them lie in one of the three aforementioned Twitter Mining approaches. I have not yet explored time-based or webhook-styled modules that we can harness, but am pretty sure that they exist.

The resources can all be built upon each other, with the help of intermediaries, to create a form of aggregate Viz&Ana that McNair needs. Having limited lived experience with different programming languages and joining modules, I cannot offer optimal advice on how exactly to build them together efficiently. However, they all possess the capability. To be further inspected

Network Visualization

Tweet Analytics

Geo Visualization

Dream Case

Picture this: a query into, for instance, #ycombinator at a 7/28/2016 1527hours CST will yield a geo-visualized world map at the bottom-most layer, indicating activities of tweets associated with  #ycombinator. Above the world map, there will be neatly separated communities of nodes and edges network-visualized to indicate the interest groups talking about this topic. There will also be lists of reports done by tweet analytics. Each part of the Viz&Ana can then be converted into other data structures and processed by other analysis software.

Field Notes

Developmental

NodeXL

In a nutshell

    • Enclosed system that auto-pulls, auto-cleans and auto-graphs Twitter networks revolving around input SEARCH TERM (read: this is important).
    • MSExcel-based (thus unsure of its portability, i.e. can we port the graph and its data structure to other softwares and development environments for further processing/analysis?
    • Highly mathematical, formal graph theory
    • Highly customizable
    • Vertices being (@twitterhandles) and edges being (follower/following relationship, mentions, replies, favroites, etc).
    • Operates on Twitter's Streaming API, requires user authentication
    • GUI; very user-friendly and accessible to even
    • Requires background in graph theory to understand mathematical concepts
    • Developed open-source by the Social Media Research Foundation, with help from academics from Cornell to Cambridge.

Features and Review

Automation

    • This being a clean-up process for the input data before analysis and display in the form of a graph
    • Group vertices by cluster (e.g. the Clauset-Newman-Moore algorithm to identify community structures) and calculate clustering coefficient
    • Count and merge duplicate edges (and therefore scale the resultant edge by width proportional to the number of edges merged)
    • Layout method - e.g. the Harel-Koren Fast Multiscale Layout algorithm

Centrality measures

    • Betweenness centrality - identification of corridor/ambassador nodes that are important links between adjacent network communities. In other words, identification of the most BROADLY CONNECTED nodes in the network. Think: few friends in high places, as opposed to an abundance of low-level friends
    • Closeness centrality - related to clustering coefficient. Identification of strong communities within a larger network
    • Eigenvector centrality - unclear
    • Clustering coefficient - as above

Overall graph metrics

    • In a nutshell: Highly customizable
    • Vertices and edge count
    • Unique edges
    • Edge width - can be a function of number of merged edges, etc
    • Node size/color - can be a function of node's degree, centrality measures, etc
    • Egonet - user can look at each node as the "center of the network universe"
      • Pagerank - useful google coefficient that measures how good one node's IN-FLOW is, i.e. the tendency to end up at subject node as agent travels around its neighborhood
      • Number of tweets ever created
      • Number of tweets favorited
      • Other common "user data"
      • User can view egonets in a matrix, and apply "sort by" such that he can easily identify those nodes with the highest e.g. in/out-degree, centrality, pagerank etc)
      • Graph density - 2*|E|/(|V|*(|V|-1))
      • Connected Components calculation

Inspiration, or the "Dream Case"

    • WHAT IF WE tap on NLP capabilities to monitor twitter handles that are known to be important, and have a constant feed of important rising new words, rising new mentions and rising new hashtags. Using this feed, we can populate and update graphs constantly, measuring 'delta' instead of using graph data per se, and thus develop a good grasp of rising organizations, events and startups in the twitterverse. We would know things before other people do. Value.
      • Our question will be: What is going on with startup XYZ?
      • Empirically, and in a micro way, I have observed that a new startup known as Aminohealth @aminohealth (enables end-users to shop around for doctors based on price range; seems very novel and in-demand) has been appearing very constantly on important feeds such as @techcrunch, @redpointvc and @accel. It has just received a 'huge launch' but is relatively unknown in the bigger twitter picture. There is also nothing conclusive about what this launch entailed, and what kind of funding it received. Using the NodeXL tool, we can conceivably find out everyone that's involved in @aminohealth's recent activities, and systematically mine knowledge from this network.
      • @aminohealth itself possesses only around 1,000 followers, despite having 700+ tweets. Delta is far more important than what-is for rising startups as such.
      • Empirically, the twitterverse is populated by important organizations as well as, we often forget, their staff. @jflomenb is constantly mentioned by @redpointvc and @accel, and has interesting exposes information about the entrepreneur scene, as shown. Again, delta is crucial.
    • WHAT IF WE compare social networks against themselves over time?
      • If we generate useful network graphs and data OVER TIME that revolves around a single entity e.g. @redpointvc, we would be able to do a few pretty amazing statistical analyses:
        • The mean number of mentions before a startup gets signed to a VC
        • What are the quantitative tweet indicators that a startup is succeeding/failing?
        • All the startups a VC has signed since the VC obtained a twitter handle
        • The average pace at which a VC signs startups
        • What are the qualitatively trendy topics that are mentioned in the history of a VC? Does this influence their activity, if at all?
        • Any regression for the above, and more
    • WHAT IF WE track ongoing events such as #kpceoworkshop
      • It'll be easy to find out who are the people that are attending the workshop, and add them to our watchlist of important people
      • Also, how important or impactful are these events? We can track their mentioners and likers and followers to identify and think about follow-up events that occur after the events themselves conclude.

Limitations

    • A input query is 'necessary'. I don't think the user can simply ask for a graph of all the followers of @xxx, for instance.
    • It's a black box - this tool is designed for end-users that want to study contingent trends and discrete events, instead of a comprehensive and stable picture of a certain "scene" (i.e. the entrepreneur scene, in our case).
      • We can, of course, run the tool continuously for all trends that we identify. But would we be able to join them all up in an aggregate fashion?
    • Unsure of the usefulness of output
      • Sure, it will be nice to generate graphs and knowledge about upcoming events and organizations, but will we be able harness this information and use it to do other stuff?
      • In other words, it's unclear how portable our output data is

Thoughts

    • In my recent days of interacting with the twitterverse, it has come to pass that Twitter is spectacular because of its malleability, flexibility and decentralized nature. All forms of social organization on Twitter is explicitly time-contingent and user-contingent. This is the why it is such an important hotbed for sociological research - it provides wonderful material for the study of social dynamics and social organization
    • In this vein, what we think of as the "Entrepreneurship Twitterverse" can be, more clearly, thought of as a time-contingent and very specific community shaped by its own trends, influencers, and cultural values, all of which are in turn shaped by the very specific people that are interested and involved in the same ideas/things. In our case, investments, foundings, IPOs, acquisitions etc
    • In light of this, does it make more sense for us to study deltas instead of things as-they-are?

Demo

    • Test case by www.pewinternet.org
      • User attempted to graph the community activity regarding the topic "pew internet"
      • User used search string "pew internet" over a fixed period of 58 days
      • Output graph nodes are created for each @shortname on the broadcasting or receiving end of tweets that include "pew internet". Output graph edges are created for each mention and reply that appeared over the course of the time bracket.
        • Graph edge colors and widths are proportional to the number of mentions/replies that occurred between two nodes (users).
        • The color and transparency of his nodes are related to follower values, i.e. how many followers does each node have..

R Packages Galore

Herein lies a great introduction to R for programmers already familiar with OOP

igraph
network
statnet
tnet
rsiena
sna

In a nutshell

  • Many R packages include social media analysis functionality
  • The advantage of using R, instead of a blackbox nice-UI, is R's portability and flexibility. Data can move easily between packages and into other software such as MSExcel or SPSS (Statistical Package for the Social Sciences).
  • According to the R community, it is widely held that despite their difference in specific functionalities, one can achieve all basic operations and visualization needs with any one of these R packages
  • For all R-based analysis, we have to use our in-house Twitter Webcrawler (Tool) to grab raw data and convert them into appropriate structures for R consumption (unsure)
  • Typically, they are all OOP with graphs, nodes and edges as objects

Features and Review

igraph

  • Powerful, feature-rich library
  • https://github.com/igraph/igraph igraph on Github]
  • igraph on its own domain
  • Also available for Py and C
  • Known for ease of calculating basic graph metrics such as:
    • g.edge_betweeness()
    • g.degree()
    • g.pagerank()
    • g.betweenness()
    • g.select() to enable easy node/edge selection
  • Known for possessing community detection algorithm (e.g. Newman-Girvan)

statnet

  • Implements recent advances in statistical modelling of networks - unsure if we need such high levels of sophistication in graph theory implementation.
  • Focuses on statistical modelling of network data
  • Includes libraries network, sna which stands for naturally, Social Media Analysis
    • 3-D graph plot
    • Subgraph census routines, including component information, paths/cycles/cliques, removing isolates
    • Positional Analysis
  • Unlike igraph, statnet is developed by a team of statisticians from the University of Washington. It is thus heavy on the statistical analysis side.
    • ERGMs model
      • Exponential family Random Graph Models
      • Advanced technique associated with analyzing data esp. in social networks
      • Statistical model operates on the premise that all alternative networks are to be considered as much as the observed one. Alternative networks are, for e.g., generated through the Degree Preserving Randomization method.
    • Includes tools for model estimation, model evaluation, model-based network simulation, and network visualization.
      • Broad functionalities powered by central MCMC (Markov Chain Monte Carlo) algorithm

Others

  • tnet
    • Two-mode networks (i.e. rows and columns of a two-mode matrix are different entities; e.g. persons vs. organizations)
  • RSiena
    • Actor-oriented model of network dynamics
      • Extremely theoretical and, presently, academic discipline.
      • Addresses the very realistic question of networks as an evolving system driven by actors (nodes of twitter users, in our case).
      • Stochastic; statistical modelling, Markov Chain
    • DREAM CASE:
      • Could we use this modelling technique to predict future twitter trends of a the entrepreneurship interest group?

Famous Classic Modelling Tools

PAJEK and UCINET are two of the most widely-used modelling toolkits on the internet. They are both blackboxes with a GUI, but also portable in the sense that their output can be easily converted for further analysis in MSExcel, SPSS and R

PAJEK

  • Open-source, free and has been the recipient of numerous software awards. Numerous books have been written about this tool.
  • Most obvious advantage being:
    1. Scale-ability - handles a billion vertices (more than we will ever need)
    2. Speed - recent release of PAJEK XXL reduced processing time for 2 or 3 times
    3. Algorithms - handles classic algorithmic operations such as the shortest-path problem
    4. Decomposition - (recursive) decomposition of a large network into several smaller networks that can be treated further using more sophisticated methods
  • Unlike other OOP's, Pajek has some very unique datatypes

network (graph); partition (nominal or ordinal properties of vertices); vector (numerical properties of vertices); cluster (subset of vertices); permutation (reordering of vertices, ordinal properties); and hierarchy (general tree structure on vertices)

  • Powerful graph theory operations
  • Unique network models
    • Temporal networks - networks that change over time
    • Multirelational networks - different set of relations imposed on the same set of vertices
    • Signed networks - networks with positive and negative lines
  • Powerful visualization support

Kamada-Kawai optimization, Fruchterman Reingold optimization, VOS mapping, Pivot MDS, drawing in layers, FishEye transformation. Layouts obtained by Pajek can be exported to different 2D or 3D output formats (e.g., SVG, EPS, X3D, VOSViewer, Mage,…). Special viewers and editors for these formats are available (e.g., inkscape, GSView, instantreality, KiNG,…)

Geo-Visualization

In layman terms, this is known as mapping. While there are a large collection of geo-visualization tools available on github, I have listed here several collections that stand out in terms of:

  • Flexibility
  • Portability
  • Aesthetic

I imagine that geo-visualizations projects we do at McNair should offer beautiful, accessible graphic outputs as well as launchpad/integration with other analysis tools.

The Mapbox Suite

In a nutshell

  • Most aesthetically pleasing geo-visualization output I have seen, thus far
  • Open-source technology from end-to-end
  • Researcher-friendly - i.e. geo-viz built on mapbox creates a information-rich and nuanced UI for researchers to play around with/lookup the data that they seek. CEO Eric Gunderson puts it in some beautiful words: "(mapbox visualizations) let you explore the stories of space, language, and access to technology."

How it works

We need:

Demo

The following map identifies locals from tourists who tweets in the Greater NYC

"To make this map, Tweets are grouped by user and sorted into locals—who post in one city for one consecutive month—and tourists—whose tweets are center in another city. Relatively inactive users simply don’t appear on the map, since we can’t confidently determine their group."

Limitations

  • aforementioned github tools are written in C
  • data cleaning and processing for Mapbox was done primarily by data firm GNIP, a black box.
  • unsure of the data structures that are used in this suite