The Report

McNair Project
Twitterverse Exploration
Project Information
Project Title	Twitterverse Exploration (Tool)
Owner	Gunny Liu
Start Date	Summer 2016
Deadline
Primary Billing	AccNBER01
Notes
Has project status
	Copyright © 2016 edegan.com. All Rights Reserved.

After 3 days+ of reconnaissance, herein lies a comprehensive update on the Twitterverse, Twitter mining and the dream case that we seek from Twitter here at McNair.

Beliefs Update

Most importantly, Twitter Mining for McNair should be an aggregate of three approaches.

Network Visualization
- Fundamentally, one ought to think of Twitter as an interest group, not a bona fide social network. Consider this: Twitter represents the degree of interest in, for instance, #ycombinator, not the stable business and personal connections made to and fro @ycombinator. It also houses everyone from the very important @barackobama to the fictional and frivolous @homersimpson. At the outset, Twitter represent trends than material fact. The following-follower relationship is mono-directional and voyeuristic, representing what people care and think about, instead of who they really are and what they really do. Twitter activity happens at the speed of thought (140 chars) and represents our rapidly-changing minds and perceptions.

At this level, let's consider the classical aspect of Twitting mining: Network Visualization. This is sociological and concerned with the self-organization of interest-based communities. It primarily provides us with a sense of social roles in an interest group; broadcastor vs receiver, influencer vs influenced. We can also learn about the quantity of interest in a social group, and, when measured over time, the delta/changes in this quantity within the group. We gain knowledge about trends that rise and fall, people that move in and out of the interest group, and community structures of a given interest group.

Tweet Analytics
- Digging a little deeper beyond this superficial exchange, we come to a point where we need to think qualitatively about tweets. What people care about reflects some material facts about their material selves. Tweets containing hashtags such as #kpceoworkshop, for instance, tells us which people are attending the event physically and which people are passing commentary on it. When a startup has an IPO/Acquisition, it will attract a tremendous volume of mentions. When the presidential candidates talk about their technology policies, the entrepreneurship twitterverse responds.

This is the next level of Twitter mining, often associated with Natural Language Processing techniques: Tweet Analytics. Combined with the Network Visualization, we can learn about events that are unfolding in different parts of the entrepreneurship world, as well as new organizations and topics that appear in the conversation. These new organizations and topics will, in turn, generate the beginnings of new interest networks. When measured over time, we can get a handle on the up-and-coming stars in the field, and emerging trends that are of note.

Geo Visualization
- On a physical level, tweets contain geo-information such as @user's home location and the tweet-from location. Through this, we stand to learn about the people's interests stratified by location. When combined with the former two forms of twitter mining, it can enhance what we know about physically-bound social dynamics and physically-bound shifts in interest and opinions.

Geo Visualization is the process of mapping tweets to a real map of the Earth. Applying tweet analytics and network visualization to it, we stand to have an understanding of the geographical dimension of entrepreneurship activities in terms of peoeple, organizations and events in particular places, for instance Palo Alto, CA or Austin, TX. When measured over time, we can observe the crests and troughs of activity in these places. This would be extremely promising especially for the HUBS research project.

For simplicity, I will refer to the above aggregate as Viz&Ana

Key Ideas

Viz&Ana: DaaS
- While exploring the web, I realized that DaaS firms focus on providing Twitter Viz&Ana services to businesses and individuals to enable data-driven decision-making. In other words, the twitter data they mine offer an user interface for the client to interpret Twitter as an observable phenomenon. Clients exercise their own judgment as to whether a marketing campaign or event organization is successful, and make decisions based on these Viz&Ana.
- To contribute to the research work at McNair, I would propose that we assemble tools and software in the spirit of a DaaS. In other words, Twitter Mining per se is not meaningful. Constructing a working system where researchers can observe the twitterverse, as if interpreting a primary source of data, is meaningful. For data scientists, running statistical analyses on outputs from this working system is meaningful.

Portability & Flexibility
- This is the bit where we distinguish ourselves dream bigger than a run-of-the-mill SAAS, whose work ends when the Viz&Ana is delivered to the hands of the clients.
- Since the Viz&Ana is for research consumption, further research and analysis must be carried out on the graphs, maps and tables produced by the Viz&Ana. We therefore should do well to avoid blackbox scenarios where beautiful but inflexible graphs are produced but cannot proceed further in the hands of the researchers. Open-source tools, a stronger backend and a good data management system is therefore important considerations when building our Viz&Ana system.
- In other words, I want data structures that can move between softwares, not just a poster to hang on the walls.

"When measured over time..."
- Since twitter represents the movement of trends, it is best interpreted as an organic body of knowledge that is contingent on the passage of time. Any Viz&Ana that we conduct on the twitterverse must be able to be viewed and extracted (and further processed) as a function of time.

Mining Tools

Blackboxes

Before the www revolution, legacy Viz&Ana software started in the past such as Pajek tend to be blackboxes whose functionality are developed by a dedicated team of commissioned engineers who knew that their target audience are not likely to know code. Many Viz&Ana software, as you will see below, fall into this category.

Modules and Scripts

There is a large community of developers and researchers who are actively involved in developing open-source, free-to-use modules and scripts. Most of the work done by them lie in one of the three aforementioned Twitter Mining approaches. I have not yet explored time-based or webhook-styled modules that we can harness, but am pretty sure that they exist.

The resources can all be built upon each other, with the help of intermediaries, to create a form of aggregate Viz&Ana that McNair needs. Having limited lived experience with different programming languages and joining modules, I cannot offer optimal advice on how exactly to build them together efficiently. However, they all possess the capability. To be further inspected

Network Visualization

Collection of R Packages - see Field Notes for detail
Intro to NodeXL - see Field Notes for detail
NodeXL Canon - see Field Notes for detail
Academic Scholarship on NodeXL - see Field Notes for detail

Tweet Analytics

Geo Visualization

ericfischer's Datamap in C - see Field Notes for detail
Geo Visualization on Mapbox - see Field Notes for detail

Dream Case

Picture this: a query into, for instance, #ycombinator at a 7/28/2016 1527hours CST will yield a geo-visualized world map at the bottom-most layer, indicating activities of tweets associated with  #ycombinator. Above the world map, there will be neatly separated communities of nodes and edges network-visualized to indicate the interest groups talking about this topic. There will also be lists of reports done by tweet analytics. Each part of the Viz&Ana can then be converted into other data structures and processed by other analysis software.

Field Notes

Developmental

NodeXL

In a nutshell

- Enclosed system that auto-pulls, auto-cleans and auto-graphs Twitter networks revolving around input SEARCH TERM (read: this is important).
- MSExcel-based (thus unsure of its portability, i.e. can we port the graph and its data structure to other softwares and development environments for further processing/analysis?
- Highly mathematical, formal graph theory
- Highly customizable
- Vertices being (@twitterhandles) and edges being (follower/following relationship, mentions, replies, favroites, etc).
- Operates on Twitter's Streaming API, requires user authentication
- GUI; very user-friendly and accessible to even
- Requires background in graph theory to understand mathematical concepts
- Developed open-source by the Social Media Research Foundation, with help from academics from Cornell to Cambridge.

Features and Review

Automation

- This being a clean-up process for the input data before analysis and display in the form of a graph
- Group vertices by cluster (e.g. the Clauset-Newman-Moore algorithm to identify community structures) and calculate clustering coefficient
- Count and merge duplicate edges (and therefore scale the resultant edge by width proportional to the number of edges merged)
- Layout method - e.g. the Harel-Koren Fast Multiscale Layout algorithm

Centrality measures

- Betweenness centrality - identification of corridor/ambassador nodes that are important links between adjacent network communities. In other words, identification of the most BROADLY CONNECTED nodes in the network. Think: few friends in high places, as opposed to an abundance of low-level friends
- Closeness centrality - related to clustering coefficient. Identification of strong communities within a larger network
- Eigenvector centrality - unclear
- Clustering coefficient - as above

Overall graph metrics

- In a nutshell: Highly customizable
- Vertices and edge count
- Unique edges
- Edge width - can be a function of number of merged edges, etc
- Node size/color - can be a function of node's degree, centrality measures, etc
- Egonet - user can look at each node as the "center of the network universe"
  - Pagerank - useful google coefficient that measures how good one node's IN-FLOW is, i.e. the tendency to end up at subject node as agent travels around its neighborhood
  - Number of tweets ever created
  - Number of tweets favorited
  - Other common "user data"
  - User can view egonets in a matrix, and apply "sort by" such that he can easily identify those nodes with the highest e.g. in/out-degree, centrality, pagerank etc)
  - Graph density - 2*|E|/(|V|*(|V|-1))
  - Connected Components calculation

Inspiration, or the "Dream Case"

- WHAT IF WE tap on NLP capabilities to monitor twitter handles that are known to be important, and have a constant feed of important rising new words, rising new mentions and rising new hashtags. Using this feed, we can populate and update graphs constantly, measuring 'delta' instead of using graph data per se, and thus develop a good grasp of rising organizations, events and startups in the twitterverse. We would know things before other people do. Value.
  - Our question will be: What is going on with startup XYZ?
  - Empirically, and in a micro way, I have observed that a new startup known as Aminohealth @aminohealth (enables end-users to shop around for doctors based on price range; seems very novel and in-demand) has been appearing very constantly on important feeds such as @techcrunch, @redpointvc and @accel. It has just received a 'huge launch' but is relatively unknown in the bigger twitter picture. There is also nothing conclusive about what this launch entailed, and what kind of funding it received. Using the NodeXL tool, we can conceivably find out everyone that's involved in @aminohealth's recent activities, and systematically mine knowledge from this network.
  - @aminohealth itself possesses only around 1,000 followers, despite having 700+ tweets. Delta is far more important than what-is for rising startups as such.

- - Empirically, the twitterverse is populated by important organizations as well as, we often forget, their staff. @jflomenb is constantly mentioned by @redpointvc and @accel, and has interesting exposes information about the entrepreneur scene, as shown. Again, delta is crucial.

- WHAT IF WE compare social networks against themselves over time?
  - If we generate useful network graphs and data OVER TIME that revolves around a single entity e.g. @redpointvc, we would be able to do a few pretty amazing statistical analyses:
    - The mean number of mentions before a startup gets signed to a VC
    - What are the quantitative tweet indicators that a startup is succeeding/failing?
    - All the startups a VC has signed since the VC obtained a twitter handle
    - The average pace at which a VC signs startups
    - What are the qualitatively trendy topics that are mentioned in the history of a VC? Does this influence their activity, if at all?
    - Any regression for the above, and more
- WHAT IF WE track ongoing events such as #kpceoworkshop
  - It'll be easy to find out who are the people that are attending the workshop, and add them to our watchlist of important people
  - Also, how important or impactful are these events? We can track their mentioners and likers and followers to identify and think about follow-up events that occur after the events themselves conclude.

Limitations

- A input query is 'necessary'. I don't think the user can simply ask for a graph of all the followers of @xxx, for instance.
- It's a black box - this tool is designed for end-users that want to study contingent trends and discrete events, instead of a comprehensive and stable picture of a certain "scene" (i.e. the entrepreneur scene, in our case).
  - We can, of course, run the tool continuously for all trends that we identify. But would we be able to join them all up in an aggregate fashion?
- Unsure of the usefulness of output
  - Sure, it will be nice to generate graphs and knowledge about upcoming events and organizations, but will we be able harness this information and use it to do other stuff?
  - In other words, it's unclear how portable our output data is

Thoughts

- In my recent days of interacting with the twitterverse, it has come to pass that Twitter is spectacular because of its malleability, flexibility and decentralized nature. All forms of social organization on Twitter is explicitly time-contingent and user-contingent. This is the why it is such an important hotbed for sociological research - it provides wonderful material for the study of social dynamics and social organization
- In this vein, what we think of as the "Entrepreneurship Twitterverse" can be, more clearly, thought of as a time-contingent and very specific community shaped by its own trends, influencers, and cultural values, all of which are in turn shaped by the very specific people that are interested and involved in the same ideas/things. In our case, investments, foundings, IPOs, acquisitions etc
- In light of this, does it make more sense for us to study deltas instead of things as-they-are?

Demo

- Test case by www.pewinternet.org
  - User attempted to graph the community activity regarding the topic "pew internet"
  - User used search string "pew internet" over a fixed period of 58 days
  - Output graph nodes are created for each @shortname on the broadcasting or receiving end of tweets that include "pew internet". Output graph edges are created for each mention and reply that appeared over the course of the time bracket.
    - Graph edge colors and widths are proportional to the number of mentions/replies that occurred between two nodes (users).
    - The color and transparency of his nodes are related to follower values, i.e. how many followers does each node have..