Difference between revisions of "Twitterverse Exploration"

From edegan.com
Jump to navigation Jump to search
Line 13: Line 13:
 
=Exploration Notes=
 
=Exploration Notes=
  
===NodeXL===
+
==NodeXL==
  
*In a nutshell: '''Enclosed system that auto-pulls, auto-cleans and auto-graphs Twitter networks revolving around input SEARCH TERM (read: this is important).'''
+
===In a nutshell===
**MSExcel-based (thus unsure of its portability, i.e. can we port the graph and its data structure to other softwares and development environments for further processing/analysis?
+
**'''Enclosed system that auto-pulls, auto-cleans and auto-graphs Twitter networks revolving around input SEARCH TERM (read: this is important).'''
 +
*MSExcel-based (thus unsure of its portability, i.e. can we port the graph and its data structure to other softwares and development environments for further processing/analysis?
 
**Highly mathematical, formal graph theory  
 
**Highly mathematical, formal graph theory  
 
**Highly customizable  
 
**Highly customizable  
Line 23: Line 24:
 
**GUI; very user-friendly and accessible to even  
 
**GUI; very user-friendly and accessible to even  
 
**Requires background in graph theory to understand mathematical concepts  
 
**Requires background in graph theory to understand mathematical concepts  
 +
**Developed open-source by the [http://nodexl.codeplex.com/ Social Media Research Foundation], with help from academics from Cornell to Cambridge.
  
*Limitations
+
[[File:Capture 24.PNG|600px|none]]
**It's a black box - this tool is designed for end-users that want to study contingent trends and discrete events, instead of a comprehensive and stable picture of a certain "scene" (i.e. the entrepreneur scene, in our case).
 
***We can, of course, run the tool continuously for all trends that we identify. But would we be able to join them all up in an aggregate fashion?
 
**Unsure of the usefulness of output
 
***Sure, it will be nice to generate graphs and knowledge about upcoming events and organizations, but will we be able harness this information and use it to do other stuff?
 
***In other words, it's unclear how portable our output data is
 
  
*Automation - clean-up before analysis and display
+
===Features and Review===
**Group vertices by cluster (e.g. the Clauset-Newman-Moore algorithm to identify community structures) and calculate clustering coefficient
+
====Automation====
**Count and merge duplicate edges (and therefore scale the resultant edge by width proportional to the number of edges merged)
+
**''''clean-up before analysis and display''''
**Layout method - e.g. the Harel-Koren Fast Multiscale Layout algorithm
+
**Group vertices by cluster ''''(e.g. the Clauset-Newman-Moore algorithm to identify community structures) and calculate clustering coefficient'''''
 +
**Count and merge duplicate edges ''''(and therefore scale the resultant edge by width proportional to the number of edges merged)''''
 +
**Layout method - ''''e.g. the Harel-Koren Fast Multiscale Layout algorithm''''
  
*Centrality measures
+
====Centrality measures====
**Betweenness centrality - identification of corridor/ambassador nodes that are important links between adjacent network communities. In other words, identification of the most BROADLY CONNECTED nodes in the network. Think: few friends in high places, as opposed to an abundance of low-level friends  
+
**Betweenness centrality - ''''identification of corridor/ambassador nodes that are important links between adjacent network communities. In other words, identification of the most BROADLY CONNECTED nodes in the network. Think: few friends in high places, as opposed to an abundance of low-level friends ''''
**Closeness centrality - related to clustering coefficient. Identification of strong communities within a larger network
+
**Closeness centrality - ''''related to clustering coefficient. Identification of strong communities within a larger network''''
**Eigenvector centrality - unclear
+
**Eigenvector centrality - ''''unclear''''
**Clustering coefficient
+
**Clustering coefficient - as above
  
*Overall graph metrics
+
====Overall graph metrics====
 
**In a nutshell: '''Highly customizable'''
 
**In a nutshell: '''Highly customizable'''
 
**Vertices and edge count
 
**Vertices and edge count
 
**Unique edges
 
**Unique edges
**Edge width - can be a function of number of merged edges, etc
+
**Edge width - ''''can be a function of number of merged edges, etc''''
**Node size/color - can be a function of node's degree, centrality measures, etc
+
**Node size/color - ''''can be a function of node's degree, centrality measures, etc''''
**Egonet - user can look at each node as the "center of the network universe"
+
**Egonet - ''''user can look at each node as the "center of the network universe"''''
***Pagerank - useful google coefficient that measures how good one node's IN-FLOW is, i.e. the tendency to end up at subject node as agent travels around its neighborhood
+
***Pagerank - ''''useful google coefficient that measures how good one node's IN-FLOW is, i.e. the tendency to end up at subject node as agent travels around its neighborhood''''
 
***Number of tweets ever created
 
***Number of tweets ever created
 
***Number of tweets favorited
 
***Number of tweets favorited
 
***Other common "user data"
 
***Other common "user data"
***User can view egonets in a matrix, and apply "sort by" such that he can easily identify those nodes with the highest e.g. in/out-degree, centrality, pagerank etc)
+
***User can view egonets in a matrix, ''''and apply "sort by" such that he can easily identify those nodes with the highest e.g. in/out-degree, centrality, pagerank etc)''''
***Graph density - 2*|E|/(|V|*(|V|-1))
+
***Graph density - ''''2*|E|/(|V|*(|V|-1))''''
***Connected Components information
+
***Connected Components calculation
 +
[[File:Capture 23.PNG|600px|none]]
  
*Inspiration, or "dream case" as Ed will
+
 
 +
===Inspiration, or the "Dream Case"===
 
**'''What if we''' tap on NLP capabilities to monitor twitter handles that are known to be important, and have a constant feed of important rising new '''words''', rising new mentions and  
 
**'''What if we''' tap on NLP capabilities to monitor twitter handles that are known to be important, and have a constant feed of important rising new '''words''', rising new mentions and  
 
rising new hashtags. Using this feed, we can populate and update graphs constantly, measuring delta instead of using graph data per se, and thus develop a good grasp of rising organizations, events and startups in the twitterverse. We would know things before other people do. Value.
 
rising new hashtags. Using this feed, we can populate and update graphs constantly, measuring delta instead of using graph data per se, and thus develop a good grasp of rising organizations, events and startups in the twitterverse. We would know things before other people do. Value.
 
***Empirically, and in a micro way, I have observed that a new startup known as '''Aminohealth''' (enables end-users to shop around for doctors based on price range; seems very novel and in-demand) has been appearing very constantly on important feeds such as @techcrunch, @redpointvc and @accel. It has just received a funding round (I'm writing this in 7/27) but is relatively unknown in the bigger twitter picture. In fact, Aminohealth does not have a twitter handle, nor is its hashtag populated. Delta is far more important than what-is for rising startups as such.
 
***Empirically, and in a micro way, I have observed that a new startup known as '''Aminohealth''' (enables end-users to shop around for doctors based on price range; seems very novel and in-demand) has been appearing very constantly on important feeds such as @techcrunch, @redpointvc and @accel. It has just received a funding round (I'm writing this in 7/27) but is relatively unknown in the bigger twitter picture. In fact, Aminohealth does not have a twitter handle, nor is its hashtag populated. Delta is far more important than what-is for rising startups as such.
***Empirically, the twitterverse is populated by important organizations as well as, we often forget, their staff. @jflomenb has useful tweets but has came into activity from a 6-year twitter hiatus. His name has been constantly mentioned by @redpointvc and @accel too. Again, delta is crucial.  
+
***Empirically, the twitterverse is populated by important organizations as well as, we often forget, their staff. @jflomenb is constantly mentioned by @redpointvc and @accel, and has interesting exposes information about the entrepreneur scene, as shown. Again, delta is crucial.  
 +
[[File:Capture 22.PNG|600px|none]]
 
**'''What if we''' compare social networks against themselves over time?  
 
**'''What if we''' compare social networks against themselves over time?  
 
***If we generate useful network graphs and data '''OVER TIME''' that revolves around a single entity e.g. @redpointvc, we would be able to do a few pretty amazing statistical analyses:
 
***If we generate useful network graphs and data '''OVER TIME''' that revolves around a single entity e.g. @redpointvc, we would be able to do a few pretty amazing statistical analyses:
Line 74: Line 76:
 
***Also, how important or impactful are these events? We can track their mentioners and likers and followers to identify and think about events that happen after the events conclude. (hmm..)
 
***Also, how important or impactful are these events? We can track their mentioners and likers and followers to identify and think about events that happen after the events conclude. (hmm..)
  
*Demo
+
 
 +
===Limitations===
 +
**A input query is ''''necessary''''. I don't think the user can simply ask for a graph of all the followers of @xxx, for instance.
 +
**It's a black box - this tool is designed for end-users that want to study contingent trends and discrete events, instead of a comprehensive and stable picture of a certain "scene" (i.e. the entrepreneur scene, in our case).
 +
***We can, of course, run the tool continuously for all trends that we identify. But would we be able to join them all up in an aggregate fashion?
 +
**Unsure of the usefulness of output
 +
***Sure, it will be nice to generate graphs and knowledge about upcoming events and organizations, but will we be able harness this information and use it to do other stuff?
 +
***In other words, it's unclear how portable our output data is
 +
 
 +
===Thoughts===
 +
**In my recent days of interacting with the twitterverse, it has come to pass that Twitter is spectacular because of its malleability, flexibility and decentralized nature. All forms of social organization on Twitter is explicitly time-contingent and user-contingent. This is the why it is such an important hotbed for sociological research - it provides wonderful material for the study of social dynamics and social organization
 +
**In this vein, what we think of as the "Entrepreneurship Twitterverse" can be, more clearly, thought of as a time-contingent and very specific community shaped by its own trends, influencers, and cultural values, all of which are in turn shaped by the very specific people that are interested and involved in the same ideas/things. In our case, investments, foundings, IPOs, acquisitions etc
 +
**'''In light of this, does it make more sense for us to study deltas instead of things as-they-are?'''
 +
 
 +
===Demo===
 
**In the following test case done by www.pewinternet.org, where user attempted to graph the community activity regarding the topic "pew internet", he entered a list of search strings all including the keywords "pew internet" over a fixed period of 58 days and some misc hours. His edges are created for each mention and reply that appeared in the time bracket. His edge colors and widths are proportional to the number of mentions/replies that occurred between two nodes (users). The color and transparency of his nodes are related to follower values, i.e. how many followers does each node have..
 
**In the following test case done by www.pewinternet.org, where user attempted to graph the community activity regarding the topic "pew internet", he entered a list of search strings all including the keywords "pew internet" over a fixed period of 58 days and some misc hours. His edges are created for each mention and reply that appeared in the time bracket. His edge colors and widths are proportional to the number of mentions/replies that occurred between two nodes (users). The color and transparency of his nodes are related to follower values, i.e. how many followers does each node have..
 
[[File:Capture 20.PNG|600px|none]]
 
[[File:Capture 20.PNG|600px|none]]
 +
[[File:Capture 21.PNG|600px|none]]

Revision as of 15:20, 27 July 2016


McNair Project
Twitterverse Exploration
Project logo 02.png
Project Information
Project Title
Start Date
Deadline
Primary Billing
Notes
Has project status
Copyright © 2016 edegan.com. All Rights Reserved.


Exploration Notes

NodeXL

In a nutshell

    • Enclosed system that auto-pulls, auto-cleans and auto-graphs Twitter networks revolving around input SEARCH TERM (read: this is important).
  • MSExcel-based (thus unsure of its portability, i.e. can we port the graph and its data structure to other softwares and development environments for further processing/analysis?
    • Highly mathematical, formal graph theory
    • Highly customizable
    • Vertices being (@twitterhandles) and edges being (follower/following relationship, mentions, replies, favroites, etc).
    • Operates on Twitter's Streaming API, requires user authentication
    • GUI; very user-friendly and accessible to even
    • Requires background in graph theory to understand mathematical concepts
    • Developed open-source by the Social Media Research Foundation, with help from academics from Cornell to Cambridge.
Capture 24.PNG

Features and Review

Automation

    • 'clean-up before analysis and display'
    • Group vertices by cluster '(e.g. the Clauset-Newman-Moore algorithm to identify community structures) and calculate clustering coefficient
    • Count and merge duplicate edges '(and therefore scale the resultant edge by width proportional to the number of edges merged)'
    • Layout method - 'e.g. the Harel-Koren Fast Multiscale Layout algorithm'

Centrality measures

    • Betweenness centrality - 'identification of corridor/ambassador nodes that are important links between adjacent network communities. In other words, identification of the most BROADLY CONNECTED nodes in the network. Think: few friends in high places, as opposed to an abundance of low-level friends '
    • Closeness centrality - 'related to clustering coefficient. Identification of strong communities within a larger network'
    • Eigenvector centrality - 'unclear'
    • Clustering coefficient - as above

Overall graph metrics

    • In a nutshell: Highly customizable
    • Vertices and edge count
    • Unique edges
    • Edge width - 'can be a function of number of merged edges, etc'
    • Node size/color - 'can be a function of node's degree, centrality measures, etc'
    • Egonet - 'user can look at each node as the "center of the network universe"'
      • Pagerank - 'useful google coefficient that measures how good one node's IN-FLOW is, i.e. the tendency to end up at subject node as agent travels around its neighborhood'
      • Number of tweets ever created
      • Number of tweets favorited
      • Other common "user data"
      • User can view egonets in a matrix, 'and apply "sort by" such that he can easily identify those nodes with the highest e.g. in/out-degree, centrality, pagerank etc)'
      • Graph density - '2*|E|/(|V|*(|V|-1))'
      • Connected Components calculation
Capture 23.PNG


Inspiration, or the "Dream Case"

    • What if we tap on NLP capabilities to monitor twitter handles that are known to be important, and have a constant feed of important rising new words, rising new mentions and

rising new hashtags. Using this feed, we can populate and update graphs constantly, measuring delta instead of using graph data per se, and thus develop a good grasp of rising organizations, events and startups in the twitterverse. We would know things before other people do. Value.

      • Empirically, and in a micro way, I have observed that a new startup known as Aminohealth (enables end-users to shop around for doctors based on price range; seems very novel and in-demand) has been appearing very constantly on important feeds such as @techcrunch, @redpointvc and @accel. It has just received a funding round (I'm writing this in 7/27) but is relatively unknown in the bigger twitter picture. In fact, Aminohealth does not have a twitter handle, nor is its hashtag populated. Delta is far more important than what-is for rising startups as such.
      • Empirically, the twitterverse is populated by important organizations as well as, we often forget, their staff. @jflomenb is constantly mentioned by @redpointvc and @accel, and has interesting exposes information about the entrepreneur scene, as shown. Again, delta is crucial.
Capture 22.PNG
    • What if we compare social networks against themselves over time?
      • If we generate useful network graphs and data OVER TIME that revolves around a single entity e.g. @redpointvc, we would be able to do a few pretty amazing statistical analyses:
        • The mean number of mentions before a startup gets signed to a VC
        • What are the quantitative tweet indicators that a startup is succeeding/failing?
        • All the startups a VC has signed since the VC obtained a twitter handle
        • The average pace at which a VC signs startups
        • What are the qualitatively trendy topics that are mentioned in the history of a VC? Does this influence their activity, if at all?
        • Any regression for the above, and more
    • What if we track ongoing events such as #kpceoworkshop
      • It'll be easy to find out who are the people that are attending the workshop, and add them to our watchlist of important people
      • Also, how important or impactful are these events? We can track their mentioners and likers and followers to identify and think about events that happen after the events conclude. (hmm..)


Limitations

    • A input query is 'necessary'. I don't think the user can simply ask for a graph of all the followers of @xxx, for instance.
    • It's a black box - this tool is designed for end-users that want to study contingent trends and discrete events, instead of a comprehensive and stable picture of a certain "scene" (i.e. the entrepreneur scene, in our case).
      • We can, of course, run the tool continuously for all trends that we identify. But would we be able to join them all up in an aggregate fashion?
    • Unsure of the usefulness of output
      • Sure, it will be nice to generate graphs and knowledge about upcoming events and organizations, but will we be able harness this information and use it to do other stuff?
      • In other words, it's unclear how portable our output data is

Thoughts

    • In my recent days of interacting with the twitterverse, it has come to pass that Twitter is spectacular because of its malleability, flexibility and decentralized nature. All forms of social organization on Twitter is explicitly time-contingent and user-contingent. This is the why it is such an important hotbed for sociological research - it provides wonderful material for the study of social dynamics and social organization
    • In this vein, what we think of as the "Entrepreneurship Twitterverse" can be, more clearly, thought of as a time-contingent and very specific community shaped by its own trends, influencers, and cultural values, all of which are in turn shaped by the very specific people that are interested and involved in the same ideas/things. In our case, investments, foundings, IPOs, acquisitions etc
    • In light of this, does it make more sense for us to study deltas instead of things as-they-are?

Demo

    • In the following test case done by www.pewinternet.org, where user attempted to graph the community activity regarding the topic "pew internet", he entered a list of search strings all including the keywords "pew internet" over a fixed period of 58 days and some misc hours. His edges are created for each mention and reply that appeared in the time bracket. His edge colors and widths are proportional to the number of mentions/replies that occurred between two nodes (users). The color and transparency of his nodes are related to follower values, i.e. how many followers does each node have..
Capture 20.PNG
Capture 21.PNG