Research Snapshot: Twitter's big data infovis

[Note: What you're reading is the first installment of a new series I'm writing at Thinky.org. I'm frustrated that the cutting edge academic and industry research from technology domains isn't getting into the hands of product managers and front-line workers. That's because the conferences are expensive and the papers are hard to read. I'm going to boil down cutting edge research in user experience design, interaction design, infovis, and other fields and create short (500 word) posts to help make them relevant to other product designers and product owners. Should be fun, right?!]

Paper: Using Visualizations to Monitor Changes and Harvest Insights from a Global-Scale Logging Infrastructure at Twitter [ PDF ]

Authors: Krist Wongsuphasawat and Jimmy Lin

Tags: New System, Infovis, Big Data, Product Management

Web analytics is about getting the right data to make decisions about your product, your website, or your marketing program. Big data means that you have too much data. So what do we do? Learn from the experts — few companies have got bigger big data than Twitter. Just a note about the scale of the problem — Twitter has thousands of Hadoop clusters which ingest more than 100TB of log files daily (that's not to mention the TBs of tweets themselves). That is some purty big data!

Wongsuphasawat & Lin (along with help from the data science team at Twitter) published a great paper at 2014’s VAST conference. The authors created Scribe Radar, which is a great addition to the open-source logging platform Scribe (developed at Facebook in 2008 - here’s a little bit about Scribe).

The software allows the twitter product management team to visualize millions of user sessions in a flexible and exploratory way.

Twitter infovis product management team log data

The value of infovis is often a virtuous cycle — better data inspires you to ask better questions, which in turn inspires further data exploration. But this cycle had been blocked at Twitter (and likely at your company or organization too) because no-one in product design, marketing or development can keep the different attributes clear. They have over 10,000 event types at twitter. Also, how can an infovis system support all the different paths, not just one-funnel-at-a-time analysis (see Shen et. al. 2013 for some work on the eBay checkout flow).

The product has interactive visualizations (view a 30 second 'preview' video) which show an icicle diagram for showing path details and depths. Then they add in some nice ways to show which paths have increased/decreased in frequency over the last n days or weeks, and other pop-up data visualizations to allow deeper inspection through elements or time.

What can you learn from this project:

  • Good conceptual overview of what 'big data' log files look like at a major corporation
  • What kinds of questions data scientists, product managers, and developer ask. Here's a quick list that's further explored in the paper:
    • What do users do when the first visit twitter?
    • How did the introduction of a new feature change user behavior?
    • Where do 'follows' happen?
  • How to carefully collapse — technically ‘aggregate' — data in smart ways to reduce the number of queries. Some suggestions that the authors implemented in Radar: reduce the number of event types, force analysts to select a subset of event types, and allowing the user to select a start and and end page/state (and maximum steps between them).