Wednesday, 19 November 2014

Making the most out of big data: computer mediated methods

Patrick Readshaw is a Media and Cultural Studies Doctoral Candidate at Canterbury Christ Church University. Patrick is interested in social media as an alternative and empowering source of information on current events, free from the constraints of other agenda-setting media forms. You can contact Patrick by email on  

When I was asked to write a blog for NSMNSS, I was certainly excited and being my first post of this kind I was suitably anxious about the prospect. However, my ongoing thesis has never ceased to provide interesting discussions with individuals in linked or parallel fields relating to social media. The main caveat in these discussions is that I often have to try not to over complicate things. With that in mind and my ham-fisted introduction out of the way I want to take some time to break down the value of so called “new media systems” like Twitter and the how I personally go about dealing with the data I collect. 

Since Social Media sites such as “Facebook” burst onto the scene 10 years ago, researchers and market analysts have been looking for a way to tap into the content on these sites. In recent years, there have been several attempts to do this with some being more successful than others (Lewis, Zamith & Hermida, 2013), particularly with regards to the scale of the medium in question. For those uninitiated (apologies to those that are) the term “Big Data” is the catch-all for the enormous trails of information generated by consumers going about their day in an increasingly digitized world (Manyika et al., 2011). It is this sheer volume of information that poses the first hurdle to be overcome when conducting research online. For example, earlier this year I was collecting data on the European Parliamentary Election and generated over 16,000 tweets in about three weeks. Bearing in mind that on average a tweet contains approximately 12 words in 1.5 sentences (Twitter, 2013), for those three weeks I had 196,500 words or 24,500 sentences to come to terms with. That is a lot of data for one person to deal with alone, especially if only applying manual techniques such as content analysis. 

So ultimately you have to ask two questions. Firstly how many undergraduates/interns chained to computers running basic content analysis is it going to take to complete the analysis in a reasonable space of time and whether that analysis is going to be reliable between the analysts. Secondly, while computational methods save time on analysis can you guarantee the same level of depth as with manual content analysis? Considering that content analysis goes beyond basic frequency statistics which can be collected simply from Twitter’s own search engine, I advocate the use of computer mediate techniques in which the data collected can firstly be reduced using filters to removes reTweets or spam responses and secondly to apply hierarchical cluster analysis among others to structure the data somewhat, or at least conceptualise it along a number of important factors. Both Howard (2011) and Papacharissi (2010) utilise this mixed methods approach as do Lewis, Zamith and Hermida (2013) whose method I adapted to my own work and applied as described above. Furthermore these individual pieces of research suggest the value of the medium overall as a source of data, due to its role as one of the primary news disseminators when access to mainstream news media is blocked such as during 2011 Arab Spring events. Burgess and Bruns (2012) have conducted addition research looking at the 2010 federal election campaign in Australia, advising the use of computational methods to reduce their sample to facilitate manual methods ultimately, maintaining depth during content analysis. As can be imagined Lewis, Zamith and Hermida (2013) and Manovich (2012) both support the methodologies utilized by the studies above and advocate making the most of the technical advances that have allowed for the content in question to be organized and harnessed in an efficient way.  

The application of mixed methodologies will continue to develop the techniques integral to facilitating the oncoming age of computational social science (Lazer et al., 2009) or “New Social Science”. While this is the case it is vitally important that while using this readily available source of data is not exploited in a way that could be potentially damaging to the medium as a whole and maintaining good research practice concerning the ethics associated with consumer privacy. As a final aside I would like to remind everyone that this data is hugely fascinating and rich beyond all belief but there are dangers associated with quantifying social life and if possible this should be at front of our minds before, during and after conducting research online (Boyd & Crawford, 2012; Oboler, Welsh & Cruz, 2012).


Boyd, d. & Crawford, K. (2012). Critical questions for Big Data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society, 15 (5), 662–679.

Burgess, J., & Bruns, A. (2012). (Not) the Twitter election: The dynamics of the #ausvotes conversation in relation to the Australian media ecology. Journalism Practice, 6 (3), 384– 402.
Howard, P. (2011). The digital origins of dictatorship and democracy: Information technology and political Islam. London, UK: Oxford University Press.

Lazer, D., Pentland, A., Adamic, L., Aral, S., Barbási, A., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D. & Van Alstyne, M. (2009). Life in the network: The coming age of computational social science. Science, 323 (5915), 721-723.

 Lewis, S. C., Zamith, R., & Hermida, A. (2013). Content Analysis in an Era of Big Data: A Hybrid Approach to Computational and Manual Methods. Journal of Broadcasting & Electronic Media, 57 (1), 34–52.

Manovich, L. (2012). Trending: The promises and the challenges of big social data. In M. K. Gold (Ed.), Debates in the Digital Humanities (pp. 460–475). Minneapolis, MN: University of Minnesota Press.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.

Oboler, A., Welsh, K., & Cruz, L. (2012). The danger of big data: Social media as computational social science. First Monday, 17 (7-2). Retrieved from

Papacharissi, Z. (2010). A private sphere: Democracy in a digital age. Cambridge, England: Polity Press.

Thursday, 13 November 2014

The changing nature of who produces and owns data: How will it impact survey research?

Brian Head is a research methodologist at RTI International. This post first appeared on SurveyPost on 20 May, 2014. You can follow Brian on Twitter @BrianFHead.

Cloud Photo

Survey researchers have become interested in big data because it offers potential solutions to problems we’re experiencing with traditional methods. Much of the focus so far has been on social media (e.g., Tweets), but sensors (wearable tech) and the internet of things (IoT) are producing an increasingly rich, complex, and massive source of data. These new data sources could lead to an important change in how individuals see the data collected about them, and thus have ramifications for those interested in gathering and analyzing those data.

Who compiles data?

Quantitative data about people have been gathered for millennia. But with technological advances and identification of new purposes for it, the past 100 years have seen significant increases in the amount of data produced and collected—e.g., data on consumer patterns and other market research, probability surveys, etc.

Common to these data are three factors: 1) the data are a commodity compiled, used, or traded by third parties; 2) generally there are no direct benefits to individuals about whom data are gathered; and 3) the organizations interested in the data gather, store, and analyze it. All this is not to say that throughout history individuals haven’t collected information about themselves. Individuals have collected qualitative data in the form of diaries and biographies. And, they have collected some quantitative data but this has generally to satisfy a third-party (e.g., collecting financial information to file taxes). But, now in addition to all of the data others compile about them, new technologies like wearable technologies (sensors) and IoT devices allow people to voluntarily produce and compile massive amounts of data about themselves and doing so can have a direct benefit to them. (Involuntary data collection through connected devices is already taking place—e.g., internet connected devices are being used for geo-targeting advertising).

Who owns or controls data?

Data are collected in different ways. Census data are collected periodically (intervals vary by nation) through a mandatory government data collection. Surveys generally operate under the requirement of voluntary participation, although there are exceptions.  Much of the consumer data gathered now is done surreptitiously. Examples include browser cookies that collect information about the websites we visit, search engines that collect information about the internet searches people conduct, email providers that scan emails, and apps that use geodata to market goods and services to prospective clients.

It seems the public is increasingly aware of and concerned with the sum of these data collections. According to a recent Robert Wood Johnson Foundation (RWJF) study large majorities of self-tracking app/device users think (84%) they do or want (75%) to own data that are collected with the device. There have been attempts to limit data collection, such as the recent attempt to limit the data the U.S. government collects on citizens.  Advocates of efforts like this tend to cite concerns over burden and privacy. The exponential growth of data collected both voluntarily and involuntarily through apps, sensors, and the IoT may cause similar (perhaps successful) attempts to change government and corporate policies to provide individuals more control over their data. In fact, market researchers are already beginning to respond to such an interest among consumers by offering to pay consumers for access to their browsing history, social network activity, and transactions they conduct online while at the same time giving those consumers control over which data they sell to the brokers.
As the amount of data collected about us increases, there’s a good chance individuals will increasingly see their data as their own, understand the value it has to various third parties, demand more control over it, and to be compensated for it. At first brush that may seem concerning. However, the type of compensation individuals’ desire for data will likely depend on how data will be used. For example, consumers are likely to continue to trade data for convenience in services (see thesis # 12). And, the RWJF report cited above suggests the usual leverages used to gain survey participation—e.g., topic salience and altruism—may work in gaining access to big data when the purpose of the study is for “public good research.”

Need for further research

Further research is needed in this area of big data to answer questions like: 1) to what extent, and how soon, will a larger proportion of the population begin to voluntarily use sensor and IoT devices; 2) will the general public continue to tolerate involuntary data collection when those data are collected by connected devices; 3) will the general public have opinions similar to early adopters in the RWJF about sharing personal data from connected devices with survey researchers; 4) will the leverages that work for gaining survey participation work for gaining access to personal big data or will new/additional leverages be needed; 5) will we be able to use techniques similar to those used to access administrative record data or will we need to develop new protocol for seeking permission to access these data? I look forward to seeing and contributing toward the research to answer these questions. What are your thoughts?

Thursday, 6 November 2014

You Are What You Tweet: An Exploration of Tweets as an Auxiliary Data Source

Ashley Richards is a survey methodologist at RTI International. This post first appeared on SurveyPost on 29, July 2014. 

Last fall at MAPOR , Joe Murphy presented the findings of a fun study he did with our colleague, Justin Landwehr, and me. We asked survey respondents if we could look at their recent Tweets and combine them with their survey data. We took a subset of those respondents and masked their responses on six categorical variables. We then had three human coders and a machine algorithm try to predict the masked responses by reviewing the respondents’ Tweets and guessing how they would have responded on the survey. The coders looked for any clues in the Tweets, while the algorithm used a subset of Tweets and survey responses to find patterns in the way words were used. We found that both the humans and machine were better than random in predicting values of most of the variables.

We recently took this research a step further and compared the accuracy of these approaches to multiple imputation, with the help of our colleague Darryl Creel. Imputation is the approach traditionally used to account for missing data and we wanted to see how the nontraditional approaches stack up. Furthermore, we wanted to check out these approaches because imputation cannot be used in the case where survey questions are not asked. This commonly occurs because of space limitations, the desire to reduce respondent burden, or other factors. I will be presenting on this research at the upcoming Joint Statistical Meetings (JSM), in early August. I’ll give a brief summary here, but if you’d like more details on it please check out my presentation or email me for a copy of the paper.

Income was the only variable for which imputation was the most accurate approach, but the differences between imputation and the other approaches were not statistically significant. Imputation correctly predicted income 32% of the time, compared to 25% for human coders and 26% for the machine algorithm. Considering that there were four income categories and a person would have a 25% chance of randomly selecting the correct response, I am unimpressed with these success rates of 25%-32%.

Human coders outperformed imputation on the other demographic items (age and sex), but imputation was more accurate than the machine algorithm. For these variables, the human coders picked up on clues in respondents’ Tweets. I was one of the coders and found myself jumping to conclusions, but I did so with a pretty good rate of success. For instance, if a Tweeter said “haha” a lot or used smiley faces, I was more likely to guess the person was young and/or female. These are tendencies that I’ve observed personally but I’ve read about them too.

As a coder I struggled to predict respondents’ health and depression statuses, and this was evident in the results. Imputation was better than humans at predicting these, but the machine algorithm was even more accurate. The machine was also best at predicting who respondents voted for in the previous presidential election, with human coders in second place and imputation in last place. As a coder I found that predicting voting was fairly simple among the subset of respondents who Tweeted about politics. Many Tweeters avoided the subject altogether, but those who Tweeted about politics tended to make it obvious who they supported.

So what does this all mean? We found that even with a small set of respondents, Tweets can be used to produce estimates with accuracy in the same range or better[1] as imputation procedures. There is quite a bit of room for improvement in our methods that could make them even more accurate. For example, we could use a larger sample of Tweets to train the machine algorithm and we could select human coders who are especially perceptive and detail-oriented. The finding that Tweets are as good or better as imputation is important because imputation cannot be used in the case where survey questions were not asked.

As interesting as these findings may be, they need to be taken with a grain of salt, especially because of our small sample size (n=29).[2] Relying on Twitter data is challenging because many respondents are not on Twitter, and those who are on Twitter are not representative of the general population and may not be willing to share their Tweets for these purposes. Another challenge is the variation in Tweet content. For example, as I mentioned earlier, some people Tweet their political views while others stay away from the topic on Twitter.

Despite these limitations, Twitter may represent an important resource for estimating values that are desired but not asked for in a survey. Many of our survey respondents are dropping clues about these values across the Internet, and now it’s time to decide if and how to use them. How many clues have you dropped about yourself online? Is your online identity revealing of your true characteristics?!?

[1] Even if approaches using Tweets may be more accurate than imputation, they require more time and money and in many cases may not be worth the tradeoff. As discussed later, these findings need to be taken with a grain of salt.

[2] We had more than 2,000 respondents, but our sample size for this portion of the study was greatly reduced after excluding respondents who don’t use Twitter, respondents who did not authorize our use of their Tweets, and respondents whose Tweets were not in English. Furthermore, half of the remaining respondents’ Tweets were used to train the machine algorithm.