Dr Luke Sloan is a Senior Lecturer in Quantitative Methods, Deputy Director of Cardiff Q-Step and a member of the Collaborative Online Social Media Observatory (COSMOS: www.cosmosproject.net). He is based in the School of Social Sciences at Cardiff University and his research focuses on the development of demographic proxies for Twitter data and understanding how social media data can augment traditional modes of social scientific analysis. @drlukesloan
A perennial criticism of Twitter data is that it’s missing many of the variables that we find interesting as social scientists and, because of this, it will never be a viable source of data for social scientific analysis. We are anchored to the practices of survey methodology in which a question is asked and answered, thus we ensure that the researcher collects the relevant demographic information allowing us to compare gender/ethnic/socio-economic groups. This is the bread and butter of social science.
In contrast, social media data is naturally occurring – it is not elicited! Because of this it is unfocused, messy and does not neatly address a pre-conceived research question. But it is a rich source of information on attitudes and provides insights into immediate reactions following key events. It’s been used to predict elections, box office revenue and even to calculate the epicentre of an earthquake. So clearly we shouldn’t be so quick to dismiss this data as useless, particularly if we are creative and innovative in how we conceptualise the manner in which demographic data may manifest and thus open this data up to social scientific analysis.
Imagine that you are walking down the street and have decided that today you are going to guess the demographics characteristics of the people that you see – the only rule is that you cannot ask them outright, you must observe their behaviour without being obtrusive. How might you work out someone’s gender? Well, perhaps you overhear someone shouting his or her name. What about their occupation? Maybe they have an ID badge or are carrying tools. What about their age? Well we all make guesses about age based on appearance, often at the risk of offending someone. The point is that through the passive uptake of incidental information which is there to be analysed (and which you have not elicited!) you can tell quite a bit about a person.
Now let’s consider this in the context of Twitter. People put their name on Twitter, thus allowing us to derive a proxy for their gender. For those who have geo-tagging switched on we can tell where they were when they tweeted, or we can use profile information to workout their home town. If we have enough time we can even look at the place which they make reference to in their tweets. We know about their hobbies as they report on their leisure activities and we know a bit about their work if they report on it via social media. Are they employed? Well we can have a look at whether they’re complaining about work, about colleagues or about the printer breaking down (‘again!’). When we look close enough we are flooded with ‘signatures’ that offer us an indication of characteristics that that would typically be found in the demographics section of a survey.
The sticking point is that we can’t derive this information for all tweeters and not all the proxies are as reliable as others. First names are actually quite an accurate proxy for gender as identity play is a minority pursuit. As long as you have stringent classification rules and understand that around 52% of UK users can’t be classified (this still results in successful identification of around 600,000 users), then you still have information for 48%*. You could think of this 48% as a sample of Twitter users which is synonymous to a survey sample, although not randomly sampled… but even then do we have any reason to think that the users we have been able to identify are substantively different to those we can’t?
The bottom line is that it is possible to derive important demographic information from Twitter data if we’re prepared to think creatively. The methods will get better and programmes of work will emerge which allow the confirmation of proxy demographic reliability. We’re only a few metres off the ground on our climb up this new methodological edifice, but seeking out a viable trail enables others to follow and establish safer, more secure routes.