FEATURE27 June 2016

The language of the internet

x Sponsored content on Research Live and in Impact magazine is editorially independent.
Find out more about advertising and sponsorship.

Data analytics Features Healthcare Impact North America Trends

Johannes Eichstaedt mines social data to determine the psychological state of populations, with some compelling findings. He spoke to Jane Bainbridge

speech bubble made up of lots of illustrations of people

Throughout his academic career, Johannes Eichstaedt has always worked with data – it’s just that initially, as a physicist, it was particle physics data rather than psychological data he was processing. But when he realised that he “didn’t much care for working in a particle accelerator”, Eichstaedt switched to psychology.

Now, as a data scientist in psychology at the University of Pennsylvania, and co-founder of the World Well-Being Project, his time is spent using natural language processing to measure well-being among populations.

His research has included using Tweets to predict heart disease and Facebook statuses to identify depression.

Language patterns

In the case of heart disease and Twitter, he was part of the team of scientists that analysed more than 50,000 tweeted words to characterise community-level psychological correlates of dying from atherosclerotic heart disease (AHD) in the US.

The language patterns identified as risk factors reflected negative social relationships, disengagement and negative emotions such as anger; while positive emotions and psychological engagement emerged as protective factors. In their findings, published in Psychological Science, the researchers found that “a cross-sectional model based only on Twitter language predicted AHD mortality significantly better than a model combining 10 common demographic and socioeconomic risk factors, including smoking, diabetes and obesity”.

Twitter topics that positively correlated with county-level AHD mortality included hostility and aggression; hate and interpersonal tension; and boredom and fatigue. In comparison, topics that negatively correlated were skilled occupations; positive experiences and optimism.

“It’s not that Twitter has some magical prediction power that other variables don’t have. It’s an extremely good predictor of income and education and of communities where people smoke – so it picks up predictors of health behaviour, and then it adds a sliver of psychological causation that the other variables don’t seem to be getting at,” says Eichstaedt.

A linear discriminant analysis (LDA) algorithm crunches the data by working with 2,000 language clusters that distil what people talk about in their Facebook statuses or Tweets.

But how accurate is this social media data? Eichstaedt says there are two biases in it – sample bias and desirability bias. However he says the sample bias is overestimated: “The median age on Twitter is 32 and for the US population it’s about 36/37”, adding that once the sample is big enough, the model re-stratifies the sample to be more representative.

He says there’s some evidence that people misrepresent themselves (desirability bias), in particular suppressing negative emotion but that “the variance between people is still highly interpretable”.

facebook data

And there are differences between the media in terms of data. With Facebook data, the users have to give permission, which means “if you get 50,000 in a sample that’s amazing – generally data from Facebook users is 3 – 10 times as good as Twitter users”.

For his depression study he used Facebook data.

“For psychological insight, Facebook is preferable; it’s just you can’t get its data for that many people.” But in this case he was using data collected by someone else, which he reinterpreted to understand depression.

Looking forward he thinks diabetes, which seems to have a lot of behavioural predictors, might be an area worth researching – not that all areas of wellbeing research are ripe for social media data analysis.

“As long as your data is big enough it will always work; the question is, will it improve on other methods? And there the answer is sometimes no; when trying to predict something like cancer, it didn’t work because income and education appear to be a much better predictor than what’s happening on Twitter.”

0 Comments