Skip to main content

Big data and old wives’ tales

Autumn is upon us – can you feel it in your bones? Many people say the weather affects their joints, but is there an actual link between pain and meteorological conditions or is it just an old wives’ tale? That is the question a University of Warwick research student in statistics spent his summer helping to pick apart, after he became one of the first interns at the Alan Turing Institute in London.

David SelbyDavid Selby, a PhD student in the Department of Statistics at the University of Warwick, was one of a team of four researchers working on Cloudy with a Chance of Pain — a citizen science project which asks people to measure their pain on their smart phones using a specially designed app which captures the local weather conditions at the time. It is the world’s first smartphone-based study to investigate the possible connection.

The human dimension

The results of the study, funded by Arthritis Research UK, may be able to inform behaviour and treatment in the future. From a statistician’s point of view though, there was a lot of work to do on the data before any conclusions could be drawn.

“The problem with working with information collected by non-expert human beings, is that it is not uniform,” says David. “People who signed up to the project had different ways of inputting the data each day, so we needed to find out what the data looked like and whether it could be analysed to give meaningful results."

Complex data set

“The information collected so far by the app has formed a very complex data set,” continues David. “We set to work looking for a method of analysis. We looked at grouping the data in different ways to reflect the participants’ way of recording or interacting with the app. We had people who dipped in and out of the study, recording sporadically, those who were very active, and a range of behaviour in between.

“We devised a way of clustering the data according to the level of engagement, which allowed us to begin to look at what the information told us.

“We used a mixed hidden Markov model to cluster the participants according to way in which they entered data using the app. Looking at the demographics of the 'most active' group, it was important to check if such users tended to be older/younger, disproportionately male/female or have a higher prevalence of certain medical conditions relative to everybody else in the study. If so, this would bias any subsequent analysis. The model could also be useful for identifying patterns of usage associated with dropping out of the study: so we can find out why these people are dropping out and inform recruitment strategies.

“Our work on the data set could have implications for how these kind of e-health/self-reported studies operate in the future. It could help people design apps to make future analyses more straightforward.”

The team also worked on applying topic modelling to the written feedback to try to identify recurring themes as well as the actual weather and pain analysis. This involved fitting regularised logistic regression models (among others) to each participant and the corresponding mean, min, max, gradient and variance of weather over the one to five days before each pain reading. This task was computationally intensive and crashed the server a few times.

Promising start

The work by David and his colleagues allowed some of the preliminary results to be released to the public in September. The initial results, which suggest there is an association between the number of sunny days, rainfall and levels of pain, generated media interest and called for more volunteers to engage with the Cloudy with a Chance of Pain project, which will run for the rest of the year.

The whole experience gave David and the team a taste of what it would be like working alongside industry.

“Our group was from various backgrounds — statistics, social sciences and computer science — and we were put together as a project team to work around the problem. We had to give regular feedback and updates to the project leaders through conference calls and presentations. This kind of inter-disciplinary collaboration is how the Alan Turing Institute operates and it gave us real experience of what it is like working on a real application, outside a university environment.”

The Alan Turing Institute, the UK’s national institute for data science, was founded in 2015 as a joint venture by five of the UK’s top universities, including Warwick, in partnership with UK Engineering and Physical Sciences Research Council. David was one of just twelve doctoral students given the opportunity to work at the prestigious new research centre this summer and was thrilled to take part in some of the first academic research conducted at the brand new facility based at the British Library.

“It really was exciting to be down there this summer. There was a real buzz around the place and the general public seemed to know all about the Institute opening.”

The Alan Turing Institute begins its first academic year this month (October 2016) when over 100 research fellows will begin work on a range of different data science projects.

Back to the bibliometrics

Meanwhile, David has returned to his PhD at Warwick where he is looking into bibliometrics — the statistical analysis of written publications, such as books or articles. David is using network theory and statistical modelling to measure the flow of influence in academia, based on how researchers give and receive citations in their published articles. David is supervised by Professor David Firth.

David says: “A nice result would be if we could automate an approximation to the REF (Research Excellence Framework), while quantifying the level of uncertainty we have in our final ranking. In more general terms though, paired comparison models and network analysis are more closely related than we might first realise. A paired comparison model can be used to rank films by showing you two at a time and asking you which you would rather watch. Network analysis can measure the influence of people on Twitter based on their followers. If you wanted to build a system that recommends films to you, or find out who is most important in your social network, you can actually borrow ideas and techniques from both areas.”