In front of a massive white board, full of hand-scrawled symbols and equations, sits Professor Graham Cormode. He has recently been awarded the Adams Prize for Mathematics. It’s a prestigious award from the University of Cambridge, given each year to a UK-based mathematician for distinguished research in the mathematical sciences.
It has been won previously by some notable, even household names. James Maxwell Clark for example, the man who formulated the classical theory of electromagnetic radiation in the 1800s, and Professor Steven Hawking, 25th in the BBC Poll of the 100 Greatest Britons. This year it was Professor Cormode, but he isn’t a mathematician – he’s a computer scientist.
“It’s a great prize to win,” explains Cormode, Professor of Computer Science at the University of Warwick. “Rather than just giving it to the best mathematician – which is rather difficult to decide – each year they select a topic within the realm of mathematics and reward work in the topic area. This year the Adams Prize was for of Statistical Analysis of Big Data and the award was split between me, a computer scientist, and Richard Samworth a statistician at Cambridge.
“This rather trendy sounding thing – Big Data – is where several different disciplines are meeting at the moment,” continues Cormode. “It’s true in a lot of areas of maths, statistics and computer science that there are certain fundamental questions people are trying to understand and we look at the those questions from the point of view of our own disciplinary background, using the tools that we know. But it turns out that similar tools get invented in different areas. So sometimes you get methods converging and people do want to think of a problem using the same mathematical representations, even if the language is a bit different.”
It’s OK to cross the streams!
Trained as a computer scientist, with a PhD from Warwick, Professor Cormode has experience working in industry and academia. He says: “I have always seen myself as a computer scientist first and foremost. But in terms of the problems I look at – they are near the border of the three disciplines and I interact with mathematicians and statisticians – we have enough of a common understanding.”
This ‘crossing of the streams’ is not just happening within Professor Cormode’s work but also on campus at the University of Warwick with work ongoing on the Computer Sciences, Statistics and Mathematics departmental buildings which will see them joined physically with a new extension – destined to become an interdisciplinary hub. Professor Cormode is also University Liaison Director for Warwick at the Alan Turing Institute – the UK’s national institute for data science, founded by Warwick and four other institutions and headquartered at the British Library in London.
Big data – it’s huge
Big Data is something relatively new. As a society we’ve never really had it before.
Professor Cormode continues: “I tend to think about it like this: We spent a lot of the 20th century developing computers and computation, and now for the most part there are computing devices everywhere and in every home. Laptops, smart phones – even dishwashers have a computer in them. Most innovations in a modern car involve computation and data – the science of how to get it to move around is mostly settled.
“So because of that ubiquitous computation you have so a much greater volume of data being collected by sensors. The challenge for the 21 century is to say, OK we’ve got the devices which can collect and store this data, but what shall we do with that data and what are the advantages of understanding that data, will it help us navigate society better and to help us improve the way that we live? And there are lots of different aspects to working that out.
“People are looking at how you build better networks and move data around, some are working on how you build a data centre to analyse it all these measurements and others are looking at how you can improve the mathematical tools and models to understand the data. The goal is to go from lots of bits and bytes into a decision like should we change a policy about how we prescribe a certain drug? How should the National Grid balance energy stored in batteries or behind hydroelectric dams versus increasing the output from nuclear power plants?
“Some of these questions are very applied and you need to understand the details of a specific situation, but there are also cross-cutting questions about the general techniques which let you represent data. My work is concerned with how we can reduce a whole lot of data to a small value. It’s all about reducing as we analyse and I look at how quickly we can reduce the volume of the data without making it less robust.
“If you look at the big tech companies like Google and Facebook – they go out to remote locations with cheap sources of energy and cooling, and build a huge data centre connected to a source of power. These have millions of different computer devices working in parallel – lots of discs and lots of data moving around. I am asking to what extent we can reduce the volume earlier in the process so we don’t need to do this. To what extent can we see data live as it arrives and relate it into some sort of mathematical representation or summary?”
There are limits
Professor Cormode is working on ways to distil data so that it may be analysed and used quicker and more locally.
He explains: “There are various techniques which we can use. Sometimes simple sampling does come into it – if you can show that the way you are doing the sampling does not loose fidelity, or if you do lose fidelity, it is at a sufficiently low level that you are still capturing the overall picture.
“There are some fundamental limits to what you can do. If wanted to know exactly what you tweeted on Tuesday in April last year – I would have to have the original and specific data. But if I wanted to know what were the topics which are important you or your peers, we can distil the data to get that kind of information more compactly. So instead of having a massive data set in a data centre, I can explore and analyse my data on something that is more akin to an ordinary laptop.
“This allows greater efficiency in understanding our world and being able to do it in different places. At the moment we have to pull all the data together take it off every sensor across the network, bit by bit to one place at great expense. You could ask the question – how much of this could I push back out so I can do the analysis on the smart phone or remote device to distil it to the small signal I’m looking for which is much cheaper and faster to gather.”
Are we cheating though?
“In some senses we are changing the question,” says Cormode. “There are some mathematical questions where we understand them well enough to know there are no shortcuts to the exact answer. A lot of my work is saying, what if I can give you an answer that is within one or two per cent of the right answer and which has a mathematical guarantee around it – that is the statistical part of this work. We go from a huge quantity of data, apply an algorithm and use probability – and get an answer where 99.9% accurate is good enough.
“We wouldn’t necessarily use this approach in every situation though. If we were asking what is the danger that the nuclear power plant will go critical I’d probably want the calculation to be 100 per cent accurate on something like that! But, if you are asking, can I understand my customer base and the topics they are concerned about based on what they are Tweeting, then yes, this is where approximation can help. There is a lot of noise in that sort of data already so if you do an accurate thing on the noisy data you end up with an approximate answer anyway. What I’m proposing is you do an approximate thing with the noisy data and get something which is still just as good.”
What can we do with it?
Twitter is already using a nuanced version of Professor Cormode’s work to develop ‘sketches’. He explains: “If you think about the total volume of data on twitter – it’s big, but not insurmountably big in terms of the number of tweets. If you want to look at the next level of impact of a tweet – say someone tweets something, it gets retweeted and then perhaps embedded in news article on a webpage and you want to track the impact of a tweet as it goes around then there’s a lot more data. So they use summary techniques – called sketches – to track where a Tweet is getting the most attention. That is a situation where you could tolerate a bit of uncertainty – you can get the general picture and it’s much more efficient.
“It’s used elsewhere for tracking views in online content, places like Netflix use to monitor peaks and usage patterns over their network. A lot of initial applications are a bit more in the consumer internet world but that is the area where we have the best instrumentation – the web is all about capturing information – it’s already doing it.”
In the near future more and more devices will become connected. The much-hyped “Internet of Things” means household appliances like fridges, dishwashers and washing machines will be on the network. Smart meters are already up and running in many homes.
“These advances are generating larger quantities of data,” continues Cormode. “You can use it at face value – you can take a smart meter, add up all the numbers and at the end of six moths there’s your bill. That’s what it was designed for. But as data scientists we see much greater potential there to learn much more about patterns of life. Can we go beyond the smart meter and look at individual devices keeping track of their own energy usage? Would my laptop be a lot more efficient if it figured how much battery to charge up?”
It’s hard to predict when these sort of advances will start to appear but certainly there is work going on around the world already identifying where data reduction and summarisation can be used either to make things more efficient or to open up new possibilities. Professor Cormode himself is looking at whether potentially using the data generated by modern cars could lead to the possibility of predictive maintenance – could a car figure out if a component is about to fail, so the manufacturer can schedule the fix before it breaks down at the side of the road? In the future will an individual car be able to interact with its owner to give a customised experience – for example know how warm you like the air conditioning on a November morning?
The Holy Grail of data work is understanding human behaviour in order to improve quality of life, a question being worked on by many of Professor Cormode’s colleagues at Warwick and at the Turing Institute.
“Many people have been looking to social networks to see if we can use it at a weathervane to understand the opinion of the populous – but it is a lot more difficult than you would think,” says Cormode. “That is partly because the volume or ferocity of online discussing doesn’t correlate with how people make their decisions in real life – for example, at the polling station. The big point is we now have a huge volume of data about all aspects of life and you want to use it to learn about something. Sometimes the data tells you something directly and sometimes it’s much more indirectly informing you. It is a big challenge for statistical modelling to take signals on one subject, and translate them to other domains.
“Twitter data can tell us the most popular party on Twitter – but it can’t tell us how the whole population is going to vote. There is clearly an overlap – there is an inference to be made – but it is not straightforward. Neither is it a scientific experiment under controlled conditions in a lab. Twitter gets headlines as social media activity is now part of the national dialogue and so you have social media teams for organisations who try to influence the conversation. You can even pay for robots to automatically tweet things – raw data on retweets and trending topics doesn’t give us a true picture.”
So can you discern the signal from the noise? “It’s a challenge,” says Cormode. “Even more traditional polling methods, which don’t try and guess or discern opinion – they ring people up and ask them who they are going to vote for – these don’t elicit true responses. There is still a huge challenge for data science to deliver more accurate interpretations of sentiment and human behaviour.
Individuals on the internet
Irony, sarcasm, complicated sentences, spelling mistakes and other human factors makes sentiment analysis a difficult problem. And now there are other more sinister aspects of human behaviour which are appearing on live streaming sites like Facebook Live. Can Big Data technology help us police online content?
Professor Cormode concludes: “There are some real fundamental problems – maybe even impossibilities – in dealing with inappropriate content. There is no simple test which says this is good and this is bad. The scale you are dealing with is massive, you have very noisy data and the best you would hope for would be you can take social network employees whose job it is to monitor and make the best use of their attention – direct them to those areas which you know are problematic. But you soon get into subjective areas, with the need to judge whether something is free speech or incitement. There is not a purely technological question. These are very human problems. There is a lots of subjectivity in the way we formulate laws and regulation, where the gold standard in deciding what is right involves the deliberation of judges and juries. We are very far from being happy with allowing an algorithm to make final decisions --- even if we increasingly see examples of this happening.
“Behind this is an age old problem. It’s not that more people are saying contentious things, it’s that it is now easier for other people to hear them. We view it as technological problem because it is technology which has enabled the broadcasting of these ideas, but it is the human component that generates the tough questions.”
|Graham Cormode was appointed Professor of Computer Science at the University of Warwick in 2013. His interests are in all aspects of the "data lifecycle", from data collection and cleaning, through mining and analytics, and private data release.|