Google tries to know everything- from my browsing history the search engine has (successfully?) classified me as a 35-44 year old male with interests in parenting, East Asian music and dogs in order to better target its advertisements (click here to read more about Google advertisement classifications). Google has also tried to estimate how many of us have flu. In 2008 Google started using some of the five billion web searches it records every day to estimate the prevalence of flu based on the number of flu-related searches taking place, in what is called Google Flu Trends.
What Google has tapped into is the increasingly widespread use of syndromic surveillance- the collection and analysis of symptom data as a proxy for confirmed lab or clinical diagnoses, in order to enable earlier detection of infectious disease epidemics. Two other examples of syndromic surveillance are identifying bioterrorist attacks in the US, and monitoring health during the London 2012 Olympics.
Following the September 11th terrorist attack and the subsequent anthrax cases, the US Government increased its use of syndromic surveillance to try and gather (close to) real-time information on outbreaks of infectious diseases with the aim of identifying a bioterrorist attack. One system they use takes nightly data from ambulance patient records of syndromes, patient demographics and ZIP codes. This data is analysed to look for clustering, and if any cluster exceeds a pre-defined threshold an electronic alert is sent out to health care providers.
During the 2012 Olympics in London the Health Protection Agency (HPA) realised that having 17,000 athletes and officials from all over the world live in close quarters in the Olympic village, along with the many other travellers who were visiting London in this period, was the perfect recipe for an infectious disease outbreak. The HPA brought together data from many different sources including records from emergency departments, GPs and health-care hotlines like NHS Direct in their syndromic surveillance system.
These examples of syndromic surveillance use different and varied data sources. The data footprints that we each leave on the internet, when we buy over-the-counter medicines and when we have any interaction with health care professionals all contribute to big electronic datasets, increasing the ease and viability of syndromic surveillance. Ideally, the data used should be available in as near real-time as possible, and be automatically generated as to not impose an additional burden on data providers.
Current syndromic surveillance systems show varying levels of success in predicting the prevalence of a disease and in issuing epidemic alerts. Most analysis and alert systems look for incidences over a certain pre-defined threshold. This is susceptible to false positives- an alert being issued simply due to noise exceeding the threshold. Additionally, there is also the question of suitability of using data that was initially collected for other purposes. Google Flu Trends uses searches of key words relating to flu to estimate prevalence, and as with its classification of users for advertising purposes, its conclusions are not always correct. Google Flu Trends drastically overestimated peak flu levels of the winter flu season in the United States in 2012. Flu season was unusually early in the US last year, causing more serious illness and increased media coverage. The increased number of searches of flu-related terms by healthy people looking for news, rather than sick people looking for information could explain the over prediction by Google.
Using big data sets to indirectly infer disease levels is a complexity topic and there is potential for improved analysis of syndromic data in many areas, in particular with the investigation of different ways to define the baseline over which cases must exceed in order for an alert to be issued. As data collection and collation improves, the potential for this surveillance just increases.