Estimating COVID-19 prevalence using web search activity

Web search activity has been used to estimate the prevalence of infectious diseases, such as influenza. However, developing models for a novel disease is a different and perhaps more challening task.
Estimating COVID-19 prevalence using web search activity

In our paper "Tracking COVID-19 using online search" published in npj Digital Medicine we show how Google search data can be used to develop complementary public health surveillance methods for COVID-19. We present results for a multilingual and multicultural selection of 8 countries: United States (US), United Kingdom (UK), Australia, Canada, France, Italy, Greece, and South Africa. Our analysis covers the period from October 2019 to the end of May 2020, i.e. the first wave(s) of the COVID-19 pandemic.

Using the symptom profile of COVID-19, we identify related web searches and compute a COVID-19 score (Figure 1, blue line). We also compare it with a historical average (2011-2018; Figure 1, dashed line). As web searches can be influenced by public interest, which is often reflected in the media coverage of a topic, we also develop a method that attempts to reduce this effect (Figure 1, black line). This latter scoring function provides on average an early warning of 16.7 (10.2-23.2) and 22.1 (17.4-26.9) days compared to confirmed COVID-19 cases and deaths, respectively.

Figure 1. Unsupervised COVID-19 scores based on web search activity.

We then transfer a COVID-19 incidence model from one country to another. We first train a regression model for Italy (one of the first major hotspots in Europe) using web searches and confirmed cases and then transfer it to the other countries (Figure 2), similarly to previous work focusing on influenza-like illness. The transfer learning approach is not affected (as much) by media coverage as it is based on supervised learning. It corroborates our previous findings from the unsupervised approach albeit with a further delay of about 5 days.

Figure 2. COVID-19 incidence scores (standardised) based on a transfer learning method. The source model is based on data from Italy.

We then conduct a regression analysis to uncover important search terms using on a joint data set from 4 English speaking countries (US, UK, Australia, and Canada) in an attempt to reduce clinical reporting bias, estimating confirmed cases based on web searches. We were among the first to indicate that there is a relationship between clinical COVID-19 indicators and the symptoms of anosmia (loss of the sense of smell), ageusia (loss of the sense of taste), and skin rash.

A limitation of our study is that in contrast to past efforts it was hard to evaluate the accuracy of our approach as clinical indicators were (are) not necessarily representative of disease prevalence. However, when we compared our COVID-19 scores in England with prevalence estimates obtained from a COVID-19 swabbing scheme (Royal College of General Practitioners), which was based on non-COVID-19 cases as well and could therefore provide a more representative statistic, we found strong correlations (>.80). We also assessed the hypothesis that the COVID-19 outbreak in Italy, which was the first major outbreak in Europe, might have Granger-caused an increase in the frequency of search terms elsewhere. We concluded that more than 70% of the search terms we used were not affected by the events in Italy. In addition, we: (a) reduced news coverage influence, and (b) need to consider that Granger-causality might in this case be misleading because COVID-19 could have emerged at the same time in Italy and other locations (especially in Europe).

Since March 2020, we have been sending our COVID-19 scores to Public Health England (PHE) on a weekly basis. These are included in PHE’s syndromic surveillance reports and have been used as a complementary early-warning resource for epidemiological monitoring and planning.

To find out more details about our methodology and outcomes, read our open-access article.