From the English sewage to COVID-19 infections using machine learning

During the pandemic, we explored what we could learn from an under-utilised resource: the sewage. Using both a forward-modelling and a data-driven approaches, this study provides insights into the links between SARS-CoV-2 concentration in wastewater and population infections.
From the English sewage to COVID-19 infections using machine learning

Measuring SARS-CoV-2 in wastewater 

Accurate surveillance of the COVID-19 pandemic can be weakened by under-reporting of cases, particularly due to asymptomatic or pre-symptomatic infections. Wastewater monitoring of SARS-CoV-2 has the potential to alleviate these biases, by effectively probing the entire population falling into a catchment area regardless of its disease status or propensity to get tested.  

The underlying principle is fairly simple: (i) fragments of SARS-CoV-2 RNA are shed in wastewater by infected people; (ii) performing quantitative polymerase chain reaction (qRT-PCR) on collected samples allows for quantification of the virus’ concentration, (iii) this concentration can then in turn be used for inference about the local level or evolution of infections. All things being equal, a higher virus concentration indicates a higher proportion of infected people. In this study we benefitted from measurements made three times a week across 45 urban catchment areas in England over 5 months. We first tested how well we could estimate infection rates from concentrations using a simple process-based model, where wastewater concentrations at the treatment plant are modelled as a linear function of the estimated amount of virus shed per person and the volume of water per person.  

However, the precise mapping from measured SARS-CoV-2 RNA concentrations in wastewater to infection rates constitutes a challenging inverse-problem. The main reason being (beyond the uncertainty of qRT-PCR measurement itself) that there are varying factors of dilution affecting the measured concentration both in time and space, e.g. rainfall or industrial activity. 

Diagram of ww factors
Figure 1: Diagram of factors influencing the measured concentrations of SARS-CoV-2 RNA in wastewater.
Figure from Wade et al 2021.

Methods and results 

As part of our collaboration within the UKHSA and academics we designed a way to get around these many complexities, using covariate measurements of other markers to disentangle the causes of observed variability. This is indeed one of the key ingredients we used to build a successful regression model and estimate the prevalence of COVID-19 from the concentration of SARS-CoV-2 RNA supplemented by other covariate measurements. For instance, the concentration of ammonia provides a good proxy for varying time-dependent dilution mechanisms since it is known to be shed by everyone regardless of their infection status. 

The second key ingredient we used to build our regression model was the unbiased prevalence estimates from the ONS (Office for National Statistics) COVID-19 Infection Survey (CIS). This has been paramount to accurately calibrate our predictions for each of the sampled locations. The results are quite compelling: our regression model provides prevalence estimates within 1.1% from the COVID-19 Infection Survey in average with 95% confidence on the test set. When aggregated at sub-regional level, the time series of prevalence estimated from wastewater (blue) fall within the 95% confidence interval of the COVID-19 Infection Survey prevalence (in black) shown on Figure 2.  

regional estimates
Figure 2: Regional 7-day rolling averages (median) of CIS prevalence estimates (black) with 95% credible intervals using Bayesian modelling (grey regions), with corresponding predictions of prevalence from WW data only (blue) with 95% confidence interval from bootstrapping (blue vertical lines), and raw SARS-CoV-2 concentrations (yellow, right axis). The WW prevalence estimates are provided at a sub-regional level and combined to produce regional estimates for comparison.

Furthermore, we use our regression model to perform a lead & lag analysis between wastewater and the two key sources of data from the COVID-19 pandemic in England: the CIS and the NHS Test and Trace (T&T) data. By carefully shifting our input backward and forward in time, we evaluate the model’s performance for various lag values and conclude that wastewater measurements do not appear to be temporally shifted with respect to the CIS data. However, when repeated with the T&T data, this same experiment indicates that wastewater data lead the time series of T&T cases by about 5 days. 

Various regression models were trialled for this problem —starting from a simple linear regression— but in the end results were presented for the model found to perform best: XGBoost , which is a highly efficient and performant implementation of the gradient boosting regression trees algorithm. It is interesting to see another example where Machine Learning can help to learn a complex function for which we do not have an analytical form, provided we have enough data. However, understanding wastewater data, its sampling and causes of variations between sites was key when building our regression model and interpreting its results. 

Beyond this work 

When combined with genomic sequencing, wastewater data can also be used to track the spread of SARS-CoV-2 variants. And more generally, beyond the COVID-19 pandemic, monitoring wastewater has various beneficial applications in public health. In fact, the same methodology can be reused to monitor other pathogens. As this method is spreading fast since it was resurfaced during the COVID-19 pandemic, increasing amounts of wastewater and environmental data are likely to be collected. The public availability and standards associated with this data can only help the field to maximise the scientific understanding and benefits it provides to the public in return.  

Behind this work 

It is worth saying a few words about the context in which the study was carried out. The science and research team of the UK Environmental Monitoring for Health Protection programme was spread across the country and came from diverse fields: epidemiology, water engineering, mathematics, computer science, data science, neuroscience and even astrophysics. Most of us had to put our usual day-to-day activities on hold to carry out this work. A critical element for working effectively as a remote and diverse team was a good dose of humour - fortunately, the sewage-based nature of the work provided plenty of material to work with! 


  1. Wade, Matthew, Anna Lo Jacomo, Elena Armenise, Mathew Brown, Joshua Bunce, Graeme Cameron, Zhou Fang, et al. 2021. ‘Understanding and Managing Uncertainty and Variability for Wastewater Monitoring beyond the Pandemic: Lessons Learned from the United Kingdom National COVID-19 Surveillance Programmes’. Earth and Space Science Open Archive, July.