A collaborative approach that engages the community to develop digital biomarkers
Despite substantial efforts to develop digital biomarkers, benchmarking and validating these algorithms remains difficult. Open, crowd-sourced efforts through DREAM Challenges is one approach to speed the process.
Every smartphone and the increasingly popular smartwatches and fitness trackers (e.g. fitbit) come embedded with a myriad of sensors. As of 2019, approximately 96% of Americans own a cellphone, of which 84% are smartphones (81% of Americans)1. The use of smartphones is growing rapidly, even among older people2. Given this wide availability of these devices, there is broad interest in the medical community in how they can be used to help people monitor health beyond activity tracking.
In the case of movement disorders or other diseases with strong motor function involvement, motion sensors in these relatively inexpensive devices can help monitor symptom severity in the context of clinical trials, patient clinical care, and collect real-world-data related to lived experience with disease. However, we have seen very few devices transition to active use in this way.
Challenges to developing digital biomarkers
One of the primary challenges to effectively using these types of data is our ability to interpret and distill them into “digital biomarkers” or measures of disease. Many efforts to solve this problem involve small validation studies and computational approaches carried out by a small number of individuals. But developing accurate, robust models can require enormous amounts of data. As we note in our paper, sophisticated deep learning methods can be highly accurate for the interpretation of sensor data from phones and wearables, but these types of models require vast amounts of data to train.
Inspired by the way brains work, deep learning models with neural networks methods were overwhelmingly more accurate than signal processing-based approaches when we had massive amounts of data (in this case ~40,000 sets of sensor reads). But they performed similarly to signal processing methods when the data size was an order of magnitude smaller. Of course, data sets of sufficient size are incredibly rare and prohibitive to collect for most groups. Democratizing data access can facilitate better biomarker development.
Access to data is not the only barrier to the development of quality digital biomarkers. In an attempt to publish or perish, researchers’ own self-interests can lead them to use subtle tactics such as selective choice of data and metrics to improve the perception of their model accuracy relative to competing methods. Unfortunately, the peer-review process is often not sufficient to detect these subtle manipulations. This also makes it difficult to truly understand the relative performance of competing methods in a truly unbiased fashion. For a more comprehensive discussion of this issue, I suggest Norel, Rice and Stolovitzky’s paper “The self-assessment trap: can we all be better than average?”3
In addition to those conscious manipulations, eager modelers can make inadvertent mistakes which inflate the perceived accuracy of their models. For example, in situations where repeated measures from the same individuals are present in the data, splitting the data without regard to the identity of the individual from which the measurement was derived can result in inflated accuracy when the individual signature is stronger than the disease signature4. In this case, the models detect subject identity rather than disease signature. This becomes problematic when measures from each individual are present in both the training and the test set. Digital health is rife with examples where individual signatures are particularly distinct, including the one we explore in our paper: gait in Parkinson’s disease.
Identity confounding is not the only type of confounding that may artificially inflate the model accuracy. Age, gender and other factors may have strong associations with gait properties, as well as a variety of other digital measurements. When these confounders are not properly balanced in a machine learning data set, they may contribute to predictive models in a way that are not reflective of the disease or trait being predicted. While a savvy reviewer theoretically can catch these types of issues,we commonly see these mistakes in the mobile health literature. This suggests the field is generally not yet educated on this topic.
An open science approach to crowdsource solutions
So what is the best way to compare models in a way that protects against self evaluation bias and inflation of the estimated accuracy? One approach is to compare methods in the context of a data analysis challenge, in which evaluation is performed by a neutral third party, using data which are unavailable to the teams building the models. Typically, this involves providing challenge participants with a training dataset, an objective and a predefined metric (or set of metrics) used to evaluate model performance. The models are then evaluated using a held-out test set which is not available to participants. This ensures an apples-to-apples comparison of models using predefined rules of engagement thus avoiding bias in evaluations. The organizers can also take care to avoid data artifacts, such as the identity confounding issue noted above, which may inflate the accuracy estimation.
Data analysis challenges are growing in popularity both in the commercial space (via platforms like Kaggle and Innocentive) as well as the research space, where more than a dozen different organizations run biomedical-related challenges5. This includes DREAM Challenges, which is the organization under which we operated for the Parkinson’s Disease Digital Biomarker (PDDB) DREAM Challenge6. DREAM launched in 2006 and has organized a variety of challenges, primarily in the genomics and computational biology space, and is committed to using “open science” practices toward the solving of biomedical problems. Specifically, challenge solutions are required to include documentation and code or Dockerized implementation, which are made publicly available upon the close of the challenge.
Open science speeds innovation by reducing duplication of efforts and allows researchers to build off of one another’s findings. It contributes to the democratization of research by increasing the likelihood that smaller research groups and those operating in limited resource environments have access to high quality algorithms, not just the larger and better funded ones. In a space where productization is common, it is important to note that open science challenges do not preclude productization, as participants retain full ownership of their inventions. It is common to see participation from commercial entities in DREAM challenges, and we have seen examples of commercial entities launched based on the solutions generated during the course of challenges.
While data analysis challenges represent a relatively new contribution to the digital health space, we hope the community will recognize the benefits of the approach for solving problems in the field.
 https://www.pewresearch.org/internet/fact-sheet/mobile/ Accessed March 17, 2021.
 Norel R, Rice JJ, Stolovitzky G. The self-assessment trap: can we all be better than average? Mol Syst Biol. 7, 537 (2011). https://doi.org/10.1038/msb.2011.70
 Chaibub Neto, E., Pratap, A., Perumal, T.M. et al. Detecting the impact of subject characteristics on machine learning-based diagnostic applications. npj Digit. Med. 2, 99 (2019). https://doi.org/10.1038/s41746-019-0178-x
 Saez-Rodriguez, Julio et al. “Crowdsourcing biomedical research: leveraging communities as innovation engines.” Nature reviews. Genetics vol. 17,8 (2016): 470-86. https://doi.org/10.1038/nrg.2016.69
 Sieberts, S.K., Schaff, J., Duda, M. et al. Crowdsourcing digital health measures to predict Parkinson’s disease severity: the Parkinson’s Disease Digital Biomarker DREAM Challenge. npj Digit. Med. 4, 53 (2021). https://doi.org/10.1038/s41746-021-00414-7
 Schaffter T, Buist DSM, Lee CI, et al. Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms. JAMA Netw Open. 3(3):e200265 (2020). https://doi.org/10.1001/jamanetworkopen.2020.0265
 Bergquist T, et al. Evaluation of crowdsourced mortality prediction models as a framework for assessing AI in medicine. https://www.medrxiv.org/content/10.1101/2021.01.18.21250072v1