Identification of Robust Deep Neural Network Models of Longitudinal Clinical Measurements

We systematically compared nine state-of-the-art deep learning approaches using simulated longitudinal body mass index, glucose, and systolic blood pressure measurements to objectively evaluate model performances.
Identification of Robust Deep Neural Network Models of Longitudinal Clinical Measurements

Every year, large volumes of longitudinal data are generated during patient interactions with the healthcare system, much of which is recorded in electronic health records (EHRs) systems. The large amount of data available in EHRs can be used to create prediction models1 to identify individuals at risk of developing specific diseases or for those with diseases, likely to benefit or be harmed by particular interventions2. Most predictive models use only snapshots of recent data and do not, however, take full advantage of the longitudinal data available in EHRs 2. Previous studies have demonstrated that longitudinal measures can improve the predictive ability of models for precision medicine. For example, a recent study by Yang et al. demonstrated that variability from longitudinal blood glucose can supplement traditional cross-sectional predictors of risk for microvascular complications in patients with type 2 diabetes3.

Deep learning is a subfield of machine learning and artificial intelligence that has demonstrated its capabilities for robust and accurate predictions from big data in many disciplines. In recent years, there has been an explosion of new deep learning approaches, including many that utilize longitudinal or time series data for prediction4. Although some of these approaches have been applied to EHR data, it has been unclear which methods produce the most robust predictions and are best at handling the challenges routinely present in clinical data, such as missingness and irregular observation times. Identifying robust model architectures that are specific to EHR data may expedite and improve future development of clinical prediction models. Objective comparison of model performance is, however, challenging, as true membership of a clinical class (e.g., disease/no-disease) is typically not known. For example, when a model erroneously predicts a patient status to be positive when the patient’s record indicates they are negative (i.e., false positive), it may be that the patient’s condition is undiagnosed or is soon to be diagnosed. Simulated data enables objective comparisons of model performance since true class membership is specified. It also allows model performance evaluation across different conditions of the data, such as the amount of signal, class imbalance, noise, and data missingness.

Developing simulated data that adequately represents real-world longitudinal data is challenging. Although simulated data have been used to evaluate time series-based deep learning methods, many of these simulated datasets were designed for audio and video signal processing5,6, and are unlikely to represent the specific characteristics seen in patient data from clinical labs, for example. We took a semi-synthetic approach where we started with real patient data, and then modified the signal, noise, and other parameters while maintaining the correlation structure observed in real data7. In our study, we simulated data based on real body mass index (BMI), glucose, and systolic blood pressure (SBP) trajectories to benchmark how well deep learning approaches can classify patients based on differences in trajectory magnitudes and shapes. We then evaluated the performances of nine state-of-the-art deep learning approaches for time series classification to identify the most robustly predictive models for these types of data. Notably, two model architectures—time series forest convolutional neural networks (TSF-CNN) and Gramian angular field convolutional neural networks (GAF-CNN)—emerged as the most robustly predictive architectures under a wide variety of simulated conditions, such as class imbalances, irregularity of sampling, and data missingness.

The promise of utilizing the TSF-CNN architecture was then further demonstrated by applying it to pediatric (ages 2-18 years) BMI trajectories to predict pediatric type 2 diabetes onsets. Because patients have varying amounts of data available—some may have BMI measurements from the age of 2 to 10 years, whereas others may only have data from ages 4 to 6 years—we developed different TSF-CNN models for different age ranges and used them to characterize i) the added value of having additional longitudinal data and ii) the model performance under common real-world conditions with missing age ranges. Notably, based solely on BMI trajectories, the TSF-CNN model was able to predict the risk of developing pediatric T2D with reasonable accuracy (AUCs of up to 0.72 using an age range of 3-12 years old). We showed that models incorporating BMIs at later age ranges, and with larger cohort sizes, had higher predictive accuracies. This novel simulation and application study supports the use of longitudinal clinical measurements in the development of predictive models, and provides guidance on which deep learning approaches are best suited for analyses of such measurements. 


This work was supported in part by the National Institutes of Health grant R61 NS113258  (D.M.R).

Image Credits: We would like to thank Mahnaz Rastgoumoghaddam for graphical illustrations of the poster picture. Some of graphical icons in the image are under free license from


  1. Moons, K. G. M., Altman, D. G., Vergouwe, Y. & Royston, P. Prognosis and prognostic research: application and impact of prognostic models in clinical practice. BMJ 338, 1487–1490 (2009).
  2. Goldstein, B. A., Navar, A. M., Pencina, M. J. & Ioannidis, J. P. A. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J. Am. Med. Informatics Assoc. 24, 198–208 (2017).
  3. Yang, C.-Y., Su, P.-F., Hung, J.-Y., Ou, H.-T. & Kuo, S. Comparative predictive ability of visit-to-visit HbA1c variability measures for microvascular disease risk in type 2 diabetes. Cardiovasc. Diabetol. 2020 191 19, 1–10 (2020).
  4. Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L. & Muller, P. A. Deep learning for time series classification: a review. Data Min. Knowl. Discov. 33, 917–963 (2019).
  5. Bianco, M. J. et al. Machine learning in acoustics: Theory and applications. J. Acoust. Soc. Am. 146, 3590 (2019).
  6. Weisberg, K., Gannot, S. & Schwartz, O. An Online Multiple-speaker DOA Tracking Using the CappÉ-Moulines Recursive Expectation-maximization Algorithm. ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. 2019-May, 656–660 (2019).
  7. Mathur, R., Rotroff, D., Ma, J., Shojaie, A. & Motsinger-Reif, A. Gene set analysis methods: A systematic comparison. BioData Min. 11, 1–19 (2018).

Please sign in or register for FREE

If you are a registered user on Nature Portfolio Health Community, please sign in