Making AI Algorithms Safer

Like Comment
Read the paper

A myriad of publications in the scientific and lay literature can now be found under the heading of ‘machine learning’ (ML) or ‘artificial intelligence’ (AI) in healthcare. These ML and AI techniques mine through massive clinical databases and are intended to find higher-order correlations and relationships with the goal of prediction or prognostication. In recent years, the huge discrepancy between the volume of AI/ML publications and the relative scarcity of successful implementation studies have been highlighted in a number of commentaries and publications [1-3]. The field of clinical AI/ML is in desperate need of rigorous clinical evidence via multicenter randomized clinical trials. Among barriers to the execution of such trials are the difficulty of integrating with different electronic health records (EHRs) and demonstrating generalizability. Some of the factors affecting this problem include: 1) precise mapping of features/input elements of an algorithm across different EHR vendors is challenging, and even for a given EHR vendor different system builds/customizations complicate standardization (see more on syntactic and semantic interoperability [9]); 2) clinical constructs and inclusion/exclusion criteria to establish the gold-standard diagnosis/outcomes are often inconsistently implemented across sites (i.e., label noise); 3) frequency of measurement of clinical variables (e.g., labs) are often healthcare system-specific and tied to factors such as severity of illness, workflow design, staffing levels, and utilization of point-of-care technologies; 4) distribution of patients characteristics (such as demographics, care level/care unit type) are widely variable and generalizability has to be assessed across a geographically diverse patient population; 5) often recorded data in EHRs (such as those from monitors, ventilators, IV pumps, etc.) are biased by vendor-specific data downsampling methods and human verification, and as a consequence different sources of real-time data (e.g., via direct HL7 feeds from devices) and retrospective/archived data may not exactly match; 6) temporal data drifts may occur due to a number of factors, including changes in processes of care, or the introduction of new measurement devices (e.g., point of care lactate measurement); and 7) implementation of AI/ML algorithms can induce changes in clinical workflow and practice patterns that can alter the distribution of data. Therefore, successful real-time implementation of AI/ML models developed using retrospective data require continuous monitoring (see phase 4 in Figure 1) and the establishment of effective algorithm change protocols (ACPs).

Figure 1. Developmental life cycle of AI/ML models (credit: Borrowed with simplifications from Michael Matheny et al. [4])

Recent literature has emphasized the importance of including ‘model facts labels’ with machine learning models, which provide information such as model name, locale, and version, a summary of the model, mechanism of risk score calculation, validation and performance, uses and directions, warnings, and other information. A model facts label defines factual information about a model and best practice guidelines. However, it does not include any mechanism to determine -- at the level of individual patient measurements -- when an algorithm can be used (i.e., ‘conditions for use’). As an analogy, a model fact label may state that a given appliance requires a 100V AC power outlet. A complementary approach is to design a ‘circuit breaker’  to disconnect the power to the appliance when connected to a 220V socket (i.e., incorrect power supply or incorrect data). Therefore, there is a need for continuous monitoring solutions for AI/ML-based clinical decision support tools that can flag out improperly behaving models [10].

In recent years, a number of frameworks for monitoring of AI/ML algorithms in healthcare has been proposed, including 1) the use of governance panels that can identify and report changes to data collection practices, clinical guidelines and coding schemes [5], and 2) algorithm performance monitoring via periodic model evaluation (e.g., AUC, PPV, or calibration) [6-8]. The former approach requires significant human resources (especially as the number of deployed AI/ML models increases) and may not be effective at identifying nuanced changes in data distribution indiscernible to the naked human eye, while the latter requires access to gold-standard labels for model performance evaluation which is often expensive and labor-intensive to obtain.

Here, we present COMPOSER (COnformal Multidimensional Prediction Of SEpsis Risk), a deep learning model for the early prediction of sepsis. COMPOSER provides a built-in statistical mechanism for monitoring input data for quality assessment and potential data distribution shift. This is achieved by flagging out outlier inputs that do not satisfy the ‘conditions for use’ of the algorithm, which are subsequently assigned to an indeterminate predicted label class. Notably, COMPOSER performs outlier detection in an unsupervised manner (i.e., no need for gold-standard labels). Moreover, the rejection statistics (over predefined weekly or monthly intervals) can be used to trigger an ACP [11], which follows the developer’s detailed plan to update the model in a safe and effective manner. This approach might provide an alternative to deploying continuously learning AI/ML systems that may have unintended consequences (e.g., ‘echo chambers’ [11-14] or emergence of unintended positive feedback loops). 

Ongoing work on algorithm monitoring and AI/ML safety is very encouraging [15] and is likely to bring about safer clinical decision support tools at the service of improving the quality of care and outcomes for patients.


[1] Seneviratne, M.G., Shah, N.H. and Chu, L., 2020. Bridging the implementation gap of machine learning in healthcare. BMJ Innovations, 6(2).

[2] Li, R.C., Asch, S.M. and Shah, N.H., 2020. Developing a delivery science for artificial intelligence in healthcare. NPJ digital medicine, 3(1), pp.1-3.

[3] Fleuren, L.M., Klausch, T.L., Zwager, C.L., Schoonmade, L.J., Guo, T., Roggeveen, L.F., Swart, E.L., Girbes, A.R., Thoral, P., Ercole, A. and Hoogendoorn, M., 2020. Machine learning for the prediction of sepsis: a systematic review and meta-analysis of diagnostic test accuracy. Intensive care medicine, 46(3), pp.383-400.

[4] Matheny, M., Israni, S.T., Ahmed, M. and Whicher, D., 2019. Artificial intelligence in health care: the hope, the hype, the promise, the peril. NAM Special Publication. Washington, DC: National Academy of Medicine, p.154.

[5] Finlayson, S.G., Subbaswamy, A., Singh, K., Bowers, J., Kupke, A., Zittrain, J., Kohane, I.S. and Saria, S., 2020. The clinician and dataset shift in artificial intelligence. The New England Journal of Medicine, pp.283-286.

[6] Hannan, E.L., Cozzens, K., King, S.B., Walford, G. and Shah, N.R., 2012. The New York State cardiac registries: history, contributions, limitations, and lessons for future efforts to assess and publicly report healthcare outcomes. Journal of the American College of Cardiology, 59(25), pp.2309-2316.

[7]  Siregar, S., Nieboer, D., Vergouwe, Y., Versteegh, M.I., Noyez, L., Vonk, A.B., Steyerberg, E.W. and Takkenberg, J.J., 2016. Improved prediction by dynamic modeling: an exploratory study in the Adult Cardiac Surgery database of the Netherlands Association for Cardio-Thoracic Surgery. Circulation: Cardiovascular Quality and Outcomes, 9(2), pp.171-181.

[8] Jin, R., Furnary, A.P., Fine, S.C., Blackstone, E.H. and Grunkemeier, G.L., 2010. Using Society of Thoracic Surgeons risk models for risk-adjusting cardiac surgery results. The Annals of thoracic surgery, 89(3), pp.677-682.

[9] URL:

[10] Eaneff, S., Obermeyer, Z. and Butte, A.J., 2020. The case for algorithmic stewardship for artificial intelligence and machine learning technologies. Jama, 324(14), pp.1397-1398.

[11] US FDA artificial intelligence and machine learning discussion paper. Technical report, April 2019

[12] Lenert, M.C., Matheny, M.E. and Walsh, C.G., 2019. Prognostic models will be victims of their own success, unless…. Journal of the American Medical Informatics Association, 26(12), pp.1645-1650.

[13] Jiang, R., Chiappa, S., Lattimore, T., György, A. and Kohli, P., 2019, January. Degenerate feedback loops in recommender systems. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (pp. 383-390).

[14] Petersen, C., Smith, J., Freimuth, R.R., Goodman, K.W., Jackson, G.P., Kannry, J., Liu, H., Madhavan, S., Sittig, D.F. and Wright, A., 2021. Recommendations for the safe, effective use of adaptive CDS in the US healthcare system: an AMIA position paper. Journal of the American Medical Informatics Association, 28(4), pp.677-684.

[15] Feng, J., 2020. Learning how to approve updates to machine learning algorithms in non-stationary settings. arXiv preprint arXiv:2012.07278.

Supreeth Shashikumar

Research Scientist, UCSD