Beyond predictive metrics: simulating the impact of healthcare machine learning models  



Machine learning applications are widespread in healthcare and clinical medicine. It is being used to predict which patients are at risk of acute kidney injury1, which patients are likely to experience septic shock in intensive care units,2 and predict post-operative mortality before a patient undergoes surgery3. Using data sources from modern electronic health record systems, and using modern machine learning algorithms, it is possible to achieve strong predictive performance, as measured by metrics such as the area under the ROC curve and other popular metrics in machine learning.

But what do metrics such as AUC really tell us? A hospital that implements a machine learning model is ultimately interested in improved patient outcomes or possibly reducing costs. What magnitude of improvement can one expect from, say, a model with an AUC of 0.8 compared to a model with an AUC of 0.7? Will there even be an improvement at all? This is further complicated in cases where resources to enact changes are limited and require rationing.


Our framework and methodology

In our paper, we developed a method for simulating the real-time application of a machine learning model in a hospital environment, and forecasting the impact of a machine learning model in terms of both prevention outcomes and cost savings. To make things concrete, let us consider the context of surgical readmissions, which we consider in the paper.

Our method assumes that we have a machine learning model that can predict whether a surgical patient is readmitted post-operatively via the emergency department (ED). This machine learning model is then used to guide an intervention provider/team. The provider operates according to a fixed weekly schedule and can only see a limited number of patients on each day in its schedule. If the provider selects a patient who does end up experiencing a readmission, then with some probability (which we call the effectiveness constant), the readmission will be prevented.

Starting from a particular day, the simulation proceeds forward in time and keeps track of which patients become available and which are discharged. On each day that falls in the prevention resource's schedule, the prevention resource selects the eligible patients with the highest risk predicted by the machine learning model, up to its capacity limit, and applies its intervention to those patients. Thus, rather than selecting patients based on a predefined cutoff for readmission risk, the provider weighs the predicted readmission probabilities of all of the eligible patients and focus on those who are most likely to be readmitted. The model tracks which patients are eligible by keeping track of which patients have already been seen by the provider and which have been discharged.

Visualization of simulation. In this example, there are 10 patients who complete surgery and are discharged over a 13 day time horizon. For example, patient 1 finishes surgery on day 1, and is discharged on day 4. For each patient, the number in the box indicates the predicted probability of readmission. The provider selects the top 2 patients on each day highlighted in the rounded purple rectangle. For example, on day 3, the provider selects patients 2 and 3. Patients who are highlighted in gray are patients who were already selected, and thus not eligible. The boxes under "Final Outcome" on the right hand side indicate what the outcome of the simulation was.

When the simulation concludes, we calculate the expected number of readmissions prevented by the provider and the expected net cost savings, which are obtained by subtracting the provider’s cost (for example, in the form of wages) from the expected cost savings associated with the prevented readmissions.


Our findings

We tested our methodology in the context of ED readmissions using data from UCLA's Ronald Reagan Medical Center.

 In a previous paper published in Anesthesiology4, we developed machine learning models to predict emergency department (ED) readmissions for the same hospital. We took two of our models from that paper based on L1 regularized (LASSO) logistic regression -- one that uses general health, perioperative and demographic features, and one that additionally uses lab-based features -- as well as HOSPITAL5 and LACE6, which are two commonly used scoring rules. These models differ in terms of their AUCs (our lab-based LASSO model had the highest AUC of 0.85, while the other three models had AUCs in the 0.71-0.74 range), as well as their availability windows (the lab-based and non-lab-based models are available as the day the patient completes surgery, while LACE and HOSPITAL are available only on the day of discharge).

Using these four models, we simulated three different provider schedules: a Monday-only schedule, a Monday/Wednesday schedule and a Monday-to-Friday schedule. In each schedule, the provider was limited to selecting at most 8 patients on each day. We simulated these schedules using a set of over 19000 admissions in the two year period of 2017-2018. We calibrate the cost savings of a readmission using national data from HCUP and the provider’s cost using salary data for nurse practitioners from UCLA.

Across all three schedules, the lab-based LASSO logistic regression model achieved the highest number of prevented readmissions and the largest expected net cost savings, which agrees with the AUCs that we found.

But while our non-lab-based LASSO logistic regression model achieves a comparable AUC to LACE and HOSPITAL, it actually results in up to 1.8 times as many prevented readmissions as HOSPITAL and LACE, because its predictions are available earlier than those two scoring rules. Our method can thus help clinicians quantify the value of having predictions available earlier as opposed to later in the patient's hospital stay, which is not possible with metrics like the AUC.

In addition, the number of work days in a schedule affects the gap between the predictive models. In the most constrained Monday-only schedule, the lab-based LASSO model prevented 3 times as many readmissions as HOSPITAL, whereas in the Monday-to-Friday schedule, the lab-based LASSO model only prevented 1.5 times as many readmissions as HOSPITAL. Thus, the value of a machine learning model is intricately tied to how constrained the provider is. If our provider could see every postoperative patient, then any two models would result in the same outcomes, and any differences in predictive performance would become moot.

Perhaps most interestingly, we can understand whether it is cost effective to use a particular machine learning model at all. In our base analysis, we assumed that the effectiveness constant, which is the probability of being able to fully prevent a readmission, is set to 10%. We varied this effectiveness constant and found that for values below around 6-7%, LACE and HOSPITAL result in negative expected net cost savings; in other words, it is actually not cost effective to use LACE and HOSPITAL to guide the provider.

Example of expected net cost savings in dollars over the 2017-2018 simulation period, as a function of the effectiveness constant (the probability of the provider successfully preventing a readmission). In this example, the provider is assumed to follow a Monday-only schedule with a capacity of up to 8 patients.



Machine learning has many uses; in clinical medicine, the most prevalent use of machine learning is to guide the allocation of rationed resources. In this work, we set out to create a simulation methodology for forecasting the impact of a machine learning model for driving resource allocation in patient outcomes and costs. The main takeaway from this work is that traditional metrics like AUC do not tell us the whole story, and that we need to consider how a model is integrated into clinical care to fully evaluate its value. More specifically, in addition to considering predictive performance, we need to consider the timing of when the prediction is available, along with the capacity and effectiveness of the treatment pathway.


While our paper focuses on readmission prediction models, the same type of idea could be applicable for other types of machine learning models, such as ones for acute kidney injury, mortality and septic shock, that might be used to guide a constrained prevention resource. Our hope is that this simulation methodology will be a helpful complement to traditional performance metrics such as the AUC, and a valuable stepping stone from model development to implementation.



  1. Tomašev, N. et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572, 116–119 (2019).
  2. Henry, K. E., Hager, D. N., Pronovost, P. J. & Saria, S. A targeted real-time early warning score (TREWScore) for septic shock. Sci. Transl. Med. 7, 299ra122--299ra122 (2015).
  3. Lee, C. K., Hofer, I., Gabel, E., Baldi, P. & Cannesson, M. Development and validation of a deep neural network model for prediction of postoperative in-hospital mortality. Anesthesiology 129, 649–662 (2018).
  4. Mišić, V. V, Gabel, E., Hofer, I., Rajaram, K. & Mahajan, A. Machine learning prediction of postoperative emergency department hospital readmission. Anesthesiology 132, 968–980 (2020).
  5. Donzé, J., Aujesky, D., Williams, D. & Schnipper, J. L. Potentially Avoidable 30-Day Hospital Readmissions in Medical Patients. JAMA Intern. Med. 173, 632 (2013).
  6. van Walraven, C. et al. Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the community. CMAJ 182, 551–7 (2010).


Please sign in or register for FREE

If you are a registered user on Nature Portfolio Health Community, please sign in