Multi-center validation of machine learning model for preoperative prediction of postoperative mortality

This study aimed to create a machine-learning prediction model for 30-day mortality after a non-cardiac surgery that adapts to the manageable amount of clinical information as input features and is validated against multi-centered rather than single-centered data.

Accurate prediction of postoperative mortality is important for not only successful postoperative patient care but also information-based shared decision making with patients and efficient allocation of medical resources. This study thus aimed to create a machine-learning prediction model for 30-day mortality after a non-cardiac surgery that adapts to a manageable amount of objective and quantitative clinical information as input features and is validated against multi-centered rather than single-centered data. We hypothesize that the performance of a lighter model that only uses an appropriately small number of input variables for predicting 30-day mortality after a non-cardiac surgery at multiple institutions will be at least as good as that of the previous complex and heavy artificial-intelligence models that use many clinical input variables. By demonstrating that the model has appropriate predictive power and transferability when applied to multiple hospitals, we attempted to confirm the clinical utility of this model.

Figure 1. Schematic diagram of external validation of each hospital model.
SNUH, Seoul national university hospital; AMC, Asan medical center; EUMC, Ewha womans university medical center; BRMH, Boramae hospital.

Data were collected from 454,404 patients over 18 years of age who underwent non-cardiac surgeries from four independent institutions. We performed a retrospective analysis of the retrieved data. Only 12–18 clinical variables were used for model training. Logistic regression, random forest classifier, extreme gradient boosting (XGBoost), and deep neural network methods were applied to compare the prediction performances. To reduce overfitting and create a robust model, bootstrapping and grid search with 10-fold crossvalidation were performed.

Figure 2. Performance evaluation of machine learning algorithms for postoperative 30-day mortality prediction.
AUROC (a) and AUPRC (b) of several models for postoperative 30-day mortality in the SNUH dataset. The values of AUROC and AUPRC are presented as 95% confidence intervals. AUROC area under receiver operating characteristic curve, AUPRC area under precision-recall curve, DNN deep neural network, XGB extreme gradient boosting, RF random forest, LR logistic regression, ASA-PS American society of anesthesiologists physical status classification.

Figure 2 presents area under the receiver operating characteristics curve (AUROC) and area under the precision-recall curve (AUPRC) of each candidate modeling method in SNUH data. All four candidate models exhibited superior prediction performances in terms of AUROC and AUPRC, compared to that for the ASA-PS class. The XGBoost method delivered the best performance in terms of AUROC (0.942) and AUPRC (0.175). Typically, the SNUH model and EUMC model with large amounts of data delivered superior performance when externally validated with data from other institutions. The performance in terms of the AUROC value of external validation is the best when the SNUH model is validated on EUMC data (AUROC 0.941). The performance in terms of the AUPRC value of external validation is the best when the SNUH model is validated on BRMH data (AUPRC 0.180). In the case of external validation of AMC data with the EUMC model in the lab model, the AUROC value was the highest (0.923). When external validation was performed on BRMH data with the SNUH model, the AUPRC value was the highest (0.177).

Figure 3. Feature importance of each hospital model with respect to SHAP value.
a SNUH model, b AMC model, c EUMC model, d BRMH model. SHAP Shapley additive explanations, SNUH Seoul National University Hospital, AMC Asan Medical Center, EUMC Ewha Womans University Medical Center, BRMH Boramae Hospital.

The feature importance extracted from the XGBoost algorithm was different for each model (Figure 3). In the SNUH and AMC models, the preoperative albumin level was found to be variable, with the highest influence on mortality within 30 days after surgery; however, the age in the EUMC model and the preoperative prothrombin time (PT) value in the BRMH model were the most important variables in the postoperative 30-day mortality.

The purpose of this study was to construct a viable artificial intelligence model for predicting prognosis prior to surgery that could be used in the real world. This type of model should exhibit the following characteristics: (1) Models can be transferred between hospitals; (2) data generation and recording do not require additional labor. (3) A straightforward and lightweight design (4) the accuracy of the model is comparable to that of previous models. To create this model, we used only objective and quantitative data that were automatically imported from the electronic medical record system, reducing interhospital variation and increased data volume to improve accuracy. The results of this study reveal that prediction power does not decrease even when using only the minimum number of variables that can be automatically extracted from electronic medical records of each hospital, compared to the previously proposed prediction model that requires numerous clinical input variables.
Additionally, this model performed well when applied directly to other hospitals, indicating that it is transferrable between hospitals. The disadvantage of existing machine learning prediction models was that they overfit the training data, making them inapplicable to other hospitals or necessitating retraining. Our model, on the other hand, is directly internalized into the hospital electronic medical record (EMR) system without additional processing, and prognosis can be predicted with a single click. Even if we do not internalize the program in the EMR, we only use a few parameters, which enables real-time prognosis prediction in an 12 outpatient clinic by entering these parameters into the program via the web.

In our study, we developed a prediction model using only results of 12 preoperative laboratory test variables, data of 3 demographic characteristics, and surgical-related information such as that of the anesthesia method, emergency status, and surgery department. In addition, we developed the lab model using only results of 12 preoperative laboratory test variables. The reason we performed modeling using only such a small number of variables was to use only objective information that can be commonly extracted from various institutions for applicability to various medical institutions for creating a prediction model. It was expected that the development of a prediction model using only this small number of objective clinical variables would enable the development of a more robust model. As a result, even though only such a small number of variables were used in our models, the prediction performance did not deteriorate, compared to the performances achieved in previous studies. This suggests that a machine-learning model trained only on objective and quantitative values of a sufficiently large cohort may be applied in other hospitals with prediction performance similar to that in the hospital where the model was trained.

The strength of our study is that it is the only study that collects large-scale data from four independent institutions to create and compare artificial intelligence models that predict postoperative 30-day mortality. In most of the previous studies, models were developed with data from a single center and validated with data from the same institution. Because of absence of data for external validation, most of the previous prediction models have overfitting problems, which cause difficulty in applying model developed by one institution to the other. In this situation, our work that externally validates prediction models using multicenter data is expected to serve as an important milestone in the development of generalized, robust models applicable to multiple hospitals.

One limitation of our study is the relatively low AUPRC value, compared to the high AUROC value. The AUPRC scores of our models are in the range of 0.1-0.2. Another limitation of our study is that we did not adopt various technical alternatives for transferability of the model. Our study confirmed that obtaining as many datasets as possible increases the robustness of the model, but this is very difficult to realize in actual clinical practice.

It is possible to create a robust artificial-intelligence prediction model applicable to multiple institutions through a light predictive model using only minimal preoperative information that can be automatically extracted from each hospital.


1. Protopapa, K. L., Simpson, J. C., Smith, N. C. & Moonesinghe, S. R. Development and
validation of the Surgical Outcome Risk Tool (SORT). Br J Surg 101, 1774-1783

2. Chiew, C. J., Liu, N., Wong, T. H., Sim, Y. E. & Abdullah, H. R. Utilizing Machine
Learning Methods for Preoperative Prediction of Postsurgical Mortality and Intensive
Care Unit Admission. Ann Surg 272, 1133-1139 (2020).

3. Fritz, B. A. et al. Deep-learning model for predicting 30-day postoperative mortality.
Br J Anaesth 123, 688-695 (2019).

4. Hill, B. L. et al. An automated machine learning-based model predicts postoperative
mortality using readily-extractable preoperative electronic health record data. Br J
Anaesth 123, 877-886 (2019).

5. Lee, C. K., Hofer, I., Gabel, E., Baldi, P. & Cannesson, M. Development and Validation
of a Deep Neural Network Model for Prediction of Postoperative In-hospital Mortality.
Anesthesiology 129, 649-662 (2018).

6. Seki, T., Kawazoe, Y. & Ohe, K. Machine learning-based prediction of in-hospital
mortality using admission laboratory data: A retrospective, single-site study using
electronic health record data. PLoS One 16, e0246640 (2021).

7. Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records.
NPJ Digit Med 1, 18 (2018).

8. Sheller, M. J. et al. Federated learning in medicine: facilitating multi-institutional
collaborations without sharing patient data. Sci Rep 10, 12598 (2020).

Please sign in or register for FREE

If you are a registered user on Nature Portfolio Health Community, please sign in