Toward Clinically Useful AI: Helping Clinicians Know When to Trust Their Models

Toward Clinically Useful AI: Helping Clinicians Know When to Trust Their Models

An important part of a clinician’s job is identifying patients who are likely to have a bad outcome without appropriate treatment. To accomplish this, healthcare providers often use predictive models to classify patients according to their level of risk. For example, clinicians use risk models to identify patients who have a high risk of death soon after a heart attack because many high-risk patients benefit from invasive interventions that can lower their subsequent risk of adverse outcomes.  As many invasive interventions carry their own risk of complications, it is desirable to preferentially use these therapies in patients who have the greatest risk.

Good risk models have high overall accuracy in large clinical datasets.  However, no model is accurate 100% of the time.  Indeed, errors in predicting patient risk can result in healthcare providers not administering lifesaving therapies to patients who need them.  It is important, therefore, for healthcare providers to have some insight into when a prediction for a given patient is likely to be incorrect.

Information from a dataset is used both to compute risk scores for new patients, and determine whether those risk scores should be trusted. Model performance is reduced in subgroups of patients for which the predictions are deemed unreliable.

In this study our goal was to develop a general framework for identifying when a given patient belongs to a group where a predictive model is very inaccurate.  We called predictions on patients who belong to these poorly performing groups “unreliable” because they correspond to misleading statements about a given patient’s risk. We approached the problem as follows. Given a risk model developed using a particular training dataset, we constructed an alternative risk model using information from the same dataset and compared the predictions of the two models on each patient in a separate dataset. We then ranked the patients based on how much the two metrics disagreed, identified the patients with the most disagreement, and evaluated the performance of the original risk model on this group, as well as the remaining patients. Our method can be computed for any clinical risk model, only needs summary statistics of the training data used to develop the original clinical risk model and is suitable in the setting of large class imbalance, a situation often encountered in healthcare settings. Using the Global Registry of Acute Coronary Events (GRACE) risk score, we demonstrated that the performance of the model in patients in the highest 1% of unreliability is close to random guessing, while the performance in the remaining patients remains robust and is consistent with prior published performance. Given that the original and alternative risk models are developed using the same data, we concluded that patients with high unreliability scores are not well represented in the dataset; consequently, the risk model should not be used for these patients. In a practical setting, we envision that the unreliability score could be integrated into existing risk calculators to inform clinicians when the risk outputs should not be used for a particular patient.

Please sign in or register for FREE

If you are a registered user on Healthcare and Nursing Community, please sign in