Machine learning models to predict atherosclerotic cardiovascular disease risk in multiethnic, real-world populations

Despite numerous therapies and public health efforts, heart disease remains the main cause of death in the United States. Specifically, atherosclerotic cardiovascular disease (or ASCVD) refers to the buildup of plaque in the walls of our arteries, which can lead to heart attacks, stroke, and death. In individuals who have not yet developed ASCVD, the goal of treatment is to prevent future ASCVD through lifestyle management, risk factor control (including weight, blood pressure, cholesterol levels, and avoiding smoking), and medications such as statins.  These strategies, which aim to prevent events in individuals who don’t have a prior history of ASCVD, are termed “primary prevention”. To help decide how to treat patients for primary prevention, the American College of Cardiology (ACC) and American Heart Association (AHA) developed a risk calculator, termed the Pooled Cohort Equations (PCE), to calculate an individual’s 10-year risk of ASCVD.1 By ACC/AHA guidelines, if a patient’s 10-year risk of ASCVD is intermediate (7.5% to <20%) or high (20% or greater), certain therapies are recommended to decrease this risk – particularly statin medications. Statins decrease cholesterol levels and decrease risk of future ASCVD events such as heart attacks, strokes, and death. The PCE are widely used in medical practice to risk stratify patients and help guide therapy, particularly the initiation of statins, to reduce ASCVD risk.

The PCE requires certain input variables: age, gender, race/ethnicity, cholesterol levels, smoking status, blood pressure, history of diabetes, and current medications – and is inapplicable if there are missing variables. These values must also be within a certain range for the PCE. For example, total cholesterol is required to be between 130 and 320 mg/dl. This may render a significant number of patients ineligible for the PCE – in a prior study, ~25% of the study cohort was ineligible for PCE use.2 Additionally, the PCE were derived from Non-Hispanic White and Black populations, and have shown inconsistent results when applied to other race/ethnicity groups including Asian and Hispanic populations.3 Thus, we aimed to overcome these gaps in PCE use by developing electronic health record (EHR)-trained machine learning models for broader ASCVD risk prediction in real-world, multiethnic patients.4

 Machine learning is used widely in several fields and their promise for risk prediction in medicine is being increasingly studied.5 Modern EHRs provide access to large-scale data that can facilitate the development of machine learning models. In our study, we used several types of machine learning models including random forests (RF), gradient boosted machines (GBM), extreme gradient boosted models (XGBoost), and logistic regression with the standard L2 penalty (LRL2), and with an L1 lasso penalty (LRLasso). We trained these models to predict ASCVD on data from over a hundred thousand patients, including a significant number of Asian and Hispanic participants. We tested their performance on an independently held-out data set of patients, including patients who were ineligible for the PCE due to missing or out-of-range variables.  We found the following:

  1. Approximately 48% of ASCVD events occurred in patients who were ineligible for the PCE. Overall, compared with the PCE, machine learning models performed as well or better while being applicable to a larger group of patients, including those who were ineligible for PCE use due to missing or out-of-range variables. GBM performance in the full cohort including PCE-ineligible patients was better than that of the PCE in the PCE-eligible cohort.
  2. Machine learning model performance did not significantly change when restricted to PCE variables only versus when allowed access to other variables from EHR data. This suggested that the PCE variables (including age, gender, race/ethnicity, blood pressure, cholesterol, diabetes, smoking) are the major predictors of ASCVD, and that efforts at improving ASCVD risk prediction using additional EHR data beyond PCE variables may not always work.

Our results suggest that EHR-trained machine learning models may help overcome important ASCVD risk prediction gaps and may have greater applicability across diverse, real-world populations to guide risk-based treatment.


We thank all our co-authors for this study.


  1. Grundy, S. M. et al. 2018 AHA/ACC/AACVPR/AAPA/ABC/ACPM/ADA/AGS/APhA/ASPC/NLA/PCNA Guideline on the Management of Blood Cholesterol: Executive Summary: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. J Am Coll Cardiol 73, 3168-3209, doi:10.1016/j.jacc.2018.11.002 (2019).
  2. Rana, J. S. et al. Accuracy of the Atherosclerotic Cardiovascular Risk Equation in a Large Contemporary, Multiethnic Population. J Am Coll Cardiol 67, 2118-2130, doi:10.1016/j.jacc.2016.02.055 (2016).
  3. Rodriguez, F. et al. Atherosclerotic Cardiovascular Disease Risk Prediction in Disaggregated Asian and Hispanic Subgroups Using Electronic Health Records. J Am Heart Assoc 8, e011874, doi:10.1161/JAHA.118.011874 (2019).
  4. Ward, A. et al. Machine learning and atherosclerotic cardiovascular disease risk prediction in a multi-ethnic population. NPJ Digit Med 3, 125, doi:10.1038/s41746-020-00331-1 (2020).
  5. Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349, 255-260, doi:10.1126/science.aaa8415 (2015).