Recent developments in Artificial Intelligence (AI) have made it easier to combine heterogenous data modalities such as demographic data, electronic health records, molecular and imaging data. Multi-modal and multi-scale data fusion provide potential benefits for studying complex diseases. The expectation is that a single data modality is not going to be sufficient to achieve clinically useful predictive performance for a complex disease. Additionally, recent strides in AI have made it easier to represent or embed several routinely clinically collected data modalities such as radiographic images, pathology slides and free text clinical notes in a way that they can be more easily used in downstream machine learning models. Taking advantage of these routinely collected biomedical data is the primary goal of our work in biomedical data fusion.
For example, quantitative imaging of radiographic images including CT and MR imaging has made great progress in the past decade especially in the field of oncology. The use of radiomics features - quantitative image features that characterize the texture, intensity or shape of lesions - have shown to be a great resource for biomarker discovery from radiographic images. More recently, convolutional neural networks trained directly on radiographic images (aka deep learning), have further improved quantitative image modeling and are particularly powerful in particular applications such as lesion segmentation. Both radiomics and convolutional neural network modeling have shown in numerous applications to complement biomedical decision support and can successfully discover quantitative biomarkers that can be helpful for diagnosis or prognosis of complex diseases.
We took advantage of these developments to embark on a biomedical data fusion project to triage COVID19 patients. Thanks to two visiting scholars in my lab, graduate student Qinmei Xu visiting from the Department of Medical Imaging, Jinling Hospital, Nanjing University School of Medicine, and Dr. Peiyi Xie, radiologist at the The Sixth Affiliated Hospital of Sun Yat-sen University we took advantage of a large cohort of more than 2000 COVID19 patients form 39 hospitals in China. Together with Stanford graduate students Xianghao Zhan & Yiheng Li, we applied our ideas in biomedical data fusion and quantitative imaging to integrate radiomic features extracted from CT images with clinical and lab data to predict the severity of COVID19 patients. We used the recent developments in automated segmentation of lesions using AI methods to automatically identify the lesions in CT images. We found that radiomics features extracted from COVID19 lesions in CT images improved significantly the prediction of ICU admission, machine ventilator use and death of COVID19 patients compared to radiologist’s interpretation of these images. Next, we found that quantitative image features were essential to predict the three outcomes and we also showed that adding labs and clinical data significantly improved the performance both when using binary classification models and time-to-event modeling.
Recent reports have emphasized the lack of validation of AI models in particular in the context of COVID19 research (see Roberts et al. Nature Machine Intelligence 2021). Validation of AI based models is important as this paper reported and many of the early studies have not been validated in external cohorts. Especially in the context of the pandemic, many studies were reported in early 2020 using machine learning & AI to diagnose and prognose COVID19 patients without extensive validation, and, it is unclear which studies are ready for clinical use. In our study, we were able to test our models in additional Chinese hospitals but not in cohorts outside of China.Several hurdles need to be overcome to externally validate models including privacy issues and regulatory issues that introduce additional steps that need to be done before data can be shared across institutions and international borders (e.g. HIPAA & GDPR). In addition, in the early stages of the pandemic few de-identified data sets were publicly available, especially CT imaging data. This has changed over time with more efforts that are underway to make large imaging cohorts available including a European COVID19 data initiative and the RICORD database supported by the Radiological Society of North America (RSNA). Next, another hurdle, especially in the context of prognostic modeling, is collecting follow-up data, even if federated or distributed learning are successfully implemented, still validation is hard if appropriate clinical outcome data is not collected in a harmonized way.
Now that the initial flood of COVID19 modeling efforts have been reported, it’s expected that validation efforts and implementation will need to be carried out to have any chance of having an impact on this devastating disease. Even if the impact of this work does not directly affect the ongoing pandemic, any lessons learned are still going to be important to the study of other complex diseases. Particularly in the context of biomedical decision support, innovations in data fusion, deep learning, quantitative imaging, machine learning and distributed learning will prove invaluable across biomedicine. With the pandemic, biomedical research is at the forefront of the general public and the expectation is that implementation of new methods is instantaneous, however, biomedical research is typically a slow-moving process and the necessary resources have to be allocated to be able to move AI models from bench to bedside.