Unleashing the potential of digital pathology data by training computer-aided diagnosis models without human annotations

This study presents an approach that removes the need of medical experts to build assisting tools for medical experts, opening the way for the exploitation of immense sources of data collected worldwide.
Unleashing the potential of digital pathology data by training computer-aided diagnosis models without human annotations

Digitization of healthcare data is changing the organization and the structure of digital workflows. Every year, an impressive amount of new healthcare data (including images, reports, etc) is produced worldwide. For example, only in 2020 over 50 petabytes (1 petabyte is 1000 terabytes) of data were produced worldwide. The large quantity of data helps medical experts to make diagnosis and to decide the best treatment to tackle diseases. However, the number of people dying because of cancer is still high. For example, about 10 million people died because of cancer in the world.
One of the reasons behind these deaths is the fact that the possibility of analyzing all the data collected is limited. The increasing amount of healthcare data does not scale with the limited number of medical experts available to analyze the data. The analysis of healthcare data requires time and expertise. The role of medical experts is not trivial: they analyze complex structures, such as tissue morphologies, to identify peculiar characteristics of diseases, such as cancer, in a context where the heterogeneity of structures is very high. This process can last around one hour per sample, saturating the capacity of hospitals to absorb the mole of new data and stretching the time needed to diagnose a disease and to define a treatment.
Recently, new technologies have developed to support medical experts in the analysis of healthcare data. In particular, deep learning algorithms emerge as the state-of-the-art algorithm for several tasks, such as image classifications. However, this technology mostly relies on annotated data. Deep learning methods are data-driven, therefore the performance and robustness of deep learning models mostly depend on the data. This requirement limits the development of algorithms to automatically analyze healthcare data since the data annotation process involves medical experts to analyze healthcare data. Therefore, only a small percentage of healthcare data are annotated by experts, limiting the potential that large amounts of data and data-driven algorithms may have on the medical domain. 
This study presents an approach aiming to exploit the potential of healthcare data from hospital workflows. The approach is designed to remove the need for human experts to annotate data: the reports corresponding to healthcare data (that can be CT scans, MRI, whole slide images) are automatically analyzed to create automatic annotations that can be used to train deep learning models. 
The method is tested on the classification of colon whole slide images, comparing the performance of two deep learning models with the same architecture, one trained using automatic made annotations and the other one using annotations made by medical experts.
The performance of the models is comparable, showing that automatic annotations generated from reports are as meaningful as the ones made by medical experts. Therefore, automatically made annotations can be used for training tools to assist medical experts.
The method is independent of the tissue and the type of images analyzed. Even if we selected colon histopathology data as use case study, it can be easily applied to data coming from different organs (lungs, prostate, head) and, most importantly, on different types of images, including CT-scans, MRIs, etc. This last point is the most important to stress since it allows to influence several healthcare domains, starting a virtuous cycle in the whole medical domain.
The implication of the results is disruptive: the need for medical experts for healthcare data analysis and annotation (until now a bottleneck) may be removed, leading to the possibility of exploiting data already available in the hospital informative systems all around the world. 
Therefore, petabytes of data are potentially available to build robust models, since they only need to be extracted from hospital databases and do not need any further analysis from experts. The exploitation of data collected from several hospitals may allow to create robust models, more accurate on predictions and less affected by data heterogeneity.

Please sign in or register for FREE

If you are a registered user on Nature Portfolio Health Community, please sign in