Deep learning is the state-of-the-art machine learning approach. The success of deep learning in many pattern recognition applications have brought excitement and high expectations that deep learning, or artificial intelligence (AI), can bring revolutionary changes in health care. Early studies of deep learning applied to lesion detection or classification have reported superior performance compared to those by conventional techniques or even better than radiologists in some tasks. The potential of applying deep-learning-based medical image analysis to computer-aided diagnosis (CAD), thus providing decision support to clinicians and improving the accuracy and efficiency of various diagnostic and treatment processes, has spurred new research and development efforts in CAD. Despite the optimism in this new era of machine learning, the development and implementation of CAD or AI tools in clinical practice face many challenges. In this chapter, we will discuss some of these issues and efforts needed to develop robust deep-learning-based CAD tools and integrate these tools into the clinical workflow, thereby advancing towards the goal of providing reliable intelligent aids for patient care.
Keywords: Machine learning, deep learning, artificial intelligence, computer-aided diagnosis, medical imaging, big data, transfer learning, validation, quality assurance, interpretable AI
Medical imaging is an important diagnostic tool for various diseases. Roentgen discovered that x-rays could non-invasively look into the human body in 1895 and x-ray radiography became the first diagnostic imaging modality soon after. Since then many imaging modalities were invented, with computed tomography, ultrasound, magnetic resonance imaging, and positron emission tomography among the commonly used, and more and more complex imaging procedures have been developed. Image information plays a crucial role in decision making at many stages in the patient care process, including detection, characterization, staging, treatment response assessment, monitoring of disease recurrence, as well as guiding interventional procedures, surgeries, and radiation therapy. The number of images for a given patient case increases dramatically from a few two-dimensional (2D) images to hundreds with 3D imaging and thousands with 4D dynamic imaging. Application of multi-modality imaging further increases the amount of image data to be interpreted. The increasing workload makes it difficult for radiologists and physicians to maintain workflow efficiency while utilizing all the available imaging information to improve accuracy and patient care. With the advances in machine learning and computational techniques in recent years, the potential and the need of developing computerized methods to assist radiologists in image analysis and diagnosis has been recognized as an important area of research and development in medical imaging.
The attempt of using computers to automatically analyze medical images emerged as early as the 1960’s [1–4]. Several studies demonstrated the feasibility of applying computer to medical image analysis but the work did not attract much attention, probably because of the limited access to high quality digitized image data and computational resources. Doi et al. in the Kurt Rossmann Laboratory at the University of Chicago began systematic development of machine learning and image analysis techniques for medical images in the 1980’s [5], with the goal to develop computer-aided diagnosis (CAD) as a second opinion to assist radiologists in image interpretation. Chan et al. developed a CAD system for detection of microcalcifications on mammograms [6] and conducted the first observer performance study [7] that demonstrated the effectiveness of CAD in improving breast radiologists’ detection performance of microcalcifications. The first CAD commercial system was approved by the Food and Drug Administration (FDA) for use as a second opinion in screening mammography in 1998. CAD and computer-assisted image analysis have been a major area of research and development in medical imaging in the past few decades. CAD methods have been investigated for various applications including disease detection, characterization, staging, treatment response assessment, prognosis prediction, and risk assessment for various diseases and with various imaging modalities. The work in the CAD field has been steadily increasing as can be seen from the trend of publications in peer-reviewed journal articles found by literature search in the Web of Science ( Fig. 1 ).
Literature search for publications in peer-reviewed journals by Web of Science from 1900 to 2019 using key words: ((imaging OR images) AND (medical OR diagnostic)) AND (machine learning OR deep learning OR neural network OR deep neural network OR convolutional neural network OR computer aid OR computer assist OR computer-aided diagnosis OR automated detection OR computerized detection OR Computer-aided detection OR automated classification OR computerized classification OR decision support OR radiomic) NOT (pathology OR slide OR genomics OR molecule OR genetic OR cell OR protein OR review OR survey)).
Although the research in CAD has been increasing, very few CAD systems are used routinely in the clinic. One of the major reasons may be that CAD tools developed with conventional machine learning methods may not have reached the high performance that can meet physicians’ needs to improve both diagnostic accuracy and workflow efficiency. With the success of deep learning in many machine learning applications such as text and speech recognition, face recognition, autonomous vehicles, chess and Go game, in the past several years, there are high expectations that deep learning will bring breakthrough in CAD performance and widespread use of deep-learning-based CAD, or artificial intelligence (AI), to various tasks in the patient care process. The enthusiasm has spurred numerous studies and publications in CAD using deep learning. In this chapter, we will discuss some issues and challenges in the development of deep-learning based CAD in medical imaging, as well as considerations needed for the future implementation of CAD in clinical use.
CAD systems are developed with machine learning methods. Conventional machine learning approach to CAD in medical imaging used image analysis methods to recognize disease patterns and distinguish different classes of structures on images, e.g., normal or abnormal, malignant or benign. CAD developers design image processing and feature extraction techniques based on domain knowledge to represent the image characteristics that can distinguish the various states. The effectiveness of the feature descriptors often depends on the domain expertise of the CAD developers and the capability of the mathematical formulations or empirical image analysis techniques that are designed to translate the image characteristics to numerical values. The extracted features are then used as input predictor variables to a classifier, and a predictive model is formed by adjusting the weights of the various features based on the statistical properties of a set of training samples to estimate the probability that an image belongs to one of the states. Conventional machine learning approach has limitations in that the human developer may not be able to translate the complex disease patterns into a finite number of feature descriptors even if they have seen a large number of cases from the patient population. The hand-engineered features may also have difficulty to be robust against the large variations of normal and abnormal patterns in the population. The performance of the developed CAD system is often limited in its discriminative power or generalizability, resulting in high false positive rate at high sensitivity or vice versa.
Deep learning has emerged as the state-of-the-art machine learning method in many applications. Deep learning is a type of representation learning method in which a complex multi-layer neural network architecture learns representations of data automatically by transforming the input information into multiple levels of abstractions.[8] For pattern recognition tasks in images, deep convolutional neural networks (DCNN) are the most commonly used deep learning networks. With a sufficiently large training set, DCNN can learn to automatically extract relevant features from the training samples for a given task by iteratively adjusting its weights with backpropagation. DCNN therefore discovers feature representations through training and does not require manually designed features as input. If properly trained with a large training set that are representative of the population of interest, the DCNN features are expected to be superior to hand-engineered features in that they have high selectivity and invariance [8]. Importantly, since the learning process is automated, deep learning can easily analyze thousands or millions of cases that even human experts may not be able to see and memorize in their lifetime. Deep learning can therefore be more robust to the wide range of variations in features between different classes to be differentiated as long as the training set is large and diverse enough for it to analyze.
CNN can trace its origin to the neocognitron proposed by Fukushima et al in the early 1980’s [9]. LeCun first trained a CNN by backpropagation to classify patterns of handwritten digits in 1990 [10]. CNN was used in many applications such as object detection, character and face recognition in the early 1990’s. Lo et al. first introduced CNN to the analysis of medical images in 1993 and trained a CNN for lung nodule detection in chest radiographs [11, 12]. Chan et al. applied CNN to microcalcification detection [13, 14] on mammograms in the same year and to mass detection in the following year [15–18]. Zhang et al. applied a similar shift-invariant neural network for the detection of clusters of microcalcifications in 1994 [19]. Although these early CNNs were not very deep, the pattern recognition capability of CNN in medical images were demonstrated.
Deep CNN was enabled by several important neural network training techniques developed over the years, including layer-wise unsupervised pretraining followed by supervised fine-tuning [20–22], use of rectified linear unit (ReLU) [23, 24] as activation function in place of sigmoid-type activation functions, pooling to improve feature invariance and reduce dimensionality [25], dropout to reduce overfitting [26], and batch normalization [27] that further reduces the risk of internal covariate shift, vanishing gradient and overfitting, as well as increases training convergence speed. These techniques allow neural networks with more and more layers and containing millions of weights to be trained. In 2012, Krizhevsky et al [28] proposed a CNN with five convolutional layers and 3 fully connected layers (named “AlexNet”) containing over 60 million weights and achieved breakthrough performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [29] that classified over 1000 classes of everyday objects on photographic images. AlexNet demonstrated the pattern recognition capability of the multiple layers of a deep structure. DCNNs with increasing depth were developed since AlexNet. He et al. [30] proposed residual learning and showed that a residual network (ResNet) with 110 to 152 layers could outperform several other DCNNs and won the ILSVRC in 2015. Sun et al. [31] showed that the learning capacity of a DCNN increased with depth but the capacity could be utilized only with sufficiently large training data.
The success of deep learning or AI in personal devices and social media, self-driving cars, chess and Go game have raised unprecedented expectations of deep learning in medicine. Deep learning has been applied to many medical image analysis tasks for CAD [32–34]. The most common areas of CAD application using deep learning include classification of disease and normal patterns, classification of malignant and benign lesions, and prediction of high risk and low risk patterns of developing cancer in the future. Other applications included segmentation and classification of organs and tumors of different types, classification of changes in tumor size or texture for assessment of treatment response or prediction of prognosis or recurrence. Because there are relatively large public data sets available for chest radiographs, thoracic CT, and mammograms, a large number of studies were conducted for lung diseases and breast cancer using the public data sets. Deep learning based image analysis has also been applied to fundus images or optical computed tomography for detection of eye diseases [35], or histopathological images for classification of cell types [36]. Most of the studies reported very promising results, further boosting the hype of deep-learning-based CAD. This new generation of CAD is called AI although these CAD tools still behave like a very complex mathematical model that memorizes information in its millions of weights and far from being “intelligent”.
CAD or AI is expected to be useful decision support tools in medicine in the near future. Other than detection and characterization of abnormalities, applications such as pre-screening and triaging, cancer staging, treatment response assessment, recurrence monitoring, and prognosis or survival prediction are being explored. Although no CAD systems with new AI techniques have been subjected to large scale clinical trials to date, experiences from CAD use in screening mammography may provide some insights into what may be expected of CAD tools in the clinic [37].
The conventional machine-learning-based CAD for detection of breast cancer in screening mammography is the only CAD application in widespread clinical use to date. These systems have been shown to have sensitivity comparable to or higher than that of radiologists, especially for microcalcifications, but they also mark a few false positives per case on average [38]. Although the performances of CAD systems are moderate, they may detect lesions of different characteristics than those by radiologists. The complementary detections by the radiologist and CAD can improve the overall sensitivity when radiologist reads with CAD. Studies have shown that radiologists’ accuracy was improved significantly when reading with CAD [5]. CAD systems were therefore approved by FDA for use as a second opinion but not as a primary reader or pre-screener. Early clinical trials [39, 40] to compare single reading with CAD to double reading showed promising results. In the CADET II study by Gilbert et al. [39], they conducted a prospective randomized clinical trial at three sites in the United Kingdom. A total of over 28,000 patients were included. The screening mammograms of each patient were independently read in two arms; one was single reading with CAD and the other was their standard practice of double reading. The experiences of the single readers in the CAD arm were matched to those of the first readers’ in the double reading arm. Arbitration was used in cases of recall due to the second reader or CAD. They found that arbitration was performed in 1.3% of the cases in single reading with CAD. The average sensitivity in the two arms were comparable at 87.2% and 87.7%, respectively. The recall rates at two centers were comparable in the two arms, 3.7% versus 3.6% and 2.7% versus 2.7%, respectively, but one of the centers had a significantly higher recall rate for single reading with CAD, 5.2% versus 3.8%. The overall recall rate therefore increased in the single reading with CAD from 3.4% to 3.9%. Gromet et al. [40] performed a respective review of the sensitivity and recall rate by single reading with CAD after CAD implementation in comparison to those of double reading before CAD use as historical control for the same group of nine radiologists in a single mammography facility. The first reading in their double reading protocol was also analyzed and treated as single reading without CAD. The study cohort contained over 110,000 screening examinations in each group. Arbitration by a third subspecialty radiologist was a part of their standard double reading protocol. A second radiologist was consulted for 2.1% of the cases interpreted by single reading with CAD but the consult might or might not be related to CAD marks. They reported that the sensitivity of single reading with CAD was 90.4%, higher than the sensitivities of either single reading alone (81.4%) or double reading (88.0%). The recall rate was 10.6% for single reading with CAD, slightly higher than the recall rate of single reading alone (10.2%) but lower than that of double reading (11.9%). These relatively well-controlled studies showed that single reading with CAD is potentially an alternative to double reading, with a gain in sensitivity but at the expense of increased recalls, which can be reduced by arbitration similar to that in double reading.
Taylor et al. [41] conducted a meta-analysis of clinical studies comparing single reading with CAD or double reading to single reading alone. They compared the cancer detection rate per 1000 women screened (CDR) and the recall rate, and estimated the average odds ratios weighted by sample size over the studies in each group ( Table 1 ). The results showed that double reading with arbitration improved the CDR without increasing the recall rates. Single reading with CAD for the matched studies increased the CDR but with a wide variation; however, without the benefit of arbitration, the recall rate increased significantly. The increase in recall rate for double reading without arbitration was more than twice of that for single reading with CAD.
Odds ratios (95% confidence interval) of increase in cancer detection rate and increase in recall rate obtained by comparison of single reading with CAD and double reading to single reading alone by Taylor et al.[41]
Odds ratio of increase in cancer detection rate | Odds ratio of increase in recall rate | |
---|---|---|
Single reading with CAD | ||
Matched (N=5) | 1.09 (0.92, 1.29) | 1.12 (1.08, 1.17) |
Unmatched (N=5) | 1.02 (0.93, 1.12) | 1.10 (1.08, 1.12) |
Double reading | ||
Unilateral (N=6) | 1.13 (1.06, 1.19) | 1.31(1.29, 1.33) |
Mixed (N=3) | 1.07 (0.99, 1.15) | 1.21 (1.19, 1.24) |
Arbitration (N=8) | 1.08 (1.02, 1.15) | 0.94 (0.92, 0.96) |