Resource Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning Graphical Abstract Authors Daniel S. Kermany, Michael Goldbaum, Wenjia Cai, ..., M. Anthony Lewis, Huimin Xia, Kang Zhang Correspondence kang.zhang@gmail.com In Brief Image-based deep learning classifies macular degeneration and diabetic retinopathy using retinal optical coherence tomography images and has potential for generalized applications in biomedical image interpretation and medical decision making. Highlights d An artificial intelligence system using transfer learning techniques was developed d It effectively classified images for macular degeneration and diabetic retinopathy d It also accurately distinguished bacterial and viral pneumonia on chest X-rays d This has potential for generalized high-impact application in biomedical imaging Kermany et al., 2018, Cell 172, 1122–1131 February 22, 2018 ª 2018 Elsevier Inc. https://doi.org/10.1016/j.cell.2018.02.010 Resource Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning Daniel S. Kermany,1,2,14 Michael Goldbaum,2,14 Wenjia Cai,2,14 Carolina C.S. Valentim,2,14 Huiying Liang,1,14 Sally L. Baxter,2,14 Alex McKeown,3 Ge Yang,2 Xiaokang Wu,4 Fangbing Yan,4 Justin Dong,1 Made K. Prasadha,2 Jacqueline Pei,1,2 Magdalene Y.L. Ting,2 Jie Zhu,1,5 Christina Li,2 Sierra Hewett,1,2 Jason Dong,1 Ian Ziyar,2 Alexander Shi,2 Runze Zhang,2 Lianghong Zheng,6 Rui Hou,5 William Shi,2 Xin Fu,1,2 Yaou Duan,2 Viet A.N. Huu,1,2 Cindy Wen,2 Edward D. Zhang,1,2 Charlotte L. Zhang,1,2 Oulan Li,1,2 Xiaobo Wang,7 Michael A. Singer,8 Xiaodong Sun,9 Jie Xu,10 Ali Tafreshi,3 M. Anthony Lewis,11 Huimin Xia,1 and Kang Zhang1,2,4,12,13,15,* 1Guangzhou Women and Children’s Medical Center, Guangzhou Medical University, 510005 Guangzhou, China Eye Institute, Institute for Engineering in Medicine, Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA 92093, USA 3Heidelberg Engineering, Heidelberg, Germany 4Molecular Medicine Research Center, State Key Laboratory of Biotherapy, The National Clinical Research Center of Senile Disease, West China Hospital, Sichuan University, Chengdu, China 5Guangzhou KangRui Biological Pharmaceutical Technology Company, 510005 Guangzhou, China 6YouHealth AI, 510005 Guangzhou, China 7Beihai Hospital, Dalian, 116021, China 8Department of Ophthalmology, University of Texas Health Science Center, San Antonio, TX 78229, USA 9Shanghai Key Laboratory of Ocular Fundus Diseases, Shanghai General Hospital, Shanghai JiaoTong University, 200080 Shanghai, China 10Beijing Instute of Ophthalmology, Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University, Beijing, China 11Qualcomm, San Diego, CA 92121, USA 12Guangzhou Regenerative Medicine and Health Guangdong Laboratory, 510005 Guangzhou, China 13Veterans Administration Healthcare System, San Diego, CA 92037, USA 14These authors contributed equally 15Lead Contact *Correspondence: kang.zhang@gmail.com https://doi.org/10.1016/j.cell.2018.02.010 2Shiley SUMMARY INTRODUCTION The implementation of clinical-decision support algorithms for medical imaging faces challenges with reliability and interpretability. Here, we establish a diagnostic tool based on a deep-learning framework for the screening of patients with common treatable blinding retinal diseases. Our framework utilizes transfer learning, which trains a neural network with a fraction of the data of conventional approaches. Applying this approach to a dataset of optical coherence tomography images, we demonstrate performance comparable to that of human experts in classifying agerelated macular degeneration and diabetic macular edema. We also provide a more transparent and interpretable diagnosis by highlighting the regions recognized by the neural network. We further demonstrate the general applicability of our AI system for diagnosis of pediatric pneumonia using chest X-ray images. This tool may ultimately aid in expediting the diagnosis and referral of these treatable conditions, thereby facilitating earlier treatment, resulting in improved clinical outcomes. Artificial intelligence (AI) has the potential to revolutionize disease diagnosis and management by performing classification difficult for human experts and by rapidly reviewing immense amounts of images. Despite its potential, clinical interpretability and feasible preparation of AI remains challenging. The traditional algorithmic approach to image analysis for classification previously relied on (1) handcrafted object segmentation, followed by (2) identification of each segmented object using statistical classifiers or shallow neural computational machine-learning classifiers designed specifically for each class of objects, and finally (3) classification of the image (Goldbaum et al., 1996). Creating and refining multiple classifiers required many skilled people and much time and was computationally expensive (Chaudhuri et al., 1989; Hoover and Goldbaum, 2003; Hoover et al., 2000). The development of convolutional neural network layers has allowed for significant gains in the ability to classify images and detect objects in a picture (Krizhevsky et al., 2017; Zeiler and Fergus, 2014). These are multiple processing layers to which image analysis filters, or convolutions, are applied. The abstracted representation of images within each layer is constructed by systematically convolving multiple filters across the image, producing a feature map that is used as input to the following layer. This architecture makes it possible to process images in the form of pixels as input and to give the desired 1122 Cell 172, 1122–1131, February 22, 2018 ª 2018 Elsevier Inc. Figure 1. Schematic of a Convolutional Neural Network Schematic depicting how a convolutional neural network trained on the ImageNet dataset of 1,000 categories can be adapted to significantly increase the accuracy and shorten the training duration of a network trained on a novel dataset of OCT images. The locally connected (convolutional) layers are frozen and transferred into a new network, while the final, fully connected layers are recreated and retrained from random initialization on top of the transferred layers. tomography (OCT) images of the retina, but the algorithm was also tested in a cohort of pediatric chest radiographs to validate the generalizability of this technique across multiple imaging modalities. RESULTS classification as output. The image-to-classification approach in one classifier replaces the multiple steps of previous image analysis methods. One method of addressing a lack of data in a given domain is to leverage data from a similar domain, a technique known as transfer learning. Transfer learning has proven to be a highly effective technique, particularly when faced with domains with limited data (Donahue et al., 2013; Razavian et al., 2014; Yosinski et al., 2014). Rather than training a completely blank network, by using a feed-forward approach to fix the weights in the lower levels already optimized to recognize the structures found in images in general and retraining the weights of the upper levels with back propagation, the model can recognize the distinguishing features of a specific category of images, such as images of the eye, much faster and with significantly fewer training examples and less computational power (Figure 1). In this study, we sought to develop an effective transfer learning algorithm to process medical images to provide an accurate and timely diagnosis of key pathology in each image. The primary illustration of this technique involved optical coherence The primary application of our transfer learning algorithm was in the diagnosis of retinal OCT images. Spectral-domain OCT uses light to capture high-resolution in vivo optical cross sections of the retina that can be assembled into three-dimensional-volume images of living retinal tissue. It has become one of the most commonly performed medical imaging procedures, with approximately 30 million OCT scans performed each year worldwide (Swanson and Fujimoto, 2017). OCT imaging is now a standard of care for guiding the diagnosis and treatment of some of the leading causes of blindness worldwide: age-related macular degeneration (AMD) and diabetic macular edema. Almost 10 million individuals suffer from AMD in the United States, and each year, more than 200,000 people develop choroidal neovascularization, a severe blinding form of advanced AMD (Ferrara, 2010; Friedman et al., 2004; Wong et al., 2014). In addition, nearly 750,000 individuals aged 40 or older suffer from diabetic macular edema (Varma et al., 2014), a vision-threatening form of diabetic retinopathy that involves the accumulation of fluid in the central retina. The prevalence of these diseases will likely increase even further over time due to the aging population and the global diabetes epidemic. Fortunately, the advent and widespread utilization of anti-vascular endothelial growth factor (antiVEGF) medications has revolutionized the treatment of exudative retinal diseases (Kaiser et al., 2007; Ferrara, 2010), allowing patients to retain useful vision and quality of life. OCT is critical to guiding the administration of anti-VEGF therapy by providing a clear cross-sectional representation of the retinal pathology in these conditions (Figure 2A), allowing visualization of individual retinal layers, which is impossible with clinical examination by the human eye or by color fundus photography. Cell 172, 1122–1131, February 22, 2018 1123 A B Figure 2. Representative Optical Coherence Tomography Images and the Workflow Diagram (A) (Far left) choroidal neovascularization (CNV) with neovascular membrane (white arrowheads) and associated subretinal fluid (arrows). (Middle left) Diabetic macular edema (DME) with retinal-thickening-associated intraretinal fluid (arrows). (Middle right) Multiple drusen (arrowheads) present in early AMD. (Far right) Normal retina with preserved foveal contour and absence of any retinal fluid/edema. (B) Workflow diagram showing overall experimental design describing the flow of optical coherence tomography (OCT) images through the labeling and grading process followed by creation of the transfer learning model, which then underwent training and subsequent testing. The training dataset only included images that passed sufficient quality and diagnostic standards from the initial collected dataset. See also Table S1. Patient and Image Characteristics We initially obtained 207,130 OCT images. 108,312 images (37,206 with choroidal neovascularization, 11,349 with diabetic macular edema, 8,617 with drusen, and 51,140 normal) from 4,686 patients passed initial image quality review and were used to train the AI system. The model was tested with 1,000 images (250 from each category) from 633 patients. Patient characteristics for each diagnosis category are listed in Table S1. After 100 epochs (iterations through the entire dataset), the training was stopped due to the absence of further 1124 Cell 172, 1122–1131, February 22, 2018 improvement in both accuracy (Figure 3A) and cross-entropy loss (Figure 3B). Performance of the Model We evaluated our AI system in diagnosing the most common blinding retinal diseases. This AI system categorized images with choroidal neovascularization and images with diabetic macular edema as ‘‘urgent referrals.’’ These conditions would demand relatively urgent referral to an ophthalmologist for definitive anti-VEGF treatment; if treatment is delayed, there is Figure 3. Plot Showing Performance in the Training and Validation Datasets Using TensorBoard Accuracy is plotted against the training step (A), and cross-entropy loss is plotted against the training step (B) during the length of the training of the multi-class classifier over the course of 10,000 steps. Plots were normalized with a smoothing factor of 0.6 to clearly visualize trends. The validation accuracy and loss show better performance, since images with more noise and lower quality were also included in the training set to reduce overfitting and help generalization of the classifier. Training dataset: orange. Validation dataset: blue. See also Figure S1. increased risk of bleeding, scarring, or other downstream complications that cause irreversible vision impairment. The system categorized images with drusen, which are lipid deposits present in the dry form of macular degeneration, as ‘‘routine referrals.’’ Anti-VEGF medications are not indicated for dry macular degeneration; therefore, referral to an eye specialist for drusen is less urgent. Normal images were labeled for ‘‘observation.’’ In a multi-class comparison between choroidal neovascularization, diabetic macular edema, drusen, and normal, we achieved an accuracy of 96.6% (Figure 4), with a sensitivity of 97.8%, a specificity of 97.4%, and a weighted error of 6.6%. Receiver operating characteristic (ROC) curves were generated to evaluate the model’s ability to distinguish urgent referrals (defined as choroidal neovascularization or diabetic macular edema) from drusen and normal exams. The area under the ROC curve was 99.9% (Figure 4). We also trained a ‘‘limited model’’ classifying between the same four categories but only using 1,000 images randomly selected from each class during training to compare transfer learning performance using limited data compared to results using a large dataset. Using the same testing images, the model achieved an accuracy of 93.4%, with a sensitivity of 96.6%, a specificity of 94.0%, and a weighted error of 12.7%. The ROC curves distinguishing urgent referrals (i.e., distinguishing images with choroidal neovascularization or diabetic macular edema from normal images had an area under the curve of 98.8%. Binary classifiers were also implemented to compare choroidal neovascularization/diabetic macular edema/drusen from normal Cell 172, 1122–1131, February 22, 2018 1125 Figure 4. Multi-class Comparison between Choroidal Neovascularization, Diabetic Macular Edema, Drusen, and Normal (A) Receiver operating characteristic (ROC) curve for ‘‘urgent referrals’’ (CNV and DME detection) with human expert performance for comparison. The area under the ROC curve was 99.9%. The zoomed area shows that the most accurate model demonstrates a performance that rivals that of six human experts. (B) Confusion table of best model’s classification of the validation image set. The model successfully scored all urgent referrals as higher than observation. (C) Weighted error results based on penalties in Figure S4 depicting neural networks in gold and human experts in blue. See also Figures S2, S3, and S4 and Table S2. using the same datasets in order to determine a breakdown of the model’s performance (Figure S1). The classifier distinguishing choroidal neovascularization images from normal images achieved an accuracy of 100.0%, with a sensitivity of 100.0% and specificity of 100.0%. The area under the ROC curve was 100.0% (Figure S2A). The classifier distinguishing diabetic macular edema images from normal images achieved an accuracy of 98.2%, with a sensitivity of 96.8% and specificity of 99.6%. The area under the ROC curve was 99.87% (Figure S2B). The classifier distinguishing drusen images from normal images achieved an accuracy of 99.0%, with a sensitivity of 98.0% and specificity of 99.2%. The area under the ROC curve was 99.96% (Figure S2C). Comparison of the Model with Human Experts An independent test set of 1,000 images from 633 patients was used to compare the AI network’s referral decisions with the 1126 Cell 172, 1122–1131, February 22, 2018 decisions made by human experts. Six experts with significant clinical experience in an academic ophthalmology center were instructed to make a referral decision on each test patient using only the patient’s OCT images. Performance on the clinically most important decision of distinguishing patients needing urgent referral (those with choroidal neovascularization or diabetic macular edema) compared to normal patients is displayed as a ROC curve, and this performance was comparable between the AI system and the human experts (Figure 4A). Having established a standard expert performance evaluation system, we next compared the potential impact of patient referral decisions between our network and human experts. The sensitivities and specificities of the experts were plotted on the ROC curve of the trained model, and the differences in diagnostic performance, measured by likelihood ratios, between the model and the human experts were determined to be statistically similar within a 95% confidence interval (Figure S3). However, the pure error rate does not accurately reflect the impact that a wrong referral decision might have on the outcome of an individual patient. To illustrate, a false-positive result occurs when a patient is normal or has drusen but is inaccurately labeled as an urgent referral, and this can cause undue distress or unnecessary investigation for the patient and place extra burdens on the healthcare system. However, a false-negative result is far more serious, because in this instance, a patient with choroidal neovascularization or diabetic macular edema is not appropriately referred, which could result in irreversible visual loss. To account for these issues, weighted error scoring was incorporated during model evaluation and expert testing (Figure S4A). By assigning these penalty points to each decision made by the model and the experts, we computed the average error of each. The best convolutional neural network model yielded a score of 6.6% under this weighted error system. The weighted error of the experts ranged from 0.4% to 10.5%, with a mean weighted error of 4.8% (Table S2). The exact breakdown of each expert’s performance regarding the correlation of their predicted labels with the true labels is depicted as confusion matrices in Figure S4B. As seen in Figure 4, the best model outperformed some human experts based on this weighted scale and on the ROC curve. Occlusion Testing We performed an occlusion test on 491 images to identify the areas contributing most to the neural network’s assignment of the predicted diagnosis. This testing successfully identified the region of interest in 94.7% of images that contributed the highest importance to the deep-learning algorithm (Figure 5A; see also Figure S5 for additional examples). Drusen were located correctly through occlusion testing in 100% of all the images, while choroidal neovascularization yielded an accuracy of 94.0% and diabetic macular edema yielded an accuracy of 91.0% (Table S3). Furthermore, these regions identified by occlusion testing were also verified by human experts to be the most clinically significant areas of pathology. Application of the AI System for Pneumonia Detection Using Chest X-Ray Images To investigate the generalizability of our AI system in the diagnosis of common diseases, we applied the same transfer learning framework to the diagnosis of pediatric pneumonia. According to the World Health Organization (WHO), pneumonia kills about 2 million children under 5 years old every year and is consistently estimated as the single leading cause of childhood mortality (Rudan et al., 2008), killing more children than HIV/AIDS, malaria, and measles combined (Adegbola, 2012). The WHO reports that nearly all cases (95%) of new-onset childhood clinical pneumonia occur in developing countries, particularly in Southeast Asia and Africa. Bacterial and viral pathogens are the two leading causes of pneumonia (Mcluckie, 2009) but require very different forms of management. Bacterial pneumonia requires urgent referral for immediate antibiotic treatment, while viral pneumonia is treated with supportive care. Therefore, accurate and timely diagnosis is imperative. One key element of diagnosis is radiographic data, since chest X-rays are routinely obtained as standard of care and can help differentiate between different types of pneumonia (Figure S6). However, rapid radiologic interpretation of images is not always available, particularly in the low-resource settings where childhood pneumonia has the highest incidence and highest rates of mortality. To this end, we also investigated the effectiveness of our transfer learning framework in classifying pediatric chest X-rays to detect pneumonia and furthermore to distinguish viral and bacterial pneumonia to facilitate rapid referrals for children needing urgent intervention. We collected and labeled a total of 5,232 chest X-ray images from children, including 3,883 characterized as depicting pneumonia (2,538 bacterial and 1,345 viral) and 1,349 normal, from a total of 5,856 patients to train the AI system. The model was then tested with 234 normal images and 390 pneumonia images (242 bacterial and 148 viral) from 624 patients. After 100 epochs (iterations through the entire dataset) of the model, the training was stopped due to the absence of further improvement in both loss and accuracy (Figures 6A and 6B). In the comparison of chest X-rays presenting as pneumonia versus normal, we achieved an accuracy of 92.8%, with a sensitivity of 93.2% and a specificity of 90.1%. The area under the ROC curve for detection of pneumonia from normal was 96.8% (Figure 6E). Binary comparison of bacterial and viral pneumonia resulted in a test accuracy of 90.7%, with a sensitivity of 88.6% and a specificity of 90.9% (Figures 6C and 6D). The area under the ROC curve for distinguishing bacterial and viral pneumonia was 94.0% (Figure 6F). DISCUSSION In this study, we describe a general AI platform for the diagnosis and referral of two common causes of severe vision loss: diabetic macular edema and choroidal neovascularization seen in neovascular AMD. By employing a transfer learning algorithm, our model demonstrated competitive performance of OCT image analysis without the need for a highly specialized deeplearning machine and without a database of millions of example images (STAR Methods). Moreover, the model’s performance in diagnosing retinal OCT images was comparable to that of human experts with significant clinical experience with retinal diseases. When the model was trained with a much smaller number of images (about 1,000 from each class), it retained high performance in accuracy, sensitivity, specificity, and area under the ROC curve for achieving the correct diagnosis and referral, thereby illustrating the power of the transfer learning system to make highly effective classifications, even with a very limited training dataset. Although our AI platform was trained and validated using the Heidelberg Spectralis imaging system, the Digital Imaging and Communications in Medicine (DICOM) standards make the OCT images from different manufacturers (e.g., Zeiss and Optovue) reasonably consistent. The goal of this preliminary approach was to develop a system and demonstrate the soundness of the methods. Future studies could entail the use of images from different manufacturers in both the training and testing datasets so that the system will be universally useful. Moreover, the efficacy of the transfer learning technique for image analysis very likely extends beyond the realm of OCT images Cell 172, 1122–1131, February 22, 2018 1127 Figure 5. Occlusion Maps and Longitudinal Follow-up OCT Images Comparing Retinal Structural Changes before and after Anti-VEGF Therapy (A) Occlusion maps highlighting areas of pathology in diabetic macular edema (left), choroidal neovascularization (middle), and drusen (right). An occlusion map was generated by convolving an occluding kernel across the input image. The occlusion map is created after prediction by assigning the softmax probability of the correct label to each occluded area. The occlusion map can then be superimposed on the input image to highlight the areas the model considered important in making its diagnosis. (B and C) Horizontal cross-section OCT images through the fovea of patients with wet AMD (B) or diabetic retinopathy with macular edema (C) before and after three monthly intravitreal injections of bevacizumab. Both intraretinal and subretinal fluid (white arrows) lessened after treatment. Scar tissue of choroidal neovascularization remained (arrow heads). All visual accurity (VA) was improved: 20/320 to 20/250, 5 months (patient 1); 20/40 to 20/32, 9 months (patient 2); 20/400 to 20/250, 3 months (patient 3); 20/80 to 20/50, 7 months (patient 4); 20/40 to 20/25; 7 months (patient 5); and 20/32 to 20/25, 7 months (patient 6). See also Figure S5 and Table S3. and ophthalmology—in principle, the techniques we have described here could potentially be employed in a wide range of medical images across multiple disciplines, and in fact, we provide a direct illustration of its wide applicability by demonstrating its efficacy in analysis of chest X-ray images. Occlusion testing was performed to identify the areas of greatest importance used by the model in assigning a diagnosis. The greatest benefit of an occlusion test is that it reveals insights into the decisions of neural networks, which are infamously known as ‘‘black boxes’’ with no transparency. Since this test was performed after training was completed, it demystified the 1128 Cell 172, 1122–1131, February 22, 2018 algorithm without affecting its results. The occlusion test also confirmed that the network made its decisions using accurate distinguishing features, which can be shared with a healthcare professional. All areas containing drusen were recognized correctly on all images used for testing, while the diabetic macular edema and choroidal neovascularization occlusion tests occasionally did not present a clear point of interest. This is likely due to the lesions and fluid pockets of choroidal neovascularization and diabetic macular edema sometimes presenting much larger than the occlusion window, while drusen tend to be smaller in size. Figure 6. Plots Depicting Performance of Pneumonia Diagnosis using Chest X-Ray Images in the Training and Validation Datasets Using TensorBoard (A–F) Comparisons were made for pneumonia versus normal (A) with cross-entropy loss plotted against the training step (B), as well as comparisons between bacterial pneumonia and viral pneumonia (C) and the associated cross-entropy loss (D). Plots were normalized with a smoothing factor of 0.6 in order to clearly visualize trends. The area under the ROC curve for detecting pneumonia versus normal was 96.8% (E). The area under the ROC curve for detecting bacterial versus viral pneumonia was 94.0% (F). Training dataset: orange. Validation dataset: blue. See also Figure S6. Although transfer learning allows the training of a highly accurate model with a relatively small training dataset, its performance would be inferior to that of a model trained from a random initialization on an extremely large dataset of OCT images, since even the internal weights can be directly optimized for OCT feature detection. However, in practice, a new convolutional neural network trained from random initialization, even with an unlimited supply of training data, would require weeks to achieve a good accuracy, whereas the multi-class holdout model implemented using transfer learning finished training and testing on different data in under 2 hr. Each binary classification and the limited model converged to a high accuracy in under 30 min. Since medical images are difficult to collect in the large amounts necessary to train a blank convolutional neural network, transfer learning using a pre-trained model trained on millions of various medical images would likely yield a more accurate model in much less time when retraining layers for other medical classifications. The performance of our model depends highly on the weights of the pre-trained model. Therefore, the performance of this model would likely be enhanced when tested on a larger ImageNet dataset with more advanced deep-learning techniques and architecture. Further, the rapid progression and development of the field of convolutional neural networks applied outside of medical imaging would also improve the performance of our approach. Cell 172, 1122–1131, February 22, 2018 1129 Finally, as mentioned earlier, we use OCT imaging as a demonstration of a generalized approach in medical image interpretation and subsequent decision making. Our framework effectively identified potential pathology on a tissue map to make a referral decision with performance comparable to (and sometimes even better than) human experts, enabling timely diagnosis of the two most common causes of irreversible severe vision loss. OCT is particularly useful in the management of retinal diseases because it has become critical to guiding antiVEGF treatment for the intraretinal and/or subretinal fluid seen in many retinal conditions. This fluid often cannot be clearly visualized by the examiner’s eyes or by color fundus photography. In addition, the OCT appearance often correlates well with visual acuity. The presence of fluid is typically associated with worse visual acuity, which improves once the fluid is resolved with anti-VEGF treatment (Figure 5B). As a testament to the value of this imaging modality, treatment decisions for exudative retinal diseases are now guided by OCT rather than by clinical examination or fundus photography, making this demonstration of AIguided classification of images more clinically relevant than prior studies that have analyzed retinal fundus photographs, such as that from Gulshan et al. (2016). Given that OCT imaging has played such a crucial role in guiding treatment, extending the application of AI beyond diagnosis or classification of images and into the realm of making treatment recommendations is a promising area of future investigation. Furthermore, our network represents a generalized platform that can potentially be applied to a wide range of medical imaging techniques (e.g., chest X-ray, MRI, computed tomography) to make a clinical diagnostic decision. We demonstrated this point by training our network on a dataset of chest X-ray images of pediatric pneumonia. Chest X-rays present a difficult classification task due to the relatively large amount of variable objects, specifically the imaged areas outside the lungs that are irrelevant to the diagnosis of pneumonia. The resulting high-accuracy model suggests that this AI system has the potential to effectively learn from increasingly complicated images with a high degree of generalization using a relatively small repository of data. By demonstrating efficacy with multiple imaging modalities and with a wide range of pathology, this transfer learning framework presents a compelling system for further exploration and analysis in biomedical imaging and more generalized application to an automated community-based AI system for the diagnosis and triage of common human diseases. By providing our data and codes in a publicly available database, we also hope that other biomedical researchers may use our work as a resource to improve the performance of future models and help drive the field forward. This could facilitate screening programs and create more efficient referral systems in all of medicine, particularly in remote or low-resource areas, leading to a broad clinical and public health impact. d d d d EXPERIMENTAL MODEL AND SUBJECT DETAILS B Images from Human Subjects METHOD DETAILS B Image Labeling B Transfer Learning Methods B Expert Comparisons B Occlusion Test QUANTIFICATION AND STATISTICAL ANALYSIS DATA AND SOFTWARE AVAILABILITY SUPPLEMENTAL INFORMATION Supplemental Information includes six figures and three tables and can be found with this article online at https://doi.org/10.1016/j.cell.2018.02.010. A video abstract is available at http://dx.doi.org/10.1016/j.cell.2018.02. 010#mmc2. ACKNOWLEDGMENTS This study was funded by the National Key Research and Development Program of China (2017YFC1104600), National Natural Science Foundation of China (81771629 and 81700882), Guangzhou Women and Children’s Medical Center, Guangzhou Regenerative Medicine and Health Guangdong Laboratory, the Richard Annesser Fund, the Michael Martin Fund, and the Dick and Carol Hertzberg Fund. AUTHOR CONTRIBUTIONS D.S.K., M.A.L., W.C., Justin Dong, C.C.S.V., G.Y., H.L., A.M., X. Wu, F.Y., J.Z., S.L.B., M.K.P., J.P., A.S., M.Y.L.T., C.L., S.H., Jason Dong, R.Z., L.Z., R.H., W.S., X.F., Y.D., V.A.N.H., I.Z., C.W., X. Wang, E.D.Z., C.L.Z., O.L., J.X., A.T., X.S., M.A.S., and H.X. collected and analyzed the data. K.Z. conceived the project. K.Z., D.S.K., M.G., and S.L.B. wrote the manuscript. All authors discussed the results and reviewed the manuscript. DECLARATION OF INTERESTS The authors declare no competing interests. Received: November 1, 2017 Revised: December 31, 2017 Accepted: February 1, 2018 Published: February 22, 2018 REFERENCES Adegbola, R.A. (2012). Childhood pneumonia as a global health priority and the strategic interest of the Bill & Melinda Gates Foundation. Clin. Infect. Dis. 54 (Suppl 2 ), S89–S92. Chaudhuri, S., Chatterjee, S., Katz, N., Nelson, M., and Goldbaum, M. (1989). Detection of blood vessels in retinal images using two-dimensional matched filters. IEEE Trans. Med. Imaging 8, 263–269. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. (2013). DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. Proceedings of the 31st International Conference on Machine Learning 32, 647–655. STAR+METHODS Ferrara, N. (2010). Vascular endothelial growth factor and age-related macular degeneration: from basic science to therapy. Nat. Med. 16, 1107–1111. Detailed methods are provided in the online version of this paper and include the following: Friedman, D.S., O’Colmain, B.J., Muñoz, B., Tomany, S.C., McCarty, C., de Jong, P.T., Nemesure, B., Mitchell, P., and Kempen, J.; Eye Diseases Prevalence Research Group (2004). Prevalence of age-related macular degeneration in the United States. Arch. Ophthalmol. 122, 564–572. d d KEY RESOURCES TABLE CONTACT FOR REAGENT AND RESOURCE SHARING 1130 Cell 172, 1122–1131, February 22, 2018 Goldbaum, M., Moezzi, S., Taylor, A., Chatterjee, S., Boyd, J., Hunter, E., and Jain, R. (1996). Automated diagnosis and image understanding with object extraction, object classification, and inferencing in retinal images. Proceedings of 3rd IEEE International Conference on Image Processing 3, 695–698. IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 512–519. Gulshan, V., Peng, L., Coram, M., Stumpe, M.C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., Cuadros, J., et al. (2016). Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA 316, 2402–2410. Rudan, I., Boschi-Pinto, C., Biloglav, Z., Mulholland, K., and Campbell, H. (2008). Epidemiology and etiology of childhood pneumonia. Bull. World Health Organ. 86, 408–416. Hoover, A., and Goldbaum, M. (2003). Locating the optic nerve in a retinal image using the fuzzy convergence of the blood vessels. IEEE Trans. Med. Imaging 22, 951–958. Hoover, A., Kouznetsova, V., and Goldbaum, M. (2000). Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Trans. Med. Imaging 19, 203–210. Swanson, E.A., and Fujimoto, J.G. (2017). The ecosystem that powered the translation of OCT from fundamental research to clinical and commercial impact [Invited]. Biomed. Opt. Express 8, 1638–1664. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the Inception Architecture for Computer Vision. In 2016 IWWW Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. Kaiser, P.K., Brown, D.M., Zhang, K., Hudson, H.L., Holz, F.G., Shapiro, H., Schneider, S., and Acharya, N.R. (2007). Ranibizumab for predominantly classic neovascular age-related macular degeneration: subgroup analysis of first-year ANCHOR results. Am. J. Ophthalmol. 144, 850–857. Varma, R., Bressler, N.M., Doan, Q.V., Gleeson, M., Danese, M., Bower, J.K., Selvin, E., Dolan, C., Fine, J., Colman, S., and Turpcu, A. (2014). Prevalence of and risk factors for diabetic macular edema in the United States. JAMA Ophthalmol. 132, 1334–1340. Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2017). ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90. Wong, W.L., Su, X., Li, X., Cheung, C.M., Klein, R., Cheng, C.Y., and Wong, T.Y. (2014). Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: a systematic review and meta-analysis. Lancet Glob. Health 2, e106–e116. Lee, C.S., Baughman, D.M., and Lee, A.Y. (2016). Deep Learning Is Effective for the Classification of OCT Images of Normal versus Age-Related Macular Degeneration. Ophthamol. Retina 1, 322–327. Mcluckie, A. (2009). Respiratory disease and its management, Volume 57 (Springer). Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features in deep neural networks? NIPS’14 Proceedings of the 27th International Conference on Neural Information Processing Systems 2, 3320–3328. Razavian, A.S., Azizpour, H., Sullivan, J., and Carlsson, S. (2014). CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. In 2014 Zeiler, M.D., and Fergus, R. (2014). Visualizing and Understanding Convolutional Networks. Lect. Notes Comput. Sci. 8689, 818–833. Cell 172, 1122–1131, February 22, 2018 1131 STAR+METHODS KEY RESOURCES TABLE REAGENT or RESOURCE SOURCE IDENTIFIER https://data.mendeley.com/datasets/rscbjbr9sj/2 N/A TensorFlow https://www.tensorflow.org/ N/A ImageNet www.image-net.org N/A Deposited Data OCT and Chest X-Ray images and codes Software and Algorithms CONTACT FOR REAGENT AND RESOURCE SHARING Further information and requests for resources and classifiers should be directed to and will be fulfilled by the Lead Contact, Kang Zhang (kang.zhang@gmail.com). There are no restrictions for use of the materials disclosed. EXPERIMENTAL MODEL AND SUBJECT DETAILS Images from Human Subjects Optical coherence tomography (OCT) images (Spectralis OCT, Heidelberg Engineering, Germany) were selected from retrospective cohorts of adult patients from the Shiley Eye Institute of the University of California San Diego, the California Retinal Research Foundation, Medical Center Ophthalmology Associates, the Shanghai First People’s Hospital, and Beijing Tongren Eye Center between July 1, 2013 and March 1, 2017. All OCT imaging was performed as part of patients’ routine clinical care. There were no exclusion criteria based on age, gender, or race. We searched local electronic medical record databases for diagnoses of choroidal neovascularization, diabetic macular edema, drusen and normal to initially assign images. A horizontal foveal cut of OCT scans was downloaded with a standard image format according to manufacure’s softwares and instructions. Chest X-ray images (anterior-posterior) were selected from retrospective cohorts of pediatric patients of one to five years old from Guangzhou Women and Children’s Medical Center, Guangzhou. All chest X-ray imaging was performed as part of patients’ routine clinical care. Institutional Review Board (IRB)/Ethics Committee approvals were obtained. The work was conducted in a manner compliant with the United States Health Insurance Portability and Accountability Act (HIPAA) and was adherent to the tenets of the Declaration of Helsinki. METHOD DETAILS OCT examinations were interpreted to confirm a diagnosis, and referral decisions were made thereafter (‘‘urgent referral’’ for diagnoses of choroidal neovascularization or diabetic macular edema, ‘‘routine referral’’ for drusen, and ‘‘observation only’’ for normal). The dataset represents the most common medical retina patients presenting and receiving treatment at all participating clinics. Chest X-ray examinations were interpreted to confirm a diagnosis, and referral decisions were made thereafter (‘‘urgent referral’’ for diagnoses of bacterial pneumonia, ‘‘supportive care’’ for viral pneumonia, and ‘‘observation only’’ for normal). Image Labeling Before training, each image went through a tiered grading system consisting of multiple layers of trained graders of increasing expertise for verification and correction of image labels. Each image imported into the database started with a label matching the most recent diagnosis of the patient. The first tier of graders consisted of undergraduate and medical students who had taken and passed an OCT interpretation course review. This first tier of graders conducted initial quality control and excluded OCT images containing severe artifacts or significant image resolution reductions. The second tier of graders consisted of four ophthalmologists who independently graded each image that had passed the first tier. The presence or absence of choroidal neovascularization (active or in the form of subretinal fibrosis), macular edema, drusen, and other pathologies visible on the OCT scan were recorded. Finally, a third tier of two senior independent retinal specialists, each with over 20 years of clinical retina experience, verified the true labels for each image. The dataset selection and stratification process is displayed in a CONSORT-style diagram in Figure 2B. To account for human error in grading, a validation subset of 993 scans was graded separately by two ophthalmologist graders, with disagreement in clinical labels arbitrated by a senior retinal specialist. For the analysis of chest X-ray images, all chest radiographs were initially screened for quality control by removing all low quality or unreadable scans. The diagnoses for the images were then graded by two expert physicians before being cleared for training the AI system. In order to account for any grading errors, the evaluation set was also checked by a third expert. e1 Cell 172, 1122–1131.e1–e2, February 22, 2018 Transfer Learning Methods Using the Tensorflow we adapted an Inception V3 architecture pretrained on the ImageNet dataset (Szegedy et al., 2016). Retraining consisted of initializing the convolutional layers with loaded pretrained weights and retraining the final, softmax layer to recognize our classes from scratch. In this study, the convolutional layers were frozen and used as fixed feature extractors. The convolutional ‘‘bottlenecks’’ are the values of each training and testing images after they have passed through the frozen layers of our model and since the convolutional weights are not updated, these values are initially calculated and stored in order to reduce redundant processes and speed up training. The newly initialized network, then, takes the image bottlenecks as input and retrains to classify our specific categories. Attempts at ‘‘fine-tuning’’ the convolutional layers by unfreezing and updating the pretrained weights on our medical images using backpropagation tended to decrease model performance due to overfitting (Figure 1). The Inception model was trained on an Ubuntu 16.04 computer with 2 Intel Xeon CPUs, using a NVIDIA GTX 1080 8Gb GPU for training and testing, with 256Gb available in RAM memory. Training of layers was performed by stochastic gradient descent in batches of 1,000 images per step using an Adam Optimizer with a learning rate of 0.001. Training on all categories was run for 10,000 steps, or 100 epochs, since training of the final layers will have converged by then for all classes. Holdout method testing was performed after every step using a test partition containing images from patients independent of the patients represented in the training partition by passing each image through the network without performing gradient descent and backpropagation, and the best performing model was kept for analysis. Expert Comparisons In order to evaluate our model in the context of clinical experts, a validation set of 1000 images (633 patients), independent of the patients in the training set, was used to compare our network referral decisions with the decisions made by human experts. Weighted error scoring was used to reflect the fact that a false negative result (failing to refer) is more detrimental than a false positive result (making a referral when it was not warranted). Using these weighted penalty points, error rates were computed for the model and for each of the human experts. Occlusion Test Similarly to the methods described by Lee et al. and Zeiler and Fergus, an occlusion test was performed to identify the areas contributing the most to the neural network’s assignment of the predicted diagnosis(Lee et al., 2016; Zeiler and Fergus, 2014). A blank 20x20 pixel box was systematically moved across every possible position in the image and the probabilities of the disease were recorded. The highest drop in the probability represents the region of interest that contributed the highest importance to the deep learning algorithm (Figure 5A, see also Figure S5 for additional examples). QUANTIFICATION AND STATISTICAL ANALYSIS The 207,130 images collected were reduced to the 108,312 OCT images (from 4686 patients) and used for training the AI platform. Another subset of 633 patients not in the training set was collected based on a sample size requirement of 583 patients to detect sensitivity and specificity at 0.05 marginal error and 95% confidence. The test images (n = 1000) were used to evaluate model and human expert performance. Receiver operating characteristics (ROC) curves plot the true positive rate (sensitivity) versus the false positive rate (1 – specificity). ROC curves were generated using classification probabilities of urgent referral versus otherwise and the true labels of each test image and the ROC function of the Python scikit-learn library. The area under the ROC curve is a measure of performance and the true positive rate (TPR or sensitivity) at some chosen true negative rate (TNR or specificity) on the ROC curve is the probability that the classifier will rank a randomly chosen ‘‘urgent referral’’ instance higher than a randomly chosen normal or drusen instance. Accuracy was measured by dividing the number of correctly labeled images by the total number of test images. Sensitivity and specificity were determined by dividing the total number of correctly labeled urgent referrals and the total number of correctly labeled non-urgent referrals, respectively, by the total number of test images. DATA AND SOFTWARE AVAILABILITY All deep learning methods were implemented using either TensorFlow (https://www.tensorflow.org). ImageNet, a public database of images, can be found at https://www.image-net.org. Dataset on high resolution JPEG OCT and chest X-ray images are deposited into the public Mendeley database (https://doi.org/10.17632/rscbjbr9sj.3). Cell 172, 1122–1131.e1–e2, February 22, 2018 e2 Supplemental Figures Figure S1. Plots Showing Binary Performance in the Training and Validation Datasets Using TensorBoard, Related to Figure 3 Comparisons were made for choroidal neovascularization (CNV) versus normal (A), diabetic macular edema (DME) versus normal (B), and drusen versus normal (C). Plots were normalized with a smoothing factor of 0.6 in order to clearly visualize trends. The validation accuracy and loss shows better performance since images with more noise and lower quality were also included in the training set to reduce overfitting and help generalization of the classifier. Training dataset: orange. Validation dataset: blue. Figure S2. Receiver Operating Characteristic Curves for Binary Classifiers, Related to Figure 4 The corresponding area under the ROC curve (AUROC) for the graphs are 100% for choroidal neovascularization (CNV) versus normal (A), 99.87% for diabetic macular edema (DME) versus normal (B), and 99.96% for drusen versus normal (C). The straight vertical and horizontal lines in (A) and the nearly straight lines in (B) and (C) demonstrate that the binary convolutional neural network models have a near perfect classification performance. Figure S3. Plots Depicting the Positive and Negative Likelihood Ratios with Their Corresponding 95% Confidence Intervals Marked, Related to Figure 4 (A) The positive likelihood ratio is defined as the true positive rate over the false positive rate, so that an increasing likelihood ratio greater than 1 indicates increasing probability that the predicted result is associated with the disease. (B) The negative likelihood ratio is defined as the false negative rate over the true negative rate, so that a decreasing likelihood ratio less than 1 indicates increasing probability that the predicted result is associated with the absence of disease. The confidence intervals show that the best trained model demonstrated statistically similar screening performance in when compared to human experts. True Labels True Labels True Labels Predicted Labels NORMAL DRUSEN CNV DME NORMAL 0 1 1 DRUSEN Expert 1 Expert 2 Predicted Labels Predicted Labels NORMAL DRUSEN CNV DME NORMAL DRUSEN CNV DME NORMAL 249 0 0 0 NORMAL 239 7 3 1 DRUSEN 245 5 0 if DRUSEN 13 223 8 6 c~250 DME- 3 0 25 222 Expert 4 Expert 5 Predicted Labels Predicted Labels NORMAL DRUSEN CNV DME NORMAL DRUSEN CNV DME NORMAL 237 10 0 3 NORMAL 249 0 DRUSEN 0 233 17 0 DRUSEN 242 c~250 True Labels True Labels Expert 3 Predicted Labels NORMAL DRUSEN CNV DME NORMAL 250 0 0 0 DRUSEN 237 Expert 6 Predicted Labels NORMAL DRUSEN CNV DME NORMAL 230 0 0 0 DRUSEN 243 (legend on next page) Figure S4. Proposed Penalties for Incorrect Labeling during Weighted Error Calculations and Confusion Matrix of Experts Grading OCT Images, Related to Figure 4 (A) The penalties include an error score of 4 for ‘‘urgent referrals’’ scored as normal and an error score of 2 for ‘‘urgent referrals’’ scored as drusen. All other incorrect answers carry an error score of 1. (B) The results for each of the human experts is depicted here, comparing the true labels and the predicted labels for each individual grader. Figure S5. Occlusion Maps of Diabetic Macular Edema, Choroidal Neovascularization, and Drusen, Related to Figure 5 (Top) Diabetic macular edema (DME), (middle) choroidal neovascularization (CNV), and (bottom), drusen. Additional examples of occlusion test images, illustrating how an occluding kernel was convolved across the input image to identify areas contributing to the algorithm’s determination of the diagnosis. Figure S6. Illustrative Examples of Chest X-Rays in Patients with Pneumonia, Related to Figure 6 The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia (middle) typically exhibits a focal lobar consolidation, in this case in the right upper lobe (white arrows), whereas viral pneumonia (right) manifests with a more diffuse ‘‘interstitial’’ pattern in both lungs.