Simple Words over Rich Imaging: Accurate Brain Disease Classification via Language Model Analysis of Radiological Reports ========================================================================================================================= * Xin Gao * Meihui Zhang * Longfei Chen * Jun Qiu * Shanbo Zhao * Junjie Li * Tiantian Hua * Ying Jin * Zhiqiang Wu * Haotian Hou * Yunling Wang * Wei Zhao * Yuxin Li * Yunyun Duan * Chuyang Ye * Yaou Liu ## ABSTRACT Brain diseases exert profound detrimental effects on human health by affecting the central nervous system. Accurate automated diagnosis of brain diseases is imperative to delay the progression of illness and enhance long-term prognosis. However, existing image-based diagnostic approaches struggle to achieve satisfactory performance due to the high dimensionality of imaging data. Radiological reports, which are required in clinical routine to describe image findings, provide a more straightforward comprehension of the imaging data, yet they have been neglected in automated brain disease classification. In this work, we explore automated brain disease classification via radiological reports and language models and compare the results with conventional image-based methods. Specifically, in the report-based diagnostic approach, we fine-tune Pre-trained Language Models (PLMs) and Large Language Models (LLMs) based on the findings part of radiological reports to achieve disease classification. Four clinically relevant brain disease classification tasks were performed in our experiments, involving 12 datasets with a total number of 14,970 patients, including two independent validation sets. The best language model reached an average area under the receiver operating characteristic curve (AUC) of 84.75%, an average accuracy (ACC) of 79.48%, and an average F1-score of 79.45%. Compared with the best image-based model, it achieved an average improvement of 10.34%, 10.75%, and 9.95% in terms of AUC, ACC, and F1-score, respectively. The language model also outperformed junior radiologists by 9.47% in terms of ACC. Moreover, the report-based model exhibited better adaptability to missing image contrasts and cross-site data variability than image-based models. Together, these results show that brain disease classification via language model analysis of radiological reports can be more reliable than image-based classification, and our work demonstrates the potential of using radiological reports for accurate diagnosis of brain diseases. ## Introduction Brain diseases seriously threaten the health and wellness of millions of people worldwide1. Accurate diagnosis of brain diseases, such as genetic marker testing, pathological examination, and tumor grading, enables the formulation of personalized treatment plans, which can facilitate precise early interventions2. To assist physicians in further improving diagnostic and treatment efficiency, various automated diagnostic models based on deep learning (DL) are developed according to the demands of different scenarios, for example, towards providing timely diagnosis in emergencies3 or precise diagnosis in routine clinical settings4. Imaging examinations are pivotal tools for assessing the conditions of brain diseases. Currently, physicians typically scrutinize imaging data, such as 3D magnetic resonance imaging (MRI), to formulate diagnosis of brain diseases. Therefore, existing automated brain disease diagnostic methods largely rely on imaging information5. However, conventional image-based models struggle to produce satisfactory results. In particular, brain images are typically high-dimensional with 3D imaging consisting of multiple slices and contrasts, and they are thus complex to analyze6,7. It is challenging for image-based brain disease classification models to learn a sufficiently disease-relevant image representation, especially when key lesions are relatively small8,9. Moreover, optimal performance of an image-based model is restricted to a specific combination of input image contrasts, and the performance is also sensitive to cross-site domain shift when training and test data originate from different sites1,10. This compromises the applicability of image-based models in real-world scenarios, as there can be different choices of image acquisitions and data acquired at various hospitals. Therefore, it is imperative to develop innovative automated brain disease classification methods to achieve better diagnosis. Radiological reports, which always accompany radiological images as required by clinical routine, are written by radiologists as part of their workflow. The reports comprise concise distillation of crucial information extracted by radiologists from the images and can potentially be effective for diagnosis with their reduced dimensionality11,12. However, brain disease classification based on radiological reports is still unexplored by existing works. In this work, we investigated a more effective approach to brain disease classification by analyzing radiological reports with language models. In particular, as existing research has demonstrated that Pre-trained Language Models (PLMs) and Large Language Models (LLMs) have the capacity for understanding and analyzing textual information through pre-training on enormous text corpora13–18, both of them were adopted to take reports as input and output the disease classification results after task-specific fine-tuning. We performed experiments on four different clinically relevant brain disease classification tasks, which involved 12 datasets and a total number of 14,970 patients. The report-based model achieved excellent performance. Our best performing language model exhibited an average area under the receiver operating characteristic curve (AUC) of 84.75%, an average accuracy (ACC) of 79.48%, and an average F1-score of 79.45%. It achieved an average improvement of 10.34%, 10.75%, and 9.95% in terms of AUC, ACC, and F1-score, respectively, compared with the best image-based model, and outperformed junior radiologists by 9.47% and 9.49% in terms of ACC and F1-score, respectively. Furthermore, the results on test data without full image contrasts and external datasets show that the language model can better address missing image contrasts and cross-site data variability than image-based models. Our exploration yields a promisingly more effective strategy for accurate brain disease diagnosis in real-world clinical scenarios. ## Results ### Study design An overview of the study design is shown in Fig. 1. We used 12 in-house datasets for the experiments, as existing public datasets for brain diseases (such as ADNI19 and BraTS20) did not contain radiological reports and public datasets with paired images and reports (such as MIMIC-CXR21, IU X-Ray22, and CheXpert23) focused on 2D chest imaging23. A detailed description of the 12 datasets are given in Table 1. Four brain disease classification tasks were considered, which were isocitrate dehydrogenase (IDH) genotyping, 1p/19q co-deletion identification, World Health Organization (WHO) grading, and brain tumor classification. View this table: [Table 1.](http://medrxiv.org/content/early/2024/11/15/2024.11.13.24317214/T1) Table 1. The detailed description of the in-house datasets. The numbers of subjects are indicated for each dataset, each category, and the training/validation/test split. GBM: glioblastoma; PCNSL: primary central nervous system lymphoma. ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/11/15/2024.11.13.24317214/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2024/11/15/2024.11.13.24317214/F1) Figure 1. An overview of the proposed work. 12 in-house datasets were collected from three hospitals. Each dataset comprises the radiological report, 3D MRI scans, and gold standard diagnosis result for each patient. Automated brain disease classification was achieved by analyzing radiological reports with language models. The models were comprehensively evaluated on four tasks. The language models were compared with conventional image-based models and radiologists. For all datasets, the gold standard diagnosis results were available and obtained by pathological analysis. For the task of IDH genotyping, patients were stratified according to their IDH mutation status types, categorized as either wildtype or mutation. For 1p/19q co-deletion identification, patients were classified based on the absence or presence of co-deletion in chromosomes 1p and 19q. For WHO grading, patients were categorized based on the degree of tumor malignancy as either low-grade or high-grade. For brain tumor classification, patients were categorized based on the type of tumor pathology: *glioblastoma* (GBM) or *primary central nervous system lymphoma* (PCNSL). In four of the datasets (DT-IDH-1, DT-CI-1, DT-WHO-1, and DT-BTC-1), MRI scans were acquired with five different contrasts, including T1-weighted (T1w), T2-weighted (T2w), fluid-attenuated inversion recovery (FLAIR), T1 contrast-enhanced (T1c) and apparent diffusion coefficient (ADC) images, at Beijing Tiantan Hospital, and radiological reports were written by radiologists in the clinical routine for the images. These four datasets were used separately for the tasks of IDH genotyping, 1p/19q co-deletion identification, WHO grading, and brain tumor classification. Another dataset DX-IDH-1 was acquired at The First Affiliated Hospital of Xinjiang Medical University as an external dataset with all five MRI contrasts and reports, and it was concerned with IDH genotyping. Also, three datasets (DT-IDH-2, DT-IDH-3, and DT-IDH-4) were acquired at Beijing Tiantan Hospital without FLAIR, ADC, or both FLAIR and ADC images, respectively, and one dataset DH-IDH-1 was acquired at Huashan Hospital, Fudan University without T2w images. These datasets were concerned with IDH genotyping. Moreover, three datasets (DT-CI-2, DT-WHO-2, and DT-BTC-2) were acquired at Beijing Tiantan Hospital without FLAIR images, and they were separately used for 1p/19g co-deletion identification, WHO grading, and brain tumor classification. In the datasets with incomplete image contrasts, reports were written based on the available image contrasts, and these datasets were used for evaluating the classification performance when image contrasts were missing. Two kinds of language models, including PLMs and LLMs, were employed for brain disease classification. For each kind of language model, a general domain version and a specific pre-trained version for Chinese were adopted. Specifically, the selected PLMs were RoBERTa-base24 and Chinese RoBERTa25, and the selected LLMs were LLaMA3-8B26 and Baichuan2-13B27. The PLMs were fine-tuned with the training sets of DT-IDH-1, DT-CI-1, DT-WHO-1, and DT-BTC-1 for their respective tasks, together with the use of their validation sets. The LLMs were fine-tuned with a parameter-efficient tuning method low-rank adaptation (LoRA)28, where all training sets of DT-IDH-1, DT-CI-1, DT-WHO-1, and DT-BTC-1 were combined. For comparison, six image-based classification models were also applied. These methods included DeepRisk29 and 3D MedMNIST30, which took the MRI scans or a subset of slices as input and output the classification result, as well as 2D MedMNIST30, DenseNet31, ViT32, and Swin Trasnformer33, which first segmented the brain tumor and then made classification based on the image and segmentation results. The detailed description of the image-based models is given in “Competing image-based models”. In addition, three radiologists were invited to perform manual classification of brain diseases. The three radiologists had three, three, and ten years of clinical experience, respectively. To evaluate the classification performance, the following metrics were computed: area under the receiver operating characteristic curve (AUC), accuracy (ACC), F1-score, sensitivity (SEN), specificity (SPEC), positive predictive value (PPV), and negative predictive value (NPV). To investigate the impact of random factors, the mean and *standard deviation* (std) of the classification performance were computed based on multiple independent repeated runs. Each run was associated with a different random seed in model training. For each task, five repeated experiments were performed. In addition, based on the results of each repeated run, Student’s *t*-tests were performed to compare the AUC and ACC between report-based and image-based models. ### The language model outperformed conventional image-based models for brain disease classification across all evaluation tasks The language models and conventional image-based models first predicted the disease classification based on all five image contrasts for the four tasks using DT-IDH-1, DT-CI-1, DT-WHO-1, and DT-BTC-1. The comparison of the classification performance between the language model Chinese RoBERTa and conventional image-based models is summarized in Table 2. Chinese RoBERTa is shown here as it has achieved the best average performance among the language models (see “Comparison and analysis of different language models” and Extended Table E2 for the detailed comparison between language models). The fine-tuned language model consistently achieved better performance in all cases, with an average AUC of 0.848, ACC of 0.795, and F1-score of 0.795 across the four tasks. Compared with the average performance of the image-based models, the language model achieved noticeable improvements, with increases in AUC of 16.72%, 18.97%, 21.98%, and 32.27%, and in ACC of 18.47%, 15.04%, 19.58%, and 22.41% for the four tasks, respectively. In addition, compared with Swin Transformer, the best image-based model in three of four tasks in terms of AUC, our language model yielded an average ACC improvement of 10.75%, and an average AUC improvement of 10.34% across the four tasks. These results demonstrate the feasibility of using report-based language models for accurate classification of brain diseases, as well as its superiority over image-based models, which are currently the common practice. View this table: [Table 2.](http://medrxiv.org/content/early/2024/11/15/2024.11.13.24317214/T2) Table 2. Comparison of our fine-tuned language model Chinese RoBERTa with conventional image-based models across four different brain disease classification tasks using DT-IDH-1, DT-CI-1, DT-WHO-1, and DT-BTC-1. The best result is highlighted in bold. The means and stds are presented in the format of mean(±std). *p*-values are computed with Student’s *t*-tests to compare the AUC and ACC between Chinese RoBERTa and the image-based models. ### The language model achieved better accuracy than junior radiologists We selected the task of brain tumor classification to compare the classification performance of our fine-tuned models with radiologists, as the radiologists found the task more dependent on the visual cues from imaging and were more confident to make direct diagnosis. The comparison was based on the test set of DT-BTC-1. The radiologists were allowed to simultaneously view all five image contrasts and the corresponding radiological reports for diagnosis. The performance of the three radiologists was compared with the receiver operating characteristic (ROC) curve of Chinese RoBERTa in Figure 2a. The ACC of the two junior radiologists with three years of experience is lower than that of Chinese RoBERTa (0.821 in Table 2), and the ROC curve of Chinese RoBERTa is above the junior radiologists. These observations indicate that the report-based model outperformed the junior radiologists. Note that the ACC of the junior radiologists is higher than that of all image-based models (see Table 2). The senior radiologist made more accurate diagnosis than all image-based and report-based models. ![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/11/15/2024.11.13.24317214/F2.medium.gif) [Figure 2.](http://medrxiv.org/content/early/2024/11/15/2024.11.13.24317214/F2) Figure 2. Additional comparison between Chinese RoBERTa and radiologists, image-based models, and others report-based models. **a**, The comparison of the ROC curve of Chinese RoBERTa and three radiologists. The ACC of the radiologists are also indicated. **b**, The classification performance on the simulated test set for IDH genotyping with full image contrasts (left bar) and missing FLAIR images (right bar). The reduction (Delta) in AUC, ACC, and F1-score in the case of missing contrasts is also indicated. **c** and **d**, The performance comparison between Chinese RoBERTa and image-based models for the two independent validation sets DX-IDH-1 and DH-IDH-1, respectively. **e**, The classification performance of the four language models, including RoBERTa-base, Chinese RoBERTa, LLaMA3-8B, and Baichuan2-13B, on DT-IDH-1, DT-CI-1, DT-WHO-1, and DT-BTC-1. ### The language model better handled missing image contrasts than the image-based models Because of the different clinical conditions of patients and different levels of expertise among radiologists, it is common that for some patients not all image contrasts are acquired, which often poses challenges for conventional image-based models34. Therefore, we further evaluated the classification performance of the report-based and image-based models with DT-IDH-2, DT-IDH-3, DT-IDH-4, DT-CI-2, DT-WHO-2, and DT-BTC-2 in this case of missing image contrasts. The report-based Chinese RoBERTa model trained previously with full image contrasts was directly applied to the test data with missing contrasts. To apply the image-based model trained previously with full image contrasts, we synthesized the missing contrast based on acquired ones using the pGAN model35, and then the acquired and synthesized contrasts were jointly fed into the image-based model. The details about the pGAN model are described in “Datasets preparation and pre-processing”. The classification performance is presented in Table 3, where FLAIR images were missing in DT-IDH-2, DT-CI-2, DT-WHO-2, and DT-BTC-2, ADC images were missing in DT-IDH-3, and both ADC and FLAIR images were missing in DT-IDH-4. Our fine-tuned language model Chinese RoBERTa outperformed six image-based models for all metrics and datasets. In particular, it achieved an average AUC of 0.792, ACC of 0.745, and F1-score of 0.747 for the six datasets. Compared with the best image-based model Swin Transformer, the average improvements in AUC, ACC, and F1-score of the language model were 14.71%, 12.07%, and 11.55%, respectively. Compared with the average performance of all image-based models, for each individual dataset, the improvements in AUC were 21.96%, 18.34%, 31.35%, 25.47%, 28.54%, and 32.08%, respectively; the improvements in ACC were 21.64%, 10.07%, 13.94%, 23.99%, 21.67%, and 26.95%, respectively; the improvements in F1-score were 22.37%, 9.39%, 16.91%, 22.39%, 26.03%, and 26.05%, respectively. These results indicate the excellent capability of the report-based model for handling the challenging scenario of missing image contrasts. View this table: [Table 3.](http://medrxiv.org/content/early/2024/11/15/2024.11.13.24317214/T3) Table 3. Classification performance of Chinese RoBERTa and image-based models on DT-IDH-2, DT-IDH-3, DT-IDH-4, DT-CI-2, DT-WHO-2, and DT-BTC-2 for patients with missing image contrasts. The best result is highlighted in bold. Moreover, to further explore quantitatively the impact of missing image contrasts on the classification performance, we performed an extended experiment for the task of IDH genotyping. Specifically, we simulated a test set with missing image contrasts based on the 408 test patients with full image contrasts in DT-IDH-1. First, we removed the FLAIR images from the input of the image-based models. Second, we manually deleted the descriptions regarding FLAIR images in the radiological reports of these patients. The classification results and the reduction in AUC, ACC, and F1-score compared with the results achieved with full image contrasts are shown in Figure 2b. The performance of our language model was very close when image contrasts were missing and complete; however, the performance of the image-based models decreased noticeably with missing contrasts. These observations further show that the language model is more robust to missing image contrasts than image-based models. ### The language model better addressed cross-site data variability than image-based models Due to the difference in the scanning device, protocol, and parameters of images acquired at different sites/hospitals, conventional image-based models may generalize poorly to an unseen site different from the training data10. To assess the impact of cross-site image variability on the report-based and image-based models, additional experiments were performed on the two external datasets DX-IDH-1 and DH-IDH-1. Note that in DH-IDH-1 there was also a missing image contrast, where no T2w image was available and a pre-trained pGAN model synthesized the T2w image. The results are presented in Figures 2c and 2d. The language model achieved better classification performance than the image-based models on the unseen sites. Specifically, for DX-IDH-1/DH-IDH-1 the language model achieved an improvement in AUC, ACC, and F1-score of 20.64%/1.19%, 7.66%/9.19%, 8.76%/9.28%, respectively, compared with the best image-based model Swin Transformer. ### The data pre-processing cost for the language model was lower than that of the image-based models In addition to the benefit of better classification performance, the language model may also require less computational overhead for data pre-processing than the image-based models, as it is more convenient to pre-process texts than images. To demonstrate the advantage of the language model in terms of the pre-processing cost, we compared the time consumption for processing 3D brain imaging data versus radiological report data, and the results are shown in Extended Table E1. The image pre-processing procedures included N4 bias field correction, image co-registration, skull stripping, and optional tumor segmentation. For radiological reports, data pre-processing involved the removal of nonsensical characters (such as blank spaces, line breaks, and extraneous symbols), tokenization, padding and truncating. The pre-processing of reports required a much smaller amount of time compared to image pre-processing, which was over 6,000 times longer with tumor segmentation and 5,000 times longer without tumor segmentation. After pre-processing, the inference stage took about an average of 1.3×10−3 seconds per patient for the language models and 1.7×10−3 seconds for the image-based models. In comparison with the pre-processing time of imaging data, the inference time of both language and image-based models was considerably short and negligible. Thus, the lower time consumption of text pre-processing can allow more efficient and timely diagnosis in clinical emergency scenarios. ### Comparison and analysis of different language models The performance of each language model for brain disease classification based on the radiological reports is summarized in Figure 2e, where the results on DT-IDH-1, DT-CI-1, DT-WHO-1, and DT-BTC-1, with full image contrasts are presented. More detailed evaluation results are shown in Extended Table E2. The fine-tuned PLM Chinese RoBERTa exhibited the best performance on three out of four tasks, achieving the best average ACC of 0.795 and F1-score of 0.795 across four tasks. The best LLM was the fine-tuned Baichuan2-13B, achieving an average ACC of 0.769 and F1-score of 0.765. Note that the average performance of each language model across the four tasks outperformed or was at least close to the best image-based model Swin Transformer in terms of AUC, ACC, and F1-score. This shows that not only Chinese RoBERTa but in general language models tended to make more reliable brain disease classification than image-based models. ## Discussion The rapid development of language models has introduced new opportunities for completing practical clinical tasks through the use of medical text information, such as clinical notes14. However, there is a lack of studies on brain disease diagnosis via language model analysis of radiological reports, where existing approaches have been based on the brain imaging information36. In this work, we have explored advanced language models for classifying brain diseases based on radiological reports, aiming to provide a new paradigm for automated brain disease diagnosis in real clinical scenarios. The results indicate that fine-tuned language models outperform conventional image-based models in terms of classification reliability. Our work has offered a new and effective solution to accurate classification of brain diseases, and it has contributed to the exploration of advanced language models for clinical applications. The better performance of language models than image-based models can be attributed to the characteristics of the data. First, in terms of data size and dimensionality, natural language data is remarkably smaller than 3D brain image data. The reduction in data complexity makes it easier for the language model to learn the association between reports and diseases. Second, in terms of data content, the radiological reports summarize important information derived from the image, such as descriptions about the anatomical structures and anomalies. Such semantic information alleviates the difficulty in image data understanding. Different types of popular language models and how to train these models have been investigated in our work. The results shown in Extended Table E2 indicate that the models developed specifically for Chinese language processing, i.e., Chinese RoBERTa and Baichuan2-13B, are better than general multilingual models, RoBERTa-base and LLaMA3-8B, for processing Chinese radiological reports. This observation is consistent with existing studies25,27, where models trained on Chinese datasets acquire a better understanding of Chinese. In addition, we have further considered different numbers of parameters and two distinct fine-tuning methods, quantized low-rank adaptation (QLoRA)37 and LoRA for the Baichuan2 model. The results presented in Extended Table E4 advocate the joint use of larger parameter size of 13 billion and LoRA fine-tuning. Among the two kinds of language models, the PLM Chinese RoBERTa performed better than the LLMs (see Extended Table E2). However, the LLMs also have their potential benefits. First, it can provide more efficient solutions to disease classification, as a single pre-trained LLM can perform several different tasks simultaneously, where different PLMs are needed for different tasks. Beyond this, the LLMs can potentially benefit from even larger data volume by simultaneous training with other similar tasks. This can be observed in Extended Table E3, as the average LLM performance is noticeably improved with simultaneous fine-tuning based on all tasks. The LLM performance may be worth further exploration in future work when more tasks are take into account. Furthermore, the LLMs may better adapt to the various writing styles of radiological reports from different hospitals than PLMs, as shown in Extended Table E5, where the LLMs have better performance than the PLMs for the two external hospitals in terms of ACC and F1-score. This may be attributed to their extensive pre-training on a massive amount of natural language data, which allows better handling of cross-site text variations. In clinical practice, missing image contrasts are pretty common for patients due to their various conditions, and the problem has long been a serious challenge for real-world applications of image-based disease classification approaches1. In addition, it is also challenging to generalize an image-based classification model to an unseen site38, where the performance can degrade drastically due to domain shift39. Our results reveal that the language models are better than image-based models at addressing missing image contrasts and cross-cite data variability. The better ability of the language models at addressing missing image contrasts and cross-site variations may be attributed to the following reasons. First, the language models are not constrained by image contrasts and can focus on all available content in the reports for better performance, whereas the synthesised image contrasts may introduce biases for the image-based models. In addition, the cross-site changes in image and report data are different. For the radiological reports, the changes are generally in writing style, while the anatomical structures and anomalies of greater significance remain consistent. However, the changes in imaging data, including image quality, signal intensity, and others, can noticeably influence the effectiveness of image-based models. Our study has limitations. First, the sample size in the experiments could still be enlarged. As it may be costly to acquire the gold standard diagnosis results, each task included thousands of patients. Future work will further accumulate the data and increase the sample size by orders of magnitude for a more extensive evaluation. Such increase in the sample size may also improve the performance of language models. Second, the comparison between PLMs and LLMs may be limited by the sample size we have collected. LLMs tend to be more effective given a huge amount of data, and it is still unknown whether LLMs will outperform PLMs given substantially more training data. Finally, in the current model implementation, there is a lack of explainability about the model’s decision. It would be interesting to explore explainable artificial intelligence techniques for the language models to provide better confidence about the diagnosis results. In conclusion, our study highlights the potential of language models for accurate brain disease diagnosis based on radiological reports. The language model achieved better diagnostic performance than conventional image-based models. Furthermore, the language model is robust to imperfect data conditions, such as missing image contrasts and cross-site variations. Looking forward, our research can contribute to the improvement of diagnostic techniques for brain diseases and the exploration of application of language models in the medical context. ## Methods ### Datasets preparation and pre-processing Four different types of brain disease classification tasks were considered. We initially collected a total number of 17,507 patients and then filtered the patients based on the presence of image contrasts and quality of radiological reports. The detailed filtering procedure for each task is shown in Figure 3. For each task, patients with poor report quality or those with preoperative preparation reports were first excluded. Then, patients were selected according to their absence of image contrasts. In this work, patients with all five image contrasts were first selected, from which the training set were derived. For the remaining patients, only those with the most common types of missing image contrasts were considered, while those with any other types of missing image contrasts were excluded. For the task of IDH genotyping, patients that did not have FLAIR images only, ADC images only, or both FLAIR and ADC images from Beijing Tiantan Hospital were included. For the two external hospitals, patients from The First Affiliated Hospital of Xinjiang Medical University who had all five image contrasts were included, while patients from Huashan Hospital, Fudan University who did not have T2w images only were included. For the other three tasks, only patients with missing FLAIR images were kept. Our final datasets comprised 14,970 patients, each associated with MRI scans and the corresponding radiological reports. ![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/11/15/2024.11.13.24317214/F3.medium.gif) [Figure 3.](http://medrxiv.org/content/early/2024/11/15/2024.11.13.24317214/F3) Figure 3. The flowchart of data filtering for the four brain disease classification tasks: (a) IDH genotyping, (b) 1p/19q co-deletion identification, (c) WHO grading, and (d) brain tumor classification. The final included patients for each task are indicated by black dotted boxes. For image data pre-processing, N4 bias field correction was first applied with a shrink factor of 2 with the Advanced Normalization Tools (ANTs) software40. Subsequently, all image contrasts were co-registered to the T1w image. The T1w image was affinely registered to the MNI152 template41 with linear interpolation. The transformation matrix was then applied to the other image contrasts so that they were all aligned with the MNI152 template. Finally, the brain mask was extracted for skull striping with ROBEX42. Each pre-processed image had a size of 256×256×256 with a resolution of 1 mm3. For report data pre-processing, nonsensical characters were first removed from each report. Then, the reports were tokenized and either padded or truncated to a maximum length of 256 to as model input. For image-based brain disease classification, additional tumor segmentation models and image synthesis models were obtained. The tumor segmentation was achieved with a nnU-Net model43, which was trained on 1500 patients from the BraTS2021 dataset44 for whole tumor segmentation. The missing image contrast synthesis was accomplished with a pGAN model35, which was individually trained for each missing contrast and each task with 100 patients with full image contrasts randomly selected from the training set of DT-IDH-1, DT-CI-1, DT-WHO-1, and DT-BTC-1. All models used in our experiments were trained with two NVIDIA RTX A6000 GPUs. ### The PLMs for brain disease classification PLMs, such as BERT-based models, have achieved excellent performance on multiple natural language processing tasks by pre-training on a diverse corpus of language understanding tasks and subsequently fine-tuning on specific tasks. Derivatives of BERT, such as RoBERTa24, have further advanced domain-specific natural language processing technologies. In this work, the general RoBERTa-base model24 with a 12-layer architecture and approximately 125 million parameters was adopted. Further, considering that our collected radiological reports were written in Chinese, the Chinese RoBERTa model, which employs whole word masking for Chinese word segmentation, has been taken into account. Specifically, we adopted the Chinese RoBERTa-wwm25 pre-trained on additional EXT data25. We fine-tuned the models in a supervised manner using the training and validation sets of DT-IDH-1, DT-CI-1, DT-WHO-1, and DT-BTC-1. The fine-tuning was performed for each task independently based on the radiological report and classification label of each patient, where the cross-entropy loss function45 was minimized. The Adam optimizer46 was used with a batch size of eight and 20 epochs for training convergence. Model selection was performed based on the smallest cross-entropy loss of the validation set. Taking the class imbalance of the training data into consideration, class weights were applied to the standard loss function to assign higher weights to minority classes and lower weights to majority classes. ### The LLMs for brain disease classification The generative LLMs, such as GPT47, T548, and LLaMA49, have acquired vast knowledge across multiple specialized domains through pre-training and have demonstrated efficient transfer learning capabilities for new tasks. In this work, we employed the generative LLMs to predict the classification of brain diseases. Due to constraints imposed by medical data privacy, only openly accessible base models were selected for our local deployment. Specifically, the latest version of LLaMA (version 3)26 with 8 billion parameters (referred to as LLaMA3-8B) was adopted, as it had been shown to outperform many openly accessible LLMs on common industry benchmarks. In addition, Baichuan (version 2)27 with 13 billion parameters (referred to as Baichuan2-13B) was adopted, as it had achieved the best performance on multiple benchmarks in both Chinese and English. Since the LLMs do not have category limitations on their outputs, we fine-tuned the LLMs using training data from all four tasks simultaneously. The training data combined the training sets of DT-IDH-1, DT-CI-1, DT-WHO-1, and DT-BTC-1, and different tasks were distinguished by different prompts. The detailed prompt for each task is shown in Table 4. Each prompt included three parts, the radiological report of each patient, role assignment as a radiologist, and a question asking the model to generate a single precise classification label. The parameter efficient fine-tuning method low-rank adaptation (LoRA)28 was used with a batch size of 4, gradient accumulation steps of 4, a rank of 16, an alpha value of 32, and a temperature value of 1. View this table: [Table 4.](http://medrxiv.org/content/early/2024/11/15/2024.11.13.24317214/T4) Table 4. Examples of the input prompt and output setting for fine-tuning the LLMs for each task. The contents within the brackets <> represent the radiological reports of different patients and their corresponding disease labels, while the other colored contents indicate task-specific prompt settings. Each model was trained for 3 epochs with an initial learning rate of 5e-5, which was adjusted according to a cosine annealing schedule. ### Competing image-based models We considered six conventional image-based classification models for comparison, including those specifically designed for brain disease classification and the latest widely used models in the general domain. The models included those that process imaging data in 2D, 2.5D, and 3D formats, and some of these classification models were aided by the tumor segmentation results. Their detailed description is given below * **DeepRisk**. The DeepRisk29 model is developed based on a 2D ResNet3450 backbone and several attention51 blocks. This model was designed to make predictions directly using whole-brain MRI scans without tumor segmentation. Specifically, we took eight equidistant slices from each MRI contrast and concatenated them for classification. * **2D & 3D MedMNIST**. The MedMNIST30 provides benchmarks for 2D and 3D biomedical image classification based on a 2D ResNet50 and a 3D ResNet52, respectively. For the 2D-based architecture **2D MedMNIST**, we took the slice with the largest tumor area from each MRI sequence and concatenated them for classification. For the 3D-based architecture **3D MedMNIST**, the whole-brain MRI scans were directly used for classification. * **DenseNet**. The DenseNet31 model, designed based on stacked dense blocks53, performed disease classification based on the images and tumor segmentation. In addition to the slice with the largest tumor area, its input concatenated eleven slices before and twelve slices after this slice for classification, along with the slice itself. * **ViT**. The pre-trained Vision Transformer (ViT)32 has learned image features from extensive imaging data, achieving breakthrough performance on a variety of vision-related tasks in the general domain. We adopted the ViT model of base size that was pre-trained on ImageNet-21k54 at a resolution of 224×224. The model input comprised the slice with the largest tumor area from each MRI contrast and subsequently concatenated them for classification. * **Swin Transformer**. Swin Transformer33 is a recently proposed Transformer-based vision model that uses shifted windows to capture local features in images, while also ensuring computational efficiency. We adopted Swin Transformer V2 of tiny size pre-trained on ImageNet-1k54 at a resolution of 256×256. The model input was identical to that of ViT in our experiments. ## Supporting information Supplementary Information of Simple Words over Rich Imaging: Accurate Brain Disease Classification via Language Model Analysis of Radiological Reports [[supplements/317214_file02.pdf]](pending:yes) ## Data Availability All data produced in the present study are available upon reasonable request to the authors. All data produced in the present work are contained in the manuscript. ## Author contributions statement Xin Gao wrote the code, conducted experiments, and wrote the manuscript. Jun Qiu, Tiantian Hua and Ying Jin interpreted the data and made diagnoses. Xin Gao, Longfei Chen, Shanbo Zhao, Zhiqiang Wu and Haotian Hou participated in model design and optimization. Junjie Li, Yunling Wang, Wei Zhao and Yuxin Li collected the data and provided access to it. Jun Qiu, Junjie Li, Xuzhu Chen, Yunyun Duan and Yaou Liu offered guidance in medical expertise. Meihui Zhang and Chuyang Ye conceived, designed and directed the study. All authors reviewed, edited and approved the manuscript. ## Supplementary materials Supplementary information is available for this paper. View this table: [Table E1.](http://medrxiv.org/content/early/2024/11/15/2024.11.13.24317214/T5) Table E1. The pre-processing time costs for image and report data. View this table: [Table E2.](http://medrxiv.org/content/early/2024/11/15/2024.11.13.24317214/T6) Table E2. The detailed comparison results of four language models across four tasks for patients with full image contrasts from datasets DT-IDH-1, DT-CI-1, DT-WHO-1, and DT-BTC-1. The best results are highlighted in bold. View this table: [Table E3.](http://medrxiv.org/content/early/2024/11/15/2024.11.13.24317214/T7) Table E3. The comparison results of fine-tuning the LLM Baichuan2-13B with all tasks simultaneously or with each task respectively. View this table: [Table E4.](http://medrxiv.org/content/early/2024/11/15/2024.11.13.24317214/T8) Table E4. The comparison results for the Baichuan2 model with different numbers of parameters and fine-tuning methods. The average performance was calculated from four tasks for each model. View this table: [Table E5.](http://medrxiv.org/content/early/2024/11/15/2024.11.13.24317214/T9) Table E5. The detailed comparison results of two kinds of language models, PLMs and LLMs, using two external datasets, DX-IDH-1 and DH-IDH-1, from The First Affiliated Hospital of Xinjiang Medical University and Huashan Hospital, Fudan University, for the task of IDH genotyping. ## Acknowledgements This study was supported by the Beijing Municipal Natural Science Foundation (7242273 & JQ20035), Fundamental Research Funds for the Central Universities (2022CX11008), Xiaomi Young Scholars Program, National Natural Science Foundation of China (81870958 & 81571631), and Special Fund of the Pediatric Medical Coordinated Development Center of Beijing Hospitals Authority (XTYB201831). * Received November 13, 2024. * Revision received November 13, 2024. * Accepted November 15, 2024. * © 2024, Posted by Cold Spring Harbor Laboratory The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission. ## References 1. 1.Shoeibi, A. et al. Diagnosis of brain diseases in fusion of neuroimaging modalities using deep learning: A review. Inf. Fusion 93, 85–117 (2023). 2. 2.Cenek, M., Hu, M., York, G. & Dahl, S. Survey of image processing techniques for brain pathology diagnosis: Challenges and opportunities. Front. Robotics AI 5, 120 (2018). 3. 3.Kazmierczak, P. M. et al. Ultrafast brain magnetic resonance imaging in acute neurological emergencies: diagnostic accuracy and impact on patient management. Investig. Radiol. 55, 181–189 (2020). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31917761&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) 4. 4.Coleman, M. R. et al. Towards the routine use of brain imaging to aid the clinical diagnosis of disorders of consciousness. Brain 132, 2541–2552 (2009). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/brain/awp183&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19710182&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000269963600023&link_type=ISI) 5. 5.Abd-Ellah, M. K., Awad, A. I., Khalaf, A. A. & Hamed, H. F. A review on brain tumor diagnosis from MRI images: Practical implications, key achievements, and lessons learned. Magn. resonance imaging 61, 300–318 (2019). 6. 6.Chen, L., Qiao, H. & Zhu, F. Alzheimer’s disease diagnosis with brain structural MRI using multiview-slice attention and 3D convolution neural network. Front. Aging Neurosci. 14, 871706 (2022). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=35557839&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) 7. 7.Tuan, T. A., Pham, T. B., Kim, J. Y. & Tavares, J. M. R. Alzheimer’s diagnosis using deep learning in segmenting and classifying 3D brain MR images. Int. J. Neurosci. 132, 689–698 (2022). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33045895&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) 8. 8.Suh, C. et al. Development and validation of a deep learning–based automatic brain segmentation and classification algorithm for Alzheimer disease using 3D T1-weighted volumetric images. Am. J. Neuroradiol. 41, 2227–2234 (2020). [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoiYWpuciI7czo1OiJyZXNpZCI7czoxMDoiNDEvMTIvMjIyNyI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDI0LzExLzE1LzIwMjQuMTEuMTMuMjQzMTcyMTQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 9. 9.Yamanakkanavar, N., Choi, J. Y. & Lee, B. MRI segmentation and classification of human brain using deep learning for diagnosis of Alzheimer’s disease: A survey. Sensors 20, 3243 (2020). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32517304&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) 10. 10.Yao, Z. et al. Artificial intelligence-based diagnosis of Alzheimer’s disease with brain MRI images. Eur. J. Radiol. 165, 110934 (2023). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37354773&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) 11. 11.Heo, T. S. et al. Prediction of stroke outcome using natural language processing-based machine learning of radiology report of brain MRI. J. personalized medicine 10, 286 (2020). 12. 12.Casey, A. et al. A systematic review of natural language processing applied to radiology reports. BMC medical informatics decision making 21, 179 (2021). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=34082729&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) 13. 13.Guevara, M. et al. Large language models to identify social determinants of health in electronic health records. NPJ digital medicine 7, 6 (2024). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=38200151&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) 14. 14.Li, T. et al. CancerGPT for few shot drug pair synergy prediction using large pretrained language models. npj Digit. Medicine 7, 40 (2024). 15. 15.Benary, M. et al. Leveraging large language models for decision support in personalized oncology. JAMA Netw. Open 6, e2343689–e2343689 (2023). 16. 16.Blinov, P., Avetisian, M., Kokh, V., Umerenkov, D. & Tuzhilin, A. Predicting clinical diagnosis from patients electronic health records using BERT-based neural networks. In Artificial Intelligence in Medicine: 18th International Conference on Artificial Intelligence in Medicine, AIME 2020, Minneapolis, MN, USA, August 25–28, 2020, Proceedings 18, 111–121 (Springer, 2020). 17. 17.Liu, H. et al. Use of BERT (bidirectional encoder representations from transformers)-based deep learning method for extracting evidences in Chinese radiology reports: development of a computer-aided liver cancer diagnosis framework. J. medical Internet research 23, e19689 (2021). 18. 18.Zhang, T. et al. Radiologic, a healthcare model for processing electronic health records and decision-making in breast disease. Cell Reports Medicine 4 (2023). 19. 19.Jack Jr, C. R. et al. The Alzheimer’s disease neuroimaging initiative (ADNI): MRI methods. J. Magn. Reson. Imaging: An Off. J. Int. Soc. for Magn. Reson. Medicine 27, 685–691 (2008). 20. 20.Menze, B. H. et al. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE transactions on medical imaging 34, 1993–2024 (2014). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/TMI.2014.2377694&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25494501&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) 21. 21.Johnson, A. E. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. data 6, 317 (2019). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31831740&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) 22. 22.Pavlopoulos, J., Kougia, V. & Androutsopoulos, I. A survey on biomedical image captioning. In Proceedings of the second workshop on shortcomings in vision and language, 26–36 (2019). 23. 23.Irvin, J. et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, vol. 33, 590–597 (2019). 24. 24.Liu, Y. et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arxiv:1907.11692 (2019). 25. 25.Cui, Y., Che, W., Liu, T., Qin, B. & Yang, Z. Pre-training with whole word masking for Chinese BERT. IEEE/ACM Transactions on Audio, Speech, Lang. Process. 29, 3504–3514 (2021). 26. 26.AI@Meta. LLaMA 3 model card (2024). 27. 27.Yang, A. et al. Baichuan 2: Open large-scale language models. arXiv preprint arxiv:2309.10305 (2023). 28. 28.Hu, E. J. et al. LoRA: Low-rank adaptation of large language models. arXiv preprint arxiv:2106.09685 (2021). 29. 29.Zhang, Y., Wang, H., Zhang, D. & Wang, D. DeepRisk: A deep transfer learning approach to migratable traffic risk estimation in intelligent transportation using social sensing. In 2019 15th International Conference on Distributed Computing in Sensor Systems (DCOSS), 123–130 (IEEE, 2019). 30. 30.Yang, J. et al. MedMNIST v2-a large-scale lightweight benchmark for 2D and 3D biomedical image classification. Sci. Data 10, 41 (2023). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36658144&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) 31. 31.Gao, P. et al. Development and validation of a deep learning model for brain tumor diagnosis and classification using magnetic resonance imaging. JAMA Netw. Open 5, e2225608–e2225608 (2022). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=35939301&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) 32. 32.Dosovitskiy, A. et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arxiv:2010.11929 (2020). 33. 33.Liu, Z. et al. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021). 34. 34.Zhou, T., Ruan, S. & Canu, S. A review: Deep learning for medical image segmentation using multi-modality fusion. Array 3, 100004 (2019). 35. 35.Dar, S. U. et al. Image synthesis in multi-contrast MRI with conditional generative adversarial networks. IEEE transactions on medical imaging 38, 2375–2388 (2019). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/TMI.2019.2901750&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30835216&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) 36. 36.Yellu, R. R., Kukalakunta, Y. & Thunki, P. Deep learning-assisted diagnosis of Alzheimer’s disease from brain imaging data. J. AI Healthc. Medicine 4, 36–44 (2024). 37. 37.Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. QLoRA: Efficient finetuning of quantized LLMs. Adv. Neural Inf. Process. Syst. 36 (2024). 38. 38.Fang, Y., Wang, M., Potter, G. G. & Liu, M. Unsupervised cross-domain functional MRI adaptation for automated major depressive disorder identification. Med. image analysis 84, 102707 (2023). 39. 39.Zhu, W., Sun, L., Huang, J., Han, L. & Zhang, D. Dual attention multi-instance deep learning for Alzheimer’s disease diagnosis with structural MRI. IEEE Transactions on Med. Imaging 40, 2354–2366 (2021). 40. 40.Avants, B. B. et al. Advanced normalization tools (ANTS). Insight j 2, 1–35 (2009). 41. 41.Fonov, V. S., Evans, A. C., McKinstry, R. C., Almli, C. R. & Collins, D. Unbiased nonlinear average age-appropriate brain templates from birth to adulthood. NeuroImage 47, S102 (2009). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S1053-8119(09)70884-5&link_type=DOI) 42. 42.Iglesias, J. E., Liu, C.-Y., Thompson, P. M. & Tu, Z. Robust brain extraction across datasets and comparison with publicly available methods. IEEE transactions on medical imaging 30, 1617–1634 (2011). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/TMI.2011.2138152&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=21880566&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000294545400004&link_type=ISI) 43. 43.Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. methods 18, 203–211 (2021). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41592-020-01008-z&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33288961&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F15%2F2024.11.13.24317214.atom) 44. 44.Baid, U. et al. The RSNA-ASNR-MICCAI BraTS 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv preprint arxiv:2107.02314 (2021). 45. 45.Zhang, Z. & Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. neural information processing systems 31 (2018). 46. 46.Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arxiv:1412.6980 (2014). 47. 47.Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. et al. Improving language understanding by generative pre-training (2018). 48. 48.Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. machine learning research 21, 1–67 (2020). 49. 49.Touvron, H. et al. LLaMA: Open and efficient foundation language models. arXiv preprint arxiv:2302.13971 (2023). 50. 50.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016). 51. 51.Vaswani, A. et al. Attention is all you need. Adv. neural information processing systems 30 (2017). 52. 52.Hara, K., Kataoka, H. & Satoh, Y. Learning spatio-temporal features with 3D residual networks for action recognition. In Proceedings of the IEEE international conference on computer vision workshops, 3154–3160 (2017). 53. 53.Iandola, F. et al. DenseNet: Implementing efficient convnet descriptor pyramids. arXiv preprint arxiv:1404.1869 (2014). 54. 54.Deng, J. et al. ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255 (IEEE, 2009).