Abstract
Background Identifying visually and semantically similar radiological images in a database can facilitate the creation of decision support tools, teaching files, and research cohorts. Existing content-based image retrieval tools are often limited to searching by pixel-wise difference or vector distance of model predictions. Vision transformers (ViT) use attention to simultaneously take into account radiological diagnosis and visual appearance.
Purpose We aim to develop a ViT-based image retrieval framework and evaluate the algorithm on NIH Chest Radiographs (CXR) and NLST Chest CTs.
Materials and Methods The model was trained on 112,120 CXR and 111,955 CT images. For CXR, a ViT binary classifier was trained on 4 ground truth labels (Cardiomegaly, Opacity, Emphysema, No Finding) and ensembled to produce multilabel classifications for each CXR. For CT, a regression model was trained to minimize L1 loss on the continuous ground truth labels of patient weight. The ViT image embedding layer was treated as a global image descriptor, using the L2 distance between descriptors as a similarity measure. To qualitatively evaluate the model, five radiologists performed a reader performance study with random query images (25 CT, 25 CXR). For each image, they chose the 5 most similar images from a set of 10 images (the 5 closest and 5 furthest images from the query in model space). Inter-radiologist and radiologist-model agreement statistics were calculated.
Results The CXR model achieved nDCG@5 of 0.73 (p<0.001) and Cardiomegaly mAP@5 of 0.76 (p<0.001) among other results. The CT model achieved nDCG of 16.85 (p<0.001). The model prediction agreed with radiologist consensus on 86% of CXR samples and 79.2% of CT samples. Inter-radiologist Fleiss Kappa of 0.51 and radiologist-consensus-to-model Cohen’s Kappa of 0.65 were observed. A t-SNE of the CT model latent space was generated to validate similar image clustering.
Conclusion Our ViT architecture retrieved visually and semantically similar radiological images.
Summary Statement This study evaluates the efficacy of using ViT based image embeddings for CBIR tasks for CXR and CT images, finding that it performs well on visual and semantic recognition tasks.
Key Results
The CXR model achieved nDCG@5 of 0.73 (p<0.001) and Cardiomegaly mAP@5 of 0.76 (p<0.001) among other results for CXR.
The CT model achieved nDCG of 16.85 (p<0.001). The model prediction agreed with radiologist consensus on 86% of CXR samples and 79.2% of CT samples.
Inter-radiologist Fleiss Kappa of 0.51 and radiologist consensus to model Cohen’s Kappa of 0.65 were observed.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This study did not receive any funding
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The study used ONLY openly available human data that were originally located at: https://nihcc.app.box.com/v/ChestXray-NIHCC and https://cdas.cancer.gov/datasets/nlst/
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
Funding information: None
Data sharing statement: Data analyzed during the study were provided by a third party. Requests for data should be directed to the provider indicated in the Acknowledgements.
Abbreviations
- (CBIR)
- Content-Based Image Retrieval
- (ViT)
- Vision Transformer
- (CXR)
- Chest Radiograph
- (CT)
- Computed Tomography
- (mAP)
- Mean Average Precision
- (nDCG)
- Normalized Discounted Cumulative Gain
- (NIH)
- National Institutes of Health
- (NLST)
- National Lung Screening Trial