MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training =============================================================== * Chaoyi Wu * Xiaoman Zhang * Ya Zhang * Yanfeng Wang * Weidi Xie ## Abstract In this paper, we consider the problem of enhancing self-supervised visual-language pre-training (VLP) with medical-specific knowledge, by exploiting the paired image-text reports from the radiological daily practice. In particular, we make the following contributions: **First**, unlike existing works that directly process the raw reports, we adopt a novel report filter to extract the medical entities, avoiding unnecessary complexity from language grammar and enhancing the supervision signals; **Second**, we propose a novel entity embedding module by querying an external knowledge description base, to exploit the rich context of additional information that the medical domain affords, and implicitly build relationships between entities in the language embedding space; **Third**, we propose a novel Transformer-based fusion model for spatially aligning the entity description with visual signals at the image patch level only with self-supervised learning, thus enabling the ability for spatial grounding; **Fourth**, we conduct thorough experiments to validate the effectiveness of our proposed architecture, and benchmark on numerous public benchmarks e.g., ChestX-ray14, RSNA Pneumonia, SIIM-ACR Pneumothorax, COVIDx CXR-2, COVID Rural, and EdemaSeverity. In both zero-shot and fine-tuning settings, our model has demonstrated strong performance compared with the former methods on disease classification and grounding. ## 1. Introduction With the rapid development of deep learning, numerous works have been proposed to facilitate computer-aided diagnosis in the medical field [16, 17, 39, 48]. Despite the tremendous progress, these models are normally trained to recognize or segment the structures that fall into a certain closed set of anatomical or disease categories, whenever a new disease comes to be of interest, a costly procedure for data annotation, model re-training, and ethics proof will be required, fundamentally limiting its practical values. As an alternative, recent research considers to train the model by exploiting a large number of multi-modal medical data, that is generated from daily clinical routine, for example, the most common example is the dataset of X-ray images with paired radiological reports [15, 25, 28]. This paper focuses on self-supervised vision-language representation learning in the medical domain, with the goal of zero-shot disease diagnosis (classification) and grounding. Undoubtedly, such tasks have also been widely investigated in the computer vision community, with significant progress made in the past years, for example, CLIP [43], ALBEF [30], BLIP [29], etc. However, to achieve such a goal in the medical domain, different challenges must be resolved, which requires research efforts from the community: *First*, data availability, training Foundation Models in computer vision normally require over millions of imagetext pairs at ease, while in the medical domain, only a few hundred thousand pairs are available [28]. The limited data amount challenges language models to understand the reports in free form [7]. *Second*, the problem considered in medical diagnosis is naturally fine-grained, that requires distinguishing between fine appearance details to understand the disease, as a consequence, domain knowledge is essential; *Third*, robustness is crucial, it is, therefore, preferable to have explainability, where diagnosis results come along with the visual grounding, to help radiologists to understand the system, and build trust between humans and machines. Unlike existing work in medical VLP (Vision-Language Pre-training) [7, 22, 40, 56] that naïvely matches raw reports with image scans, we propose a novel knowledge-enhanced visual-language model that takes medical prior into consideration and enables us to address the aforementioned challenges explicitly, as shown in Fig. 1: *First*, we propose a report filter to extract useful medical entities, and simplify each report into sets of triplets, denoted as {entity, position, exist}. Consequently, decomposing reports into entities leads to an effective representation of the reports with minimal information loss, enriching supervision signals at the detailed entity level; *Second*, we map these entities into fine-grained descriptions by querying a well-defined medical knowledge base, and compute the text embedding for these descriptions, to implicitly build relationships between entities; *Third*, we adopt a transformerbased architecture for aligning the image patches with entity descriptions, that simultaneously infer the likelihood of certain diseases and the visual evidence in the form of a spatial heatmap, *i.e*., providing grounding for explainability purpose. ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/01/11/2023.01.10.23284412/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/F1) Figure 1. Our method mainly considers combining medical knowledge with VLP. **a**. shows the standard VLP flowchart which uses text-image retrieval as a proxy task. **b**. is our MedKLIP flowchart. We adopt a report filter module to decompose raw reports at entity level and further use knowledge descriptions to explain entities. Our model can realize zero-shot classification and grounding. We train the model on the most widely-used medical image-report dataset MIMIC-CXR [28] and rigorously evaluate on numerous public benchmarks, *e.g*., ChestXray14 [50], RSNA Pneumonia [44], SIIM-ACR Pneumothorax [1], COVIDx CXR-2 [41], COVID Rural [12, 47], and EdemaSeverity [8]. We get a state-of-the-art zero-shot classification and grounding performance on different diseases, spanning different image distributions, with further fine-tuning, our model still exceeds previous models significantly. ## 2. Related Work ### Vision-Language Pre-training Models Vision-Language Pre-training (VLP) models have achieved great success in natural scenarios. Generally, there are two typical structures for VLP models. One is two-stream methods [5,27,30]. The other is single-stream methods [10, 32]. These impressive results promote VLP methods in medical. Different from natural data, medical VLP suffers from a serious Lack of Data (224k [28] vs 400M [5]). Most medical VLP methods follow the two-stream methods [8, 22, 40, 56]. Con-VIRT [56] first proposed to use contrastive learning as a proxy task in medical. LoVT and GLoRIA then focus on improving the local alignment performance [8, 22]. BioViL notices the language pattern in reports is different from other natural texts and re-designs the language model used for medical VLP [7]. These works have greatly contributed to the development of this aspect. However, they still view medical texts and images as common natural data and use a classical pipeline to handle them, instead of leveraging the rich prior knowledge in medical. ### Medical Named-Entity-Recognition Models Various natural language processing (NLP) approaches have been proposed to extract useful information from radiology reports [25, 37, 42, 45]. These early methods considered only the disease, causing they lose a lot of information. Some Analysis tools [4,6] have also been developed to extract key clinical concepts and their attributes from biomedical text. Further state-of-the-art works [26, 51] are proposed to extract relationship between different entities without demand of pre-defined close disease set, retaining most of useful information with high accuracy. This progress inspires us a lot and provide a new perspective for VLP. However, how to take advantage of this Named-Entity-Recognition (NER) models have not been discussed sufficiently in VLP field. ### Medical Knowledge Enhanced Models Leveraging external medical knowledge to enhance deep learning models is not a new topic [52]. Depending the approaches of using medical knowledge, They can be classified into modelbased and input-based two types. In model-based types, the authors may imitate the radiological practice to design the model [20, 23, 31, 49] or change the model structure based on diagnostic patterns [11, 18, 38]. In input-based types, the knowledge is viewed as an extra input to calculate features [46, 53, 54] or to guide the final loss [9, 24]. Multi-task learning and attention mechanism [33, 34, 36] are often adopted to realize medical knowledge combination. However, most of these works are task-specific [52] because medical knowledge lacks an effective shared representation space across various diseases while we demonstrate that leveraging medical entity descriptions with text encoding has the potential to provide such a space. ## 3. Method In this section, we start by describing our considered problem scenario in Sec. 3.1, followed by the details of our proposed Transformer-based architecture in Sec. 3.2, including, the visual encoder, knowledge-enhanced text encoder, and the fusion module for aligning the visual and language signals. In Sec. 3.3, we describe the training procedure for our proposed model with the paired image-reports sourced from the daily routine X-ray scans. ### 3.1. Problem Scenario Assuming we are given a training set with *N* samples, *i.e*., 𝒟train = (𝒳1, *𝒯*1), …, (𝒳*N*, *𝒯**N*), where 𝒳*i*, *𝒯**i* refer to the X-ray image and its corresponding medical report generated in the daily routine scans, respectively, our goal is to train a visual-language model that enables us to diagnose the existence of certain diseases and localize the visual evidence spatially. Specifically, at inference time, we can freely ask the system to identify the likelihood of the patient getting a certain disease (may or may not be seen during training), with its visual description for the disease of interest: ![Formula][1] where 𝒳*i* ∈ ℝ*H*×*W* ×3 refers to an image sample from the **test set**, with *H, W* denoting height and width respectively. ŝ*i* ∈ [0, 1] refers to the inferred likelihood of the patient having a certain disease indicated by the input description, and ![Graphic][2] denotes a predicted spatial heatmap, with high activation on pixels that potentially provide the visual indication for such disease. In the following section, we will detail the individual components of our architecture, namely, the visual encoder, text encoder, and fusion module, and training them with the available training set (𝒟train). ### 3.2. Architecture In this section, we detail our proposed framework, consisting of three main components, namely, visual encoding, knowledge-enhanced language encoding, and fusion module, as shown in Fig. 2. Note that, we hereon only consider single sampled image-reports pair (𝒳*i*, *𝒯**i*), and ignore the subscript in notations for simplicity. ![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/01/11/2023.01.10.23284412/F2.medium.gif) [Figure 2.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/F2) Figure 2. The whole framework of our method. The figure mainly contains the fourth module: *Visual Encoder, Knowledge-enhanced Language Encoding, Fusion Module. Knowledge-enhanced Language Encoding* contains *Text Encoder* and *Report Filter. Report Filter* extracts entities from the raw reports and *Text Encoder* further embeds them. *Visual Encoder* is used to encoder the input of visual modalities and *Fusion Module* is used for cross-modality interaction. The details of *Report Filter* can be found in the right sub-figure. A report is first filtered by a pre-trained filter and viewed as a set of triplets. The *“Position”* part is mixed with some negative positions for contrastive loss and the *“Exist”* part is used for CE loss. #### 3.2.1. Visual Encoding Given an X-ray image scan 𝒳 ∈ ℝ*H*×*W* ×3, we can compute the features with a visual backbone: ![Formula][3] *h, w, d* refer to the height, width, and feature dimension of the output feature map, in our case, we adopt a standard ResNet-50 as the visual backbone, and take the output from the 4th residual block. Note that, we make the such an architectural choice for a fair comparison with existing work [7,22,40,56], while other visual backbones, *e.g*., ViT [14], can equally be applied. #### 3.2.2. Knowledge-enhanced Language Encoding The goal of this module is to extract useful information from the text report, by incorporating medical domain knowledge. In particular, we design two stages, namely report filtering, and entity encoding. ##### Report Filtering To start with, we propose to condense the report and transform it into a set of entity triplets, *i.e*., removing the unnecessary complexity from language grammar, as shown in Figure 2 (right). In detail, we use a pretrained text filter [26, 55] to extract valuable information from the report, for example, the medical entities and their corresponding positions on the image. Specifically, given a report 𝒯 with a set of sentences, 𝒯= {*s*1, *s*2, …, *s**M*}, the filter independently operates on each of the sentences, and construct a number of triplets from the report, with the extracted entity (most are diseases), spatial position, and a label indicating the existence of the disease: ![Formula][4] where *t**j* represents the total number of entities contained in one sentence, with *n* = 0 indicating the special case that there is no valid entity. **Note that**, the position refers to the spatial position of the entity lying in an image, it is not to be confused with the positional embedding in Transformer. ##### Entity Encoding Here, we replace the entities by querying detailed visual descriptions from a medical-purpose knowledge base, for example, “Pneumonia”→ “It is a condition of the lung primarily affecting the small air sacs known as alveolar. It may present with opacities and pleural effusion and it can increase the diagnostic accuracy of lung consolidation”. Note that, such descriptions for the medical terminologies can easily be sourced from either existing educational textbooks or online resources12. Despite its simplicity, converting the entities into descriptions is crucial for more reliable and open-vocabulary diagnosis, as it further decomposes the medical entities into visual attributes that are shared by different diseases, encouraging the model to capture a deep understanding of the visual evidence. To encode the entity, we use the ClinicalBERT [3] as a pre-trained text encoder, to first compute the embedding for the entity description and position, and then adopt a linear MLP to flexibly project the embedding to desired dims: ![Formula][5] ![Formula][6] Each triplet has now been converted into {*e, p, l*}, *l* ∈{0, 1} denotes the existence of the entity. ##### Discussion We have made two major differences compared to the existing visual-language models in computer vision, *First*, the information in medical reports is often more condensed, normally describing the existence of abnormality and their positions in the image, thus, adopting the **filter** operation can avoid unnecessary complexity from grammar, while still retaining most of the useful information in reports. *Second*, entities tend to be medical terminologies that are only understandable to audiences with a medical background, enrich the encoding by visual descriptions can significantly help the model to capture a deep understanding of the visual evidence for diseases, specifically, for seen diseases, such shared visual attributes enable to build the implicit relationship, while for unseen diseases, their visual evidence may have already been well understood by processing the descriptions of the seen ones, as they tend to be shared among diseases. #### 3.2.3. Fusion Module After extracting all the entities and their corresponding positions from the reports, we select the top |*Q*| most commonly appearing entities in reports, and compute the textual embeddings for their corresponding descriptions, denoted as an entity set *Q* = {*e*1, *e*2, …, *e*|*Q*|}, and top |*P* | position embeddings as a position set *P* = {*p*1, *p*2, …, *p*|*P* |}. For a certain image, its computed visual representation and the entity set will be passed into a fusion module for alignment, consisting of multiple Transformer Decoder layers. We treat the entity set *Q* as Query, and the image features 𝒱 as Key and Value into the Transformer decoders, the outputs from the fusion module are further fed into two linear MLP layers, one is used for inferring the existence of the entity, and the other generates an embedding to indicate the entity’s spatial position: ![Formula][7] where ŝ ∈ℝ|*Q*| represents the existence prediction for each entity query, and ![Graphic][8] represents the prediction embedding of spatial position for the entities. ![Graphic][9] denotes the average of the cross-attention maps sourced from Transformer Decoder layers. The training procedure will be detailed in Sec. 3.3. ##### Discussion In contrast to the existing approaches [56] that aligns the reports with the entire image, our adopted Transformer decoder enables to compute correspondences between text and image at the patch level. Consequently, the image features 𝒱 are more suitable for downstream segmentation tasks and the average of the cross-attention map in each layers can be used directly for **zero-shot** grounding. ### 3.3. Training To train the proposed model, we leverage the corresponding triplets, for entities that are not mentioned in the considered report, we simply ignore them while computing loss. For simplicity, the following formulations are based on the assumption that the considered query do have corresponding triplets. In specific, for the existence prediction ŝ, we use binary cross-entropy with the existence label, denoted as ℒcls; to supervise the position prediction for each entity query, we adopt contrastive learning, randomly sample *M* position embeddings from the position set: ![Formula][10] where ⟨·, · ⟩ represents the inner product of two vectors and ℐ (·, ·) is a random index sampling function. *P* is unnormalized in calculation. The final loss is the sum of the two: ![Formula][11] where *α*1, *α*2 refer to two hyper-parameters controlling the ratio of the two losses, and we set them to be 1.0 by default. ## 4. Experiment In this section, we start by introducing the dataset used for experiments, *e.g*., pre-training, and various downstream datasets. Then we describe the implementation details and the considered baselines. ### 4.1. Pre-training Dataset **MIMIC-CXR v2 [19, 28]** consists of over 227k studies of paired image-report data, they are sourced from 65,379 patients at different scanning. Each study can have one or two images (different scan views), totaling 377,110 images. ### 4.2. Datasets for Downstream Tasks **ChestX-ray14 [50]** contains 112,120 frontal-view X-ray images of 30,805 unique patients, collected from the year of 1992 to 2015 by NIH(National Institutes of Health), with labels of 14 common diseases provided. We split the dataset into 0.8*/*0.1*/*0.1 for train/valid/test. **RSNA Pneumonia [44]** contains more than 260k frontalview chest X-rays with corresponding pneumonia opacity masks collected by RSNA (Radiological Society of North America). Commonly, it is treated as a classification tasks [7, 22]. We split the dataset into 0.6*/*0.2*/*0.2 for train/valid/test. **SIIM-ACR Pneumothorax [1]** contains more than 12k frontal-view chest X-rays with pneumothorax masks collected by SIIM-ACR (Society for Imaging Informatics in Medicine and American College of Radiology). Similarly to RSNA Pneumonia dataset, it can be both used as classification and segmentation tasks. We split the dataset into 0.6*/*0.2*/*0.2 for train/valid/test. **COVIDx CXR-2 [41] and COVID Rural [12, 47]** aim to evaluate on diagnosing COVID-19. COVIDx CXR-3 contains 29,986 images from 16,648 patients with COVID-19 classification labels. We use it as a classification dataset and split it into 0.7*/*0.2*/*0.1 for train/valid/test. Additionally, we use COVID Rural dataset for COVID-19 segmentation. It contains more than 200 chest X-rays with segmentation masks, and we split it into 0.6*/*0.2*/*0.2 for train/valid/test. **Edema Severity [8]** contains 6,524 examples from MIMIC-CXR with pulmonary edema severity labels (0 to 3, increasing severity) extracted from the radiology reports. Of these, 141 radiologists were examined by radiologists, and consensus was reached on severity level. It can be seen as a typical fine-grained classification task. We split the dataset into 0.6*/*0.2*/*0.2 for train/valid/test. ### 4.3. Implementation Details This section details the implementation for architecture, pre-training, zero-shot inference and fine-tuning procedure. #### Model architecture As input to the model, images are resized into 224 ×224 ×3. We use the first four layers of ResNet50 [21] as our visual backbone (Φvisual), and adopt a MLP layer to transform the output feature dimension into *d* = 256. As a result, the output feature maps from visual encoder is 𝒱∈ ℝ14×14×256. On the report side, we extract the entities with a pre-trained text filter, as described in [26], and compute the entity and position embedding with a pre-trained ClinicalBERT [2], its default embedding dim is *d*′ = 768. We obtain |*Q*| = 75 entities and |*P* |= 51 positions that most frequently appear in the reports, following [55]. We sample *M* = 7 negative positions for each entity to calculate contrastive loss for training entity position training. In the fusion module, We adopt 4 Transformer Decoder layers with 4 heads in each layer. #### Pre-training At this stage, both the filtering operation and language encoding use pre-trained networks, while the visual encoder and fusion module are trained end-to-end on the image-text pairs. We use AdamW [35] optimizer with *lr* = 1 ×10−4 and *lr*warm up = 1 ×10−5. We train on a GeForce RTX 3090 GPU with batch size 32 for 60 epochs. The first 5 epochs are set for warming up. #### Inference At inference time, given a test image, we aim to infer the existence of certain entity / disease, and ground their visual evidence. For those entities that have appeared at training time, we simply adopt the corresponding elements from the entity query set, while for those unseen ones, we replace the entity with a brief description, and treat that as an added query to the model. The existence output can be directly applied for classification and the average cross-attention of different layers in the transformerbased fusion module between specific query and the visual features are used for grounding. #### Fine-tuning For the downstream tasks, with large amount of training data, we can fine-tune the model end-to-end, with our pre-trained visual backbone as initialization. Specifically, for image classification task, we adopt ResNet50 [21] and initialize its first four layers with our pre-trained visual encoder. For image segmentation task, we use ResUNet [13] as backbone and initialize its encoder with our pre-trained image encoder. ### 4.4. Baselines We compare with various existing state-of-the-art medical image-text pre-train methods, namely, ConVIRT [56], GLoRIA [22] and BioViL [7]. Since ConVIRT and GLo-RIA are pre-trained on an in-house dataset, we re-train their models on MIMIC-CXR dataset for fair comparison. For BioViL, we use the officially released models by the authors. For zero-shot setting, we use the prompt as mentioned by BioViL [7]. For fine-tuning, we all use ResNet50 as the visual encoder as described in Sec. 4.3. ### 4.5. Metrics **AUC** refers to the area under the receiver operating characteristic (ROC) curve, that is commonly used for detection and binary classification tasks. **F1 and ACC** are used as supplementary metrics for classification tasks. Specifically, F1 comprehensively measures the recall and precision of the model, and ACC is the short of Accuracy. The final binary prediction threshold is chosen to maximise the F1 score. The ACC score is also calculated under this threshold. **Pointing Game** is used for evaluating the grounding performance. In specific, we extract the region with max response in the output heat-map, for one instance, if the region hit the ground-truth mask, it is considered a positive prediction, otherwise negative. Finally, accuracy can be calculated as the pointing game score. **Dice and IOU** are commonly used for segmentation tasks. For zero-shot segmentation, we search the segmentation threshold with 0.01 interval for all methods, and report the maximal Dice score for each model. **Precision and Recall** refer to the detection Precision and Recall. For medical, it is important that lesions are detected even without fine segmentation. Additionally, in some hard cases, especially for the zero-shot setting, Dice and IOU may be too strict to reflect the performance difference. Precision and recall scores can compensate for these. We choose the IOU threshold as 0.1 to calculate the scores. ## 5. Results In this section, we will report the experimental results. In general, we split the results into two parts: zero-shot setting and fine-tuning setting. In the zero-shot case (Sec. 5.1), we carry out the ablation study and compare it with the other SOTA image-text pre-train methods. We mainly consider classification and segmentation tasks; In the fine-tuning case (Sec. 5.2), we evaluate the model’s transferability by fine-tuning the model with 1%, 10%, and 100% data portion. Additionally, we also add a disease grading downstream task, which can be seen as a fine-grade classification task, showing that our pre-trained model can be transferred to the downstream tasks at ease. ### 5.1. Zero-shot In this section, we compare our method with the other state-of-the-art methods under zero-shot setting, classification, and grounding. Due to the space limitation, we include the entire ablation study in the supplementary material, referring to it for more details and analysis, and all comparisons here are made using our best model with position contrastive loss and entity description encoder. #### 5.1.1. Classification ##### Seen Diseases As shown in Tab. 1, we compare with existing methods on three widely-used datasets, demonstrating consistent performance improvement. Specifically, on pneumonia and pneumothorax datasets, despite the images being collected by different clinics with different diseases, our model improves the AUC score from 0.83 to 0.87 on RSNA pneumonia dataset and from 0.71 to 0.89 on SIIM-ACR pneumothorax dataset, as shown in Tab. 1. This shows that our method can better deal with the multi-center and multi-disease data distribution in medical. While on ChestX-ray14 dataset, we improve the average AUC scores from 0.69 to 0.77, we refer the reader to supplementary material for a more detailed comparison of 14 diseases. View this table: [Table 1.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/T1) Table 1. Comparison with other state-of-the-art methods on zero-shot classification task. AUC, F1 and ACC scores are reported. For ChestX-ray14, the metrics all refer to the macro average on the 14 diseases. ##### Unseen Diseases In particular, we use covid-19 to evaluate the systems. **Note that**, our considered setting is different from existing approaches, where all entities have been exposed to the model at training time, and prediction can be made by a retrieval-type approach, *i.e*., compute the similarity between the image and the entity embedding by encoding the disease name with a language encoder [7], while we are considering a stricter setting for openset classification. Covid-19 is a new disease that only appeared in 2019, MIMIC-CXR reports collected in the year 2015 do not include any entity of covid-19, thus it requires the system to have the ability to diagnose truly unseen diseases. As shown in Tab. 2, existing approaches that only rely on disease name struggles to make the correct diagnosis. While with our proposed approach by introducing medical knowledge, *i.e*., using entity descriptions, our methods can understand the complex medical entity descriptions unseen in the training set, and significantly boost the performance 0.66 to 0.74 on AUC and from 0.59 to 0.70 on ACC. View this table: [Table 2.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/T2) Table 2. Comparison with other state-of-the-art methods on zero-shot Covid-19 classification task. AUC, F1 and ACC scores are reported. *“Direct covid-19”* refers to directly use “Covid-19” to construct the prompt sentence while *“Covid-19 Description”* refers to replace the name “Covid-19” with its medical description. #### 5.1.2. Grounding In addition to the plain diagnosis, explainability can be equally critical in healthcare, improving the reliability and trustiness of the machine learning systems. Here, we consider providing explainability by grounding the abnormality in the prediction and compare against the existing approaches. Similarly, we split the diseases into seen and unseen ones, depending on whether their names have appeared in the medical reports. Specifically, “Pneumonia” and “Pneumothorax” are viewed as seen, and “Covid-19” is viewed as unseen. Due to the space limitation, we refer the reader to supplementary material for visualization results. ##### Seen Diseases We show the results for grounding on RSNA Pneumonia opacity and SIIM-ACR Pneumothorax collapse in Tab. 3. As shown in Tab. 3a, our proposed model surpasses existing approaches on all metrics, for example, we improve the pointing game score from 0.83 to 0.87, the detection Recall from 0.85 to 0.87, the detection precision from 0.50 to 0.64, the IOU from 0.30 to 0.32 and the Dice from 0.44 to 0.46. While on SIIM-ACR dataset (Tab. 3b), the pneumothorax region tends to be thin and narrow, localizing it can often be more challenging than that of opacity grounding [7], we thus only consider pointing game, recall, and precision. Similarly, our method can achieve significantly better performance than prior approaches. View this table: [Table 3.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/T3) Table 3. Comparison with other state-of-the-art methods on zero-shot region grounding tasks. (a) shows the results on RSNA Pneumonia dataset. (b) shows the results on SIIM-ACR Pneumothorax dataset. The pneumothorax region tends to be thin and narrow and much more challenging for grounding, we thus only consider pointing game, recall, and precision. Our method can achieve better performance on different metrics, especially on the pointing game score. ConVIRT as the basic method proposed earliest can not realize this function. ##### Unseen Diseases We also conduct the zero-shot grounding experiment on the unseen disease, namely, Covid-19, as shown in Tab. 4. Our model has shown consistent improvements in all metrics, *e.g*., boosting the pointing game score from 0.40 to 0.58. One observation to be noticed is that, results in Tab. 4 are mostly consistent with those in Tab. 2, *i.e*., better classification results tend to lead to better grounding. Overall, our model with knowledge-enhanced language encoding has facilitated the visual encoder to learn underlying evidence relating to the diseases, therefore, leading to more interpretable representations than prior approaches. View this table: [Table 4.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/T4) Table 4. Comparison with other state-of-the-art methods on zero-shot covid-19 opacity region grounding task. *“Direct covid-19”* refers to directly use “Covid-19” to construct the prompt sentence while *“Covid-19 Description”* refers to replace the name “Covid-19” with its medical description. Our method can achieve better performance on different metrics. ### 5.2. Fine-tuning In this section, we consider the fine-tuning scenario, with the pre-trained model as initialization, and trained end-to-end on the downstream tasks. We test on three different tasks, namely, classification, segmentation, and grading. In classification and segmentation, the test splits and metrics are the same as in the “zero-shot” section. Grading is a new task we introduce in fine-tuning setting, which can be seen as a fine-grained classification task. #### 5.2.1. Classification We experiment on four different datasets, using 1%, 10%, 100% of the data for fine-tuning, that is consistent with the existing work [7, 22, 56]. As shown in Tab. 5, our model has demonstrated substantial improvements in the AUC scores over the existing approaches across all datasets, reflecting that our pre-trained representation is of higher quality than existing models. We refer the readers to the supplementary material for more detailed comparison results. View this table: [Table 5.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/T5) Table 5. Comparison of AUC scores with other state-of-the-art methods on fine-tuning classification task. The macro average of AUC scores on 14 diseases are reported for ChestX-ray14 dataset. #### 5.2.2. Segmentation In Tab. 6, we conduct fine-tuning experiments on three different diseases for segmentation. We pick 1%, 10%, 100% of the data for fine-tuning. For all three different diseases with different image distributions, our methods surpass the existing state-of-the-art methods by a large margin, especially under the low-data regime. View this table: [Table 6.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/T6) Table 6. Comparison of Dice scores with other state-of-the-art methods on fine-tuning segmentation tasks. Three diseases are reported, and for each disease, three data portions, 1%, 10%, 100% are adopted to show the performance change under different data amounts. #### 5.2.3. Grading Besides diagnosis, grading the disease severity level also plays an important role. Here, we adopt our pre-trained features and train them for the multi-class classification task, with 0 to 3 representing different severity levels. As shown in Tab. 7, for each level, the AUC, F1, and ACC scores are calculated as one class against all other ones, for example, 0 vs {1, 2, 3}. Final macro average scores of four levels are computed. On the majority of severity levels, our method can achieve the best results. View this table: [Table 7.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/T7) Table 7. Comparison with other state-of-the-art methods on fine-tuning edema severity grading multi-class classification task. AUC score is reported in the Table. “0,1,2,3” in the table represents the severity level and final macro average scores are reported. ## 6. Conclusion In this paper, we introduce a novel medical knowledge enhanced VLP model. First, we propose a report filter to extract useful medical entities with more useful supervision signals, simplifying complex raw reports with minimal information loss. Second, we translate entities into detailed medical descriptions and embed them with a text encoder enabling the network to understand complex medical expert-level knowledge. Finally, a transformer-based structure is proposed to do local region alignment. In experiments, We evaluate our method on different datasets under various settings. Our method shows strong zero-shot classification and grounding abilities, even facing unseen diseases. Besides, in fine-tuning setting, our method still outperforms state-of-the-art methods significantly, showing the superiority of our method. ## Data Availability All data produced in the present study are available upon reasonable request to the authors ## Supplementary ### A. The Entity Description Base and Position Set Tab. 8 shows the descriptions we used to translate different entities. We have kept 75 entities in query set *Q*, following [55]. “Tail abnorm obs” entity represents some tailed entities and “exluded obs” represents some entities useless for diagnosis. The last “covid-19” description row is only referred to for inference since it does not appear in pre-train reports. View this table: [Table 8.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/T8) Table 8. The Entity description used for translate single entity name. The description can be easily found from the open website. Additionally, we keep 51 positive positions, following [55], to form the position set *P*, as {*trachea, left\_hilar, right\_hilar, hilar\_unspec, left\_pleural, right\_pleural, pleural\_unspec, heart\_size, heart\_border, left\_diaphragm, right\_diaphragm, diaphragm\_unspec, retrocardiac, lower\_left\_lobe, upper\_left\_lobe, lower\_right\_lobe\_middle\_right\_lobe, upper\_right\_lobe, left\_lower\_lung, left\_mid\_lung, left\_upper\_lung\_left\_apical\_lung, left\_lung\_unspec, right\_lower\_lung, right\_mid\_lung, right\_upper\_lung\_right\_apical\_lung, right\_lung\_unspec, lung\_apices, lung\_bases, left\_costophrenic\_right\_costophrenic, costophrenic\_unspec, cardiophrenic\_sulcus, mediastinal, spine\_clavicle, rib, stomach, right\_atrium, right\_ventricle, aorta, svc, interstitium, parenchymal, cavoatrial junction, cardiopulmonary, pulmonary, lung_volumes, unspecified, other*}. “Other” is used to reprepresnt some tailed positions. ### B. Details of Fusion Module In the transformer-based fusion module, the queries are first passed through a self-attention layer and then followed by a multi-head attention layer between the modified queries and image features. In each head, the image features are processed by a linear key head and a linear value head as key embeddings and value embeddings independently. The value is weightedadded based on the attention map, which is calculated by the soft-max dot product of the keys and queries. Finally, a feed-forward network gathers the vector of different heads resulting in the output of this layer. The output vector of the former layer is considered as the entity query vector of the next layer. In formulation, if denoting ![Graphic][12] as the i-th layer, the procedure is expressed as: ![Formula][13] ### C. Ablation Study Our final method mainly contains three key parts, transformer-based fusion module, position location contrastive loss (PosCL), and entity description encoder (DE). We gradually remove the modules to analyze their effectiveness. “w/o (DE)” refers to removing DE module and “w/o (PosCL+DE)” refers to only maintaining the fusion module with basic CE loss. Tab. 9 and Tab. 10 shows the quantitative results. View this table: [Table 9.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/T9) Table 9. Ablation study on zero-shot classification task. AUC, F1 and ACC scores are reported. For ChestX-ray 14, the metrics all refer to the macro average on the 14 diseases. View this table: [Table 10.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/T10) Table 10. Ablation study on zero-shot grounding tasks. (a) shows the results on RSNA Pneumonia dataset. (b) shows the results on SIIM-ACR Pneumothorax dataset. #### Transformer Decoder The lines about “w/o (PosCL + DE)” in tables demonstrate the performance of the basic model modified only by base CE loss. This model can exceed many former methods. This indicates the complex syntax will hurt the network to capture the useful entities significantly and our filtering operation combined with medical NER can greatly relieve the problem. #### Position Contrastive Loss The PosCL can significantly help the network to ground the abnormalities. As shown in the results by adding PosCL the classification results can be further improved, e.g., from 0.75 to 0.76 on AUC in ChestX-ray14 dataset. Besides classification, location contrastive loss can bring more gain in grounding. These results show position is another vital element in reports especially for grounding tasks. Our filtered triplets can conclude and clean the reports with little information loss and make the network learn the report information more straightforward. #### Entity Description Encoder By adding entity descriptions, we want to realize two goals. *First*, in addition to just learning from the image-report data, the network can actively learn the relationship between different entities based on the entity descriptions. As shown in tables, adding descriptions in most scenarios can help the network better understand the entity and bring gain to the final metric scores. *Second*,the description encoder enables our model to **handle openset new diseases**. Since the entity list is a close set during pre-training, our method will be only able to handle the seen diseases without DE, while, with a description encoder, our method can handle unseen diseases and understand complex medical disease knowledge. ### D. Detailed results on ChestX-ray14 We further show the detailed performance of 14 different diseases on ChestX-ray14 dataset. Tab. 11 shows the results on the zero-shot setting. Our method can exceed the former methods for most diseases. The radar Fig. 3Y shows more visually how our model compares with other solutions under the zero-shot setting. Our method can exceed the former methods for most diseases. Under 100% fine-tuning settings, we achieved similarly excellent results as shown in Tab. 12. View this table: [Table 11.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/T11) Table 11. Comparison with other state-of-the-art methods on zero-shot ChestX-ray 14 diseases classification task. For each disease, AUC score is reported and the macro average AUC score is also reported. We use the first three letters to represent one disease but for “pneumonia” and “pneumothorax” we use the first two and the last letters. View this table: [Table 12.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/T12) Table 12. Comparison with other state-of-the-art methods on fine-tuning ChestX-ray 14 diseases classification task. For each disease, AUC score is reported and the macro average AUC score is also reported. We use the first three letters to represent one disease but for “pneumonia” and “pneumothorax” we use the first two and the last letters. ![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/01/11/2023.01.10.23284412/F3.medium.gif) [Figure 3.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/F3) Figure 3. The radar figure of our method and other methods of ChestX-ray14 14 diseases. AUC scores are reported and, as shown, our method exceeds the previous state-of-the-art on most diseases. ### E. Visualization Results Fig. 4 shows visualization results of our model on zero-shot grounding task. As shown in figure, the ground truth of “Pneumonia” is given by bounding box and generally related to a large area region. Thus the metrics on this are higher than other two datasets. Our network captures its regions very well. For “Pneumothorax”, its abnormality pattern is different from other diseases, which aim to capturing the collapsed part of the lung, rendering darker areas on the images rather than brighter opacity. Its ground-truth masks are generally thin and narrow while our network can still highlight its location. For “Covid-19”, its image textual was similar to “Pneumonia”, but since this is a totally new disease, grounding its regions is much more challenging. It requires the model to build relationships between them based on their complex definition and symptoms. The visualization results suggest that our model successfully achieve this, supporting that, for other unseen diseases, our model can also understand their complex descriptions. ![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/01/11/2023.01.10.23284412/F4.medium.gif) [Figure 4.](http://medrxiv.org/content/early/2023/01/11/2023.01.10.23284412/F4) Figure 4. The visualization of zero-shot grounding results of our method. Each column represents the results on one disease and the left in it is the ground-truth and right is the heatmap predication of our model. The brighter the red on the figure, the more likely the model considering this region to be associated with abnormalities. ## Footnotes * {wtzxxxwcy02{at}sjtu.edu.cn, xm99sjtu{at}sjtu.edu.cn, ya_zhang{at}sjtu.edu.cn, wangyanfeng{at}sjtu.edu.cn, weidi{at}sjtu.edu.cn} * [https://chaoyi-wu.github.io/MedKLIP/](https://chaoyi-wu.github.io/MedKLIP/) * 1 Wikipedia [https://en.wikipedia.org/wiki/Wiki](https://en.wikipedia.org/wiki/Wiki) * 2 UMLS [6] [https://www.nlm.nih.gov/research/umls/index.html](https://www.nlm.nih.gov/research/umls/index.html) * Received January 10, 2023. * Revision received January 10, 2023. * Accepted January 11, 2023. * © 2023, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/) ## References 1. [1].Society for imaging informatics in medicine: Siim-acr pneumothorax segmentation. [https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation](https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation). 2019. 2, 5 2. [2]. Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapo-lis, Minnesota, USA, June 2019. Association for Com-putational Linguistics. 5 3. [3]. Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, WA Redmond, and Matthew BA McDermott. Publicly available clin-ical bert embeddings. NAACL HLT 2019, page 72, 2019. 4 4. [4]. Alan R Aronson and François-Michel Lang. An overview of metamap: historical perspective and re-cent advances. Journal of the American Medical In-formatics Association, 17(3):229–236, 2010. 2 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1136/jamia.2009.002733&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=20442139&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F01%2F11%2F2023.01.10.23284412.atom) 5. [5]. Federico Bianchi, Giuseppe Attanasio, Raphael Pisoni, Silvia Terragni, Gabriele Sarti, and Sri Lak-shmi. Contrastive language-image pre-training for the italian language. arXiv preprint arXiv:2108.08688, 2021. 2 6. [6].Olivier Bodenreider. The unified medical lan-guage system (umls): integrating biomedical termi-nology. Nucleic acids research, 32(Suppl 1):D267–D270, 2004. 2, 4 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkh061&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=14681409&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F01%2F11%2F2023.01.10.23284412.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000188079000061&link_type=ISI) 7. [7]. Benedikt Boecking, Naoto Usuyama, Shruthi Ban-nur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. In European conference on computer vision, pages 1–21, 2022. Official Implementa-tion: [https://github.com/microsoft/hi-ml/tree/main/hi-ml-multimodal](https://github.com/microsoft/hi-ml/tree/main/hi-ml-multimodal). 2, 3, 5, 6, 7, 8, 17 8. [8]. Geeticka Chauhan, Ruizhi Liao, William Wells, Jacob Andreas, Xin Wang, Seth Berkowitz, Steven Horng, Peter Szolovits, and Polina Golland. Joint modeling of chest radiographs and radiology reports for pulmonary edema assessment. In International Conference on Medical Image Computing and Computer-Assisted In-tervention, pages 529–539. Springer, 2020. 2, 5 9. [9]. Sihong Chen, Jing Qin, Xing Ji, Baiying Lei, Tianfu Wang, Dong Ni, and Jie-Zhi Cheng. Automatic scor-ing of multiple semantic attributes with multi-task feature leverage: a study on pulmonary nodules in ct images. IEEE transactions on medical imaging, 36(3):802–814, 2016. 2 10. [10]. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text represen-tation learning. In European conference on computer vision, pages 104–120. Springer, 2020. 2 11. [11]. Hui Cui, Yiyue Xu, Wanlong Li, Linlin Wang, and Henry Duh. Collaborative learning of cross-channel clinical attention for radiotherapy-related esophageal fistula prediction from ct. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 212–220. Springer, 2020. 2 12. [12]. Shivang Desai, Ahmad Baghal, Thidathip Wong-surawat, Piroon Jenjaroenpun, Thomas Powell, Shay-maa Al-Shukri, Kim Gates, Phillip Farmer, Michael Rutherford, Geri Blake, et al. Chest imaging repre-senting a covid-19 positive rural us population. Scien-tific data, 7(1):1–6, 2020. 2, 5 13. [13]. Foivos I Diakogiannis, François Waldner, Peter Cac-cetta, and Chen Wu. Resunet-a: A deep learn-ing framework for semantic segmentation of remotely sensed data. ISPRS Journal of Photogrammetry and Remote Sensing, 162:94–114, 2020. 5 14. [14]. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16×16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020. 3 15. [15]. Jared A Dunnmon, Alexander J Ratner, Khaled Saab, Nishith Khandwala, Matthew Markert, Hersh Sagreiya, Roger Goldman, Christopher Lee-Messer, Matthew P Lungren, Daniel L Rubin, et al. Cross-modal data programming enables rapid medical ma-chine learning. Patterns, 1(2):100019, 2020. 1 16. [16]. Bradley J Erickson, Panagiotis Korfiatis, Zeynettin Akkus, and Timothy L Kline. Machine learning for medical imaging. Radiographics, 37(2):505, 2017. 1 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1148/rg.2017160130&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28212054&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F01%2F11%2F2023.01.10.23284412.atom) 17. [17]. Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebas-tian Thrun. Dermatologist-level classification of skin cancer with deep neural networks. nature, 542(7639):115–118, 2017. 1 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature21056&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28117445&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F01%2F11%2F2023.01.10.23284412.atom) 18. [18]. Leyuan Fang, Chong Wang, Shutao Li, Hossein Rabbani, Xiangdong Chen, and Zhimin Liu. Attention to lesion: Lesion-aware convolutional neural network for retinal optical coherence tomography image classification. IEEE transactions on medical imaging, 38(8):1959–1970, 2019. 2 19. [19]. Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. Physiobank, phys-iotoolkit, and physionet: components of a new re-search resource for complex physiologic signals. cir-culation, 101(23):e215–e220, 2000. 5 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTQ6ImNpcmN1bGF0aW9uYWhhIjtzOjU6InJlc2lkIjtzOjExOiIxMDEvMjMvZTIxNSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIzLzAxLzExLzIwMjMuMDEuMTAuMjMyODQ0MTIuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 20. [20]. Ivan Gonzalez-Diaz. Dermaknet: Incorporating the knowledge of dermatologists to convolutional neural networks for skin lesion diagnosis. IEEE journal of biomedical and health informatics, 23(2):547–559, 2018. 2 21. [21]. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5 22. [22]. Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vi-sion, pages 3942–3951, 2021. Official Implementa-tion: [https://github.com/marshuang80/gloria](https://github.com/marshuang80/gloria). 2, 3, 5, 6, 7, 8, 17 23. [23]. Xin Huang, Yu Fang, Mingming Lu, Fengqi Yan, Jun Yang, and Yilu Xu. Dual-ray net: automatic diagno-sis of thoracic diseases using frontal and lateral chest x-rays. Journal of Medical Imaging and Health Infor-matics, 10(2):348–355, 2020. 2 24. [24]. Sarfaraz Hussein, Kunlin Cao, Qi Song, and Ulas Bagci. Risk stratification of lung nodules using 3d cnn-based multi-task learning. In International con-ference on information processing in medical imaging, pages 249–260. Springer, 2017. 2 25. [25]. Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Mark-lund, Behzad Haghgoo, Robyn Ball, Katie Shpan-skaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial in-telligence, volume 33, pages 590–597, 2019. 1, 2 26. [26]. Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, Curtis Langlotz, et al. Radgraph: Extracting clinical entities and re-lations from radiology reports. In Thirty-fifth Con-ference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. 2, 3, 5 27. [27]. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text su-pervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021. 2 28. [28]. AEWP Johnson, Tom Pollard, Roger Mark, Seth Berkowitz, and Steven Horng. Mimic-cxr database. PhysioNet10, 13026:C2JT1Q, 2019. 1, 2, 5 29. [29]. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and gener-ation. arXiv preprint arXiv:2201.12086, 2022. 1 30. [30]. Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language represen-tation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021. 1, 2 31. [31]. Liu Li, Mai Xu, Xiaofei Wang, Lai Jiang, and Han-ruo Liu. Attention based glaucoma detection: a large-scale database and cnn model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10571–10580, 2019. 2 32. [32]. Liunian Harold Li, Mark Yatskar, D. Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019. 2 33. [33]. Xiaomeng Li, Xiaowei Hu, Lequan Yu, Lei Zhu, Chi-Wing Fu, and Pheng-Ann Heng. Canet: cross-disease attention network for joint diabetic retinopathy and di-abetic macular edema grading. IEEE transactions on medical imaging, 39(5):1483–1493, 2019. 2 34. [34]. Qing Liao, Ye Ding, Zoe L Jiang, Xuan Wang, Chunkai Zhang, and Qian Zhang. Multi-task deep convolutional neural network for cancer diagnosis. Neurocomputing, 348:66–73, 2019. 2 35. [35]. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018. 5 36. [36]. Gabriel Maicas, Andrew P Bradley, Jacinto C Nasci-mento, Ian Reid, and Gustavo Carneiro. Training med-ical image analysis systems like radiologists. In In-ternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 546–554. Springer, 2018. 2 37. [37]. Matthew BA McDermott, Tzu Ming Harry Hsu, Wei-Hung Weng, Marzyeh Ghassemi, and Peter Szolovits. Chexpert++: Approximating the chexpert labeler for speed, differentiability, and probabilistic output. In Machine Learning for Healthcare Conference, pages 913–927. PMLR, 2020. 2 38. [38]. Masahiro Mitsuhara, Hiroshi Fukui, Yusuke Sakashita, Takanori Ogata, Tsubasa Hirakawa, Takayoshi Yamashita, and Hironobu Fujiyoshi. Em-bedding human knowledge into deep neural network via attention map. arXiv preprint arXiv:1905.03540, 2019. 2 39. [39]. Yasuhide Miura, Yuhao Zhang, Emily Tsai, Curtis Langlotz, and Dan Jurafsky. Improving factual com-pleteness and consistency of image-to-text radiology report generation. In Proceedings of the 2021 Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics: Human Language Technologies, pages 5288–5304, 2021. 1 40. [40].Philip Müller, Georgios Kaissis, Congyu Zou, and Daniel Rückert. Joint learning of localized representa-tions from medical images and reports. arXiv preprint arXiv:2112.02889, 2021. 2, 3 41. [41]. Maya Pavlova, Naomi Terhljan, Audrey G Chung, Andy Zhao, Siddharth Surana, Hossein Aboutalebi, Hayden Gunraj, Ali Sabri, Amer Alaref, and Alexan-der Wong. Covid-net cxr-2: An enhanced deep convolutional neural network design for detection of covid-19 cases from chest x-ray images. Frontiers in Medicine, 9, 2022. 2, 5 42. [42]. Yifan Peng, Xiaosong Wang, L. Lu, Mohammadhadi Bagheri, Ronald Summers, and Zhiyong Lu. Neg-bio: a high-performance tool for negation and un-certainty detection in radiology reports. AMIA Sum-mits on Translational Science Proceedings, 2018:188, 2018. 2 43. [43]. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 1 44. [44]. George Shih, Carol C Wu, Safwan S Halabi, Marc D Kohli, Luciano M Prevedello, Tessa S Cook, Arjun Sharma, Judith K Amorosa, Veronica Arteaga, Maya Galperin-Aizenberg, et al. Augmenting the national institutes of health chest radiograph dataset with ex-pert annotations of possible pneumonia. Radiology. Artificial intelligence, 1(1), 2019. 2, 5 45. [45]. Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pa-reek, Andrew Y Ng, and Matthew Lungren. Com-bining automatic labelers and expert annotations for accurate radiology report labeling using bert. In Pro-ceedings of the 2020 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP), pages 1500–1519, 2020. 2 46. [46]. Jiaxing Tan, Yumei Huo, Zhengrong Liang, and Lihong Li. Expert knowledge-infused deep learning for automatic lung nodule detection. Journal of X-ray Science and Technology, 27(1):17–35, 2019. 2 47. [47]. Haiming Tang, Nanfei Sun, and Yi Li. Deep learning segmentation model for automated detection of the opacity regions in the chest x-rays of the covid-19 positive patients and the application for disease severity. medRxiv preprint, 2020. 2, 5 48. [48]. Joseph J Titano, Marcus Badgeley, Javin Schefflein, Margaret Pain, Andres Su, Michael Cai, Nathaniel Swinburne, John Zech, Jun Kim, Joshua Bederson, et al. Automated deep-neural-network surveillance of cranial images for acute neurologic events. Nature medicine, 24(9):1337–1341, 2018. 1 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41591-018-0147-y&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F01%2F11%2F2023.01.10.23284412.atom) 49. [49]. Kun Wang, Xiaohong Zhang, Sheng Huang, Feiyu Chen, Xiangbo Zhang, and Luwen Huangfu. Learning to recognize thoracic disease in chest x-rays with knowledge-guided deep zoom neural networks. IEEE Access, 8:159790–159805, 2020. 2 50. [50]. Xiaosong Wang, Yifan Peng, L. Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017. 2, 5 51. [51]. Joy T Wu, Nkechinyere Nneka Agu, Ismini Lourentzou, Arjun Sharma, Joseph Alexander Paguio, Jasper Seth Yao, Edward Christopher Dee, William G Mitchell, Satyananda Kashyap, Andrea Giovannini, et al. Chest imagenome dataset for clinical reasoning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. 2 52. [52]. Xiaozheng Xie, Jianwei Niu, Xuefeng Liu, Zhengsu Chen, Shaojie Tang, and Shui Yu. A survey on incorporating domain knowledge into deep learning for medical image analysis. Medical Image Analysis, 69:101985, 2021. 2 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.media.2021.101985&link_type=DOI) 53. [53]. Yutong Xie, Yong Xia, Jianpeng Zhang, Yang Song, Dagan Feng, Michael Fulham, and Weidong Cai. Knowledge-based collaborative deep learning for benign-malignant lung nodule classification on chest ct. IEEE transactions on medical imaging, 38(4):991–1004, 2018. 2 54. [54]. Wenkai Yang, Juanjuan Zhao, Yan Qiang, Xiaotang Yang, Yunyun Dong, Qianqian Du, Guohua Shi, and Muhammad Bilal Zia. Dscgans: Integrate domain knowledge in training dual-path semi-supervised con-ditional generative adversarial networks and s3vm for ultrasonography thyroid nodules classification. In International conference on medical image computing and computer-assisted intervention, pages 558–566. Springer, 2019. 2 55. [55]. Ke Yu, Shantanu Ghosh, Zhexiong Liu, Christopher Deible, and Kayhan Batmanghelich. Anatomy-guided weakly-supervised abnormality localization in chest x-rays. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 658–668. Springer, 2022. 3, 5, 14, 15 56. [56]. Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare, 2022. Highest Starred Implementation: [https://github.com/edreisMD/ConVIRT-pytorch](https://github.com/edreisMD/ConVIRT-pytorch). 2, 3, 4, 5, 6, 8, 17 [1]: /embed/graphic-2.gif [2]: /embed/inline-graphic-1.gif [3]: /embed/graphic-4.gif [4]: /embed/graphic-5.gif [5]: /embed/graphic-6.gif [6]: /embed/graphic-7.gif [7]: /embed/graphic-8.gif [8]: /embed/inline-graphic-2.gif [9]: /embed/inline-graphic-3.gif [10]: /embed/graphic-9.gif [11]: /embed/graphic-10.gif [12]: /embed/inline-graphic-4.gif [13]: /embed/graphic-20.gif