Abstract
Chest X-ray (CXR) is a conventional diagnostic tool for cardiothoracic assessment, boasting a high degree of costeffectiveness and versatility. However, with an increasing number of scans to be evaluated by radiologists, they can suffer from fatigue which might impede diagnostic accuracy and slow down report generation. We describe a prototype computer-assisted diagnosis (CAD) pipeline employing computer vision (CV) and Natural Language Processing (NLP) trained on the publicly available MIMICCXR dataset. We perform image quality assessment, view labelling, segmentation-based cardiomegaly severity classification, and use the output of the severity classification for large language model-based report generation. Four certified radiologists assessed the output accuracy of the CAD pipeline. Across the dataset composed of 377,100 CXR images and 227,827 free-text radiology reports, our system identified 0.18% of cases with mixedsex mentions, 0.02% of poor quality images (F1=0.81), and 0.28% of wrongly labelled views (accuracy 99.4%), furthermore it assigned views for 4.18% of images which have unlabelled views. For binary cardiomegaly classification, we achieve state-of-the-art performance of 95.2% accuracy. The inter-radiologist agreement on evaluating the report’s semantics and correctness for radiologistMIMIC is 0.62 (strict agreement) and 0.85 (relaxed agreement) similar to the radiologist-CAD agreement of 0.55 (strict) and 0.93 (relaxed). Our work found and corrected several incorrect or missing metadata annotations for the MIMIC-CXR dataset, and the performance of our CAD system suggests performance on par with human radiologists. Future improvements revolve around improved text generation and the development of CV tools for other diseases.
Introduction
More than a century has passed since its discovery, still, chest X-ray (CXR) remains a staple non-invasive diagnostic tool in medical institutions across the globe owing to its versatility, inexpensiveness, and convenience (1). Thoracic anatomical structures including the lungs and heart are visualised and distinguished on a 2D film based on radiodensity variations, enabling the detection of common cardiothoracic conditions such as pneumonia, pneumothorax and tuberculosis (2). The procedure costs only a fraction of more sophisticated imaging techniques such as computer tomography (CT) or magnetic resonance imaging (MRI) and can be completed in mere minutes, making it an ideal starting point in thoracic imaging (3).
Despite its popularity, CXR faces its own challenges in its application. Unlike CT and MRI, CXR reduces the 3D pleural space to a singular plane, leading to structural overlaps between anatomical features, with over 40% of the lung parenchyma obscured by the ribs and mediastinum (4). Depending on the type and site of a lesion, the visible differences between the lesion and surrounding structures may be extremely subtle or even indistinguishable (5, 6).
Apart from technical limitations, human factors also complicate CXR interpretation. Radiologists are subject to cognitive biases, distractions, and fatigue, which negatively impact their diagnostic acumen and brood errors, potentially leading to patient harm (7, 8). Furthermore, the ease of performing a CXR has inadvertently led to widespread over-prescription, with the average US radiologist workload increasing by 30% to 70% in under three decades, contributing to significant radiologist burnouts (1, 9, 10). In less privileged regions of the world, severe radiologist shortages are often observed with practising radiologists struggling to meet demand (11). It is therefore perhaps unsurprising that even with numerous technical advancements and error-mitigating strategies developed over the years, studies have repeatedly failed to identify any improvement in radiology interpretive error rates since 1949, which stands at an uncomfortable 3% to 4% of reports, a sizable fraction of which is comprised of CXR (7, 8, 12).
To remedy this chronic deficiency, exploratory efforts have been made to incorporate artificial intelligence (AI) technology into CXR interpretation in recent years (5). The underlying rationale is that, unlike their human counterparts, AI readers are immune to fatigue, distractions, and may suffer less from certain types of cognitive biases based on the training methods, allowing for higher accuracy and consistency (5). Additionally, AI could significantly relieve the burden of radiologists, especially in times of crisis such as the COVID-19 pandemic, reducing the backlog and improving patient outcome (13, 14).
Due to the sheer complexity of the CXR interpretation, most modern approaches employ deep learning methods, such as training convoluted neural network (CNN) models (5, 15, 16). Nevertheless, attempts to implement AI in radiology reading tend to diverge in their philosophy. Some aim to develop AI-augmented diagnostic tools to assist human radiologist interpretation, while others are concerned with creating a system to automate the entire process. For instance, the CAD system CheXGAT (17) helps radiologists detect and discern between fourteen chest conditions, while AI-CenterNet CXR (18) is a fully automated system for detecting and localising eight thoracic conditions. Similarly, some projects may pursue a comprehensive model for widespectrum usage such as the ones above, while others narrow their focus to one or few diseases (5, 19). COVID-Net CXR2 and Covid-MANet, CNN models developed for detecting COVID-19 pneumonia, are examples of the latter (13, 14).
The present study produces a focused CAD system for the automatic detection and reporting of cardiomegaly in CXR. Cardiomegaly refers to abnormal or pathological enlargement of the heart, which is often a manifestation of dilated or hypertrophic cardiomyopathies (20–22). While the aetiology of both cardiomyopathies is highly complex and not yet fully elucidated, it is known that both could lead directly to heart failure syndrome which carries a poor prognosis (20, 23, 24). Severe cardiomegaly on CXR is immediately apparent to experienced radiologists on a Posterior-Anterior (PA) CXR, while more ambiguous cases may require the calculation of cardiothoracic ratio (CTR) using a PA CXR (Fig 1). Any value above a threshold (often set between 0.45∼0.55) possibly indicates cardiomegaly (20, 25, 26). The process can be time-consuming and contain a degree of subjectiveness due to variations in thresholds used and measurements taken, and thus may benefit from the implementation of CAD systems.
Several CAD systems have been developed for cardiomegaly detection and can be categorised into classification-based and segmentation-based approaches (27). Several transfer learning models were developed, with one classifying normal vs cardiomegaly CXR images with an AUC of 0.87, which is close to the average of similar approaches (28– 30). U-Net-based segmentation models are utilised to automatically calculate CTR for determining cardiomegaly presence and are considered a feasible option with the advantage of higher explainability over classification-based methods (31, 32). Nevertheless, the existing cardiomegaly CAD systems share many of the same limitations. The datasets used for model creation and testing are often limited in size due to the rarity of pixel-level labelled CXR data, restricting model generalisability (27, 32). The final output of these models is either directly binary, or CTR ratios which are then used to derive binary decisions (presence or absence of cardiomegaly) (27, 33). Without the capability of determining the severity of heart enlargement or producing report-styled sentences mimicking those written by human radiologists, the capacity to which these CAD systems can automate cardiomegaly detection and reporting is limited.
The current study is inspired by existing segmentation-based cardiomegaly CAD systems and improves upon them by enhancing generalisability and introducing missing features crucial to realising full automation in cardiomegaly detection and reporting. MIMIC-CXR, a massive dataset containing 377,100 CXR images and 227,827 reports, is selected as the main dataset. A ResNet50 model is first trained using the dataset for filtering CXR images suitable for cardiomegaly detection. Lung and heart segmentations and CTR calculation are subsequently performed on suitable images with varying degrees of cardiomegaly reported. Statistical techniques are then applied to determine the CTR threshold for different severity levels of cardiomegaly, ranging from borderline to severe. Finally, natural language processing (NLP) and large language models (LLM) are leveraged to produce accurate, believable cardiomegaly report sentences based on machine predictions. If fully ensembled, the CAD system only requires an input stream of CXR images and provides full automation from image quality assessment to report sentence generation, subjective to double checking by human radiologists, as illustrated by the flowchart (Fig2). A cohort of four radiologists were invited to assess the output quality of CAD output sentences as human experts.
We hypothesise that coupling the output of computer vision models with large language models can generate plausible expert-level radiology report text. Specifically, we aim to:
1) assess the quality of the MIMIC-CXR dataset, including both images and reports, to prepare it for developing computer vision models, 2) Utilise CXR segmentation algorithms to derive features relevant to sub-tasks, 3) create a computer vision model for cardiomegaly classification using labels extracted from reports using natural language processing tools, 4) combine the output of the computer vision classification model with large language models to generate plausible medical text, and finally 5) use expert evaluation of the generated reports to assess the quality of the generated text.
Materials and Methods
MIMIC-CXR dataset
The MIMIC-CXR database (36) was selected as the primary source of CXR data in this study. It is publicly available with credentialed access via PhysioNet (37) and includes 377,100 CXR images in both JPG and DICOM format and 227,827 corresponding free-text radiology reports, all anonymised. The images and corresponding text reports are stored across 10 folders (p10-p19). Additionally, the database contains CSV files with image metadata and feature annotations acquired with CheXpert (38) and NegBio (39). In each feature column, 1 represents a positive mention of the feature, 0 represents negative mentions, and −1 represents ambiguous mentions. A blank value indicates the absence of any mention of the feature.
All CXR images in JPG format and their corresponding free text reports, alongside metadata and annotation CSV files, were directly downloaded from PhysioNet using command prompts.
The data user agreement of MIMIC-CXR prohibits sharing images from the dataset. Images from MIMIC-CXR that are already in the public domain also fall under this agreement. The MIMIC-CXR team have confirmed that this policy applies to all data therefore no disclosure is permitted (private communication 30 August 2024). We therefore have replaced the images with similar images from the NIH Chest X-ray dataset (34) (for frontal PA images) and from (35) (for AP and lateral images, under a CC BY-NC 3.0 license). In each of the figures (where relevant) we have indicated the original MIMIC ID that the inference is made on so that those with credentialised access to MIMIC-CXR can inspect the original data. The report snippets shown here are all synthetically generated, however their context mimics the original MIMIC-CXR report text. MIMIC-CXR report IDs are disclosed in the relevant captions to allow readers with MIMICCXR data access to inspect the original images and reports.
Generation of poor quality images
All images from p10 and p11 folders were scrutinised manually to identify ‘abnormal’ images which are those that cannot be properly interpreted by radiologists and would warrant retaking the image. These images contain severe cropping, distortion, and/or absence of thoracic structures. Upon positive identification, these images were then used to generate 500 additional ‘artificial’ abnormal images by applying synthetic and augmentation techniques to imitate common types of abnormal images from normal images (see examples in Fig 3A). The ‘genuine’ abnormal images were set aside for testing (see below).
Computer vision models
Two classification models were trained on the MIMIC-CXR dataset, one to classify normal from poor quality images, and a subsequent model to classify frontal from lateral images as shown in Fig 2A to perform an image quality control.
Classification
For the normal versus poor quality image classification model, 500 frontal including both PA and AP views as well as 500 lateral images were selected at random for the normal group from the p10 and p11 folders to form the training set, where each image was manually checked.
For the frontal versus lateral classification model, 1,000 images from each group were selected from the entire MIMICCXR dataset (p10-p19) through cross-referencing with view labels reported in the metadata CSV file. Similarly, each image was manually checked for labels being correctly assigned. The test set consists of another 2,000 images (equally split) selected at random, and manually checked, without overlapping patients with the training set.
For both models, the training image datasets were randomly split into training and validation sets at a ratio of 80:20, enabling real-time model training performance monitoring. The ResNet50 model (40) was loaded without pre-trained weights, and all images were resized to 256×256 pixels. Data augmentation was incorporated into training data loading which randomly introduced contrast jitter and rotations to all images for boosting data variability, model versatility and robustness. Multiple combinations of training parameters and hyper-parameters were experimented with for performance optimisation in a grid search-like process with manual tuning. For both training and validation sets, the batch size was set to 64, with cross-entropy loss and Adam optimiser (learning rate of 0.001). 20 epochs were scheduled, with an early stopping condition if both training and validation accuracy surpassed 99% at the end of an epoch. The trained model states were saved as .pth (PyTorch-trained model) files. Each model was initialised 10 times with different random states and results are represented as median with interquartile range.
The trained models were subsequently evaluated on the independent test sets. Performance was gauged by overall accuracy, as well as precision and recall, with confusion matrices constructed. To understand the rationale of the model in classifying images, Grad-CAM (41) was used to create attention heatmaps to visualise the focus areas of each model, providing insight into what features it considers distinctive to each class, as well as catching instances where the model may predict based on non-relevant features.
Segmentation and cardiothoracic ratio calculation
Good quality frontal CXR images with cardiomegaly mentions in their associated reports were then grouped based on the severity descriptors extracted from text reports. For each frontal image, segmentation of the lungs and heart was carried out using the inbuilt default U-Net-based segmentation model in the Chest X-ray Anatomy Segmentation (CXAS) package (42) (v0.0.9), generating lungs and heart masks in the PNG format.
The Maximal Horizontal Diameter (MHD) of the lungs and heart masks were used to calculate the computer vision-based CTR (CV-CTR). The MHD was calculated by counting the number of mask pixels in each row of a given mask image using opencv-python (v4.9.0.80) and numpy (v1.26.2) packages, and subsequently taking the 95th percentile value to eliminate potential outliers caused by sub-optimal segmentation.
If either the lung or heart MHD of an image was 0, it was excluded from further analysis as segmentation had likely failed for that image. The CV-CTR for each image was then calculated by dividing the heart MHD by the lung MHD (Fig 1). The CV-CTRs were subsequently grouped by severity descriptors from the reports.
Additionally, another set of ratios was calculated using the total areas, instead of HMD, of heart and lung masks. Areas were measured by counting the total number of mask pixels within lung and heart masks, and the Cardiothoracic Area Ratio (CAR) was calculated by dividing heart by lung area.
Natural language processing tools
Extracting data from reports
CXR reports containing either positive or negative mentions of cardiomegaly and its synonyms were filtered by selecting only reports labelled 1 or 0 in the ‘cardiomegaly’ column in the CheXpert CSV, thereby rejecting any reports with ambiguous mentions or no mentions concerning cardiac size. For each selected report, NLP methods, including text tokenisation and keyword searching, were applied using the NLTK (v3.8.1) package to extract sentence(s) describing cardiomegaly or cardiac size. Cardiomegaly mentions were extracted from a total of 137,424 reports through this process, roughly 60% of all MIMICCXR dataset reports.
All adjectives and adverbs from extracted cardiomegaly mentions were identified and ranked based on frequency. From the top 50 most frequent adjectives and adverbs respectively, severity descriptors (terms that describe the severity of a condition) that appeared in at least 100 reports were selected for further analysis. Adjectives and adverbs sharing the same stem (such as mild and mildly) were merged. Cardiomegaly mentions files containing these severity descriptors were grouped by keyword matching, with any files containing two or more different severity descriptors being discarded.
Comorbidity analysis
Comorbidity analysis was performed for each report using annotations of 14 conditions in the CheXpert CSV. The conditions were ‘Atelectasis’, ‘Cardiomegaly’, ‘Consolidation’, ‘Edema’, ‘Enlarged Cardiomediastinum’, ‘Fracture’, ‘Lung Lesion’, ‘Lung Opacity’, ‘No Finding’, ‘Pleural Effusion’, ‘Pleural Other’, ‘Pneumonia’, ‘Pneumothorax’, and ‘Support Devices’. Any positive mention of the condition in the report (labelled ‘1’ in the CSV) was considered an occurrence. Ochiai similarity (Eqn. 1) was calculated between conditions to analyse their degree of association (where a is the number of co-occurrences of conditions A and B, b the number of single occurrence of condition A, and c the number of single occurrence of condition B). We used Ochiai similarity instead of co-occurrence frequency as different conditions had varying occurrence frequencies. The Ochiai similarities among the 14 conditions were used to construct a graph of co-occurrences using the NetworkX package (v3.1).
Report text generation
Previously extracted cardiomegaly mentions were regrouped with severity categories. The mentions were pre-processed to strip periods and all letters were converted to lowercase to account for minor variations in expressions. Within each severity category, the most frequently used sentences for cardiomegaly reporting were counted and ranked. Expressions that mention other conditions irrelevant to cardiomegaly and/or patient history were filtered out. The ten most frequent filtered expressions from each category were given to MedQBot on Poe (https://poe.com/medqbot), an LLM originally intended for medical exam preparation, with a prompt to generate ten similarly concise, report-styled, cardiomegaly expressions. The prompt used is listed below, where [severity group] was replaced by each of the adjectives or adverbs identified above:
Here are some CXR report sentences describing [severity group] cardiomegaly. Based on this set of sentences, generate similarly concise and report-styled sentences describing [severity group] sentences.
The generated sentences were acquired on 13 March 2024. They were manually checked for accuracy, style and relevance, with unsuitable sentences (e.g. too verbose, unrelated, inaccurate, temporal) being discarded.
Human evaluation
To evaluate our cardiomegaly CAD pipeline performance, four radiologists (three with 1 to 5 years of experience, and one with 10+ years of experience) were invited to participate in a small-scale assessment using the online Qualtrics questionnaire. A total of 40 items were prepared, each consisting of a frontal CXR image, a report snippet of the findings section, and a side-by-side question box where the radiologists indicate to what degree they agree with each sentence from the snippet based on the image provided. The radiologists were able to provide reasoning for any disagreement in a free-text box. Each item was seen by two radiologists who were blinded to each other’s answers and were not informed whether the sentences were machinegenerated or from the original MIMIC report. Each radiologist evaluated a similar number of sentences, with the same ratio of changed and unchanged sentences.
The questions were divided into four categories (Table 1) to ensure an even mix of original and machine-generated sentences, either related or unrelated to cardiomegaly. The machine-generated cardiomegaly sentences were randomly selected from the pre-curated list as described before based on the predicted severity group by the computer vision model. For question categories 2 and 4, non-cardiomegaly sentences were rewritten and paraphrased with MedQBot with the following prompt on 19 March 2024:
For the following paragraph taken from a CXR report, generate a new paragraph by paraphrasing each sentence from the given paragraph, while maintaining concision and standard CXR reporting style.
Each radiologist was assigned 20 items (each containing multiple statements to assess), with each radiologist receiving roughly the same number of items in each listed category. The assignment process also ensured that each question was answered by exactly two radiologists. The radiologists were not informed of the design or pipeline of this study nor the composition of the question set or that of their assigned questions. Their responses were individually downloaded as CSV from Qualtrics and pre-processed to integrate into a single data frame using pandas (v2.1.4) with paired responses for each question, complete with question category and cardiomegaly relevance.
To assess inter-rater agreement we computed the total agreement by comparing their results for each sentence. We used two agreement metrics: strict agreement which consisted of four distinct categories (‘Agree’, ‘Partially Agree’, ‘Disagree’, and ‘Unsure’) and relaxed agreement which merged ‘Agree’ and ‘Partially Agree’ into a single category.
Results
Data quality assessment
We first reviewed the data quality for several aspects relevant to this study. The MIMICCXR dataset is one of the largest publicly available chest X-ray radiology datasets, thus containing a considerable degree of data variability. Consequently, only a subset of the MIMIC-CXR dataset is suitable for developing cardiomegaly detection CAD systems.
Dataset filtering for downstream applications is complicated by a host of factors. As mentioned before, frontal CXR is best suited for cardiomegaly detection, while lateral images are rarely used and should thus be excluded. Furthermore, the MIMIC-CXR dataset is not error-proof. We found that 421 patient reports contain mismatched sex mentions (Table 2). View labels of images were occasionally missing or incorrect, for instance, by mislabelling a lateral image as PA/AP. In addition, a smaller proportion of images suffer from quality issues severe enough to entirely prevent analysis. To address this, the CAD system needs to first identify and flag unsuitable images to prevent further processing and analysis while alerting human radiologists. This is accomplished by leveraging two ResNet50 classification models as detailed below.
ResNet50 classification sub-models and metadata quality
As mentioned, a two-step hierarchical model was developed to classify CXR images by quality and view for dual purposes. In direct relation to the CAD system itself, the model provides a filtering step to only allow suitable images to be assessed by downstream cardiomegaly detection algorithms. Additionally, the model can also be utilised to help correct mislabelling and flag flawed images within the MIMIC-CXR dataset, or potentially any similar CXR datasets. 22 abnormal images were identified from p10 and p11 subfolders and categorised into five groups based on similar features (Fig 3A). For each category, 100 new images were generated with synthetic and augmentation techniques. Together, they constituted an abnormal training subset of 500 images to train the first model to classify normal vs abnormal images. This model demonstrated high overall accuracy on a test set with 20 real bad-quality images from the MIMIC-CXR and 1,000 normal images with 85% recall for poor quality class (Fig 3B). The second model, which differentiates between frontal and lateral CXR, exhibited much better performance, with 99.4% accuracy on the test set with 1,000 frontal and lateral images each (Fig 3D).
For both sub-models (see Fig 2), model explainability was explored through mapping model attention with Grad-CAM (Fig 3C, E). Model attention for sub-model 1 is relatively localised with a region of high intensity on the left. For submodel 2, model attention is diffused when frontal images are encountered, covering most of the image with higher attention zones covering both lungs. When presented with a lateral image, model attention is markedly more focused and often localised to the region just below the diaphragm.
Cardiomegaly severity classification
Extraction of cardiomegaly mentions using natural language processing
Gaining a comprehensive understanding of the cardiomegaly landscape in the MIMIC-CXR dataset is paramount to CAD development. ‘Normal’ is the most common descriptor for heart size with 10,883 cases, followed by ‘Mild’ (4,651 cases), ‘Borderline’ (4,172 cases) and ‘Moderate’ (2,392 cases) when describing positive cardiomegaly. Generally, descriptor frequency decreases with increasing perceived severity. ‘Mild to moderate’, ‘Moderate to severe’ and ‘Severe’, are less commonly used, each with less than a thousand cases (Figure 4A).
CXR auto-segmentation and cardio-thoracic ratio calculation
After CV-CTR calculation, severity descriptors were ranked by median CV-CTR and CAR. For each pair of neighbouring descriptors, t-tests were performed with Šidák correction to identify statistically synonymous severity descriptors which were then merged to form severity categories.
Threshold values separating each severity level were calculated separately for both CV-CTR and CAR. For the former, t-tests were performed for every value between 0 and 1 at intervals of 0.01 with CVCTR data of each severity group, and each value was assigned to the severity group with the greatest p-value, forming ratio intervals. For CAR, t-tests are performed in a similar manner with the range of tested values set between 0 to 0.5 instead. A correlation test between the two sets of ratios was also carried out to determine whether the ratios are correlated on a case-by-case basis.
For each descriptor, CV-CTR was calculated for all corresponding PA CXR images with the computer vision model (Fig 4A). Distributions appear to be normally distributed, although some are slightly skewed, such as the ‘moderate-tosevere’ category. Data spread is generally bigger in higher severity groups. The ranking of descriptors by median is mostly in line with linguistic expectations. 93.8% of mild cardiomegaly cases were found to have CV-CTR over the clinical threshold of 0.5 based on the computer vision model, with upwards to over 98% in severe cardiomegaly, indicating high recall of the CAD system for cardiomegaly cases.
Data-driven model for severity classification
Significant overlaps are observed between the distribution of neighbouring descriptors. To identify statistically synonymous severity descriptors, multiple t-tests were carried out between CTR distributions of each pair of neighbouring descriptors, and any neighbouring descriptors with non-significant differences were merged into groups, forming a total of seven severity categories (Fig 4A).
CV-CTR intervals for each severity group were subsequently established for each severity category (Fig 4B). The upper threshold for ‘normal’ CTR (0.51) is close to 0.5, mirroring the clinical threshold for cardiomegaly diagnosis. The ratio intervals for normal and severe categories are much larger than intermediates to encompass extreme values. The number of cases falling into each severity interval varies slightly compared to the reference distribution of cases based on original report descriptors.
Using previously established severity group CV-CTR intervals, a confusion matrix was constructed to evaluate heart size predictions of the model against MIMIC ground truths (Fig 4C). At both ends of the severity spectrum, the model performs relatively well in accurately determining cardiomegaly severity. Intermediate severity categories, such as ‘moderate’, saw a drop in accuracy in severity classification, although the model is still highly accurate for binary classification (normal vs cardiomegaly) across all categories excluding ‘borderline’.
The severity classification based on CV-CTR is similar to the cardiothoracic area ratio (CAR), defined here as the ratio of the total heart area and lung area as presented in PA CXR images. We found the distribution pattern of CV-CAR to be highly similar to that of CV-CTR, and the two are found to be positively linearly correlated (R2=0.82), with 90% of residuals within ±0.09, and with the CV-CAR showing similar patterns across the 7 severities (Supplementary Materials, Figure 7).
Report sentence generation using a medical large language model
Once a severity category prediction has been made by the computer vision model, it is used as an input argument for report sentence generation using LLMs. A sentence is picked at random from the pre-curated list of generated sentences of the corresponding severity category to constitute the final report sentence. Fig 5 illustrates three examples of report sentence generation based on cardiomegaly severity predictions. Model prediction for the CXR image in Fig 5A matches that of the original report, and as such a sentence resembling the original is generated. Model prediction for the CXR image in Fig 5B differs from the original by one level of severity and thus is regarded as a close match, while a mismatch is found for the CXR image in Fig 5C.
Human evaluation of radiology reports
Inter-rater agreement
We calculated the strict and relaxed agreement in three different settings: (1) between our radiologists (strict: 0.487, relaxed: 0.864), (2) between our radiologists and the original MIMIC radiologists on unchanged sentences (strict: 0.624, relaxed: 0.845), and (3) between our radiologists and machine-generated sentences (strict: 0.545, relaxed: 0.929).
Notably, the machine-generated sentences achieved a strict agreement of 0.545, which was 0.089 lower than the maximum strict agreement between our radiologists and the MIMIC radiologists (0.634). In contrast, the machinegenerated sentences achieved the highest relaxed agreement (0.929), surpassing the agreement between our radiologists and the MIMIC radiologists by 0.084.
Comorbidity mentions
We evaluated the relationship between other mentions in the reports and cardiomegaly. We show here the most common co-occurrences across all reports, and for those with co-occurrences with cardiomegaly (Fig 6A) we visualise the CV-CTR in Fig 6B. Between 53% (lung lesion) and 90% (edema) of patients had these as comorbidities with presence of cardiomegaly.
Discussion
MIMIC-CXR Dataset Quality Control
The presence of quality issues, such as abnormal images and mislabelling within the MIMIC-CXR dataset, is an unexpected finding despite the relative rarity. Regardless of the underlying causes, if left unchecked and unbeknownst to researchers, they could pose significant challenges to ensuring data quality and reliability in future studies utilising the dataset.
Current literature involving MIMIC-CXR, however, rarely mentions the dataset quality shortcomings, an odd predicament given the popularity of the dataset. The original journal describing the MIMIC-CXR dataset does contain a brief notice of the variability of images caused by image rotation, poor patient positioning and secondary collimation, yet did not comment on the presence of severe image quality issues nor instances of mislabelling (36). Out of five studies published over the past two years involving the use of MIMIC-CXR images for creating deep learning models, none has acknowledged any aspects of quality issues within the dataset (43–47).
The current study sought to ascertain the extent of selected quality issues within MIMIC-CXR and rectify them for the benefit of future studies using the same dataset. Manual identification of bad-quality images or wrong view labels is possible, but time-consuming considering the overwhelming number of images involved. ResNet50 model 1 from the study was used to identify potentially abnormal images, from which a total of 83 images with severe quality issues were annotated after human cross-checking. The same principle was similarly applied to correcting mislabelled views using model 2, with 1,054 total corrections. Additionally, model 2 was utilised to classify images previously without view labels with human crosscheck. The augmented image metadata dataset represents an improvement over the original, serving to maximise the exclusion of images of bad quality or incorrect views in datasets used in future studies.
Conversely, the ResNet50 models themselves may still have room for improvement. Judging from model attention heatmaps, model 1 frequently displays a stronger focus around the edge of the image as opposed to inner zones, which may limit its ability to identify certain types of abnormal images. The hierarchical nature of the models makes model 1 a potential weak spot, as its moderate false negative rate increases the chance of an abnormal image continuing through the CAD system unchecked. However, since abnormal images are rare by nature, the overall impact is likely limited.
Cardiothoracic Segmentation and Cardiomegaly Detection
Our computer vision model, developed as part of the CAD system, exhibited good performance for binary cardiomegaly classification. As it was not a machine learning model and instead relied on statistical methods, the need for curating a training set is eliminated, and all images within the MIMICCXR dataset can potentially serve as part of the test set. It has an overall accuracy of 95.2% (0.95 AUC) by using all images with unequivocal heart size mentions from MIMICCXR as the test set, utilising MIMIC-CXR report diagnosis as ground truth and 0.51 CV-CTR as the threshold for cardiomegaly (borderline cases are excluded as they do not fit into either category). The CAD model has a false negative rate of 6.6%, indicating that it captures the majority of cardiomegaly cases, while its false positive rate is 3.2% is reflective of high precision of cardiomegaly detection. Note that these numbers may not fully reflect the model’s true accuracy, as the model is not yet tested on other datasets. Also, not all cardiomegaly diagnoses from MIMIC-CXR are necessarily correct, as evidenced later. In terms of overall accuracy, our model proves superior to nearly all existing classificationbased (28) and segmentation-based (27, 30, 31) models (Table 3). The test set used in this study (n=19,144) also vastly outnumbers those used in similar studies (131 to 600) by magnitudes in terms of total images included.
Furthermore, while these past studies have developed similar CAD systems for binary cardiomegaly detection, few have gone to lengths to implement functions for differentiating between different levels of severity. In contrast, when radiologists report findings of cardiomegaly, it is often standard practice to attach a severity descriptor. The current study is unique in that it identified the most common descriptors used, grouped synonyms based on statistical analysis and derived CV-CTR intervals for each severity category, which makes the CAD output more informative and is indispensable for downstream report sentence generation.
Our work also explores the subjectivity in severity attribution in clinical practice. Without any universal guidelines for descriptor selection, radiologists rely on individual judgment to pick the most appropriate descriptors. This is a process that permits a significant degree of subjectivity, which is reflected by the wide distributions of CV-CTR across different descriptors and severity categories (Fig 4B). While fluctuations in segmentation and computer vision model performance certainly do contribute to the overlaps, natural subjectivity and personal biases of radiologists are also likely factors.
Furthermore, responses gathered from the radiologist questionnaire revealed 4 cases where the responding radiologist rejected the original MIMIC-CXR judgement of cardiomegaly, which constitutes 12.5% of all original cardiomegaly sentences assessed. In nearly half (47.5%) of all cardiomegaly sentences included in the questionnaire, the two assessing radiologists disagreed in their response agreement levels. Indeed, the strict agreement between assessing radiologists is only 0.4878 (0.864 relaxed), indicating a considerable degree of subjectivity in CXR interpretation and reporting. Nonetheless, a point of positivity lies in the fact that the upper threshold of normal CV-CTR determined by statistical analysis (0.51) is nearly identical to the currently used clinical threshold (0.5), indicating a high level of adherence in clinical practice.
In addition, studies have shown that CTR may have limited diagnostic power when compared to cardiac MRI, the gold standard for cardiac chamber enlargement detection (48, 49). Therefore, we have explored the CAR as an alternative metric for cardiomegaly detection. While not used clinically, the concept considers the definition and presentation of cardiomegaly and has the advantage of considering two dimensions of the heart as opposed to a single dimension used for CTR calculation. Despite these advantages, we did not observe a major difference between CTR and CAR in terms of alignment with cardiomegaly severity. However, the reliability of the CAD system may be improved by cross-checking CV-CTR and CV-CAR derived from the same images, flagging instances where a significant mismatch is found between predicted severity by both methods.
The study also examined correlations between cardiac enlargement and various comorbidities. The mention of edema appeared to be the strongest predictor for enlarged cardiac size, with CV-CTR greater than 0.5 in over 90% of cases, only marginally less than mentioning of cardiomegaly itself. As such, the presence edema may be strongly predictive of the co-occurrence of cardiomegaly. This reflected the outcome of the co-occurrence analysis showing a frequent cooccurrence between the two conditions (Fig 6B).
LLM Sentence Generation and CAD Performance Evaluation
The inclusion of an automatic report sentence generation module is another feature less commonly found in other cardiomegaly CAD systems, the addition of which further enhances user experience by removing the need for manual text input. Compared to more diverse and complex conditions, the reporting of cardiomegaly has the advantage of being quite consistent and uncomplicated, which enables LLMs to imitate their style effectively.
Another strength of the study is the inclusion of radiologist assessment of the performance of the CAD system, a rare inclusion with parallels only found in one other study (30). Whereas most other work in the space of report generation use natural language generation metrics to evaluate the agreement between generated text and original reports (50–52). While the assessment is limited in scale, it nevertheless helped validate the system’s feasibility. The current CAD system reached an average agreement of S:0.545/R:0.929 with our radiologists’ verdicts, comparable with S:0.624/R:0.845 between our radiologists and MIMIC, suggesting that the CAD system’s performance may be non-inferior, or share a comparable level of subjectivity, to trained radiologists. Our findings indicate that our machinegenerated sentences were on par with the MIMIC radiologists in assessing the severity of cardiomegaly and the semantics used to generate the report. In fact, disagreements appear to be mostly linguistic rather than diagnostic, as evidenced by the distribution of reasons for disagreement cited, which can be improved by curating a better set of generated sentences.
Limitations and Future Work
While the preliminary results offer much encouragement, the study remains a proof-ofprinciple at this stage, owing to the lack of large-scale trials involving clinical validation. It is also yet to be tested on datasets other than MIMIC-CXR. Furthermore, the caseby-case performance of the computer vision model is dependent on accurate anatomical segmentations, and as such is vulnerable to errors arising from aberrant segmentations. It is especially a concern among CXR images where the cardiac silhouette and/or part of the lungs is obscured by conditions such as pulmonary edema and pleural effusion, which can significantly diminish the reliability of segmentations obtained with CXAS. A more versatile segmentation model optimised for challenging CXR images could greatly benefit the performance of the model.
Additionally, since PA is the recommended view for cardiomegaly detection, AP CXR images have been excluded from analysis, despite the close resemblance of AP CXR to PA CXR. While AP CXR has the inclination to exaggerate cardiac size (53), they can, nevertheless, serve as an initial screening for determining the need for further investigation. For instance, one study has recommended a CTR ratio of 0.6 as the threshold when using AP CXR for increased specificity (54). The CAD’s usage scenarios may therefore be expanded by including AP CXR in follow-up studies.
During the preparatory phase in LLM sentence generation, it was noted that a significant portion of cardiomegaly mentions includes references to past time points, examples include ‘mild cardiomegaly is unchanged’ and ‘severe cardiomegaly has improved’. The study did not explore the possibility of generating sentences capturing multiple time points. However, it may be feasible by allowing the model to calculate CTR and make a severity prediction for both the past and present CXR in question, using any difference as an indicator of change to generate more dynamic sentences.
Conclusion
The present study has developed the essential framework and components for the assembly of a highly automated and reliable CAD system for cardiomegaly detection using PA CXR images. The system holds several advantages, including a high degree of cost-effectiveness, explainability in detection, and efficiency in operation. Input CXR images are first screened to exclude and flag instances unsuitable for cardiomegaly detection. By calculating CTR using a computer vision model from deep learning-derived heart and lung segmentations, high overall accuracy in detecting cardiomegaly and assigning appropriate severity labels is achieved. Furthermore, the model is capable of generating human-like, report-styled sentences based on the output of the computer vision model. Performance metrics indicate that the ability of the model to generate accurate cardiomegaly report sentences based on PA CXR is likely to be non-inferior to human radiologists, although larger trials are necessary to confirm this judgment. With further optimisation, the novel CAD system could prove to be a welcome addition to the radiology toolkit, representing an accessible and cost-effective method to offer radiologists a second perspective to validate their own assessments.
Data Availability
All data produced in the present work are contained in the manuscript.
AUTHOR CONTRIBUTIONS
Conceptualisation, J.M.P. and M.G.S.; Methodology, T.Z., K.X., A.D.L. and J.M.P.; Formal Analysis and Investigation, T.Z., K.X. and W.S.; Supervision, J.M.P., A.D.L., M.G.S., K.L.R. and M.B.M.; Writing – Original Draft: T.Z., K.X., A.D.L. and J.M.P.; Writing – Review & Editing: J.M.P., A.D.L., K.L.R., and M.G.S. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by Health Data Research (HDR) UK and the Medical Research Council (MRC) via an UKRI Rutherford Fund Fellowship to J.M.P. (MR/S004033/1). A.D.L. and J.M.P. are additionally supported by the Horizon Europe project CoDiet. The CoDiet project is funded by the European Union under Horizon Europe grant number 101084642. CoDiet research activities taking place at Imperial College London is supported by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee (grant number 101084642). J.M.P. is supported by the MRC project GI-tools (MR/V012452/1). The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
Supplementary Materials
Acknowledgement
We gratefully acknowledge the support of Dr Catherine Payne, Dr Abi Redwood, Dr Liam Roebuck and Dr John Taylor (listed alphabetically) in scoring the reports. We would also like to express our gratitude to Dr Jamie Campbell and Dr Joanna Davis for their time and assistance in gathering expert feedback. We thank Dr Tom Pollard and Dr Alistair Johnson for their advice and clarifications on the MIMIC-CXR data user agreement that prohibits disclosure of data (images, report snippets). The NIH Clinical Center was the provider of the de-identified images of chest xrays shown in this manuscript, and is available through the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC.