Automated quality control of T1-weighted brain MRI scans for clinical research: methods comparison and design of a quality prediction classifier

Gaurav Bhalerao; Grace Gillis; Mohamed Dembele; Sana Suri; Klaus Ebmeier; Johannes Klein; Michele Hu; Clare Mackay; Ludovica Griffanti

doi:10.1101/2024.04.12.24305603

Abstract

Introduction T1-weighted MRI is widely used in clinical neuroimaging for studying brain structure and its changes, including those related to neurodegenerative diseases, and as anatomical reference for analysing other modalities. Ensuring high-quality T1-weighted scans is vital as image quality affects reliability of outcome measures. However, visual inspection can be subjective and time-consuming, especially with large datasets. The effectiveness of automated quality control (QC) tools for clinical cohorts remains uncertain. In this study, we used T1w scans from elderly participants within ageing and clinical populations to test the accuracy of existing QC tools with respect to visual QC and to establish a new quality prediction framework for clinical research use.

Methods Four datasets acquired from multiple scanners and sites were used (N = 2438, 11 sites, 39 scanner manufacturer models, 3 field strengths – 1.5T, 3T, 2.9T, patients and controls, average age 71 ± 8 years). All structural T1w scans were processed with two standard automated QC pipelines (MRIQC and CAT12). The agreement of the accept-reject ratings was compared between the automated pipelines and with visual QC. We then designed a quality prediction framework that combines the QC measures from the existing automated tools and is trained on clinical datasets. We tested the classifier performance using cross-validation on data from all sites together, also examining the performance across diagnostic groups. We then tested the generalisability of our approach when leaving one site out and explored how well our approach generalises to data from a different scanner manufacturer and/or field strength from those used for training.

Results Our results show significant agreement between automated QC tools and visual QC (Kappa=0.30 with MRIQC predictions; Kappa=0.28 with CAT12’s rating) when considering the entire dataset, but the agreement was highly variable across datasets. Our proposed robust undersampling boost (RUS) classifier achieved 87.7% balanced accuracy on the test data combined from different sites (with 86.6% and 88.3% balanced accuracy on scans from patients and controls respectively). This classifier was also found to be generalisable on different combinations of training and test datasets (leave-one-site-out = 78.2% average balanced accuracy; exploratory models = 77.7% average balanced accuracy).

Conclusion While existing QC tools may not be robustly applicable to datasets comprised of older adults who have a higher rate of atrophy, they produce quality metrics that can be leveraged to train a more robust quality control classifiers for ageing and clinical cohorts.

Introduction

Large big brain MRI datasets hold immense value for well-powered statistical analyses and cross-cohort investigations (Madan, 2022). The emergence of open science initiatives and platforms for sharing data has made it possible to combine data from multiple sites and studies (Markiewicz et al., 2021; Wilkinson et al., 2016). With the emergence of comprehensive neuroimaging pipelines (e.g., UK Biobank, Human Connectome Project, etc.), it is now feasible to obtain imaging derived outcome measure on other datasets, including clinical populations (Littlejohns et al., 2020; Van Essen et al., 2013). In the ageing and dementia space there is a wealth of clinical datasets, made available through initiatives such as the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and Dementias Platform UK (DPUK) (Bauermeister et al., 2020; Petersen et al., 2010). The aggregation of neuroimaging data obtained from clinical populations not only increases sample sizes but also facilitates the generation of reproducible and generalisable outcome measures, thus paving the way for innovative approaches in detecting brain biomarkers (Khanna et al., 2018; Van Horn & Toga, 2009). A substantial focus of neuroimaging research revolves around enhancing automated pipelines to produce reliable and relevant outcome measures from extensive datasets (Esteban et al., 2019; Frazier-Logue et al., 2022; Notter et al., 2023; Sherif et al., 2014). However, analysing large-scale datasets requires robust automated pipelines to ensure the generation of consistent measures across varied datasets. Despite the benefits, dealing with clinical datasets pose an additional challenge in big data analysis due to higher heterogeneity, motion artefacts, and disease-related factors like atrophy or other abnormalities (Andre et al., 2015; Nárai et al., 2022). Consequently, the critical task arises of identifying useable scans for processing through the automated pipelines to obtain reliable results.

While the MRI protocol may vary across datasets, a core component is a structural T1-weighted (T1w) scan. T1w MRI is used to examine brain structures, assess brain volume changes, and detect abnormalities, for example those associated with neurodegenerative diseases. It is also used as anatomical reference for the analysis of other structural and functional imaging modalities, as it provides detailed anatomical information. The initial and crucial step in brain imaging analysis involves assessing the quality of T1w MRI scans. The effectiveness of subsequent steps, such as multimodal registration and morphometry estimation, relies heavily on the quality of these scans. Traditionally, researchers visually inspect scans before analysis, but this practice isn’t always feasible when dealing with large datasets. Removing too many scans after quality assessment can decrease the sample size, while including poor-quality scans can introduce biases into the resulting outcomes (Gilmore et al., 2021).

Several automated approaches have been developed for quality control (QC) on T1w brain MRI scans (Hendriks et al., 2023). Various rule-based QC approaches have been proposed considering the image background to assess scan quality e.g. using measures such as - distortion (Woodard & Carley-Spencer, 2006), noise and ghosting artifacts (Gedamu et al., 2008), derived from image background (Mortamet et al., 2009), etc. Other rule-based QC approaches considered the image foreground to assess quality of the scans (Jang et al., 2018; Osadebey et al., 2018). Several automated machine learning approaches have been proposed, which extract quality measures from the images and are trained using visual QC labels to predict scan quality (pass or fail)(Alfaro-Almagro et al., 2018; Esteban et al., 2017; Pizarro et al., 2016). Various other studies used deep learning approaches to classify the scans as pass or fail using the entire image instead of specific quality measures (Bottani et al., 2022; Keshavan et al., 2019). Tools for brain morphometric analysis like Computational Anatomy Toolbox (CAT12) also offer quality control ratings based on tissue segmentation to evaluate scan quality (Gaser et al., 2022). While current automated QC tools are valuable, they are usually designed using data from healthy and/or young population or optimised for a specific dataset or type of scanner. To perform successful quality control in large clinical datasets, it is important to establish a framework that offers broader applicability across various clinical cohorts, age range and scanner types.

In this study, we tested two existing automated QC tools: MRIQC and CAT12. MRIQC is an open-source tool, offering an extensive array of metrics for evaluating quality on raw T1w images (based on noise, information theory, and specific artifacts), and it has become a standard reference in numerous studies (Chen et al., 2023; Elliott et al., 2023; Lorenzini et al., 2022). CAT12 is widely utilized in the field and encompasses a variety of quality control options (based on noise contrast, inhomogeneity contrast, resolution) applicable to images processed within the tissue segmentation pipeline (Besteher et al., 2022; Hahn et al., 2022; Sakreida et al., 2022). To classify the scans into pass or fail, MRIQC additionally provides a pre-trained supervised classifier which can be utilised to predict the quality of scans. In contrast CAT12 provides image quality ratings for each measure which can be used to determine usable or unusable scans from the analysis. Due to their wide use and broad range of comprehensive measures available in both tools from raw and tissue-segmented scans, we selected these tools as good candidates to perform QC on clinical datasets. We first tested the agreement between MRIQC and CAT12 with visual quality inspection on a large sample of clinical research data (N = 2438) from an extensive spectrum of datasets spanning ageing and neurodegenerative cohorts. We studied the relationship between the QC metrics produced by the two tools and tested the tools’ performance when adjusting the accept-reject threshold. We then proposed a new classification framework which uses a combination of QC metrics from both automated tools as features and visual QC as gold standard. We tested the generalisability of the proposed classifier on various test datasets that differed in terms of population and scanner. Finally, by looking at the distribution of QC measures that contributed most to the higher classification accuracy, we explored how they could be used to inform data harmonisation strategies. The code is openly available, and the proposed classifier will be made accessible on the DPUK data portal, to support future clinical research studies.

Methods

Data & visual QC of T1w brain scans

Structural T1w brain images from 4 clinical research datasets (N = 2438) acquired on 39 scanners from three different manufacturers (Siemens, Philips, GE) were used: 1) Oxford Brain Health Clinic (BHC) (Griffanti et al., 2022) [age range: 65 - 101 years], 2) Oxford Parkinson’s Disease Centre (OPDC) (Griffanti et al., 2020) [age range: 39 - 116 years], 3) Whitehall II imaging study (Filippini et al., 2014) [age range: 60 – 85 years], 4) Alzheimer’s Disease Neuroimaging Initiative (ADNI) (Petersen et al., 2010) [age range: 55 - 92 years]. Information on scanner, manufacturing model, counts, acquisition matrix and voxel size for these datasets is provided in Table 1.

View this table:

Table 1. Dataset-wise and scanner-wise counts of T1w scans of datasets used in this study

Oxford Brain Health Clinic - BHC (N=160)

The Oxford BHC is a joint clinical-research service for memory clinic patients which offers high-quality assessments not routinely available, including a multimodal brain MRI scan (Griffanti et al., 2022). Images are acquired on a Siemens 3T Prisma scanner using a protocol matched with the UK Biobank imaging study (Miller et al., 2016). The visual quality ratings were obtained from the dataset owners. These images were originally rated into low, medium, high quality. We categorised medium and high-quality images into accept label and low-quality images into reject label.

Oxford Parkinson’s Disease Centre Discovery Cohort - OPDC (N=383)

The OPDC study aims to identify biomarkers of Parkinson’s disease for early detection and progression. The dataset includes multimodal brain MRI data (acquired on a 3T Siemens Verio scanner) along with deep longitudinal clinical phenotyping in patients with Parkinson’s, at-risk individuals, and healthy elderly volunteers (Griffanti et al., 2020). For this dataset, the visual ratings were not available from dataset owners hence each image was visualised and rated into low, medium, and high quality by one rater. The medium and high-quality images were grouped into accept category and low-quality images were in reject category.

Whitehall II imaging sub-study (N=775)

The Whitehall II study is a longitudinal study of British civil servants to explore the factors affecting brain health and cognitive ageing (Filippini et al., 2014). In this dataset, 552 scans were acquired on a Siemens Verio 3T scanner (referred as Whitehall1 in the manuscript – protocol details in ((Filippini et al., 2014) and 223 scans on a Siemens Prisma 3T (referred as Whitehall2 in the manuscript – protocol details in (Zsoldos et al., 2020)). We treated the data from these two scanners separately in all the analyses for our work. The visual quality ratings (accept and reject) were obtained from the dataset owners.

Alzheimer’s Disease Neuroimaging Initiative - ADNI (N=1120)

The ADNI (adni.loni.usc.edu) was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). In this study we included all the baseline T1w brain images from ADNI 1,2,3 and GO (first run in each session). Due to the highly variable numbers of scans for each scanner, we grouped data from the same manufacturer and field strength together, for a total of 7 ADNI sites. The visual quality ratings were available on a scale from 1 (excellent quality) to 4 (unusable). Upon careful inspection of the quality description, we decided to label images with a rating of 1 or 2 into the accept category and those with a rating of 3 or 4 into the reject category.

T1w processing in automated tools

All the images were named and organised in Brain Imaging Data Structure (BIDS) (Gorgolewski et al., 2016) and defaced (to preserve the privacy of individuals) before processing.

MRIQC pipeline

MRIQC is an open-source pipeline that extracts image quality metrics (IQMs) from structural (T1w and T2w) and functional MRI data (Esteban et al., 2017). It uses modular sub-workflows from neuroimaging software toolboxes such as FSL (Jenkinson et al., 2012), ANTs (Avants BB et al., 2013)and AFNI at the background (Cox & Hyde, 1997). MRIQC also provides a random forest classifier (mriqc_clf) pre-trained on 1102 T1w scans (17 sites) from the Autism Brain Imaging Data Exchange (ABIDE) dataset. The classifier generates probability value for each scan (range 0 - 1) and any scan with probability more than or equal to 0.5 (default threshold) is categorised to reject label.

Each defaced T1w brain image was processed in MRIQC pipeline (singularity version 0.15.1). The list of image quality metrics (IQMs) and their description are provided in Table 2 (a detailed explanation can be found in the user manual of MRIQC). From each image 68 metrics were extracted. We used MRIQC’s random forest classifier (mriqc_clf) and labelled images into binary accept and reject labels.

View this table:

Table 2: List of MRIQC image quality metrics

CAT12 pipeline

CAT12 (Computational Anatomy Toolbox) is an extension of SPM12 covering diverse morphometric methods to provide computational anatomy (Gaser et al., 2022). CAT12 provides a retrospective QC framework for empirical quantification of image quality parameters.

Each defaced T1w brain image was processed in CAT12 segmentation pipeline (standalone version r2042 running on v93 of MATLAB compiler runtime). The surface processing option was enabled during the segmentation. Post segmentation, CAT12 generates a segmentation report for each image and provides image quality ratings (IQRs) based on noise, resolution, bias and aggregates these ratings to weighted IQR [range A+ (excellent) to F (unacceptable/failed)]. Additionally, (for the proposed classifier work) we also considered additional quality measures which are not provided in the CAT12 visualisation report but saved in the output of segmentation (named as, cat_<subjdirname>.mat). The description of all the quality measures is provided in Table 3, (a detailed explanation can be found in the user manual of CAT12). From each image 36 quality measures were extracted (Pravesh Parekh, 2021). To label the images into accept and reject quality, each image with weighted image quality rating (IQR) of C minus and below (selected as ‘default threshold’) was labelled into reject class.

View this table:

Table 3. List of CAT12 image quality measures

Comparison of MRIQC and CAT12 quality measures

We first compared the quality measures between the two automated tools. The quality measures derived from MRIQC and CAT12 were correlated using Pearson’s correlation. The correlation analysis was conducted in MATLAB2022b (The MathWorks Inc. (2022). MATLAB Version: 9.13.0.2105380 (R2022b), 2022).

Comparison of ratings between automated tools and visual QC

We calculated the percentage of scans that would pass QC and compared the agreement between visual QC, MRIQC classifier predictions (default threshold, MRIQC(D)) and CAT12’s weighted IQR (default threshold, CAT12(D)) using Kappa coefficient of inter-rater reliability (IRR) (Landis & Koch, 1977; McHugh, 2012). Three comparisons were performed: 1) CAT12 ratings vs. MRIQC ratings, 2) CAT12 ratings vs. visual QC ratings, 3) MRIQC ratings vs. visual QC ratings.

Further, we explored the effect of changing the labelling threshold from MRIQC classifier and CAT12’s weighted IQR. We investigated this by changing the CAT12’s weighted IQR threshold to – 1) strict (CAT12 (-)): any scan with weighted IQR rating C and below were labelled to reject category, 2) lenient (CAT12 (+)): any scan with weighted IQR rating D+ and below were labelled to reject category. Similarly, for the MRIQC classifier we changed the threshold of acceptance to – 1) strict (MRIQC (-)): scans with probability equal to or more than 0.4 were labelled to reject category, 2) lenient (MRIQC (+)): scans with probability equal to or more than 0.6 were labelled to reject category. We then re-calculated the Kappa coefficient for the above three comparisons. The Kappa coefficient was calculated using IRR package in R (Matthias Gamer et al., 2019; R Core Team, 2022).

Proposed QC classifier

In this section we present our proposed QC classifier. The primary model (combined data model) was trained and tested on a mix of data from multiple datasets and sites. We then tested the generalisability of our classification framework in a leave-one-site-out approach and in cases where training and test data differ in terms of field strength and/or scanner manufacturer.

Combined data model

Data and classifiers

We designed a binary QC classifier which combines the MRIQC and CAT12 quality measures as features. Binary visual QC ratings were used as target. For the combined data model, we first randomly divided our entire sample (N = 2438) into 80% training (N = 1955) and 20% test data (N = 483). The data was divided ensuring fair representation of target labels, sites, and proportion of patients and controls (when applicable) among both the training and test datasets. The site-wise and label-wise split for training and test datasets is provided in Table 4. We tested three options for the underlying machine learning classification: support vector machine, random forest, and random under-sampling boost.

View this table:

Table 4. Site-wise split of training and test data for the combined data model

Support vector machine (SVM) is one of the most common supervised classifiers, simple to train for hyperparameters, effectively handles high dimensional data and less prone to overfitting than non-linear classifiers (Cortes & Vapnik, 1995). We used the ‘fitcsvm’ implementation in MATLAB (MATLAB Version 9.14.0.2239454 (R2023a), 2023). Two hyperparameters were optimised in nested cross-validation (CV): box constraint (0.01, 0.1, 1, 10, 100, 1000) and Kernel function (linear, radial basis function). The remaining hyperparameters were maintained at their default settings. Random forest (RF) is a supervised classifier robust to outliers and non-linear data, faster to train and handles unbalanced classes in the data (as in our data ‘reject’ class samples are substantially lower than ‘accept’ class) (Breiman, 2001). We used the ‘fitcensemble’ implementation in MATLAB (MATLAB Version 9.14.0.2239454 (R2023a), 2023). Two hyperparameters were optimised in nested CV: Maximal number of decision splits (10,50) and number of ensemble learning cycles (10, 50, 100). The remaining hyperparameters were maintained at their default settings. We selected random under-sampling boost (RUS) as third classifier due to its ease of implementation, effective handling of imbalanced classes, rapid processing speed, and reduced computational complexity (Seiffert et al., 2008). It is a supervised classifier that under samples the majority class labels in the training process to balance the minority class. Given the imbalance of classes in our data, we used random under-sampling to avoid skewing towards the majority class (accept) and improve the detection of the minority class (reject) in our datasets. We used the ‘fitcensemble’ implementation in MATLAB (MATLAB Version 9.14.0.2239454 (R2023a), 2023). Three hyperparameters were optimized in nested CV: Maximal number of decision splits (10, 50), number of ensemble learning cycles (10, 50, 100), and learning rate for shrinkage (0.01, 0.1). The remaining hyperparameters were maintained at their default settings.

Nested cross validation approach

The classifiers were trained in a nested cross validation (CV) framework consisting 5 outer folds and 3 inner folds (See Figure 1). In the training phase, within every CV iteration, the features were standardized for each site separately. During the feature standardization of test data, only the mean and standard deviation from train data were applied to avoid data leakage. Within the CV, features were ranked using multiple filter-type feature selection methods (ReliefF, Chi-square, Minimum Redundancy Maximum Relevance, class separability criteria – t-test, entropy, Bhattacharya distance, Wilcoxon, Receiver Operating Characteristics). The ranks were then aggregated using robust ranking aggregation (Kolde et al., 2012). For each feature size (iterative; 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 104 (all)), the classifier was trained on the inner fold’s train data and tested on the inner fold’s test data for the grid of hyperparameters. For each outer CV iteration and for each feature size, the classification performance was averaged over all the inner CV folds and the combination of hyperparameters achieving the best performance were chosen. Finally, for each feature size the outer cross validation iteration was executed with the chosen combination of hyperparameters from the inner folds, models were re-trained, and tested on the outer test data. To get precise estimates of model’s performance, we ran a total 100 iterations of the nested CV in the training phase and obtained the best combination of hyperparameters for each feature size for each classifier. In the final model design, we aggregated feature ranks from all outer cross validation folds across 100 iterations and derived a final ranking of the features. The final model was then trained by using all the training data with the best combination of hyperparameters for each feature size and feature ranking across 100 CV iterations.

Figure 1. Nested cross validation workflow for training the QC classifier. The model was trained for 5 outer folds and 3 inner folds. The hyperparameters of the model were optimised on the inner test data and the combination giving the best performance (balanced accuracy) were selected for the outer folds. The nested cross validation process was repeated for 100 times. The best performing hyperparameters for each feature size and feature ranks across 100 iterations were used to train the final model and tested on the hold-out data.

Assessment and comparison of prediction performance

The final model’s performance on the test data was assessed by balanced accuracy (Eq. 1 – 3).

True positive (TP) – the model correctly predicts the accept label; True negative (TN): correctly predicts reject label; False positive (FP) – wrongly predicts accept label for a scan that should be rejected; False negative (FN) – wrongly predicts reject label a scan that should be accepted.

The choice to use balanced accuracy as our primary metric is based on the fact that our datasets have imbalance in the accept and reject classes and we are interested in both classes being predicted well for unseen datasets.

For each classifier (SVM, RF, RUS) we selected the feature size that gave the best performance. We then compared the prediction performance of the three optimised classifiers with each other and with MRIQC and CAT12. This comparison of prediction performance was done for - 1) combined test data (N = 483), 2) test data categorised by site (see Table 4 for number of scans in each site in each class), 3) patients and controls separately within the test data (see Table 5 for number of scans for in each class), 4) each sub-category of diagnosis within the test data (see Table 5 for number of scans in each class for each category).

View this table:

Table 5. Diagnosis group-wise number of scans in accept and reject labels

Feature importance

We investigated the distribution of the top 10 ranked features derived from the final combined data model by employing kernel density and scatter plots. To ascertain potential statistical variations in the distribution among sites, we conducted a two-sample Kolmogorov-Smirnov test. Subsequently, to address multiple comparisons, we applied the Bonferroni correction to obtain adjusted p-values.

Leave-one-site out models

To further validate our approach, we created leave-one-site out CV models using the best performing classifier on the combined data (among SVM, RF and RUS) to see how well our training workflow generalises to an unseen site. From this classifier, we extracted the combinations of hyperparameters, feature ranking and feature size at which the best performance was observed on the combined test data. These parameters were then used to re-train classifier on data from remaining sites while keeping each site as test data. Finally, the classification performance on each test site was assessed, comparing them against MRIQC and CAT12, and against the best performance of a combined data model on each site in the test data. The split of data for training and testing for each model is provided in Table 6.

View this table:

Table 6. Train and test split for leave-one-site-out models.

Exploratory models

Finally, we explored how well our approach generalises when the model is trained on data from one field strength and/or manufacturer and tested on data from other field strengths/manufacturers. These exploratory models were designed to test:

Generalisability across field strength: the majority of the datasets were acquired on 3T scanners (N = 1967) hence we trained the model on data from all 3T scanners and tested on the data from 1.5T field strengths.
Generalisability across manufacturer: the majority of the datasets were acquired from Siemens scanners (N=1928) hence we trained the model combining Siemens data from all field strengths and tested on data from other manufacturers.
Generalisability across manufacturer and field strength: the majority of the data is from 3T Siemens scanners (N = 1743) hence we trained the model only from 3T Siemens scanner data and tested on the remaining data.

The data split for training and test data is provided in Table 7. Similar to the leave-one-site out models, we chose the best performing classifier (among SVM, RF and RUS) on the combined test data and re-trained and tested the classifier for three different cases. As explained above, the training process used hyperparameter combinations, feature ranking and feature size at which the best performance was observed on the combined test data. The classification performances were assessed for each model, comparing them against MRIQC and CAT12, and against the performance of the combined data model on the test data from each case individually.

View this table:

Table 7. Training and test data split for exploratory models

Results

Comparison of quality measures (CAT12 vs MRIQC)

We analysed the correlation between CAT12 quality measures and MRIQC IQMs (Figure 2) to explore both common and distinct metrics within these tools. We observed various statistically significant correlation coefficients between pairs of measures from these automated tools. For instance, CAT12’s resolution measure exhibited significant correlation with MRIQC’s summary-based metrics derived from tissues, FWHM, image size, image spacing, and overlap of CSF with tissue probability maps (TPM). The absolute volume of tissues measured by CAT12 demonstrated significant correlations with MRIQC’s intra-cranial volume fraction of tissues and overlap of tissue classes with TPM. The relative intensity of background in CAT12 exhibited a significant correlation with MRIQC measures encompassing noise-based metrics, measures tied to specific artifacts (such as image nonuniformity), as well as other parameters like size, spacing, FWHM, and residual partial volumes of tissues. Relative intensities from CSF and GM in CAT12 were significantly correlated with summary measures from background, CSF, and GM in MRIQC. Summary measures derived from WM in MRIQC significantly correlated with CAT12’s resolution, relative intensity of CSF, and absolute volumes of GM and WM tissues. Similarly, CAT12’s relative contrast showed a significant correlation with summary measures from CSF and GM in MRIQC. On the other hand, non-significant or low correlations (below ±0.5) suggest that the two tools are also capturing unique information about the image. Measures falling in this category for CAT12 are noise, mean intensity from tissues, and surface measures while for MRIQC are QI1, QI2 (targeting specific artifacts), EFC, FBER (informed by information theory). For detailed correlation coefficients, p-values, and the upper and lower bounds of a 95% confidence interval for each pair of measures, refer to Supplementary material.

Figure 2. Correlation plot between MRIQC IQMS (columns) and CAT12 quality measures (rows). MRIQC generated total 68 IQMs and from CAT12 we extracted 36 quality measures.

Comparison of ratings between automated tools and visual QC

Percentage of scans passing QC

We first compared the percentage of scans that passed (accept category) QC using visual QC, MRIQC, and CAT12. The results are reported in Table 8. Overall, CAT12 showed the highest percentage of accepted scans compared to visual QC and MRIQC. MRIQC showed a more similar percentage of accepted scans to visual QC overall, but with over 8% difference in 3 datasets (ADNI 3T GE, ADNI 1.5T Philips and BHC)

View this table:

Table 8. Percentage of scans passing visual QC and QC from automated tools

Classification agreement

We computed Kappa coefficient to measure the agreement between the automated tools and with visual QC (Figure 3). A detailed table of Kappa coefficients, associated p-values, and percentage agreement for all the pairs of ratings is provided in supplementary material.

Figure 3. Kappa coefficient values comparing the agreement of ratings between visual QC and automated tools. The colours indicate the lowest (red), medium (yellow) and highest values (green) in each dataset. See supplementary material for details.

Automated tools vs visual QC

When evaluating the agreement on all datasets together (Figure 3 panel l), MRIQC and visual QC showed higher value of Kappa coefficient (k=0.3) than CAT12 and visual QC (k=0.28). However, when looking at each dataset separately, in some cases the agreement was higher between CAT12 and visual QC (panels a, b, c, f, h, k, Kappa between 0.27 and 0.59), while other datasets showed higher Kappa coefficient between MRIQC and visual QC (panels d, e, g, i, j, Kappa between 0.26 and 0.51). Notably, ADNI 1.5T Philips and 2.9T Siemens showed no agreement between CAT12 and visual QC (panels e, g).

CAT12 vs MRIQC ratings

For all datasets together, we found significant agreement between the ratings from CAT12 and MRIQC ratings (k=0.23). When considered each dataset separately, Whitehall2 dataset showed the highest agreement (k=0.54). Notably, some datasets in ADNI (3T GE, 1.5T Siemens, 1.5T Philips, 2.9T Siemens) showed no agreement or worse than expected agreement (zero or negative values of Kappa coefficient in Figure 4 panels a, f, e, g).

Figure 4. Kappa coefficient values comparing the agreement of ratings between visual QC and automated tools after adjusting the acceptance thresholds of automated tools. The comparison with default thresholds is also provided for ease of comparison. The colours indicate the lowest (red), medium (yellow) and highest values (green) in each panel. See supplementary material for details.

Impact of threshold on classification agreement

Given that the inter-rater reliability did not show consistency across datasets on which tool produced more similar ratings to visual QC using their default threshold (0.28 for CAT12, 0.30 for MRIQC), we explored the effect of using a lenient and stricter threshold of acceptance on the automated tools. The percentage of accepted scans upon adjusting the threshold are provided in Table 9. A detailed table of Kappa coefficients, associated p-values, and percentage agreement for all the pairs of ratings is provided in Supplementary material.

View this table:

Table 9. Percentage of accepted scans after adjusting acceptance thresholds for MRIQC and CAT12 (‘-‘ for strict threshold and ‘+’ for lenient threshold).

Visual QC vs automated tools

We recalculated the Kappa coefficient values after adjustment of thresholds to find the agreement between the automated tools and visual QC ratings (see Figure 4). When looking at the Kappa coefficient values from all datasets together (panel l), we found that the agreement between visual QC and CAT12 ratings improved after applying a strict threshold to CAT12. We found a similar effect when looking at each dataset separately on most of our datasets (panels b, d, e, h, I, j). For example, the lack of agreement between visual QC ratings and default threshold ratings of CAT12 (k =0) for ADNI 1.5T Philips dataset improved significantly (k=0.43) after applying strict threshold to CAT12 ratings. However, some datasets did not show any improvement in Kappa coefficient after adjusting thresholds (panels a, c, g, k).

For all datasets together, we did not see any improvement in Kappa coefficient when comparing visual QC with changed thresholds in MRIQC ratings. Some datasets showed increased agreement after applying lenient threshold to MRIQC ratings (panels b, c, e, g, h, k). For example, the significant agreement between visual QC and default threshold ratings of MRIQC in Whitehall2 (k = 0.52) was further improved (k = 0.72) after applying lenient threshold to MRIQC ratings. Only ADNI 1.5T GE dataset showed significantly improved value of Kappa coefficient after applying strict threshold to MRIQC ratings (from k = 0.25 to k= 0.5). The rest of the datasets did not show any improvement upon adjusting the threshold of MRIQC ratings (panels d, f, i, j).

CAT12 vs MRIQC ratings

For all datasets together and each dataset separately, the Kappa coefficient significantly improved between MRIQC default threshold ratings and CAT12 ratings after applying a strict threshold (from k=0.23 to k=0.31). Most of the datasets showed similar effect of improvement between default threshold ratings of MRIQC and CAT12 ratings after applying strict threshold (panels c, d, e, h, i, j). For Whitehall2 and ADNI 3T Philips, the Kappa coefficient improved between default ratings of CAT12 and MRIQC ratings after applying lenient threshold. Only for ADNI 3T GE, the Kappa coefficient value between default threshold ratings of CAT12 and MRIQC (from k =-0.06 to 0.47) after applying strict threshold to MRIQC ratings. Notably, ADNI 1.5T Siemens and ADNI 2.9T Siemens did not show any improvement upon adjustment of thresholds, showing zero agreement.