Abstract
The excitation/inhibition (E/I) ratio has been shown to be imbalanced in individuals diagnosed with autism (AT) or schizophrenia (SZ), relative to neurotypically developed controls (TD). However, the degree of E/I imbalance overlap between SZ and AT has not been extensively compared. Our main objectives were (1) to quantify group differences in the E/I ratio between TD, AT, and SZ, (2) to assess the potential of the E/I ratio for differential diagnosis, and (3) to verify the replicability of our findings in a second, independently-acquired dataset. For each participant, we computed the Hurst exponent (H), an indicator of the E/I ratio, from the timecourses of 53 independent components covering the entire brain. Using Random Forest (RF), we ran a classification analysis using the largerof the two datasets (exploratory dataset; 519 TD, 200 AT, 355 SZ) to determine which of the 53 H would yield the highest performance in classifying SZ and AT. Next, taking the ten most important H from the exploratory dataset and the clinical and phenotypic information collected in the replication dataset (55 TD, 30 AT, 39 SZ), we used RF to compare classification performance using five feature sets: (a) H only; (b) Positive and Negative Syndrome Scale (PANSS) and the Autism Diagnostic Observation Schedule (ADOS) only; (c) PANSS, ADOS, Bermond–Vorst Alexithymia Questionnaire (BVAQ), Empathy Quotient (EQ), and IQ; (d) H, PANSS and ADOS; (e) H, PANSS, ADOS, BVAQ, EQ and IQ. Classification performance using H only was higher in the exploratory dataset (AUC = 84%) compared to the replication dataset (AUC = 72%). In the replication dataset, the highest classification performance was obtained when combining H with PANSS, ADOS, BVAQ, EQ and IQ (i.e., model e; AUC = 83%).
1. Introduction
Autism (AT) and schizophrenia (SZ) have been recognized as independent diagnoses since the 1970s (APA, 2013). While AT is primarily characterized by distinct patterns of social communication, and repetitive behaviors, a SZ diagnosis consists of positive (e.g., hallucinations, delusions) and negative (e.g., social withdrawal) symptoms. They also show different onset trajectories (i.e., AT becomes apparent in early childhood, while SZ onset usually occurs in late adolescence or early adulthood), albeit a progression from AT to SZ cannot be excluded (Hsu 2022). However, the heterogeneity of both diagnostic categories (Benkarim et al., 2022; Segal et al., 2023) and their phenotypic overlap (Kästner et al., 2015) can hinder accurate psychiatric diagnosis. More precisely, AT and SZ co-occur in approximately 4% of cases (Lai et al., 2019), and share both social (Oliver et al., 2020) and sensory-motor functioning patterns (Du et al., 2021). Differential diagnosis is additionally hindered by the fact that the common clinical observational and interviews to assess diagnosis-specific symptoms, such as the Autism Diagnostic Observation Schedule (ADOS) and the Positive and Negative Syndrome Scale (PANSS), do not have good specificity (Bastiaansen et al., 2011; Trevisan et al., 2020). The overlap between SZ and AT has led to inquire about the underlying neural mechanisms, and whether these might aid in differential diagnosis (Horien et al., 2022). A recent international machine learning competition aimed to classify AT and typically developed (TD) showed that fMRI data can yield a classification accuracy of ∼ 80% (Traut et al., 2022). Classification accuracies based on structural or functional MRI data can be just as high when distinguishing SZ from TD (de Filippis et al-2022). However, the challenge is greater when tackling diagnosis overlap of heterogeneous nosological categories, such as AT and SZ, that share both genetic variants and neuroimaging patterns (Moreau et al., 2021).
One potential brain-based marker is the excitation/inhibition (E/I) ratio, that has been shown to be different in both AT and SZ compared to TD. This is based on the concerted activity of mostly glutamatergic (i.e., excitatory) and GABAergic (i.e., inhibitory) neurons. The former are the most numerous and project throughout the entire brain, while the latter are fewer and synapse locally (for a comprehensive account of excitatory and inhibitory activity balance in the human brain, see Tatti et al., 2017). A way to estimate the E/I ratio in humans based on non-invasive measures, such as resting state fMRI (rsfMRI), is by computing the Hurst (H) exponent from the acquired timeseries; an increased H values indicates a decreased E/I ratio, and vice-versa (for a review see Campbell & Weber, 2022).
In AT, Rubenstein & Merzenich (2003) first hypothesized that observed sensory processing patterns may result from an increased E/I ratio. Among the evidence they cite is the fact that parietal and cerebellar areas show ∼50% less glutamic acid decarboxylase (GAD), the enzyme that synthesizes the inhibitory neurotransmitter γ - aminobutyric acid (GABA) in AT compared to TD (Fatemi et al., 2002). Additionally, cortical minicolumns, which are functional units composed of GABAergic and glutamatergic neurons processing thalamic inputs, are smaller and more numerous in AT compared to TD (Casanova et al., 2002). A more recent summary specifically points towards the impact of reduced inhibition on cortical and hippocampal functioning in AT (Sohal & Rubenstein, 2019). Whether this E/I imbalance is mainly due to excessive excitatory activity, or deficient inhibitory activity, is not entirely clear (Dickinson, Jones & Milne, 2016; Ford & Crewther, 2016), but recent evidence points to the E/I imbalance in AT being caused by concomitant effect (see Canitano & Palumbi, 2021). Finally, direct evidence for the contribution of an E/I imbalance in AT comes from a study using bumetanide (i.e., a selective NKCC1 chloride importer antagonist, which decreases depolarizing GABA action, to reduce the E/I ratio) in a large cohort of AT children (Juarez-Martinez et al., 2023). These authors reported a decrease in repetitive behaviors following a 91-day bumetanide trial. Another direct link between sensory processing and GABAergic activity in AT has been provided by Huang et al. (2023). These authors used arbaclofen, a GABA type B receptor agonist, to show that auditory repetition suppression was negatively impacted by the drug in TD, but improved in AT.
In SZ, the E/I ratio has also been reported to be imbalanced compared to TD. Post-mortem and genetic evidence (Anticevic & Lisman, 2017), and computational modeling revealed that this imbalance causes hyperconnectivity in association brain areas (Yang et al., 2016). In addition, a review has shown a link between an E/I imbalance and aberrant internal sensory processing in SZ, such as hallucinations (Jardri et al., 2016). Finally, dopamine appears to be crucial in maintaining the E/I balance by modulating the excitability of glutamate and GABAergic neurons, thus contributing crucially to the E/I ration in SZ (Purves-Tyson et al., 2021). Dopaminergic activity, in concerted action with glutamatergic and GABAergic activity, when disrupted, can directly impact memory function and prefrontal connectivity in SZ (please see Winterer & Weinberger, 2004, for an extended account).
Supporting evidence in favor of an E/I imbalance in AT and SZ has been corroborated by animal models and post-mortem human studies, as well as by experimental, genetic, and magnetic resonance spectroscopy studies (MRS) (for a comprehensive review, please see Dickinson et al., 2016). It has been proposed that there are common neuronal pathways underlying E/I imbalance in both AT and SZ (Canitano & Pallagrosi, 2017; Foss-Feig et al., 2017), and that this relies in turn on shared genotype (Gao & Penzes, 2015). However, given the substantial heterogeneity in both AT and SZ (e.g., Segal et al., 2023), it is difficult to ascertain to which extent there is overlap between the two diagnoses.
In recent years, various machine learning approaches have been employed to improve differential diagnosis of mental disorders (Cho et al., 2019). Among these, interpretable models, such as Random Forest (RF), have become increasingly popular due to their transparency, as opposed to the traditional “black box” methods, such as support vector machine (Murdoch et al., 2019; Rudin, 2019). A trade-off between high interpretability and high accuracy is usually considered when opting for a particular classification approach, as highly accurate classifiers usually provide little if any transparency. Among interpretable approaches, RF stands out as an algorithm that can provide both high accuracy and reasonable interpretability (Bhattacharya, 2022). In addition, it can also be used for feature selection based on feature importance in an out-of-sample classification, as we did in the current project (Speiser et al. (2019).
To assess the role of clinical vs. E/I ratio data in classifying AT and SZ, we used two independent datasets and five distinct sets of features (i.e., classification models) comprising phenotypic and clinical assessment scores, the E/I ratio (as indexed by the H exponent) of multiple brain areas, or both. To quantify the E/I ratio based on rsfMRI timeseries, we computed the H of 53 predefined functional brain areas, based on the Neuromark templates (Du et al., 2020). The H exponent has been refined as a reliable computational approximation of synaptic E/I based on extensive physiological and in silico studies, (e.g., Trakoshis et al., 2020). For the clinical features we focused on core symptoms assessments: ADOS — measuring AT-related social and communication functioning, and PANSS — measuring SZ-related positive (e.g. delusions, hallucinations) and negative (e.g., social withdrawal) symptoms, and general psychopathology (e.g., attention deficits), an IQ estimate, and two social cognitive measures: EQ — measuring empathy, and BVAQ — measuring alexithymia, both of which have been shown to be different in AT and SZ compared to TD (van’t Wout et al., 2007; Warrier et al., 2018; Kinnaird, Stewart & Tchanturia, 2019). Two rsfMRI datasets were used to test replicability. The first dataset included data from publicly available repositories of either AT or SZ data, and included a relatively large dataset (Du et al., 2022). The second dataset was collected on-site, and included the above-mentioned clinical and phenotypic measures in a smaller dataset. We therefore only tested the out-of-sample replicability of the E/I-based classification model. We believe the use of both datasets holds important advantages. The replication dataset, while consisting of fewer participants, contains both rsfMRI and phenotypic and clinical data. In addition, the AT, SZ and TD in this dataset were collected in the same setting, which precludes the risk of site-related confounds. The larger, exploratory dataset was obtained by sourcing datasets from different online repositories, namely the Autism Brain Imaging Data Exchange (ABIDE I and II) for AT, and the Bipolar-Schizophrenia Network on Intermediate Phenotypes (B-SNIP), the Center for Biomedical Research Excellence (COBRE), the Maryland Psychiatric Research Center (MPRC), and the Function Biomedical Informatics Research Network (FBIRN) for SZ. These datasets had been acquired at various sites with different scanning parameters, and clinical and phenotypical data was also not uniformly available across sites and clinical groups. While this prevented us for doing a full exploration of all the classification models that we were able to test using the smaller dataset, it allowed us to, 1) reduce model complexity and increase model stability (Schultz et al., 2022) in the smaller replication dataset by using only the most important H features (i.e., brain regions) from the larger exploratory dataset, and 2) demonstrate the replicability of the results of this model.
2. Methods
2.1. Participants
Two independent datasets were used in the current project. An exploratory dataset (Exploratory), based on several publicly-available online datasets (described below), and an internally-collected replication dataset (Replication).
For the Exploratory dataset, we analyzed 519 TD (362 males & 157 females; mean age = 28.49 ± 7.68), 200 AT (180 males & 20 females; mean age = 24.74 ± 6.6), and 355 SZ (245 males & 110 females; mean age = 30.91 ± 7.95) from the previously preprocessed and harmonized dataset used in Du et al. (2022). The participants in Du et al. (2022) had been selected from several data repositories: the AT from ABIDE1 and ABIDE2, and the SZ from BSNIP, COBRE, FBIRN, and MPRC_PK. From the dataset of Du et al. (2022), we chose a subset of participants that closely matched the age (18-35 y.o.) and intelligence quotient (IQ) of the Replication dataset. Note that an estimated IQ > 75 criterion was chosen because it was the inclusion threshold in the BSNIP dataset, where no IQ values were recorded. Because some of the data from the Replication dataset had been previously submitted to data repositories (e.g., ABIDE1) from which the dataset of Du et al. (2022) had been drawn, we ensured that no participants were included in both the Exploratory and Replication datasets, by excluding them from the Exploratory dataset.
For the Replication dataset, participants were recruited via the Olin Neuropsychiatry Research Center (ONRC) and the Yale University School of Medicine and underwent resting state fMRI (rsfMRI) scanning for the current study. We discarded participants with head motion > 10 mm, and those with incomplete phenotypic assessment information, resulting in a final dataset consisting of 55 TD (26 males & 29 females; mean age = 23.86 ± 3.65), 30 AT (25 males & 5 females, mean age = 22.33 ± 3.78), and 39 SZ (31 males & 8 females, mean age = 25.66 ± 3.53). The Replication dataset has been previously used by Hyatt et al. (2020, 2021) and Rabany et al. (2019), and the exclusion criteria we used here were the same: intellectual disability (i.e., estimated IQ < 80), neurological disorder (e.g., epilepsy), current drug use as indicated by pre-scanning interview and urine test, incompatibility with MRI safety measures (e.g., metal implants), and a history of psychiatric diagnoses in TD.
2.2. Clinical and phenotypical assessment
Clinical and phenotypic data was collected from the Replication dataset, but was not consistently available for the Exploratory dataset. The clinical assessment focused on assessing psychotic and autistic features. The severity of psychotic symptoms was assessed using the Positive and Negative Syndrome Scale (PANSS; Kay et al., 1987) in both the AT and SZ group. The PANSS scores can be interpreted along three subscales: positive symptoms, reflecting the severity of hallucinations and delusions; negative symptoms, reflecting the severity of blunted affect and anhedonia, and a general subscale quantifying other psychopathology such as poor attention and lack of insight. The ADOS, module 4 (Lord et al., 2000) was administered to all participants and the total score was used in this study to confirm or rule out an autism diagnosis and quantify autistic social communication characteristics. The Structured clinical interview for DSM-IV-TR Axis I disorders (SCID; First & Gibbon, 2004) to confirm a SZ diagnosis and the absence of any Axis I diagnoses in TD. Estimated IQ was calculated for the entire dataset using the Vocabulary and Block Design subtests of the Wechsler Scale of Adult Intelligence-III (WAIS-III; Wechsler, 1997; Sattler and Ryan, 1999). Additionally, all participants completed the Empathizing Quotient (EQ; Wakabayashi, Baron-Cohen & Wheelwright, 2006) which measures general empathy including both the affective and cognitive empathy components; the Bermond–Vorst Alexithymia Questionnaire (BVAQ; Vorst & Bermond, 2001), whose subscores are computed along five distinct dimensions: “verbalizing” reflects one’s propensity to talk about one’s feelings; “identifying” reflects the extent to which one is able to accurately define one’s emotional states; “analyzing” quantifies the extent to which one seeks to understand the reason for one’s emotions; “fantasizing” quantifies one’s tendency to day-dream, and “emotionalizing” reflects the extent to which a person is emotionally aroused by emotion-inducing events. Descriptive statistics and group comparisons of the clinical and phenotypic data are given in Table 1.
2.3. Imaging data acquisition and preprocessing
The preprocessing steps of the rsfMRI data, using the SPM toolbox, of the Exploratory dataset were extensively described in Du et al. (2022). In short, the first few volumes were discarded, then rigid-body motion correction and slice-timing correction were performed. Finally, the data were normalized, resampled to 3 mm3 isotropic voxels, and smoothed with a 6 mm FWHM Gaussian kernel. Prior to preprocessing, the effects of age, gender, site acquisition, and interactions between age and site, and gender and site were regressed from the gray matter volumes of each voxel, to ensure between-site harmonization; this procedure was detailed in Du et al. (2022).
For the Replication dataset, rsfMRI scans lasted 7.5 min and were collected using a Siemens Skyra 3 T scanner at the ONRC. Participants lay still, with eyes open, while fixating a centrally presented cross. Blood oxygenation level dependent (BOLD) signal was obtained with a T2*-weighted echo planar imaging (EPI) sequence: TR = 475 ms, TE = 30 ms, flip angle = 60 deg, 48 slices, multiband (8), interleaved slice order, 3 mm3 voxels. Neuroimaging data were preprocessed using SPM8 (www.fil.ion.ucl.ac.uk/spm/software/spm8/). Each dataset was realigned to the first T2* image using the INRIAlign toolbox (https://www-sop.inria.fr/epidaure/Collaborations/IRMf/INRIAlign.html), coregistered to their corresponding high signal-to-noise single-band reference image (sbREF; Glasser et al., 2013), spatially normalized to the Montreal Neurological Institute (MNI) standard template (Friston et al., 1995), and spatially smoothed (6 mm3). Finally, framewise displacement (FD) motion parameters were computed according to the FSL library algorithm (Jenkinson et al., 2012), and the mean FD value for each run was used as a covariate in group analyses.
2.4. Data analysis
For both the Exploratory and Replication dataset, we ran a fully automated independent component analysis (ICA) on the preprocessed fMRI data using the Group ICA for fMRI Toolbox (GIFT v4.0c; https://trendscenter.org/software/gift/; Calhoun et al., 2001) to define functional brain regions. The 53 replicable independent component (IC) templates from the NeuroMark pipeline (Du et al., 2020) were used to estimate participant-specific, spatially-independent components using a spatially-constrained ICA algorithm (Du et al., 2018). A complete list of the NeuroMark IC templates, arranged into seven functional domains, and peak MNI coordinates for each IC template are given in Table 2 and illustrated in Supplement Figure 1. After detrending and despiking using 3dDespike (AFNI, 1995), we extracted one Hurst exponent (H), an estimate of the E/I ratio, from each of the resulting 53 IC timecourses of each participant.
We estimated H using the nonfractal MATLAB toolbox (https://github.com/wonsang/nonfractal; You et al., 2012). Specifically, we used the function bfn_mfin_ml.m with the “filter” argument set to “haar” and the “ub” (upper bound) and “lb” (lower bound) arguments set to [1.5,10] and [-0.5,0], respectively, as previously recommended by Trakoshis et al. (2020).
All the other statistical analyses were performed with R 4.1.1. These included a one-way analysis of covariance (ANCOVA), Tuckey post-hoc, and two-sided two-sample Welch t tests.
Classification/Discriminant Analyses
A crucial aspect of classification algorithms in neuroimaging is sample size. It has been shown that larger sample sizes lead to more accurate estimates when brain-behavior relationships are investigated via traditional statistical approaches (Marek et al., 2022). While this seems to be generally true also for machine learning (Schultz et al., 2022), the relationship between sample size and classification accuracy does not appear to be entirely linear (e.g., Flint et al., 2021). For this reason, in the current project, we established the initial classification accuracy of our brain-based classification model in the largest of our two datasets.
First, using the Exploratory dataset, we classified the AT and SZ participants using a random forest (RF) algorithm, as implemented in the Interpretable Artificial Intelligence (IAI) toolbox (https://www.interpretable.ai/) and accessed through R 4.2.1 (R Core Team, 2018). The feature set consisted of the 53 H exponents of each participant. We ran 100 randomized sample splits and averaged the model performance metrics that we obtained for each of the splits to obtain a final classification performance index (i.e., area under the curve/AUC, sensitivity and specificity). From each group, with each new split, 50% of the data was randomly allocated to the test, and the rest to the train group. We used the AT sample as reference for calculating sensitivity and specificity. Following the classification, we selected the 10 ICs with the highest feature importance of H (figure/table x) as a simplified feature set for use in RF classification of the Replication dataset in the next step. Next, using the Replication dataset, we classified the AT and SZ participants using the same algorithm and toolbox, with 100 splits and 50% randomized allocation of data into the train and test groups. Five models were used for the RF classification in this case, containing the following features: (a) E/I model: the H only values of the 10 ICs ranked as most important by the RF classification in the Exploratory dataset; (b) symptoms only model: PANSS 3 factor scores and ADOS total scores; (c) symptoms and cognitive model: PANSS, ADOS, EQ, BVAQ, and IQ scores; (d) E/I and symptoms model: the 10 H from model (a) plus the PANSS and ADOS scores, and (e) E/I, symptoms and cognitive model: the 10 H from model (a) plus the PANSS, ADOS, EQ, BVAQ, and IQ scores. Similar to the previous step, we used the AT sample as reference to calculate sensitivity and specificity, and each final model performance index (i.e., AUC, sensitivity and specificity) was obtained by averaging the respective model performance metric over the 100 sample splits. Misclassification for each participant for each model was calculated as the ratio of number of times each participant was misclassified, to the total times s/he was allocated to a test set.
3. Results
3.1. Group differences in demographic, clinical and phenotypic assessment
In the Exploratory dataset, there were significant group differences differences on age (F (2) = 42.45, p < .001), and sex distribution (χ2 (2) = 35.156, p <.001).
Data for the Replication dataset, including statistical tests, are presented in Table 1. There were significant differences in estimated IQ, age, sex and FD, and therefore these parameters were used as covariates in further group analyses. Regarding core symptom assessments, the AT and SZ groups did not significantly differ in their social and communication skills, as indicated by the ADOS scores, but the PANSS scores on all three domains (i.e., positive and negative symptoms, and general psychopathology) were significantly elevated in SZ compared to AT. For social tasks, BVAQ-Fantasizing was significantly decreased in AT compared to SZ. Empathy was significantly decreased in both AT and SZ compared to TD, and in AT compared to SZ.
3.2. Group differences in H in the Exploratory dataset
The ANOVA results testing group differences in H values are given in Table 2. While all areas showed significant group differences, the largest effect size (i.e., η2 >= 0.06) was found for: the paracentral lobule (i.e., ICs no. 10 and 13; belonging to the Sensorimotor domain), the calcarine gyrus, middle occipital gyrus, fusiform gyrus, and inferior occipital gyrus (i.e., ICs no. 17, 18, 20, 23; in the Visual network), the insula, and the superior, middle and right inferior frontal gyrus (i.e., ICs no. 27, 30, 31, 35; in Cognitive control domain), the precuneus and anterior cingulate cortex (i.e., IC no. 43, 44, 47; Default Mode network), and one area of the Cerebellar domain (i.e., IC no. 50).
3.3. Group differences in H in the Replication dataset
The ANCOVA results showing group differences in H values from each component are given in Table 3. The areas most sensitive to overall group differences, after controlling for age, sex, IQ, and FD, were the left and right postcentral gyrus and paracentral lobule (i.e., IC no. 9, 11, and 10 respectively, part of the Sensorimotor network), and the calcarine gyrus, middle occipital gyrus, middle temporal gyrus, inferior occipital gyrus, and lingual gyrus (i.e., IC no. 17, 18, 19, 23 and 24 part of the Visual network). The supplementary motor area (i.e., IC no. 34), associated with the Cognitive Control domain, also yielded significant group differences.
3.4. Classification accuracy and feature importance in the Exploratory dataset
Using the complete 53 Hurst feature set, the classification was: AUC = 84%, Sensitivity = 65%, and Specificity = 83% (Figure 1 A; please note that sensitivity and specificity are calculated in relation to AT).
Next, we used the ten ICs with the highest Hurst feature importance from the RF classification of the Exploratory dataset to simplify the feature set for RF classification of the Replication dataset. We selected these ten Hurst ICs as follows: the precuneus and the anterior cingulate cortex (i.e., ICs 43, 44, 47, and 48) from the Default Mode Network, the superior frontal gyrus (i.e., IC 35) of the Cognitive Control domain, the paracentral lobule and the precentral gyrus (ICs 10 and 14), from the Visual domain, and the middle occipital gyrus, the inferior occipital gyrus, and the lingual gyrus (i.e., ICs 18, 23, and 24), from the Sensorimotor domain (Supplement Figure 2).
3.5. Classification results in the Replication dataset
In the Replication dataset, we obtained the following classification performance, outlined in Figure 1A: For model (a), using the reduced Hurst feature set (10 ICs), AUC = 72%, Sensitivity = 64%, and Specificity = 67%. For model (b), using the PANSS and ADOS as features, AUC = 78%, Sensitivity = 65%, and Specificity = 73%. For model (c), using the PANSS, ADOS, EQ, IQ, and BVAQ as features, AUC = 76%, Sensitivity = 62%, and Specificity = 76%. For model (d), using the reduced Hurst feature set in combination with PANSS and ADOS, AUC = 81%, Sensitivity = 67%, and Specificity = 76%. Finally, for model (e), using the reduced Hurst feature set in combination with PANSS, ADOS, EQ, IQ, and BVAQ, AUC = 83%, Sensitivity = 68%, and Specificity = 79%.
In the Replication dataset, for each classification instance, we inspected the scaled contribution of each feature included in each of the five models. In order to account for the different number of features included in each classification model, we scaled the raw feature importance values (Supplement Figure 3) by multiplying them by the number of features included in that respective model (Supplement Figure 4). As can be seen in Supplement Figure 4, the precentral gyrus (IC 14), the inferior occipital gyrus (IC 23), the lingual gyrus (IC 24), and the precuneus (IC 48) are the most important H features overall. The most important symptom scores are the ADOS total, followed by PANSS positive and negative. However, the most important feature overall was the estimated IQ, which prompted our concern that the IQ might bias our classification output and misleadingly appear as being more important than the Hurst features in model (e), given that our groups were not matched for IQ (Table 1). We therefore repeated the classification for models (c) and (e) and found that performance remained virtually unchanged (i.e., AUC = 0.75, Sensitivity = 0.6, Specificity = 0. 76, and AUC = 0.83, Sensitivity = 0.68, Specificity = 0. 81 respectively). Following the elimination of the IQ feature from this last run of models (c) and (e), feature importance also remained unchanged compared to that shown in Supplement Figure 4.
Finally, inspection of misclassified participants (Figure 1 B), showed that on average, SZ were increasingly more accurately classified as we moved from models (a) through (e), while AT were more accurately classified when Hurst features were included (i.e., models a, d, and e).
4. Discussion
The current paper assessed the feasibility of using the E/I ratio, as estimated by the H exponent, to distinguish between AT and SZ. We had two main goals: (1) perform an out-of-sample replication of the classification model using a reduced set of brain-based features (i.e., the H exponent from the ten most important brain components), and (2) compare the classification accuracy when different sets of clinical, phenotypic and imaging features were combined.
We first explored group differences between the three groups, in each independent dataset (Tables 2 and 3). The most consistent findings across both datasets reflected a significantly reduced H (i.e., increased E/I) in SZ compared to TD in most areas of the cerebellar domain, the bilateral postcentral gyrus and paracentral lobule, from the sensorimotor domain, and all but one area of the visual domain. Given that levels of glutamate and GABA have been reported to vary inconsistently across SZ, and moreover, to be impacted by both medication and disease progression (see, e.g., Foss-Feig et al., 2017, for an in-depth overview), we believe that the replicability of these group differences are all the more notable. What is more, this may suggest that despite the aforementioned medication and disease progression related neurotransmitter variations in SZ, a persistently elevated E/I ratio remains, predominantly in visual processing brain areas.
Group differences were less consistent between TD and AT, and the effects mostly displayed opposite directions in the two independent datasets. Namely, in the Exploratory dataset, significantly larger H (i.e., reduced E/I) were found in AT compared to TD in the precuneus, the cerebellum, the frontal cortex, the left inferior parietal lobule, some areas of the visual domain (i.e., calcarine, fusiform, inferior occipital, and middle temporal gyrus). In the Replication dataset, the supplementary motor area showed significantly reduced H (i.e., increased E/I) in AT compared to TD. We believe that one potential explanation that could account for this reduced replicability in this case could be the increased heterogeneity of AT which may simply not have been sufficiently captured given the limited size of the Replication dataset. Indeed, Foss-Feig et al. (2017) also suggest that E/I variations appear to be more heterogenous in AT than SZ. Another aspect that could contribute to these inconsistent findings might be due to sex differences, which we were unable to assess in the current project given insufficient female AT participants. Previously, Trakoshis et al. (2020) reported that the H of the ventromedial prefrontal cortex was significantly elevated in female AT, but significantly decreased in male AT compared to TD. Thus, a more extensive exploration of sex differences in larger samples of AT could give a clearer picture on this aspect.
A direct comparison of H in AT v. SZ revealed no significant differences in the Replication dataset, but showed a consistent pattern of significantly increased H (i.e., decreased E/I) in AT compared to SZ, in all but six of the 53 brain areas. These findings however require further validation given the heterogeneity in data collection between these two groups in the Exploratory dataset.
Classification performance of AT and SZ, using the H values of all 53 brain areas, in the Exploratory dataset, was very good (AUC = 84%), with especially high specificity for SZ (83%). A similar trend was maintained when using a reduced classification model comprising the ten most important H-based features in the Replication dataset (Figure 1.A), though classification performance was overall reduced, as was to be expected in an out-of-sample replication, and given the limited size of the Replication dataset. We further explored how augmenting the feature set in the Replication dataset could improve classification performance. We found that while using PANSS and ADOS resulted in increased classification performance compared to using H alone, performance increased further when combining H, PANSS and ADOS, and was the highest when using H, PANSS, ADOS, as well as IQ, EQ and BVAQ. This steady increase in performance appeared to be mostly driven by a steady decrease in misclassified SZ (Figure 1.B). Due to some concerns that unmatched IQ between AT and SZ might be biasing these results, we re-ran these classification models without IQ as a feature, and obtained extremely similar results. We may therefore conclude that: (1) classification performance based on E/I, as estimated by H, is substantial and replicable across independent datasets, and (2) classification performance is the highest when H and clinical assessment are combined, resulting in notably decreased misclassification instances in both AT and SZ (Figure 1.B). Interestingly, AT were more frequently misclassified when only clinical assessment data were used, compared to both H only and H combined with clinical assessment (Figure 1.B). Given that both AT and SZ in the Replication dataset were acquired using the same protocol and testing environment, we surmise that inherent AT heterogeneity could best explain these different trajectories in classification performance.
While the Exploratory dataset that we used in the first step of our classification analysis offered a satisfactory amount of data, it does have the limitation that these data were collected using a variety of protocols and acquisition sites. On the other hand, while the Replication dataset offered us the possibility to test different classification models, given that identical clinical assessment protocols were used for both AT and SZ, it offered a rather small number of participants. Nevertheless, we believe that especially given these limitations, the replicated results are noteworthy.
Another limitation which we were unable to address using the currently available datasets were sex differences. This is especially true for AT, given the historic bias which has resulted in currently available samples being overwhelmingly dominated by male participants. Finally, future systematic and longitudinal studies could further clarify how H may be influenced by age (of onset), disease progression and medication in SZ.
Data Availability
Most of the data in the current study has already been submitted to the NDAR public repository.
Supplement
Footnotes
We have added a second, much larger sample size, that enabled us to verify the replicability of our initial results. We also finally opted to using Random Forest instead of Optimal Classification Trees.