Abstract
Background Autism spectrum disorder (ASD) represents a panel of conditions that begin during the developmental period and result in impairments of personal, social, academic, or occupational functioning. Early diagnosis is directly related to a better prognosis. Unfortunately, the diagnosis of ASD requires a long and exhausting subjective process.
Objective To review the state of the art for the automated autism diagnosis.
Methods In February 2022, we searched multiple databases and several sources of grey literature for eligible studies. We used an adapted version of the QUADAS-2 tool to assess the risk of bias in the studies. A brief report of the methods and results of each study is presented. Data were synthesized for each modality separately using the Split Component Synthesis (SCS) method. We assessed heterogeneity using the I2 statistics and evaluated publication bias using trim and fill tests combined with ln DOR. Confidence in cumulative evidence was evaluated using the GRADE approach for diagnostic studies.
Results We included 344 studies from 186020 participants (51129 are estimated to be unique) for nine different modalities in this review, from which 232 reported sufficient data for meta-analysis. The area under the curve was in the range of 0.71-0.90 for all the modalities. The studies on EEG data provided the best accuracy, with the area under the curve ranging between 0.85 to 0.93.
Conclusions The literature is rife with bias and methodological/reporting flaws. Recommendations are provided for future research to provide better studies and fill in the current knowledge gaps.
Background
Rationale
Autism spectrum disorder (ASD) represents a panel of conditions that begin during the developmental period and result in impairments of personal, social, academic, or occupational functioning and is currently estimated to affect about 1 in 44 children in the U.S. [1] and 1 in 100 children worldwide [2]. Furthermore, estimates show that about 2.2% of all adults in the U.S. have ASD [3]. It is associated with a high economic burden. The cost of caring for Americans with autism reached $268 billion in 2015 and is estimated to rise to $460 billion by 2025 as individuals with ASD require special healthcare, special education, and government assistance [4]. Individuals with ASD usually have a low quality of life as they have difficulties living independently and having social relationships. They also have problems getting employed as fewer than half of young adults with ASD maintain a job, even lower than the employment rate of ex-convicts who achieve a 75% employment rate [5]. Furthermore, ASD individuals experience substantially higher rates of mood disorders than the general population, contributing to less quality of life and higher mortality through suicide [6].
ASD is associated with verbal and nonverbal communication, social interactions, and behavioral complications [7]. Clinical manifestations typically occur in the 2nd–3rd year of life and persist through adulthood [8]. Evidence suggests that early diagnosis is directly related to a better prognosis [9]. Different health systems follow different clinical pathways; In some countries, children get screened using pre-diagnosis screening methods, and potential cases will be referred to specialized services by general practitioners [7]. In such cases, a lack of training and technical knowledge may lead to a late diagnosis. In some other countries, professionals directly approach the patient and their parents, often without former screening [10], which usually happens when the patient develops significant signs and symptoms.
The diagnosis of ASD requires a long and exhausting subjective process. Clinicians must conduct a clinical assessment of the patient’s developmental age based on a variety of factors (e.g., behavior excesses, communication, self-care, social skills) [7] or a combination of parent/caregiver information, ideally assessed by standardized instrument, with results from standardized direct observation and other additional, independent information from school teachers, partner or public authorities [11]. ASD cannot yet be diagnosed with objectively specific and sensitive biomarkers.
Machine learning provides novel opportunities for human behavior research and clinical translation. With the new advancements in this area, complex, sophisticated predictive models could be quickly developed based on labeled data. It can be applied to the data from individuals with ASD to offer new diagnostic biomarkers. ASD diagnosis could be seen as a classification problem in machine learning where the clinician tries to build a classifier based on data of individuals to classify participants as ASD or typically developed (TD).
Previously, there have been other reviews [12]–[16] with a similar aim, but we believe all those reviews had severe limitations and flaws. First, almost none of the previous studies investigated all the modalities used for this subject. Most of these reviews focused on brain imaging methods, with hardly any mention of other existing methods such as facial recognition models, electroencephalograms, eye tracking systems, etc. Also, their search strategies were seriously flawed. Two reviews [12], [16] used methodological search filters, which are not recommended for diagnostic test accuracy reviews due to the lack of availability of appropriate subject headings, and inconsistent use of those which are available by database indexers [17]. Four other studies [13]–[15] did not report their search strategies or reported them in insufficient detail.
Nevertheless, when comparing the number of studies we found with those reviews, it is evident that the flaws in their search resulted in missing a substantial amount of the available evidence. Additionally, those reviews either did not assess the risk of bias in the included studies or did it ambiguously. Another problem with some of the previous reviews were mixing the diagnostic accuracy results of different assessment modalities into one meta-analysis. This is a classic example of mixing apples and oranges in a meta-analysis, as was stated by Sharpe in 1997 [18]. Other limitations included the short presentation of data from the included studies, ignoring heterogeneity and not investigating its possible sources, not evaluating for the risk of publication biases, and not assessing the overall confidence in the cumulative evidence. Taking all these factors into consideration, this study aims to present a comprehensive, in-depth systematic review (with meta-analyses when feasible) of the specifications and diagnostic accuracy of the current state-of-the-art automated ASD diagnostic or classification models to help future researchers in their efforts and reveal the current gaps in knowledge, pitfalls, and promises.
Clinical role of the index test
In April 2018, the Food and Drug Administration (FDA) approved the first artificial intelligence (AI)-based system for clinical use. The system, named IDx-DR, was an AI model with a retinal camera device that was developed to detect diabetic retinopathy in adults who have diabetes [19]. Since then, the FDA has evaluated dozens of AI-based health-related models, with at least nine achieving a de novo pathway clearance and three getting pre-market approval [20], [21]. Nevertheless, most algorithms still need to be refined enough to substitute for a clinician’s judgment. It would be preferable for patients and clinicians alike if they had a simple explanation of how a classification algorithm determines a particular label for a case [22].
Thus, when these models are used in a clinical setting, the problem is not all about providing the correct decision, but one should also be able to describe how the model managed to reach that decision. Unfortunately, the decision-making process of most of these algorithms happens in a “black box” manner. There are few justifiable reasons behind those decisions, at least by the current state of clinical knowledge. Other challenges for delivering clinical impact with AI systems include legal issues, logistical difficulties in implementation, and sociocultural considerations [23]. Thus, at the moment, we propose these models should only be considered a low-cost, accessible, non-invasive method for screening children to identify those at risk of developing ASD. So, if enough sensitivity is provided, they may be used as a screening test. Generally, screening tests should provide reasonable sensitivities since test-negatives will not be tested by more specific tests, but they may have lower specificities. Although, if enough specificity is provided, they might be used as add-on tests to ascertain the diagnosis and differentiate it from other conditions with similar symptoms.
Summary of artificial intelligence model pipelines
Although studies on the diagnosis or classification of clinical outcomes using AI might have developmental details different from one study to another, they usually follow a similar pipeline. This pipeline generally includes (a) data preprocessing and wrangling, (b) feature engineering, (c) feature scaling and selection, and (d) model training and evaluation. This pipeline is presented in Figure 1.
Data preprocessing and wrangling
In health-related tasks, datasets can consist of a variety of modality types. These data are also usually filled with different nuisance sources, which can interfere with optimal model training. Thus, data cleaning is typically necessary to ensure or enhance performance. Although modern deep learning models do not require ideal datasets, preprocessing and wrangling techniques have dramatically increased the model’s performance. Preprocessing techniques usually take advantage of statistical methods to improve data quality in different ways. For instance, in the case of neuroimaging data, slice-timing correction, head-motion correction, and susceptibility distortion correction address particular artifacts, while co-registration and spatial normalization are concerned with signal localization [24].
Feature engineering
While data processing includes cleaning and preparing the raw data, feature engineering creates actual model training features. In the case of classifying problems, a feature refers to any measurable property with information about the class ownership extracted from raw data. For instance, measures such as gray matter volume, white matter volume, or mean diffusivity may be used as features in neuroimaging data. The feature engineering process may include various techniques, such as numerical transformations, category encoding, clustering, and principal component analysis [25].
Feature scaling and selection
Feature scaling is the process of normalizing the range of independent variables or features. In contrast, feature selection is the selection of a subset of relevant features for use in model development. They shorten training time, improve data compatibility with a learning model class, and benefit the model by avoiding the curse of dimensionality [26]. Feature selection methods are divided into three categories: filtering techniques such as t-test, wrapper methods such as recursive cluster elimination, and embedding methods such as least absolute shrinkage and selection operator (Lasso).
Model training and evaluation
The training process is the optimization of model parameters based on the training dataset to find better representations of the corresponding class. This process sometimes leads to overfitting. Overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data and may fail to fit additional data or predict future observations reliably” [27]. To test for possible overfitting and evaluate the model’s generalizability, it should be further validated on a dataset different than the one used for the training. Different validation methods exist, such as k-fold cross-validation (CV), leave-one-out cross-validation (LOOCV), and external validation methods [28].
Objectives
To review the current state-of-the-art automated ASD diagnostic or classification models and present the gaps in knowledge, pitfalls, and promises
Methods
The design and methods used for this review complied with the Centre for Reviews and Dissemination (CRD’s) Guidance for Undertaking Reviews in Healthcare [29] and are reported in line with Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy (PRISMA-DTA) [30].
Eligibility criteria
Population: individuals with autism spectrum disorder (ASD) regardless of age, sex, and ethnicity
Index test: AI models and systems
Target condition:
- Autism spectrum disorder (ASD) as defined by the Diagnostic and Statistical Manual of Mental Disorders-IV (DSM-IV), DSM-V, and International Statistical Classification of Diseases and Related Health Problems 11 (ICD-11)
- Childhood autism, atypical autism, and Asperger syndrome as defined by ICD-10
- Autism as defined by Autism Diagnostic Observation Schedule (ADOS), Autism Diagnostic Interview-Revised (ADI-R), The Childhood Autism Rating Scale (CARS), or Gilliam Autism Rating Scale (GARS)
- We did not consider participants at risk for autism as eligible for this review.
Comparator: typically developed (TD) individuals
Reference standard: diagnosis made by a trained physician, psychiatrist, or other eligible experts based on diagnostic criteria mentioned above
Study design: cross-sectional design, including both single-gate (cohort type) and two-gates (case-control type) designs
We assessed the eligibility of studies reports irrespective of their language, publication status, and publication date. We did not include studies that used AI models for the questionnaire-based diagnosis of ASD because those studies are mostly about using machine and deep learning methods to tailor available diagnostic questionnaires, which is different from the scope of this review.
Information sources
In February 2022, we searched Embase, MEDLINE (Ovid), APA PsycINFO (Ovid), IEEE Xplore, Scopus, and Web of Science Core Collection for eligible studies. Additionally, we searched gray literature using OpenGrey, Center for Research Libraries Online Catalogue (CRL), and Open Access Theses and Dissertations (OATD) to find any unpublished potentially relevant studies. We also carried out a ‘snowball’ search through citation searching (forward and backward citation tracking) using Scopus on all included studies in this review to identify further eligible studies or study reports. As a final step, we checked the references of the reviews with a similar subject identified through our search to see if other potentially eligible studies existed.
Our search strategy is reported in line with the PRISMA search extension [31]. No restriction or search filter was used. Free-text terms and keywords were identified using the MeSH Browser [32] website and PubMed PubReMiner [33] word frequency analysis tool. The search strategy was reviewed using the Peer Review of Electronic Search Strategies (PRESS) [34] guideline. The Multimedia Appendix 1: Search strategy file presents a detailed report of the search strategy.
Study selection
Citations identified from the literature searches and reference list checking were imported to EndNote [35], a citation management software. Duplicates were identified and removed using EndNote’s de-duplication tool. Next, the remaining records were imported into Rayyan QCRI [36], a web-based application that employs natural language processing, artificial intelligence, and machine learning technologies to speed the screening of titles and abstracts of records. Another step of de-duplication was conducted in Rayyan. In this process, duplicates were identified, manually re-checked, and removed using Rayyan’s automatic de-duplication feature, with the similarity threshold set to 0.85. Then four reviewers in two teams independently reviewed the titles and abstracts of the first 50 records. Inter-rater reliability was calculated using Cohen’s kappa to be 0.87, interpreted as almost perfect agreement. Then the same two teams independently screened titles and abstracts of the retrieved records. Two other reviewers were consulted to make the final decision in disagreements. Afterward, the full text of all potentially eligible records was retrieved. Records from the same study were linked to avoiding including data from the same study more than once. Next, the same two teams independently screened full-text studies for inclusion. A study was included when both team reviewers assessed it as satisfying the inclusion criteria. In cases of disagreements, two other reviewers were consulted.
Data collection process
A data extraction form was developed. The form was pilot-tested by four reviewers using five randomly selected studies. Inter-rater reliability was calculated again using Cohen’s kappa to be 0.93 (almost perfect agreement). After holding discussions to resolve discrepancies, the same reviewers used the form independently to extract data from eligible studies. Extracted data were compared, with any differences being resolved through further discussion. We tried to contact the study authors in cases of missing or unclear data. Extracted data included:
● Study identifiers and design: study name, location, dates, design (single-gate; two-gates), and funding sources.
● Characteristics of dataset: dataset publicity, inclusion, and exclusion criteria, sample size, gender, and age.
● Performance and validation methods: a qualitative description of pre-processing methods, feature extraction, and selection methods, modality parameters, qualitative description of the AI algorithms, validation methods, and reference standard used.
● Model evaluation metrics.
In cases where one study evaluated several algorithms and presented different results, we only extracted the data for the model with the best accuracy.
Risk of bias
Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) [37] tool addresses four domains: 1) Patient selection; 2) Index test(s); 3) Reference standard; 4) Flow and timing. Each domain is assessed regarding the risk of bias and concerns regarding applicability to the review question.
Since this review aims to evaluate the diagnostic accuracy of artificial intelligence algorithms, some revisions to the QUADAS-2 were deemed necessary. Thus, some alterations and modifications were done by the discussion and consensus among authors. The patient selection domain in such studies is usually more about data sets for training and testing the model. This domain’s signaling questions and assessments have been revised to cover those issues. The index test was divided into two sections, one evaluating the potential biases in the modality used and the other assessing the AI algorithms in the study. For question eight (validation of the algorithm), judgments were based on the results of a study [38]. In their study, which evaluated the validation methods of studies with the aim of automated autism diagnosis, they found that in small sample sizes (<60), the leave-one-out cross-validation (LOOCV) is the least biased internal validation method.
In contrast, in the studies with a sample size of >60, k-fold cross-validation is the optimal choice. Also, external validation and data-split methods are considered at low risk of validation method bias regardless of the sample size. For the reference standard, we assessed if participants were diagnosed based on valid criteria or by an expert. Finally, the flow and timing domain was judged not to apply to machine learning methodologies. No overall summary score was calculated. A detailed summary of the adopted version of the tool is presented in Table 1.
The tool was piloted before use. Four reviewers independently evaluated 10% of the included studies to calculate inter-rater reliability using Cohen’s kappa, which was 0.78. After re-training and holding discussions to resolve discrepancies, the same four reviewers evaluated all the included studies independently. They recorded supporting information and justifications for judgments of risk of bias for each domain (low, unclear, high, or not applicable). We discussed disagreements and resolved discrepancies through discussion.
Diagnostic accuracy measures
We used the data from the two-by-two tables to calculate true-positives (TP), true-negatives (TN), false-positives (FP), false-negatives (FN), sensitivity (Se), specificity (Sp), and accuracy for the best-performing algorithm in each study. When any TP/FP/TN/FN values were 0, 0.5 was added to prevent zero cell count. Our primary accuracy measure for the meta-analyses was the diagnostic odds ratio (DOR):
Synthesis of results
We summarized diagnostic test accuracy by creating a two-by-two table for the best-performing algorithm in each study. When two-by-two tables were unavailable, we tried to calculate the metrics by putting the available data in equations.
Meta-analyses were conducted using R version 4 [39], function ‘SCSmeta’ [40]. Hierarchical bivariate random-effects models are commonly used for the meta-analysis of diagnostic test accuracy studies. These models can consider both within- and between-subject variability and threshold effect [41]. An issue with bivariate models is that the inputs into the model are the study-specific pairs of Se and Sp. The latter can demonstrate heterogeneity across studies either due to systematic differences or implicit dissimilarity in test thresholds, or both. Another issue is that some of the between-study variability could be due to some degree of threshold variability, and while the bivariate approach takes the negative correlation between Se and Sp into account when modeling Se/Sp pairs, such a correlation may also be artifactual because of systematic error (study biases), spectrum effects, or implicit variations in thresholds when tests are interpreted differently. We used the more robust Split Component Synthesis (SCS) method [40]. The SCS estimator for the DOR, Se, and Sp is less biased and has a more minor mean squared error than the bivariate model estimator. For this purpose, first, we estimated the logit Se, logit Sp, and ln DOR for each study. We estimated the summary ln DOR and variance using the IVhet model [42]. Then we generated a centered ln DOR for each study. We fitted two ordinary least squares regression models: one for the logit Se (dependent variable) on the centered DOR (independent variable) and the other for the logit Sp (dependent variable) on the centered DOR (independent variable). The intercept in each linear regression model corresponds to the summary logit Se and summary logit Sp respectively.
Summary Receiver Operating Characteristics (SROC) curves were generated for each modality based on parameter estimates extracted from the SCS model. The SROC curves were specified by the summary Se and Sp intersection point, its 95% CIs, and the confidence limits of the Se and 1-Sp. Individual study Se/Sp pairs were indicated on the plot as circles with size proportional to the inverse of the study’s variance in DOR. The area under the curve (AUC) was calculated using the following formula:
We quantified heterogeneity with I2 and τ2 statistics. I2 quantifies inconsistency across studies to assess the impact of heterogeneity on the meta-analysis [43]. I2 is interpreted as follows: 0% to 40%: might not be important; 30% to 60%: may represent moderate heterogeneity; 50% to 90%: may represent substantial heterogeneity; 75% to 100%: represents considerable heterogeneity.
To investigate the possible role of overfitting on the accuracy results of the included studies, we performed a meta-regression analysis. In this analysis, the accuracy of each study was plotted against its sample size. We also conducted sensitivity analyses on modalities with at least five studies at low risk of bias in all domains to evaluate the robustness of our results.
Reporting bias assessment
The impact of publication bias has been understudied in the context of DTA systematic reviews. One possible reason is that the statistical significance of treatment efficacy is well-defined in clinical trials; however, statistical significance needs to be more intuitively defined for measures of diagnostic test accuracy. Another reason is that the odds ratio is expected to be large in diagnostic studies. Applying tests for funnel plot asymmetry in such circumstances will likely result in publication bias being incorrectly indicated by the test far too often [44]. Nevertheless, empirical evidence has demonstrated that smaller studies report greater diagnostic test accuracy; thus, ignoring the impact of publication bias when conducting a meta-analysis can lead to overestimating the diagnostic test accuracy [45].
A contour-enhanced funnel plot was designed to assess publication bias in our review with ln DOR on the x-axis and the inverse of the standard error on the y-axis. To check for potential unpublished studies on the plot, we used the trim and fill method [46] combined with ln DOR, as shown in a simulation study, to be superior to other funnel plot asymmetry tests for DTA reviews [47]. Using the Deeks method, we also performed a statistical test for funnel plot asymmetry [48].
Confidence in cumulative evidence
The confidence in cumulative evidence for each synthesis was assessed using the Grading of Recommendations, Assessment, Development, and Evaluation (GRADE) approach for diagnostic tests and strategies [49], which takes into account five considerations: study design, risk of bias, inconsistency, indirectness, imprecision, and publication bias. Factors designed for upgrading the evidence (dose effect, plausible bias, large effect, and confounding) still need to be well developed, and thus we did not consider them in our review. The only exception is the ‘large effect’ domain, which was considered in this review. We decided that a substantial likelihood of disease associated with test results (sensitivity and specificity more than 0.85) will increase the quality of evidence. Four reviewers rated the certainty of the evidence for the outcome as high, moderate, low, or very low. We resolved any discrepancies by consensus.
Results
Overview of results
This research aimed to review the current state-of-the-art automated ASD diagnostic or classification models and present the gaps in knowledge, pitfalls, and promises. First, we have introduced a summary of the study selection process, followed by the characteristics of the included studies, including sample sizes, participants’ age, the modalities used, the sources of data (datasets), the classifiers implemented, and the evaluation metrics utilized by the included studies. Next, we have presented the results for assessing the risk of bias in the studies. It is immediately followed by the results of the syntheses, which include synthesis based on modality, synthesis based on feature set, meta-regression analysis, and sensitivity analysis. Finally, we have evaluated the reporting biases in the included studies and confidence in the cumulative evidence. A summary overview of the results, alongside a chart of the flow of the study, is presented in Figure 2.
Study selection
For a detailed summary of the flow of the studies, see the PRISMA flow diagram presented in Figure 2. For this review, we identified 12567 records in our primary search. After removing duplicates, we screened the titles and abstracts of 8245 records. Another 6748 records were excluded at that stage, and 1497 records remained for full-text assessment. We excluded 1154 studies following the evaluation of the full text of the records. In the end, we included 344 studies [50]–[393] with 186020 participants (following the removal of overlapping participants in public databases, 51129 are estimated to be unique) in this review. Three studies reported results for two different modalities separately, resulting in 347 results being included in this review.
Study characteristics
A summary of the characteristics of the contributing studies is provided in Multimedia Appendix 2: Characteristics of included studies which includes the parcellation atlas used by each study (if applicable), the feature selection methods used, the sample sizes, the validation method for the algorithm, the classifier and the algorithm specifications, the evaluation metrics of the algorithm for diagnosing or classifying ASD individuals, and the potential sources of bias in each study. Readers who are interested in designing new AI algorithms for diagnosing or classifying ASD individuals are highly recommended to check this data as it would give them an overview of the methods used so far and their effectiveness, guiding them through the next steps required to achieve better results, while also avoiding duplicate efforts.
Sample sizes
The median sample size was 152.5 participants (interquartile range 58–871). The smallest sample size was 12, and the largest was 17614.
Age of participants
The age group of participants overlapped in most of the included studies. Thirteen studies included toddlers (<2 years old), 143 included children and adolescents (2-18 years old), 30 included adults (>18 years old), and 139 included participants of all ages. Also, 23 studies did not report the age of the participants.
Modality
Our investigations revealed that 155 studies used the data from resting-state functional magnetic resonance imaging (rs-fMRI), 53 used structural MRI (S-MRI), 48 used electroencephalography (EEG), 21 used eye-tracking, 14 used task-based fMRI (T-fMRI), 11 used diffusive weighted imaging/diffusive tensor imaging (DWI/DTI), and 10 used facial recognition. Also, 24 studies used data from more than one modality. In contrast, 14 used other miscellaneous modalities (functional near-infrared spectroscopy, kinematic features, response to name, video analysis, photo shooting, vocal analysis, and positron emission tomography).
Dataset
We observed that 187 studies used the data stored in Autism Brain Imaging Data Exchange (ABIDE) initiative [394], [395] (a dataset that has aggregated functional and structural brain imaging data of ASD participants collected from laboratories around the world), 15 used the data from the National Database for Autism Research (NDAR) [396] (a dataset with data at all levels of biological and behavioral organization (molecules, genes, neural tissue, behavioral, social and environmental interactions) of ASD participants), and 141 studies used other sources.
Classifier
Our investigations revealed that 124 studies used a support vector machine (SVM) classifier, 48 used convolutional neural networks (CNN), 24 used fully-connected neural networks (FCNN), 17 used graph neural networks (GNN), 15 used logistic regression (LR), 14 used a combination of an autoencoder and softmax layer (AE + Softmax), and ten used k-nearest neighbors model (KNN). Also, 50 studies used ensemble algorithms, while 48 used other miscellaneous models. The choice of classifiers is usually highly dependent on the data structure. As a result, we sought to identify the type of classifiers used for each modality. Check Figure 3 for the results.
Most of the included modalities were classified using SVMs as they are easy to use and powerful conventional classifiers. The only exceptions were the facial recognition modality (which almost entirely used CNNs) and the T-MRI modality (which mainly used CNNs).
Evaluation metrics
The median Sensitivity was 80.8% (interquartile range 71.2%-89.7%). The lowest sensitivity was 24.8%, and the highest was 100%. Unfortunately, 116 studies did not report the sensitivity.
The median Specificity was 80.9% (interquartile range 71.0%-90.6%). The lowest specificity was 33.0%, and the highest was 100%. Unfortunately, 117 studies did not report the specificity.
The median accuracy was 80.0% (interquartile range 70.9%-90.0%). The lowest accuracy was 50.0%, and the highest accuracy was 100%.
Risk of bias
Figure 4 shows the judgments for risk of bias and concerns regarding applicability for each domain across the included studies.
Patient Selection
For this domain, 174 studies (51%) were judged to be at high risk of bias, while 73 (21%) were at unclear risk. Only 96 studies (28%) were considered to be at low risk of bias in patient selection. The three main issues with studies at high/unclear risk were:
Unclear data source and/or participants’ characteristics (age, sex, etc.)
ASD and TD groups not being matched for age or sex
Unclear inclusion/exclusion criteria
Index Test - Modality
For this domain, 47 studies (14%) were judged to be at high risk of bias, while 16 studies (5%) were considered to be at unclear risk. Also, this domain was not applicable (N/A) for 32 studies (9%). Most facial recognition studies were at high risk for this domain due to not performing data preprocessing/cleaning.
Index Test - AI Algorithm
For this domain, 75 studies (22%) were judged to be at high risk of bias, primarily due to inappropriate validation methods. In contrast, 24 (7%) were considered to be at unclear risk due to insufficient reports of the characteristics of the classifying algorithm used.
Reference Standard
For this domain, 25 studies (7%) were judged to be at high risk of bias because they did not specify if participants were diagnosed using valid criteria or with the help of an expert. Also, 29 studies (8%) mentioned that an expert diagnosed participants but did not specify which criteria were used for the diagnosis.
Results of syntheses
Synthesis based on modality
Two hundred thirty-two studies provided sufficient data (sensitivity and specificity values or other diagnostic measures which allowed us to calculate them ourselves) for SCS analysis. Results of the meta-analyses are presented in SROC curves in Figure 5 and Table 2.
Synthesis based on the feature set
In this synthesis, 232 studies that provided sensitivity and specificity were included. We performed statistical synthesis for the feature set used in the studies if there were at least five studies for the feature set. Data were presented as accuracy range in cases with fewer than five studies. See Multimedia Appendix 3: Results of SCS meta-analyses for the results of the syntheses.
Results of meta-regression analysis
All studies were included in the meta-regression analysis to investigate the effect of sample size on the reported accuracy measures. The results of this synthesis are presented in Figure 1. β1 was -0.002, indicating a negative correlation between the sample size of the studies and their accuracy result. This relationship was tested using a t-test, revealing that the correlation was statistically significant (p<.001). β0 was 81.0, while R2 was 0.003.
Results of sensitivity analyses
Thirty-six studies (5 modalities) provided sufficient data to be included in the sensitivity analyses. The results of the sensitivity analyses are presented in Table 3.
Reporting biases
The contour-enhanced funnel plot for the studies that reported both sensitivity and specificity is presented in Figure 6.
Visual inspection of the plot revealed an asymmetrical plot, with missing studies in the lower-left portion of the plot. Though the statistical test (p=.115) was not significant, it cannot rule out the presence of publication bias.
Confidence in cumulative evidence
Confidence in cumulative evidence was judged to be low for rs-fMRI, very low for S-MRI, high for EEG, moderate for eye tracking, moderate for T-fMRI, low for DWI/DTI, low for facial recognition, high for other modalities, and very low for multimodal results.
Discussion
Summary of main findings
This research offers the most extensive review of the automated diagnosis or classification of ASD up to this date. In this systematic review, we screened 12567 records and included 344 studies. For a summary report of the results, see Table 4.
As our results suggest, EEG data may be the most reliable source for establishing models for automated diagnosis or classification of ASD. However, it should be noted that most of the included studies on EEG used in-house datasets, which are usually more homogenous in their core compared to the studies on other modalities such as rs-fMRI and S-MRI, where most studies used the more heterogenous publicly available datasets like ABIDE. Nevertheless, we encourage future research to work more on EEG data, considering the limited number of studies currently available for this modality. Unfortunately, at the moment, no large publicly available dataset exists for EEG of ASD patients, which in turn makes future research on this modality less feasible.
Another finding observed in our results was the high accuracy of the studies that utilized facial recognition models. Although the results of these studies look promising, one should be careful in interpreting these results because most of the studies on this modality used the Kaggle dataset, where most of the images are downloaded from websites and Facebook pages related to autism and thus, might not be considered a very reliable source of data.
We also found other interesting results across the included studies. These results are presented in Highlight box 1.
Results of highlighted research. DNN: Deep Neural Network, DWI/DTI: Diffusion-Weighted Imaging/Diffusion Tensor Imaging, EEG: Electroencephalography, FC: Functional Connectivity, FS: Feature Selection, ROI: Region of Interest, rs-fMRI: Resting-state functional Magnetic Resonance Imaging, S-MRI: Structural Magnetic Resonance Imaging.
The study of A Ronicko 2020 [85] found that the classification of ASD using rs-fMRI data improves with full correlation FC compared to partial correlation.
The study of Abdolzadegan 2020 [219] on EEG data found that the use of the DBSCAN algorithm for artifact removal significantly improves classification results.
The study of Ahmed 2020 [199] proposed a single-volume image generator that can produce 2D three-channel images from a 4D fMRI image. They proposed that the generated 2D images represent activated brain regions during the performed task by the patients promptly.
The study of Aslam 2021 [252] proposed an EEG-based ASD classification processor that targets a patch-form factor sensor that may be used for long time monitoring in a wearable environment.
The study of Cancino 2021 [103] evaluated the impact on the performance of a given model when using different preprocessing strategies on rs-fMRI, namely global signal regression (GSR), bandpass filtering (BPF), BPF + GSR, and none. They concluded that the best preprocessing strategy depends on the classification model used.
The study of Demirhan 2018 [311] concluded that when using S-MRI data, the minimal-Redundancy Maximal-Relevance (mRMR) algorithm was more successful in FS than the ReliefF algorithm when the selected number of the features was the same.
The study of Eill 2019 [381] found that rs-fMRI data was significantly more informative than S-MRI and DWI/DTI data for the classification of ASD and non-ASD participants.
The study of Fan 2021 [302] proposed a novel federated deep learning framework for multi-site 3D brain MRI images that aggregates learned features from different sites without transferring raw data, ensuring the security of subject information, while also holding to high accuracy performance.
The study of Ferrari 2020 [329] introduced a new workflow to deal with confounders and outliers in medical data which allows for finding generalizable patterns even if the dataset is limited.
The study of Georges 2020 [341] investigated a large pool of existing FS techniques for boosting feature reproducibility within a dataset. They introduced FS-Select, a method capable of identifying the best FS method to discover the most reproducible and reliable subset of features.
The study of Graa 2019 [307] introduced a multi-view learning-based data proliferator that enables the classification of imbalanced multi-view representations by generating synthetic data for each view to handle imbalanced data and mapping all original and proliferated views into a shared subspace where their distributions are aligned for the target classification task.
The study of Gupta 2020 [129] proposed a new measure called ambivert degree that considers the node’s degree as well as connection weights in rs-fMRI data to identify nodes with both high degree and high connection weights as hubs, leading to significantly higher classification accuracy with significantly fewer trainable weights compared to using FC features.
In the study of Hu 2020 [102], an interpreting method was proposed which could explain a trained FCNN model with a precise linear formula for each input instance and identify decision features of the model which contributed most to the classification task.
In the study of Huang 2019 [65], a novel sparse low-rank constrained multi-templates data-based method for ASD diagnosis or classification using rs-fMRI data was proposed, which simultaneously performs FS and adaptive local structure learning.
In the study of Huang 2019 [71], novel multiple network-based frameworks for rs-fMRI modality were introduced to enhance the representation of FC networks by fusing the common and complementary information conveyed in multiple networks.
The study of Huang 2021 [174] found that using causal connectivities for FC data improves diagnostic accuracy significantly compared with using correlations and partial correlations.
The study of Jiao 2020 [87] utilized capsule networks to build classifiers for classifying ASD participants and stratifying them into groups with distinct FC patterns.
In the study of Jun 2019 [162], a novel method was proposed that directly models the regional temporal BOLD fluctuations in a stochastic manner and estimates the dynamic characteristics in the form of likelihoods. They also transformed the learned weight coefficients of the model into activation patterns, from which it was possible to identify the ROIs that are highly associated with ASD and TD groups.
In the study of Karampasi 2020 [193], they compared classification performance using different FS methods, namely Local Learning-based Clustering FS (LLCFS), Infinite FS (InfFS), and minimal-Redundancy Maximal-Relevance (mRMR) and Laplacian Score. They concluded that the best FS method depends highly on the classification model. They also introduced two novel features, namely the Haralick texture features and the Kullback-Leibler divergence.
In the study of Kazeminejad 2020 [150], graph theory was used to extract the positive correlation matrix (only the positive values of the original correlation matrix), the absolute value of the correlation matrix, and the anticorrelation matrix (only the negative correlation values). Following implementing a model on those data, they concluded that graph features extracted from the anti-correlation matrix led to the highest accuracy, suggesting that anti-correlation should not be discarded as they may include useful information that would aid the classification task.
The study of Kernbach 2018 [108] proposed a transdiagnostic hierarchical Bayesian modeling framework for rs-fMRI data which combined Indian Buffet Processes and Latent Dirichlet Allocation to find shared endo-phenotypes of default mode dysfunction in attention deficit hyperactivity disorder (ADHD) and ASD.
The study of Leming 2021 [301] introduces a novel technique of deriving symmetric similarity matrices from regional histograms of grey matter volumes estimated from S-MRI as features.
In Li 2017 (363) study, a new approach for estimating FCs by remodeling Pearson’s correlations as an optimization problem was suggested, which provided a way to incorporate biological/physical priors into the FCs.
The study of Li 2020 [62] proposed a federated learning approach, where a decentralized iterative optimization algorithm is implemented and shared local model weights are altered by a randomization mechanism, ensuring that private information cannot be recovered from the model gradients or weights.
The study of Li 2021 [346] designed novel ROI-aware graph convolutional layers for fMRI data, which contained ROI-selection pooling layers that highlight salient ROIs making it possible to infer which ROIs were important for prediction.
The study of Lu 2021 [391] for the first time integrated genomic data with rs-fMRI data for the classification of ASD.
The study of Ma 2021 [159] found that phase synchrony-based classification models of fMRI data outperformed static FC-based models.
The study of Manoharan 2021 [222] on EEG data found that cosine metric and FAN (fixed amount of nearest neighbors) outperformed Euclidean metric and threshold respectively as the distance metric and the neighborhood selection strategy.
The study of Naghashzadeh 2021 [156] exploited the transfer function perturbation (TFP) method to estimate the instantaneous phase and envelope features from rs-fMRI data for the classification task. They concluded that phase features had a significantly lower correlation than envelope features.
The study of Okamoto 2021 [111] implemented a model based on invariant information clustering (IIC) which improved the performance of the leave-one-site-out cross-validation technique.
In the study of Payabvash 2019 [387], tract-based spatial statistics (TBSS) were used for voxel-wise comparison and co-registration of edge density (ED) maps in addition to conventional DTI metrics, resulting in an improvement of model predictive power.
The study of Rabany 2019 [173] utilized dynamic FC measures from rs-fMRI data in a model to classify and study the overlap in neuropathology between schizophrenia and ASD.
The study of Reiter 2021 [120] used different ASD sample compositions stratified by gender and severity score for the classification task. Their findings suggested that model performance varies significantly with the sample composition.
The study of Sadiq 2022 [196] used the non-oscillatory brain connectivity (NOC) method instead of Pearson’s correlation coefficients (PCC) to distinguish subtypes of ASD (namely autistic disorder, Asperger’s disorder, and Pervasive developmental disorder-not other specified) from healthy controls. They found that the use of NOC measures significantly improved the performance of the model compared to using PCC.
In the study of Sartipi 2018 [186], ROIs from rs-fMRI data were decomposed using the double-density dual-tree discrete wavelet transform into time-frequency sub-bands and the generalized autoregressive conditional heteroscedasticity (GARCH) model was used for feature extraction from those sub-bands. Extracted features were used in a classification model and achieved satisfactory results.
The study of Shahamat 2020 [315] proposed a novel genetic algorithm-based brain masking (GABM) method for visualization of the knowledgeable regions used by a 3D-CNN trained on the fMRI FC data.
The study of Shi 2020 [128] proposed a novel FS method based on the minimum spanning tree (MST) to seek neuromarkers from rs-fMRI data, which significantly improved the performance of the classification task.
The study of Soussia 2018 [340] used high- and low-order relationships between morphological regions in the S-MRI data as features instead of the typical brain connectomes and found that they improve classification performance.
In the study of Xing 2018 [112], a new convolutional neural network with element-wise filters (CNN-EW) for FC data was proposed that gives a unique weight to each edge of the brain network which may reflect the topological structure information more realistically.
The study of Yang 2019 [86] implemented several classical machine learning classifiers (such as support vector machines, logistic regression, and ridge) on seven different brain atlases (CC400, CC200, AAL, HOA, EZ, TT, and Dosenbach) for rs-fMRI data and found that the most promising brain atlas was Craddock 400 (CC400).
The study of Yang 2020 [81] implemented DNNs with four different hidden layer configurations using the four different pipeline datasets from the ABIDE repository (CPAC, CCS, DPARSF, and NIAK). Their results indicated that the dataset preprocessed by using the CPAC (Configurable Pipeline for the Analysis of Connectomes) pipeline achieved the highest accuracy, recall, and precision.
In the study of Yang 2021 [337], three new models (2D CAM, 3D CAM, and 3D Grad-CAM) were proposed which help with the interpretability of the classification task on S-MRI data.
The study of Yao 2019 [144] proposed a multi-scale triplet graph convolutional network (MTGCN) for rs-fMRI data that utilizes multiple atlases to partition the brain into ROIs and also takes into account the underlying high-order (e.g., triplet) association among subjects.
In the study of Zhang 2020 [202] fixel-based analysis (FBA) method was implemented on DTI data for the first time to classify ASD participants.
The study of Zhang 2020 [334] introduced path signature (PS) features for rs-fMRI data, which can capture the dynamic longitudinal information of the brain development for ASD Identification.
Study of Zhao 2018 [182] proposed a multi-level, high-order FC network representation as an alternative to the pairwise correlation between ROIs in fMRI data that can capture complex interactions among brain regions.
In the study of Zhao 2020 [158], a central-moment method was proposed to extract temporal-invariance properties contained in either low- or high-order dynamic FC networks.
In the study of Zhuang 2019 [89], invertible networks were used to help interpret the classification model’s decisions on fMRI data, presenting new insights into model interpretation techniques.
In general, we believe state-of-the-art AI algorithms might be used to ascertain diagnosis by differentiating ASD from other conditions with similar symptoms. These tools might also be used as a screening test before the actual diagnosis by the clinician. Yet again, we do not suggest using these tools as the replacement for a clinician’s judgment, as stakeholders would prefer a simple explanation for how an algorithm arrives at its classification of a particular case. Unfortunately, the decision-making process of these algorithms happens in a “black box” manner. We found little effort in the included studies to explain the features their algorithms specifically used to get to their decisions and the justifiable reasons behind them. Thus, we suggest future studies try to present saliency maps or class activation maps (CAM) to visualize the features used by their models to classify ASD individuals and to what extent those decisions are in line with the available clinical knowledge. Some other methods to address this issue are provided elsewhere [397]–[402]. The inappropriate use of feature reduction methods in the literature was also an issue. It must be noted that any feature reduction steps should not be applied to test data as they may induce overfitting [403]. Very few studies seem to consider this issue. Some other critical challenges of using automated tools to achieve clinical impact are highlighted in the paper of Kelly et al. in 2019 [23]. Regrettably, very few attempts to address these challenges were observed in the included studies. Most notably, a considerable proportion of the studies reported their results only in one measure (accuracy) while not providing other essential measures such as sensitivity and specificity. Another primary concern with these studies is the cost and convenience of the modalities used. A simple brain MRI could cost between $1600 to $8400 in the U.S. [404], while an fMRI scan could cost the patient around $3500 to $5200 [405]. An EEG usually costs lower than MRI, and fMRI scans, with an estimated range of $200 to $700 for a standard EEG, but if extended monitoring is required, the cost could go as high as $3000 [406]. However, ASD patient cooperation in performing an EEG is often poor, requiring sedation, which decreases the convenience of the procedure [407].
Our results also indicate that overfitting is a serious issue in the included studies, as the meta-regression revealed that sample size was significantly correlated with accuracy results across the studies. A recent paper by Ying 2019 [408] offers a variety of solutions to address the overfitting problem. These include the early-stopping strategy, network-reduction strategy, data expansion, and use of regularization terms and dropout technique. Also, linear classifiers are less prone to overfitting than non-linear classifiers due to their lower complexity [116]. Such strategies should be strongly considered in future studies, especially with small sample sizes.
A significant flaw in the literature was that most studies relied on only two class labels (ASD and Non-ASD) for data labeling, assuming that both cases and controls are well-defined entities. This approach ignores that ASD can be broken down into subgroups such as typical autism, low/high function autism, idiopathic autism, and monogenic autism. This approach also ignores that the diagnostic criteria for ASD are composed of behavioral symptoms that may overlap with other mental disorders, such as attention deficit hyperactivity disorder (ADHD).
Overall completeness and applicability
There were no significant concerns regarding completeness and applicability in this review. Assessments of indirectness using the GRADE tool confirmed this.
Quality of the evidence
Unfortunately, the currently available evidence is rife with bias, especially in the patient selection domain. Future studies should be more accurate in their methods to gain robust results. Also, we could not include a considerable proportion of the studies in meta-analyses, as these studies only reported one single evaluation metric (mainly accuracy). We tried contacting the authors for additional data but were unsuccessful in most cases. Including those studies could have had a significant effect on our results.
Potential biases in the review process
There was a considerable risk of publication bias, which downgraded our results’ confidence.
Implications for research
Our recommendations for future research are summarized in Highlight box 2. The most prominent weakness in the literature was the need for better methodological and reporting quality. For example, in most studies, we couldn’t assess the risk of bias for at least one domain simply due to inadequate reporting. We strongly suggest that future researchers consider using standard reporting guidelines such as MINIMAR [409], CLAIM [410], and STARD-AI [411], which are specifically developed for reporting AI studies in healthcare.
We also detected a similar pattern across the literature: most studies implemented an AI algorithm on the data of ASD individuals and reported the results of the classification task, but very few studies tried to explain what features their algorithms use or how justifiable the reasons behind those features used for the classification task. As discussed above, providing a reasonable justification behind the algorithm’s decisions is as important as its results for such a technology to enter the clinical setting. Future studies should consider addressing this issue. Also, future studies should consider deriving more fair decisions from their AI models by (1) pre-processing: transforming the original dataset so that the underlying discrimination towards some groups is removed; (2) in-processing: adding a penalization term in the objective function or imposing a fairness-relevant constraint; and (3) post-processing: further recomputing the results from predictors to improve fairness. More on this subject could be found elsewhere [412].
Future studies must also consider the overfitting problem by addressing data-level and model-level issues. This includes using bigger training data, dropout technique, L1/L2 regularization, and batch normalization [413]. The results of the studies must also be reported in more than just one evaluation metric (sensitivity, specificity, recall, precision, etc.).
Another prominent issue across the literature was the use of datasets that do not reflect the biological diversity of patients optimally, with gender unbalance being the most frequent example of it. Current publicly available datasets contain primarily male patients, which may result in male-biased models. A similar pattern was observed for participants’ functioning, where most data came only from the high-functioning participants. Thus, the development of more balanced datasets seems to be a necessity. An ideal dataset should be Findable, Accessible, Interoperable, and Reusable (FAIR) [414]. More on the requirements for an ideal medical imaging dataset in the era of AI, including the essential metadata required, is discussed elsewhere [415].
We also encourage future researchers to evaluate and compare the biomarkers used to differentiate cases from controls by different models within the same modality, as such data could be of enormous value in finding markers specific to ASD patients and also help illuminate the “black-box” decision-making nature of these models.
Studies should also try to break ASD and non-ASD groups into smaller subgroups like “mild autism” and “severe autism” or “low function autism” and “high function autism.”
Finally, a gap in the knowledge is human-machine collaboration. As AI algorithms can complement, but not replace, physicians in most aspects of medicine, future models should be studied and implemented as an integral part of a complete healthcare system. Indeed, scientists, physicians, patients, regulatory agencies, and health insurance providers need to create a healthcare system that can learn and adapt as it develops [416]. At the same time, AI researchers should recognize the limits of their models to prevent their overuse and misuse, which could otherwise sow distrust and cause patient harm [417].
Recommendations for future research.
Use standard reporting guidelines (such as MINIMAR CLAIM, and STARD-AI).
Use saliency maps, class activation maps (CAM), invertible networks, or other methods of model interpretation to avoid black-box dilemmas in AI modeling.
Use the early-stopping strategy, network-reduction strategy, data expansion, regularization terms, dropout technique, or other techniques to avoid model overfitting on data.
Report evaluation performance in more than just one metric (e.g., sensitivity, specificity, positive predictive value, etc.).
Use datasets that are balanced for the most important potential confounders (e.g., gender, age, IQ, ASD subtype, etc.).
Develop biologically diverse datasets to address gender unbalance and function unbalance problems of current publicly available datasets (e.g., ABIDE).
Develop publicly available datasets (like ABIDE for fMRI) for other modalities, especially EEG data as its results are promising.
Avoid aggregating data from different subtypes of ASD into one general “ASD class”.
Evaluate the model performance on a separate dataset unseen to the model (i.e., test dataset).
Avoid applying feature reduction methods to test data as it may induce overfitting.
Declarations
Availability of data and materials: To access the data of the studies, contact their respective authors. Review data are available as appendix files.
Competing interests: The authors declare that they have no competing interests.
Funding: Not funded
Authors’ contributions:
Coordination of the review: AV, MM, ANA, AHM
Designing the study: AV, MM, ANA, SHS, IMO, FA, AHM
Developing the protocol: AV, MM, ANA, AHM
Performing the search: AV, ZMG, AM, MG
Study selection: AV, MM, RA, SHH, MST, ZMG
Data extraction: AV, RA, SHH, MST, ZMG
Assessing the risk of bias in included studies: AV, MM, SHH, MST, RA, ZMG, SF
Analysis of data: AV
Interpretation of the results: AV, MM, AHM
Assessing the confidence in cumulative evidence: AV, SHH, MST, RA, ZMG
Writing the review: AV, MM, ANA, SHS, IMO, FA, AHM
Correspondence: AHM
Acknowledgments
None
Registration: PROSPERO registration ID: CRD42021262575, CRD42021262825, CRD42021262831.
Amendments: Initially, we planned to research three modalities (EEG, rs-fMRI, and S-MRI) and publish the results in three papers. However, we decided to extend our research after posting our protocol and before conducting our search to present all the results in one comprehensive article.
Data Availability
To access the data of the studies, contact their respective authors. Review data are available as appendix files.
Footnotes
This is the first draft of the manuscript.
Bibliography
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].
- [52].
- [53].
- [54].
- [55].
- [56].
- [57].
- [58].
- [59].
- [60].
- [61].
- [62].↵
- [63].
- [64].
- [65].↵
- [66].
- [67].
- [68].
- [69].
- [70].
- [71].↵
- [72].
- [73].
- [74].
- [75].
- [76].
- [77].
- [78].
- [79].
- [80].
- [81].↵
- [82].
- [83].
- [84].
- [85].↵
- [86].↵
- [87].↵
- [88].
- [89].↵
- [90].
- [91].
- [92].
- [93].
- [94].
- [95].
- [96].
- [97].
- [98].
- [99].
- [100].
- [101].
- [102].↵
- [103].↵
- [104].
- [105].
- [106].
- [107].
- [108].↵
- [109].
- [110].
- [111].↵
- [112].↵
- [113].
- [114].
- [115].
- [116].↵
- [117].
- [118].
- [119].
- [120].↵
- [121].
- [122].
- [123].
- [124].
- [125].
- [126].
- [127].
- [128].↵
- [129].↵
- [130].
- [131].
- [132].
- [133].
- [134].
- [135].
- [136].
- [137].
- [138].
- [139].
- [140].
- [141].
- [142].
- [143].
- [144].↵
- [145].
- [146].
- [147].
- [148].
- [149].
- [150].↵
- [151].
- [152].
- [153].
- [154].
- [155].
- [156].↵
- [157].
- [158].↵
- [159].↵
- [160].
- [161].
- [162].↵
- [163].
- [164].
- [165].
- [166].
- [167].
- [168].
- [169].
- [170].
- [171].
- [172].
- [173].↵
- [174].↵
- [175].
- [176].
- [177].
- [178].
- [179].
- [180].
- [181].
- [182].↵
- [183].
- [184].
- [185].
- [186].↵
- [187].
- [188].
- [189].
- [190].
- [191].
- [192].
- [193].↵
- [194].
- [195].
- [196].↵
- [197].
- [198].
- [199].↵
- [200].
- [201].
- [202].↵
- [203].
- [204].
- [205].
- [206].
- [207].
- [208].
- [209].
- [210].
- [211].
- [212].
- [213].
- [214].
- [215].
- [216].
- [217].
- [218].
- [219].↵
- [220].
- [221].
- [222].↵
- [223].
- [224].
- [225].
- [226].
- [227].
- [228].
- [229].
- [230].
- [231].
- [232].
- [233].
- [234].
- [235].
- [236].
- [237].
- [238].
- [239].
- [240].
- [241].
- [242].
- [243].
- [244].
- [245].
- [246].
- [247].
- [248].
- [249].
- [250].
- [251].
- [252].↵
- [253].
- [254].
- [255].
- [256].
- [257].
- [258].
- [259].
- [260].
- [261].
- [262].
- [263].
- [264].
- [265].
- [266].
- [267].
- [268].
- [269].
- [270].
- [271].
- [272].
- [273].
- [274].
- [275].
- [276].
- [277].
- [278].
- [279].
- [280].
- [281].
- [282].
- [283].
- [284].
- [285].
- [286].
- [287].
- [288].
- [289].
- [290].
- [291].
- [292].
- [293].
- [294].
- [295].
- [296].
- [297].
- [298].
- [299].
- [300].
- [301].↵
- [302].↵
- [303].
- [304].
- [305].
- [306].
- [307].↵
- [308].
- [309].
- [310].
- [311].↵
- [312].
- [313].
- [314].
- [315].↵
- [316].
- [317].
- [318].
- [319].
- [320].
- [321].
- [322].
- [323].
- [324].
- [325].
- [326].
- [327].
- [328].
- [329].↵
- [330].
- [331].
- [332].
- [333].
- [334].↵
- [335].
- [336].
- [337].↵
- [338].
- [339].
- [340].↵
- [341].↵
- [342].
- [343].
- [344].
- [345].
- [346].↵
- [347].
- [348].
- [349].
- [350].
- [351].
- [352].
- [353].
- [354].
- [355].
- [356].
- [357].
- [358].
- [359].
- [360].
- [361].
- [362].
- [363].
- [364].
- [365].
- [366].
- [367].
- [368].
- [369].
- [370].
- [371].
- [372].
- [373].
- [374].
- [375].
- [376].
- [377].
- [378].
- [379].
- [380].
- [381].↵
- [382].
- [383].
- [384].
- [385].
- [386].
- [387].↵
- [388].
- [389].
- [390].
- [391].↵
- [392].
- [393].↵
- [394].↵
- [395].↵
- [396].↵
- [397].↵
- [398].
- [399].
- [400].
- [401].
- [402].↵
- [403].↵
- [404].↵
- [405].↵
- [406].↵
- [407].↵
- [408].↵
- [409].↵
- [410].↵
- [411].↵
- [412].↵
- [413].↵
- [414].↵
- [415].↵
- [416].↵
- [417].↵