Abstract
Reading ability is a complex skill requiring multiple proficiencies (e.g., phonological awareness, decoding, and comprehension). Reading ability has genetic and environmental components that create the potential for significant gene-gene and gene-environment interactions, but the evidence for these interactions is limited. We used data from the Avon Longitudinal Study of Parents and Children and the Genes, Reading and Dyslexia Study to assess the contributions of genetic and demographic features to a continuous latent reading ability score. We then used this score as the phenotype on which to predicate genome-wide single nucleotide polymorph screening, followed by feature selection using an elastic net analysis. Results from the elastic net models showed that genetic and demographic features predicted reading ability for both cohorts. Five single nucleotide polymorphisms were associated with latent reading in the Avon Longitudinal Study of Parents and Children, as well as in the Genes, Reading and Dyslexia cohorts. For both cohorts, larger vocabularies were positively associated with latent reading ability. Genes within the neuron migration pathway were overrepresented in the Avon Longitudinal Study of Parents and Children cohort. We provide support that genes involved in early brain development have an impact on latent reading ability performance. Our findings also indicate high generalizability of genetic findings between cohorts, using our approach.
Efficient and adequate reading is the result of several neurological processes and pathways 1, 2 that enable critical component skills, which include phonological awareness 3, 4, decoding 5, 6, fluency 7, 8, and comprehension. There is substantial evidence that reading ability has a strong genetic component 9–11. Understanding the genetics of reading ability will help to identify key developmental periods and underlying molecular processes. Although reading ability has a high heritability, ranging from 40 to 60% 12, the current knowledge about specific genes and genetic variants for reading is limited by small datasets and the traditional statistical approaches that have been used for genetic analysis. The overall goal of this study was to examine genetic associations with reading ability using an alternative statistical approach to identify genes not previously implicated by other approaches.
Reading development and performance has a complex etiology involving genetics, as well as environmental and demographic factors. Past research has established genetic contributions to reading disability (i.e., difficulties learning to read which cannot be explained by neurological or sensorial conditions, including dyslexia and other subtypes) 13–15 and performance on quantitative traits used to determine reading disability status (e.g., nonword reading 16) or correlated with reading (e.g., multivariate rapid automatized naming/rapid alternating stimulus 17). Recent investigations involving reading disability and related tasks have provided evidence for polygenicity, gene-gene interactions, 18 and functional biological pathways 19, 20. For example, KIAA0319/TTRAP and DYX1C1 interact with GRIN2B in children with a reading disability when performing a short-term memory task 21 and DCDC2/KIAA0319 interact to diminish single word reading, nonword reading, spelling, phoneme deletion, and comprehension 22. Additionally, pathway analysis suggests the effects of functional mechanisms such as neuron migration, neurite outgrowth, cortical morphogenesis, and ciliary structure and function 23 on reading. Reading development has also been linked to a number of other factors including biological sex, birth weight, gestational weeks, mother’s highest education, and language ability24. Mascheretti and colleagues 21 suggest that birth weight and gestational weeks are potential environmentally-related factors for dyslexia and that environmental factors may interact with each other and genetic risk. For example, they suggested that teacher quality and parental education may interact with genetic risk for dyslexia, either exacerbating or providing protection against dyslexia. Gu and colleagues 25 reported that two non-coding single nucleotide polymorphisms (SNPs; rs3779031, rs987456) within CNTNAP2 interact with environmental factors in females but not males. In females, scheduled reading time interacted with rs987456 to reduce the risk of dyslexia. In summary, past research has revealed that the genetic contributions to reading development and performance are a complex system with multiple genes involved, as well as gene-gene and gene-environment interactions.
The most significant limitation to genetic studies of reading performance, reading disability, and dyslexia have been the small size of cohorts available for study. However, our review of published studies exploring genetic association with reading disability or reading ability identified two additional limitations. We searched the Genome Wide Association Study (GWAS) catalog for publications on reading or dyslexia and reviewed the included studies in Carrion- Castillo and colleagues 12 (see Table 1 for identified articles). Past research has (1) represented reading as either case-control (3/15 published studies, Table 1) or single task performance (6/15 published studies) and (2) relied on one-SNP-at-a-time statistical approaches (12 out of 15 published studies), examining reading disability or reading and language performance. In 7 out of 15 association studies, only one or two measures of reading or reading-related tasks were used to assign affected status as a binary variable or as a continuous variable.
Additionally, 12 out of 15 studies published to date used the classical one-SNP-at-a-time to identify common variants. Neither of these conceptualizations fully captures the complexity of reading and neither accounts for possible measurement error (i.e., the difference between the “true” score and what we can observe or measure).
There are several methods to overcome these limitations. One method to account for possible measurement error is to create and use a latent reading ability score as the phenotype for genetic analysis. Three studies 26–28 have used either principal components (PCs) or a regression based composite score as their phenotype; however, these methods are not always adaptable across datasets or even iterations within a dataset. In contrast, confirmatory factor analysis can be used to create a latent reading ability score using multiple measures and is adaptable across datasets with different reading measures. For the statistical analyses, assessing genetic associations one at a time can result in data loss, especially in the small sample sizes that characterize the genetics of reading research, since multiple test correction methods must be applied. This statistical bottleneck is slowing the identification of potentially relevant genes and SNPs. It is possible to address this limitation by employing additional statistical models, such as elastic net, in conjunction with genome-wide association, to increase the number of informative SNPs. Due to these limitations, previous studies may have missed important genetic factors that contribute to or protect against reading impairments.
There is limited knowledge of how environmental and demographic factors interact with genetic factors because to date only three studies have examined gene-environment interactions in reading 25, 29, 30. Due to constraints imposed by research design and statistical analysis, few studies have integrated genetic, environmental, and demographic data within the same analysis 19. Beyond statistical constraints, another limiting factor in understanding the interactions between genes and environmental-demographic features is the demographic similarity between the most frequently used cohorts. Cohorts that capture a wider range of environmental-demographic features must be included, so that findings are generalizable beyond samples representing European descent. Increasing numbers of demographically varied cohorts are being utilized, as a result of recruitment of understudied groups within genetics 17. Because our understanding of the genetics of reading is limited by prior statistical and cohort constraints, we do not know how many genes are relevant for understanding the genetics of reading, which biological pathways are crucial, or how including environmental and demographic factors influence genetic associations. By combining machine learning (i.e., statistical models that learn from data) and confirmatory analytical methods, we can further our understanding of the genetics of reading ability and contribute to the refinement of the field’s hypotheses. We have previously developed methods for combining statistical and machine learning approaches with biological domain knowledge to study the association between genetic and environmental factors and disease 31–33, and have applied them to study various disorders19, 34–43, including those involving reading ability 19.
In this study, we sought to address two limitations in the research exploring genetic contributions to reading ability: the focus on a reduced phenotype and over-reliance on certain statistical methods. We hypothesized that (1) the elastic net model would generate more replicated genetic markers associated with latent reading ability between datasets than traditional methods, (2) informative genetic markers would be overrepresented in certain biological processes, and (3) we would find informative positive and negative associations. Our approach was to use confirmatory factor analysis to create a latent reading ability score and combine machine learning with genome-wide association study (GWAS) approaches. We studied a robust and well-known population dataset, the Avon Longitudinal Study of Parents and Children (ALSPAC) 44, to test our initial hypotheses and then replicated our findings in an ethnically/racially diverse dataset, the Genetics of Reading and Dyslexia (GRaD) Study.
Methods
We obtained ethical approval for this study from the ALSPAC Ethics and Law Committee and the Local Research Ethics Committee(s) (Arizona State University Institutional Review Board).
Datasets
Table 2 provides descriptive information for the ALSPAC and GRaD cohorts. Table 3 reports reading data for the ALSPAC and GRaD cohorts, including sample size for each variable, mean, and standard deviation.
ALSPAC
Our discovery cohort was the ALSPAC. The ALSPAC is a population-based birth cohort which has been extensively described in various studies 44–47. The total sample size was 15,454 pregnancies, resulting in 15,589 fetuses, 14,901 of which were alive at 1 year of age. For this study, we used data from 8,071 participants who had participated in at least one of the ALSPAC “Focus at” sessions. These Focus sessions began at age 7 with Focus at 7 and collected behavioral and physiological data not easily assessed via questionnaire. We used data from Focus at 7, Focus at 8, and/or Focus at 9 and for whom genetic data were available. Measures included parent surveys and clinical data. The study website contains details for all the data that are available through a fully searchable data dictionary (http://www.bristol.ac.uk/alspac/researchers/our-data/). The inclusion criteria for this study were: (1) no diagnosis of autism spectrum disorder, (2) normal hearing status at Focus at 7, (3) nonverbal intelligence greater than 72 standard score on the Weschler Intelligence Scales for Children 48, (4) reading data from two Focus sessions, and (5) genetic data. Lastly, for twin-pairs one child was randomly selected for analysis to achieve data independence, which resulted in 186 children being removed from the analysis.
GRaD
Following discovery analyses in the ALSPAC, we replicated our procedures using the GRaD Study database 49. The GRaD Study is a multi-site, case-control study of reading disability in minority youth in the USA, Canada and Puerto Rico. Phenotype information and DNA were collected for 1,435 participants. To be included in the GRaD cohort, participants had to be African American or Hispanic American between the ages of 8 and 15 years with typical or disordered reading. Exclusion criteria were age outside the target range; non-minority race/ethnic status; foster care placement; preterm birth (<36 weeks); prolonged stay in NICU; history of diagnosed or suspected significant developmental delay, behavioral problems, serious emotional/psychiatric disturbances, chronic neurological condition, vision loss, or hearing loss; and frequent school absences. We used data from all participants in the GRaD cohort for our analyses. We had complete behavioral data for 1,409 participants to create the reading ability score and 1,341 participants who passed genomic quality control for final data analyses.
Measures
ALSPAC
We used behavioral and demographic measures, including reading, language, and nonverbal intelligence measures, collected between the ages of 7 and 9. Reading skill was measured during Focus at 7 and Focus at 9 using a combination of word reading, spelling, and connected text tasks. At Focus at 7 years, children completed the single word reading subtest on the Wechsler Objective Reading Dimensions 50 and an experimenter derived spelling task 51. Nonword repetition was assessed at Focus at 8 52. At Focus at 9, children completed single word reading, nonword reading 53, and spelling tasks similar to the ones presented during Focus at 7 years, but with new words/items. Additionally, at age 9, children completed the Neale Analysis of Reading Ability (NARA; Neale, 1997), which provided scores for reading rate, accuracy, and reading comprehension. The Wechsler Intelligence Scale for Children (WISC; Wechsler et al. 1992) (Focus at 8) yielded an estimate of nonverbal intelligence. Receptive language was assessed using the Wechsler Objective Language Dimensions Language Comprehension subtest (WOLD; Rust, 1996). Vocabulary was measured using the Wechsler Intelligence Scales for Children vocabulary subtest (WISC; Wechsler, Golombok, & Rust, 1992)
We selected biological sex, birth weight, maternal education, child ethnicity, bilingual language status, hearing function, and Attention Deficit Hyperactivity Disorder (ADHD) status, as our demographic measures. Biological sex and birth weight in grams were reported at birth. Maternal education was obtained at 32-weeks’ gestation and measures the highest degree the mother had obtained by that point: Vocation, certificate of secondary education, O-levels, A-levels, or College degree. Child’s ethnicity was reported by mothers at 32 weeks’ gestation. Bilingual language status was obtained via parent report at Focus at 8. Hearing functioning was measured via bone conduction at Focus at 7. ADHD) status was determined at age 7 using parent and teacher questionnaires.
GRaD
We used reading, language, and demographic data from the GRaD. All data was collected during one or two sessions. Reading was measured using multiple indicators, including the Test of Word Reading Efficiency (TOWRE; Torgesen et al. 2012), Woodcock- Johnson Tests of Achievement (3rd Edition; WJ-III; Woodcock et al. 2001), Clinical Test of Phonological Processing (CTOPP; Wagner et al. 2013), and Standardized Reading Inventory (SRI; Newcomer 1986). The TOWRE provided two timed measures of reading fluency – Sight Word Efficiency and Phonemic Decoding Efficiency. The TOWRE measures a child’s ability to quickly and accurately read real words and pseudowords. The WJ-III provides three measures of reading – Letter-Word Identification, Word Attack, and Spelling. Letter-Word Identification and Word Attack are untimed analogous tasks to those of the TOWRE test, in which children read lists of known and pseudowords. Similar to the ALSPAC spelling task, the WJ-III Spelling task requires children to correctly spell an orally presented word. The CTOPP-Blending Words requires children to combine phonemes into a word. This task taps into phonological awareness. The Standardized Reading Inventory provides a comprehension and word recognition score. Receptive vocabulary was measured using the Peabody Picture Vocabulary Test 60.
We selected parent reported ADHD status, Hispanic/Latino status, child’s race, and maternal education (either birth or, if relevant, adoptive) as demographic measures. ADHD status was defined as having received a psychiatric diagnosis of ADHD. Maternal education was obtained via parent questionnaire and was defined as (1) less than 7 years of school, (2) 7 to 9 years of school, (3) 10 to 11 years of school, (4) high school diploma or GED, (5) associate degree, trade, or business school, (6) bachelor’s degree, and (7) professional or advanced degree.
Genotyping
ALSPAC
ALSPAC samples were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme, subcontracting the Wellcome Trust Sanger Institute, (Cambridge, UK) and the Laboratory Corporation of America (Burlington, NC, USA). The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (identity by descent (IBD) < 0.8). Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed. SNPs with a minor allele frequency of < 1%, a call rate of < 95%, or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as a proportion of IBD > 0.1. Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 participants and 500,527 SNPs passed these quality control filters.
GRaD
Saliva was collected using Oragene-DNA kits and DNA was then extracted with prepIT-L2P (OG-500; DNA Genotek Inc, Ottawa, Ontario, Canada). Participants were genotyped for >2 million SNPs using the Illumina Infinium Omni2.5-8 BeadChip at the Yale Center for Genome Analysis (Orange, CT, USA). Initial genotyping quality control and SNP genotyping calls were conducted using GenomeStudio (Illumina, San Diego, CA, USA) and standard Infinium genotyping data analysis parameters were used to optimize genotyping accuracy. Individuals were removed if they were missing more than 3% of their genotypes (n = 39), if there were discrepancies between reported and inferred sex based on X chromosome heterozygosity (n = 52), and if IBD > 0.125 using REAP (n = 10) . SNPs were removed from downstream analyses if they had minor allele frequency of less than 5% (n = 926,457), missingness greater than 5% (n = 22,849), Hardy-Weinberg equilibrium p<0.0001 (n = 116,259) or were not autosomal (n = 60,551). 1,331 participants and 1,265,623 SNPs passed these quality control filters.
Statistical Analysis
Creating reading ability score
To create the latent reading ability score, we used confirmatory factor analysis (CFA) to load all of the reading variables onto a single factor. We used lavaan 62 to fit and assess the reading ability model. Current practice is to use several model fit criteria instead of relying on a single measure. We assessed model fit using a combination of absolute, parsimonious, and comparative indices of model fit 63. To determine goodness of fit, we evaluated (a) Tucker-Lewis Index (TLI), (b) root mean square error approximation (RMSEA), and (c) standardized root mean square residual (SRMR). We used the TLI to assess comparative or incremental fit. The TLI is a non-normed fit index that is analogous to the r-squared coefficient, with penalties for added parameters. Like the r-squared coefficient, higher values indicate better fit, with the traditional cutoff value for good fit at 0.90. Our index of parsimony was RMSEA. RMSEA ranges from 0 to 1 with values of < 0.08 representing acceptable fit and values < 0.05 representing close fit 64. We report the 90% confidence interval for the RMSEA and the p-value for the closeness of fit test, which tests the null hypothesis that RMSEA is <= 0.05; this test should result in a nonsignificant p-value. Our index of absolute fit was SRMR, which represents the squared difference between observed and predicted correlations and for which values < 0.08 are considered acceptable.
After assessing the fit of the model, we extracted the lambda values associated with manifest paths that exceeded 0.20. The following equation was used to approximate an individual’s reading ability score: wherein each lambda was multiplied by the corresponding z-score for a measure and the products were summed together. By using this method, we can approximate the latent construct of reading ability instead of relying on a single reading measure. After approximating the reading ability, we assessed the distribution of scores and normalized if necessary. We used the reading ability score as the phenotype for SNP screening and as the outcome variable for the elastic net model.
Imputation for missing data
We used multiple imputation methods to deal with missing data. For the multiple imputation, we used mice 65. For each imputation, we estimated scores for missing reading measures, fit the reading ability model to the imputed dataset, and approximated reading ability scores. We averaged reading ability scores for the five imputations and determined the variability. We used the average reading ability score as the phenotype for SNP screening and as the outcome variable for the elastic net model.
SNP screening
To constrain the high-dimensionality of the dataset, we used genome- wide association (GWA) to screen SNPs prior to multivariate modeling. GWA was completed in PLINK 66 and performed chromosome-by-chromosome based on criteria used in prior studies 67. We selected up to 100 SNPs based on uncorrected p-values and SNPs with an FDR-BH of 0.1 or less to be included in the subsequent multivariate modeling. Because the GRaD dataset contains more than one ethnic group, we included 10 ancestry principal components. We used the standard settings in PLINK.
Machine learning
There are many options for machine learning. We selected a procedure that would allow for correlated features, more features than subjects, multiple data types, and multiple feature selection. We performed multivariate modeling using an elastic net model to link reading ability with the SNPs that survived the screening step, as well as demographic, environmental, and behavioral covariates. An elastic net is a regularized regression model that enables simultaneous feature selection (in our case, variables are the SNPs) in a high-dimensional setting 68, 69. It adds two regularization terms to the loss function of an ordinary regression model: one L1-norm regularization whose effect is to force the regression coefficients of small effects to be exactly zero, thus enabling feature (e.g., SNP) selection; a second L2-norm regularization term insures highly correlated SNPs are selected. There are two tuning parameters corresponding to the two regularization terms to balance with the loss function. Tuning parameter selection is typically done using cross validation (see below).
We used 5-fold cross-validation to determine the best tuning parameters. In this process, the sample is split into five random groups, four of which are used to train the model and one for testing. This splitting repeats until every “fold” has served as the test set. Cross-validation was performed 10 times to select tuning parameters. After the best tuning parameters were identified, the model was refit using all the data to generate coefficients.
In addition to SNP only modes, we ran SNP plus demographic feature models. For ALSPAC, we included nonverbal IQ, vocabulary, receptive language score, ADHD status, birth weight, bilingual language status, and mother’s highest education in the multivariate model. For GRaD, we selected ADHD status, Hispanic/Latino status, child’s race, mother’s highest education (either birth or, if relevant, adoptive), and vocabulary. However, for GRaD, we were only able to use child’s race and vocabulary score (measured by Peabody Picture Vocabulary), due to large missingness for the other variables (ADHD missing = 266, Hispanic/Latino missing = 138, mother’s education missing = 215). We conducted this analysis to determine the impact of SNPs on reading when controlling for demographic, environmental, and behavioral contributions.
Pathway enrichment and network analysis
We mapped informative SNPs from the elastic net to genes using g:SNPense on g:Profiler 70. After mapping SNPs to genes, we performed enrichment analysis using g:GOSt on g:Profiler. g:Profiler was selected over similar tools because recent comparisons of the available tools showed that g:Profiler has the most up- to-date repository of pathways and draws from multiple curated sources (e.g., KEGG, Reactome). However, enrichment analysis alone only identifies those pathways that are overrepresented in a gene list but it cannot delineate how these pathways interact. Therefore, we used Cytoscape to explore how the pathways were connected 71. Cytoscape performs network analysis on biological pathways and produces visualizations and network statistics.
Results
Latent Reading Ability Score Creation
ALSPAC
We fitted a single factor model to the reading variables after z-score transformation. Two measures of goodness of fit indicated highly acceptable fit (TLI = 0.925, SRMR = 0.033), whereas chi-square ( χ2 (df = 27) = 1307.97, p < .001) and RMSEA (RMSEA = 0.124, 95% confidence interval = 0.118, 0.131, p = < .001) did not meet the guidelines for fit. All manifest paths were significant and had standardized path values over 0.70 (range = 0.745, 0.929). There were no negative variances, and the model explained a large amount of variance in reading performance (average variance = 0.667). These results indicate that the single factor model fit the data and could be used to create a reading ability score.
After computing the latent reading ability score, we investigated the distribution. Because of the nature of the score, the mean was zero and the standard deviation was one. Reading ability scores ranged from -3.46 to 2.07 and inspection of the plot indicated slight right skew. We investigated the reading ability scores for children with a reading disorder compared to children with typical reading ability. A reading disorder was defined as failing three of the reading variables. Children with a reading disorder had lower scores compared to children with typical reading (Supplemental Figure 1). These results indicated that the latent reading ability score functioned as desired, by capturing the full range of reading ability. We proceeded with the genetic analyses using this latent reading ability score. Additionally, we imputed the latent reading ability score for children for whom reading data were missing using multiple imputation.
GRaD
We fitted a single factor model using TOWRE, CTOPP-Blending Words, WJ-III Word Reading, WJ-III Word Attack, WJ-III Spelling, and SRI – Passage Comprehension, and SRI WR measures. Twenty-six children had incomplete data. We used robust modeling to account for missing data. Two goodness-of-fit metrics were highly acceptable (TLI = 0.909, SRMS = 0.031), whereas chi-square (χ2 (df = 20) = 730.21, p < .001) and RMSEA (RMSEA = 0.159, 95% confidence interval = 0.149, 0.168, p < .001) did not indicate good fit. The standardized path values ranged from 0.528 (CTOPP – Blending Words) to 0.913 (SRI WR). The model accounted for a large amount of variance in reading ability (average variance = 0.754). There were no negative variances. These results indicate that the single factor model fit the data and could be used to create a latent reading ability score.
We inspected the distribution of the latent reading ability score, which had a mean of 0 and standard deviation of 1. There was no evidence of non-normality in the histogram (Supplemental Figure 1).
ALSPAC
Genome-wide association and multivariate modeling
We performed a GWAS predicated on latent reading ability and imputed reading ability scores in the ALSPAC cohort. The latent reading ability analysis will be called “Analysis_1” and the imputed reading ability analysis will be called “Analysis_2” henceforth. After multiple test correction, Analysis_1 had four significant SNPs and Analysis_2 had 24 significant SNPs when using an FDR-BH of 0.05. Analysis_2 had an additional 67 SNPs with an FDR-BH of less than 0.1, all on chromosome 17 (Supplemental Table 1). Table 4 contains the list of SNPs that were significant after multiple test correction and genomic information for Analysis_1 and Analysis_2.
For the elastic net model, we used 149 SNPs for Analysis_1 and 250 SNPs for Analysis_2. Seventy-one and 67 SNPs were identified as informative (i.e., had a beta value greater than the absolute value of 0.01) for Analysis_1 and Analysis_2, respectively. Table 5 reports genomic information and beta weights from the elastic net model for informative SNPs. Supplemental Table 1 provides the lists of SNPs used in both Analysis_1 and Analysis_2. Across the two analyses, there were seven SNPs commonly selected. For Analysis_1, 24 SNPs were positively associated with latent reading ability (i.e., predicted better reading) and 47 SNPs were negatively associated with latent reading ability (i.e., predicted worse reading). For Analysis_2, 17 SNPs were positively associated with imputed reading ability, 50 were negatively associated. SNPs were selected from across the genome with the majority on chromosome 6 for Analysis_1 and on chromosome 15 for Analysis_2.
We used a linear regression model to assess the fit of Analysis_1 and Analysis_2. Analysis_1 fit the data significantly better than a null model (F (1, 83) = 8.22, p < .0001) and explained roughly 14 percent of the variance in reading ability (Adjusted R2 = 0.141). The positively associated SNPs were located within 16 genes with the majority representing intron variants. The negatively associated SNPs were located within 21 genes with most variant effects being intronic. SNPs mapped to DNAAF4 had positive and negative associations with reading ability. RAPGEF2 and GRIN2B were negatively associated with reading ability.
Analysis_2 fit the data significantly better (F (1, 75) = 10.07, p < .0001) than a null model and explained roughly 14 percent of the variance in the reading score (Adjusted R2 = 0.138). The positively associated SNPs were located within 11 genes while the negatively associated SNPs were located within 24 genes. Most of these variants were intron or non-coding variants for both positive and negative associations.
We compared the results from both analyses to identify which SNPs replicated internally and mapped the replicated SNPs to genes. The seven SNPs common to both Analysis_1 and Analysis_2 mapped to six genes, which were KIAA0319 (chr 6), FOXP2 (chr 7), DRD2 (chr 11), CYP19A1 (chr 15), DNAAF4 (chr 15), and ATP2C2 (chr 16). These SNPs were the SNPs included from previous genome-wide studies on dyslexia and reading traits, thus replicating previous research.
Pathway and network analysis
For Analysis_1, there were two significantly overrepresented biological pathways (GO:0010996, response to auditory stimulus, p = .0315; KEGG:04015, Rap1 signaling pathway, p = .02395). For Analysis_2, we replicated the significance of response to auditory stimulus (GO:0010996, p = 0.0186), but not the Rap1 signaling pathway. We were unable to perform the network analysis due to the limited number of pathways identified.
Addition of demographic features
We investigated the impact of nonverbal IQ, birthweight, mother’s highest education, vocabulary at age 8, receptive language, and bilingual language status in the presence of genetic features. Our elastic net model selected all the demographic features as informative in the presence of genetic features. It yielded positive beta weights for nonverbal IQ, vocabulary, mother’s highest education, and bilingual language status, indicating that higher values on these features were associated with higher latent reading ability scores. Birthweight had a negative association with latent reading ability; however, further examination of this relationship did not indicate a negative trend but rather a near zero correlation (Pearson’s r = 0.017). Examining the beta weights indicate that birthweight and bilingual language status had the lowest associations with latent reading ability, which may indicate that these factors were selected due to their relationship with other factors in the model and not directly with latent reading ability. Relationships between demographic features and latent reading ability are presented in Supplemental Figure 2.
The addition of demographic features also increased the number of SNPs selected as informative, with the model selecting 35 positively associated SNPs and 52 negatively associated SNPs. Twenty-seven of the positively associated SNPs mapped to 19 genes across the genome, including DCDC2, DNAAF4, GRIN2B, and KIAA0319. Thirty-four of the negatively associated SNPs mapped to 23 genes across the genome, including DCDC2, FOXP2, RAPGEF2, SNTG1, and DNAAF4.
We fit a standard linear regression model using the selected features from the elastic net. A model with demographic features fit the data significantly better (F (1, 106) = 16.28, p < .0001) than a null model and explained roughly 33 percent of the variance in the reading score (Adjusted R2 = 0.329). Additionally, comparing the model with demographic features to Analysis_1 without these features, showed that the model with demographic features fit the data significantly better (F (1, 23) = 38.68, p < .001). There were no “significantly” overrepresented biological pathways, although neuron migration had the lowest adjusted p-value (p = 0.269).
Replication in GRaD
We performed a GWAS using the reading score generated by confirmatory factor analysis within the GRaD dataset. No SNPs remained significant after multiple test correction.
For the elastic net model, we used the top 100 SNPs before multiple test correction and 12 SNPs from the ALSPAC Analysis_1 list for which we were able to find matches in the GRaD dataset. Forty-eight SNPs were positively associated with latent reading ability and 36 SNPs were negatively associated with latent reading ability. After removing markers that did not begin with “rs”, there were 40 SNPs positively and 34 SNPs negatively associated with latent reading ability. The 40 positive SNPs mapped to 18 genes, while the 34 negative SNPs mapped to 15 genes. Overall, these SNPs mapped to 33 unique genes.
We also included biological sex, child’s ethnicity/race, and vocabulary score within the model. Including the demographic features increased the number of SNPs selected as informative to 91, with 52 positively associated and 41 negatively associated. After removing SNPs that did not begin with “rs”, the positively associated SNPs mapped to 18 genes, while the negatively associated SNPs mapped to 16 genes. Table 6 presents the 55 informative SNPs that mapped to known genes for the analysis that included demographic features. Biological sex, child’s race, and vocabulary scores had small associations with latent reading ability. Girls had higher latent reading ability scores than boys (Cohen’s d = 0.097). Children who were African American had a small difference on their latent reading ability scores compared to children who were not African American (Cohen’s d = 0.14) Additionally, higher vocabulary scores were related to higher latent reading ability scores (Pearson’s r = 0.709). See Supplemental Figure 2 for a visual depiction of latent reading ability score and demographic feature relationships.
We replicated five SNPs between GRaD and ALSPAC analyses. These SNPs were rs79439102 (LSAMP), rs7681750 (RAPGEF2), rs10046 (CYP19A1), rs77641439 (DNAAF4), and rs12606138 (NEDD4L). Of the replicated SNPs, three had the same direction in the GRaD and ALSPAC cohorts (rs77641439, rs79439102, and rs10046), whereas the other two (rs12606138 and rs7681750) had positive associations in the GRaD cohort but negative associations in the ALSPAC cohort. Additionally, RAPGEF2:rs7681750 was not considered informative after adding demographic features for the ALSPAC cohort, although a different SNP on RAPGEF2 was selected in both ALSPAC analyses (RAPGEF2:rs55703414). There were no significantly overrepresented pathways using the genes from the GRaD results.
Discussion
We investigated the genetic contributions to reading ability using a combination of confirmatory factor analysis, data imputation, GWAS, multivariate elastic net models, and pathway analysis. We were able to overcome a common limitation of genetic studies of reading ability, namely lack of significant findings after multiple testing correction, and identified several genes that are informative for reading ability. We replicated multiple genetic associations, from previous studies, across our complete data and imputed analyses, and between the ASLPAC and GRaD cohorts. We identified novel SNPs within known loci and novel loci. Our results support the polygenic understanding of reading 14 and suggest that several domain general genes are involved in reading development. We also demonstrated that certain demographic features are associated with reading ability alongside genetic features.
Genetic Associations
Our ALSPAC results consistently selected SNPs from previous literature from DNAAF4, RAPGEF2, DCDC2, KIAA0319, FOXP2, GRIN2B, SNTG1, SUCLA2, and others as informative to understanding reading (dis)ability. We replicated SNPs between the ALSPAC and GRaD cohorts from LSAMP (limbic system associated membrane protein; chr3q13.31), RAPGEF2 (Rap guanine nucleotide exchange factor 2; chr 4q32.1), CYP19A1 (cytochrome P450 subfamily A member 1; chr15q21.2), DNAAF4 (Dynein axonemal assembly factor 4; chr15q21.3), and NEDD4L (Ubiquitin protein ligase NEDD4-like; chr18q21.31). These genes all have some role in brain and neuron development and have been implicated in related cognitive processes. CYP19A1:rs10046 and DNAAF4:rs77641139 are located within DYX1C1 (15q15.1 to 15q21.3), the first susceptibility locus for dyslexia, and both genes have been linked to reading and language disorders 72. CYP19A1 and DNAAF4 help to regulate estrogen, with CYP19A1 regulating estrogen signaling, and DNAAF4 influencing neuronal differentiation, survival, and plasticity by regulating estrogen receptors. Disruptions in CYP19A1 in animal models have resulted in cortical disorganization 73. DNAAF4:rs77641439 is within 11bp of the previously identified DNAAF4:rs57809907, suggesting that it was not selected due to a false positive but rather because this region influences reading ability. Variants for CYP19A1:rs10046 and DNAAF4:rs77641139 can affect regulatory elements of these genes as they can alter the mRNA either in the 3’ UTR or through nonsense mediated decay, providing a possible method by which they influence our reading ability phenotype. SNPs from RAPGEF2 and NEDD4L have also been previously associated with reading disability 19, 74 and related cognitive skills (e.g., working memory)75. Like CYP19A1 and DNAAF4, RAPGEF2 and NEDD4L are expressed in the brain and are involved in functional biological pathways that help with brain development. RAPGEF2 helps with the formation of connections for the corpus callosum, anterior commissure, and the hippocampal commissure during brain development. NEDD4L is a ubiquitin protein ligase which binds and regulates membrane proteins to aid in internalization and turnover. However, the SNPs replicated are both within the introns for these genes, so how exactly they influence reading ability is unclear. Lastly, we identified a novel SNP in LSAMP. LSAMP mediates selective neuron growth and axon targeting, which contributes to the guidance of axons. SNPs from LSAMP have been associated with self-reported educational attainment and mathematical ability74, two skills related to reading ability. LSAMP:rs79439102 is an intron variant, so we are uncertain how it might influence reading ability; although, there are a growing number of intron variants associated with behavioral phenotypes (for example see 76) or it may have been selected because it is correlated with another SNP.
All of these genes are implicated in brain development, specifically in neural organization and neuron connections and many have a more general role in development. There is growing awareness that domain general genes have an important role to play in the genetics of reading 77, 78. Additionally, four out of five of these genes play a role in the neuron migration pathway, which is often hypothesized to be causal to reading disabilities 79, 80. These findings build on prior work which showed that neuron migration was overrepresented in genetic findings for typical and atypical readers 19, 81. There is still much to understand about how the selected SNPs influence reading ability through gene expression and function over time, during brain development.
We replicated two novel SNPs from a known loci and three novel loci. Our analysis was able to identify these novel SNPs and loci because of the elastic net model and latent reading ability score. One potential concern is that these novel SNPs/loci may have been selected because of less stringent criteria for p-values than in previous studies. Machine learning methods, such as elastic net, bypass significance testing thus providing a way to overcome false positives. In other studies, using similar analysis methods, researchers input several thousand SNPs. Here, we utilized less than 200 per analysis. Additionally, not all SNPs with the same or similar p-value were selected as informative in our analysis. For example, only two SNPs from AC104041.1 were selected as informative, although six SNPs from this gene were inputted into the model with an FDR p-value of .035. This demonstrates that the elastic net model is informed more by how the inputted features work together than by significance testing. Our latent reading ability score is what enabled us to replicate results across datasets. This score was better able to represent the same construct (i.e., reading) than a single task across participants, and reduced measurement error, which could have influenced SNP selection.
Demographic Features
We were able to replicate the importance of several demographic features in predicting reading ability. Our results demonstrated that higher vocabulary, better receptive language, higher mother’s education, and higher nonverbal IQ scores were associated with better reading ability. These associations have been identified by several research studies 19, 24, 82. Our previous study using a case-control design suggested that these features may function as protective factors against developing a reading disorder 19; however, more research is needed to examine the interaction between genetic and demographic features. The inclusion of demographic features more than doubled the amount of variance explained in reading ability, from 14% to 33%, although there was still a significant amount of unexplained variance. This finding suggests that there are other important features associated with reading ability that we were unable to include in our model. Additional features could include nonlinguistic measures, such as finger tapping, rhythm judgement, and visual attention, where previous studies suggest a link between motor abilities83, auditory perception84, 85, and visual abilities86, 87 with reading and dyslexia.
Limitations and Future Directions
The strengths of this study are the statistical procedures to overcome multiple test correction loss in small sample sizes, ability to compensate for missing data, and external replication in an ethnically/racially diverse sample. Despite these strengths, there are a few notable limitations, including the relatively small sample sizes of the cohorts for genetic analyses, as well as key statistical limitations. The ALSPAC sample is one of the largest samples for investigating behavioral genetics in children, but it is still considered small by genetics standards. This is especially true, as evidenced by the work of large consortiums in recent years. The sample size limitation is unavoidable, and a common limitation for genetic studies examining reading ability. Partly, this is due to the cost involved in resources and infrastructure required to collect in-person behavioral data. Despite this limitation, the ALSPAC cohort is a rich database with a large enough sample size for developing hypotheses. We were able to diminish this limitation by using imputation to increase the amount of useable data and by employing a replication dataset. Future research studies can use methods such as online data collection, meta-GWA across samples, and data pooling to address the sample size issue.
Our statistical limitations include inconsistent goodness of fit indices for our latent reading ability, inability to determine the influence of all selected SNPs or investigate interactions, and interpretation of directionality. Our chi-square and RMSEA indices were not within the “acceptable” range. This finding was not unexpected because these estimates are often misestimated when a sample size is greater than 100 (chi-square) or there are less than ten observed variables (i.e., RMSEA). Because our sample sizes exceed 1000 and we used eight variables per cohort, we must rely on TLI and SRMR, both of which were within acceptable limits. Therefore, we can be reasonably confident that our model fit the data adequately. We cannot explain the role of every selected SNP, because many SNPs did not map to genes. For example, rs2957954 had a high beta weight in Analysis_1 but is an intergenic marker. Therefore, we cannot adequately explain the role this SNP might have on reading ability. There is some evidence that such intergenic markers might have a role in the evolution and development of communication in humans 88, so this is an avenue for future research. Although all of the informative SNPs were non-coding or non-functional in our queried databases, it is, nevertheless, possible that some could cause alternative splicing or regulate gene expression of other genes. We could not investigate interactions between SNPs or between SNPs and demographic features, because the elastic net model we used only considers cumulative effects. Future research studies should investigate gene-gene and gene-environment interactions to better understand how these sets of features work together for reading development. A final limitation to the elastic net model is that the directionality of the selected features may not map exactly onto the relationship in the data. Directionality in the elastic net is determined by the algorithm for best fit to the data, but deeper exploration outside of the model is needed to understand the true relationship. In general, the directionality in the model mirrors actual directionality (e.g., positive beta weight for vocabulary from the model and true positive association); however, there are a few instances where the direction the elastic model selected was not reflected within the data (e.g., birth weight and reading; child’s race in GRaD and reading). Therefore, our results provide guidance for what factors are important to consider, but all relationships should be explored in greater depth.
Conclusions
Our findings reinforce the understanding that reading ability is a complex disorder with genetic and demographic associations. For both cohorts, we found that SNPs from more than 20 genes were selected as informative. This suggests that reading ability is influenced by several genetic factors, each contributing a small effect. This polygenic hypothesis is becoming a prominent hypothesis for understanding reading ability. Although the genetic associations are each individually small, together they influence the development of brain structures and connections between neurons, resulting in good or poor reading. We identified three novel loci using our methods and these loci should be further explored in other datasets. In addition to these genetic markers, child characteristics such as vocabulary, nonverbal IQ, and language, and external factors, such as maternal educational attainment, also predict reading ability. Therefore, reading ability is not only the product of genetic markers, but also additional skills and factors, all of which could potentially serve as targets for early interventions to improve reading. Our findings provide evidence for genetic markers that replicate in ethnically/racially diverse samples, expanding our understanding of the genetics of reading beyond English speaking, European-Caucasian children. Therefore, our findings suggest that there is large generalizability of genetic factors for reading across research cohorts.
Data Availability
Please note that the study website contains details of all the data that are available through a fully searchable data dictionary (http://www.bristol.ac.uk/alspac/researchers/our-data/). Data from this study are available through ALSPAC upon approval by the executive board. Summary level data for the GRaD Study will be made available upon request.
Declarations
Not Applicable.
Funding
The UK Medical Research Council Wellcome Trust Grant (ref: 217065/Z/19/Z) and the University of Bristol provide core support for ALSPAC. This publication is the work of the authors and they will serve as guarantors for the contents of this paper. A comprehensive list of grants funding (PDF, 459KB, http://www.bristol.ac.uk/alspac/external/documents/grant-acknowledgements.pdf) is available on the ALSPAC website. GWAS data was generated by Sample Logistics and Genotyping Facilities at Wellcome Sanger Institute and LabCorp (Laboratory Corporation of America) using support from 23andMe. We acknowledge funding from the National Institutes of Health (Grant ref: R01 NS043530 awarded to JRG; P50 HD027802 awarded to JRG), and the Manton Foundation for support of the GRaD Study. The first author was supported by a National Institutes of Health F32 postdoctoral training grant (1F32HD089674-01A1; PI: HSL).
Competing Interests
The authors have no competing interests to declare related to this paper.
Ethics Approval
We obtained ethical approval for this study from the ALSPAC Ethics and Law Committee and the Local Research Ethics Committee(s) (Arizona State University Institutional Review Board). Informed consent and assent were obtained by research staff for each dataset. The GRaD dataset also had ethical approval from all recruitment sites (University of Colorado-Boulder, University of Denver, Tufts University, University of New Mexico, Kennedy Krieger Institute/John Hopkins University, Hospital for Sick Children-Toronto, and Yale University).
Consent to Participate
Not Applicable.
Consent to Publish
Not Applicable
Availability of Data and Material
Please note that the study website contains details of all the data that are available through a fully searchable data dictionary (http://www.bristol.ac.uk/alspac/researchers/our-data/). Data from this study are available through ALSPAC upon approval by the executive board. Summary level data for the GRaD Study will be made available upon request.
Code Availability
Code is available in Supplemental A.
Authors Contributions
HSL conceived and designed the study under the mentorship of JL and VD, analyzed and interpreted the data, drafted the manuscript, and revised it based on feedback from JL, VD, JG, and GRaD Consortium. HSL approved the final version of the manuscript on behalf of all the authors.
Acknowledgements
The authors are extremely grateful to all the families who took part in this study, the midwives for their help in recruiting them, and the entire ALSPAC team, which includes interviewers, computer and laboratory technicians, clerical workers, research scientists, volunteers, managers, receptionists and nurses. We are indebted to the members of the GRaD Study: the participants and their families, the teams who performed recruiting and testing, as well as the project managers at each site. We would also like to thank the Yale W.M. Keck Biotechnology Resource Laboratory’s DNA Sequencing Resource for Sanger sequencing services. Thank you to Xiaonan Liu for his help with data analyses on the F32 project.
Footnotes
There was a typo in an author name
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.
- 33.↵
- 34.↵
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.↵
- 44.↵
- 45.
- 46.
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.↵
- 61.
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.
- 90.
- 91.
- 92.
- 93.
- 94.