Abstract
Genome-wide association studies (GWAS) have identified numerous genetic variants associated with Alzheimer’s disease (AD) phenotypes. However, how these variants contribute to the etiology of AD remains largely elusive. Recent advances in genomic large language models (LLMs) have revolutionized regulatory genomic prediction tasks, offering new opportunities to interpret the genetic variation observed in personal genome. In this study, we propose epiBrainLLM, a novel computational framework that leverages genomic LLM to enhance our understanding of the causal pathways from genotypes to brain measures to AD-related clinical phenotypes. Our framework will first convert the personal DNA sequence into a diverse set of genomic and epigenomic features using a pretrained genomic LLM and then use these features to further predict phenotypes. Across various experimental settings, our results demonstrate that incorporating pretrained genomic LLMs significantly improves association analysis compared to using genotype information alone. We conclude that our proposed framework provides a novel perspective for understanding the regulatory mechanisms underlying the AD disease etiology, potentially offering insights into complex disease mechanisms beyond AD.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
The works of Q.L. was supported by NIH grant K99HG013661. The works of W.Z., and W.H.W were partially supported by NIH grants R01HG010359 and R01HG007735. The work of L.L. was supported by NSF grant CIF-2102227, and NIH grants R01AG061303 and R01AG062542.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Ethics committee/IRB of Alzheimer's Disease Neuroimaging Initiative(ADNI) gave ethical approval for this work
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data availability
The raw data in this study can be accessed through Alzheimer’s Disease Neuroimaging Initiative: ADNI (https://adni.loni.usc.edu/). WGS data using BWA & GATK HaplotypeCaller pipeline were downloaded in the “Genetic Data” category. sMRI images were download in the “Image Collections” category under the “ADNI1:Complete 1Yr 1.5T” entry. The annotation data, including gene expression and histone modification data, were originally from ENCODE database and visualized with WashU Epigenome Browser (https://epigenomegateway.wustl.edu/).