PT - JOURNAL ARTICLE AU - Espin-Garcia, Osvaldo AU - Craiu, Radu V. AU - Bull, Shelley B. TI - Two-phase sample selection strategies for design and analysis in post-genome wide association fine-mapping studies AID - 10.1101/2021.05.15.21257266 DP - 2021 Jan 01 TA - medRxiv PG - 2021.05.15.21257266 4099 - http://medrxiv.org/content/early/2021/05/18/2021.05.15.21257266.short 4100 - http://medrxiv.org/content/early/2021/05/18/2021.05.15.21257266.full AB - Post-GWAS analysis, in many cases, focuses on fine-mapping targeted genetic regions discovered at GWAS-stage; that is, the aim is to pinpoint potential causal variants and susceptibility genes for complex traits and disease outcomes using next-generation sequencing (NGS) technologies. Large-scale GWAS cohorts are necessary to identify target regions given the typically modest genetic effect sizes. In this context, two-phase sampling design and analysis is a cost-reduction technique that utilizes data collected during phase 1 GWAS to select an informative subsample for phase 2 sequencing. The main goal is to make inference for genetic variants measured via NGS by efficiently combining data from phases 1 and 2. We propose two approaches for selecting a phase 2 design under a budget constraint. The first method identifies sampling fractions that select a phase 2 design yielding an asymptotic variance covariance matrix with certain optimal characteristics, e.g. smallest trace, via Lagrange multipliers (LM). The second relies on a genetic algorithm (GA) with a defined fitness function to identify exactly a phase 2 subsample. We perform comprehensive simulation studies to evaluate the empirical properties of the proposed designs for a genetic association study of a quantitative trait. We compare our methods against two ranked designs: residual-dependent sampling and a recently identified optimal design. Our findings demonstrate that the proposed designs, GA in particular, can render competitive power in combined phase 1 and 2 analysis compared to alternative designs while preserving type 1 error control. These results are especially apparent under the more practical scenario where design values need to be defined a priori and are subject to mispecification. We illustrate the proposed methods in a study of triglyceride levels in the North Finland Birth Cohort of 1966. R code to reproduce our results is available at github.com/egosv/TwoPhase_postGWAS.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis research is supported by funding from the Canadian Institutes of Health Research: CIHR Operating Grant MOP-84287 (RVC, SBB), CIHR Training Grant GET-101831 (OE-G); and the Ontario Institute for Cancer Research (OICR) through funding provided by the Government of Ontario (OE-G). OE-G has been fellow trainee of OICR Biostatistics Training Initiative and CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology) - CIHR Training Grant in Genetic Epidemiology and Statistical Genetics.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:n/aAll necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe data/analyses presented in the current publication are based on the use of study data downloaded from the dbGaP web site, under study id phs000276.v2.p1. https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000276.v2.p1