Two-phase sample selection strategies for design and analysis in post-genome wide association fine-mapping studies
==================================================================================================================

* Osvaldo Espin-Garcia
* Radu V. Craiu
* Shelley B. Bull

## Abstract

Post-GWAS analysis, in many cases, focuses on fine-mapping targeted genetic regions discovered at GWAS-stage; that is, the aim is to pinpoint potential causal variants and susceptibility genes for complex traits and disease outcomes using next-generation sequencing (NGS) technologies. Large-scale GWAS cohorts are necessary to identify target regions given the typically modest genetic effect sizes. In this context, two-phase sampling design and analysis is a cost-reduction technique that utilizes data collected during phase 1 GWAS to select an informative subsample for phase 2 sequencing. The main goal is to make inference for genetic variants measured via NGS by efficiently combining data from phases 1 and 2. We propose two approaches for selecting a phase 2 design under a budget constraint. The first method identifies sampling fractions that select a phase 2 design yielding an asymptotic variance covariance matrix with certain optimal characteristics, e.g. smallest trace, via Lagrange multipliers (LM). The second relies on a genetic algorithm (GA) with a defined fitness function to identify exactly a phase 2 subsample. We perform comprehensive simulation studies to evaluate the empirical properties of the proposed designs for a genetic association study of a quantitative trait. We compare our methods against two ranked designs: residual-dependent sampling and a recently identified optimal design. Our findings demonstrate that the proposed designs, GA in particular, can render competitive power in combined phase 1 and 2 analysis compared to alternative designs while preserving type 1 error control. These results are especially apparent under the more practical scenario where design values need to be defined *a priori* and are subject to mispecification. We illustrate the proposed methods in a study of triglyceride levels in the North Finland Birth Cohort of 1966. R code to reproduce our results is available at [github.com/egosv/TwoPhase_postGWAS](https://github.com/egosv/TwoPhase_postGWAS).

## 1 Introduction

Genome-wide association studies (GWASs) have become well-established untargeted approaches for identifying genetic loci that influence the aetiology of complex diseases and traits. Single-nucleotide polymorphisms (SNPs) genotyped using GWAS arrays typically lack any known biological function. Consequently, in post-GWAS studies, identifying causal variants and susceptibility genes in GWAS-identified regions of association is the next important step for researchers. Identified variants and genes can become instrumental in personalized medicine from diagnosis and intervention to drug development and other forms of therapy.

Recent advances in next-generation sequencing (NGS) technologies allow investigators to sequence the entire human genome at the base-pair level, but, the costs of whole genome sequencing are relatively high in comparison to GWAS analysis. Targeted sequencing, which identifies all variants in a region with high-confidence, can be cost effective when fine mapping a genetic region identified at GWAS stage. Indeed, high-density sequence variants in the targeted region are typically in linkage disequilibrium (LD) with strongly associated SNPs from GWAS, making the latter good candidates as auxiliary covariates for subsample selection. Thus, two-phase sampling design and analysis[1;2] emerges as a suitable cost-reduction technique in the post-GWAS context. The main goal of this strategy is to make inference on incompletely-observed sequencing data. At phase 1, GWAS data are collected for everyone in the study. At phase 2, sequencing data are collected only in a subsample of the phase 1 sample. The subsample is selected based on phase 1 information alone (outcome, auxiliary SNPs), making the sequence data missing-by-design in the non-selected individuals.

While the majority of the literature in two-phase sampling designs concentrates on effect estimation and hypothesis testing, relatively less attention has been paid to phase 2 sample selection. Specifically, most of the work examining optimal designs has focused on case-control studies[3;4], in which for example, a *balanced* design (equal sample distribution across strata) has been recommended as near optimal[5]. Typically, in the design of case-control studies, optimization is performed to determine sampling fractions across predefined strata subject to a budget constraint on the phase 2 sample size[6;7;8]. Another approach, described in Zhao *et al*., seeks to optimize the sampling fraction (*ρ*) under simple random sampling considering asymptotic relative efficiency of the maximum likelihood estimators from the one-versus two-phase designs[9]. More recently, Tao *et al*. derived general optimal designs of two-phase studies paying special attention to continuous, binary, and time-to-event outcomes[10]. Specifically, Tao *et al*. demonstrate the relationship between their optimal design (hereafter referred to as TZL) and previously proposed (ranked) designs such as outcome-dependent and residual-dependent sampling (ODS and RDS, respectively).

In this report, we propose two approaches for two-phase sample selection in post-GWA fine-mapping studies. The resulting designs are valid and their implementation is available for all distributions in the exponential family. The first approach, LM, extends and adapts previous work primarily developed for case-control studies by solving a constrained optimization problem via Lagrange multipliers using numerical methods. The second approach, GA, exploits the advantages of genetic algorithms (GAs) for discrete optimization with fixed-subsets. To the best of our knowledge, this work introduces a novel usage of GAs in the context of selecting phase 2 designs.

In the next section we introduce a maximum likelihood framework for design and analysis of two-phase studies, and define the two approaches to select a phase 2 subsample. In addition, we contrast the proposed designs (LM and GA) with ranked designs (ODS, RDS, and TZL). In Section 3 we conduct simulation studies of a quantitative trait (QT) to evaluate the performance of the proposed designs against ranked designs under the ideal scenario in which all design quantities are known in advance. In Section 4, we assess a more practical scenario where the design values are misspecified using simulated data with realistic LD patterns from the 1000 Genomes Project. Our results show that the proposed designs, GA in particular, achieve competitive power against alternative designs under various scenarios. Additionally, in Section 5, we illustrate these methods in an application to the North Finland Birth Cohort of 1966. We conclude with a discussion of the advantages and challenges of the studied approaches as well as potential avenues of future research.

## 2 Phase 2 Sample Selection under Maximum Likelihood

### 2.1 Two-phase designs in fine-mapping studies

Let *Y* be the trait of interest and *G* be a (potentially causal) sequence variant located in a genomic region identified by GWAS test results. By design, variants in the region of interest are ascertained in only a fraction of individuals. Consequently, two-phase studies consist of a GWAS in phase 1 from which a subsample of individuals is selected; in phase 2, fine-mapping sequence data are collected for the subsample and combined analysis is performed using information from phases 1 & 2. In this post-GWAS setting, the trait data (*Y*) and the GWAS-SNP (*Z*), are observed for every subject in the study. The two-phase design aims to select a subset of informative subjects based on available data in the GWAS, namely (*Y, Z*). Of note, *Z* can be either an observed or imputed genotype, in the latter case the purpose might be to verify the association with sequencing data. Inference on the missing-by-design sequence variants is conducted using all available data. We define the missing indicator *R**i* = 𝟙{*i* ∈ *S*2}, *i* = 1, …, *N* where *N* is the number of individuals in the entire phase 1 cohort and *S*2 represents the set of ![Graphic][1]</img> subjects selected into the phase 2 subsample. We let ![Graphic][2]</img> denote the set of (*N* − *n*) subjects in the GWAS study who were unselected for phase 2. We specify the selection model for the *i*th subject as *π**i*(***ψ***) = *π*(*Y**i*, *Z**i*; ***ψ***) = Pr(*R**i* = 1|*Y**i*, *Z**i*; ***ψ***), where ***ψ*** is a vector that characterizes the distribution of the inclusion probabilities. To operationalize the selection, (*Y, Z*) can be stratified into *K* disjoint groups, {1}, …, {*K*}, such that *π**i*(***ψ***) = *π**k*(***ψ***) for all (*Y**i*, *Z**i*) ∈ {*k*}; that is, all subjects in the *k*th stratum have equal selection probabilities. *R**i* is designed to be conditionally independent of *G**i* given *Y**i* and *Z**i*, i.e. the phase 2 selection mechanism dictated by *R**i* is completely determined by *Y**i* and *Z**i*, making *G**i* missing at random[11].

### 2.2 Maximum likelihood formulation

Let *f**β*(*y*|*g, z*) be the parametric relationship between (*G, Z*) and *Y* indexed by ***β***. Here, *f**β*(*y*|*g, z*) corresponds to a probability function in the exponential family with E[*Y* |*g, z*; ***β***] = *µ*(*g, z*; ***β***) = *h*−1(*β* +*β*1*g* +*β**z**z*), where *h*(*·*) denotes the link function. We denote 𝒢, *Ƶ* as the sets of uniquely observed values of *G* (in *S*2) and *Z* (in ![Graphic][3]</img>). Let Pr(*G, Z*) be the joint probability function of *G* and *Z* given by the discrete probabilities *p**g,z*, *g* ∈𝒢 and *z* ∈ *Ƶ*, which is left unspecified and define ***p*** = {*p**g,z*}*g*∈𝒢,*z*∈*Ƶ*. We consider here the nonparametric estimation of the joint distribution of *G* and *Z*, with support on the Cartesian product between 𝒢 and *Ƶ*. Considering the above, we define the observed-data likelihood following previous literature[12;13;14] as ![Formula][4]</img> 

In (1), the proportionality arises since estimation of (***β, p***) does not involve *π**i*’s. Thus the log-likelihood is ![Formula][5]</img> 

We note that additional phase 1 covariates ***X***, e.g. genomic principal components, can be introduced into the parametric model, i.e. *f**β*(*y**i*|*g**i*, *z**i*, ***x****i*) with corresponding ![Graphic][6]</img>, by assuming *G* and ***X*** are conditionally independent given *Z*. For simplicity, the formulation above considers *G* as a single variable, however, this can be extended to a vector with the respective considerations as illustrated in Section 4 below.

If we denote *θ* = (***β′***, ***p′***) ′, then, under regularity conditions, the limiting distribution of the maximum likelihood estimator ![Graphic][7]</img> follows asymptotically ![Graphic][8]</img> [15], where Φ = (***β′***, ***p′***, ***ψ′***) ′ and 𝕁 (Φ) is the expected information matrix, which is a function of the full parameter set Φ as the expectation is taken with respect to (*R, Y, G, Z*); note that *G* is observed in *S*2 and missing by design in ![Graphic][9]</img> [8]. The derivation of the expected information matrix is shown in Appendix A.

### 2.3 Post-GWAS analysis under maximum likelihood

Note that the likelihood in equation (1) is most useful at the design stage when no phase 2 subsample has been identified nor have any data been collected. However, once these items are available, the following re-expression is typically used: ![Formula][10]</img> 

The formulation above has been amply studied[13;14;16;17]. Estimates can be obtained via the EM algorithm[18;19;20] and the corresponding asymptotic variance covariance matrix is computed via the Louis’ method[21]. In fine-mapping, the aim is to identify and prioritize potential causal variants in a genetic region of interest to allow for follow-up replication and functional studies[22]. This can be achieved under the proposed maximum likelihood (ML) as follows: first, the genetic effect of each variant in the targeted region is estimated and tested individually (single-variant analysis); second, genetic effects are estimated and tested in multivariable models (conditional on strongest single-variant signals). Conditional analysis serves to identify independent signals in the region and to unmask associations that may have been missed in single-variant analysis. These steps are detailed in Section4.

### 2.4 Selecting phase 2 designs

In post-GWAS fine-mapping studies that target an identified genomic region, the costs of sequencing can make it unfeasible or inefficient to sequence all subjects available in phase 1, restricting the number of individuals in *S*2 (*n*). Here we propose two approaches to select a phase 2 design under a budget constraint and flexible optimality criteria using Lagrange multipliers or genetic algorithms. In addition, we discuss on the specification of such optimality criteria and compare the proposed methods against another class of widely used phase 2 sample selection strategies, the so-called ranked designs.

#### 2.4.1 Lagrange multipliers (LM)

Following previous ideas[8;23;24], we first propose to obtain a phase 2 design for regional fine-mapping studies by minimizing the following expression ![Formula][11]</img>  where Λ(*·*) is an optimality criterion, *λ* is a Lagrange multiplier accounting for the budget constraint and *N**k* is the number of subjects in phase 1 belonging to the *k*th stratum.

Here, we formulate the approach specifically for the ML framework described in Sections 2.2 and 2.3. For our purposes, ***β*** and ***p*** are design quantities, thus, they need to be specified *a priori* leaving the *π*s to be determined from phase 1 data alone. Note that this approach aims to find selection probabilities, ![Graphic][12]</img>, that minimize equation (3) for allocating the phase 2 sample across strata {*k*}, *k* = 1, …, *K*. The vector ***p*** can be interpreted in terms of the (joint) genotyping distribution between *G* and *Z*, which can be easily specified according to well-established genetic principles, e.g. Hardy-Weinberg equilibrium (HWE), or by external data such as the 1000 Genomes Project. Thus, the expected effect size of the sequence variant, *β*1, becomes the primary parameter to specify.

#### 2.4.2 Genetic algorithms (GA)

Genetic algorithms are designed to mimic nature’s evolutionary process, in which the fittest members of a population are selected to pass on their genetic information. GAs are powerful tools to optimize a fitness measure/objective function, Λ(*·*); overviews can be found in Holland[25] and Whitley[26]. This optimization technique is suitable for a discrete solution space and is performed through a stochastic search by building an initial population of candidate solutions that evolves generationally through pairing, mating, recombining and mutating the candidate solutions. In our case, these candidate solutions correspond to vectors of the form ***R*** = (*R*1, …, *R**N*) with ![Graphic][13]</img> and *R**i* ∈ {0, 1} for all *i*, i.e. vectors of indicator variables denoting whether the *i*th subject is selected for phase 2. The reasoning behind GA implementation in the context of phase 2 sample selection is twofold: 1) it provides a suitable framework for discrete optimization, and 2) it has proven to be an efficient strategy to find a fittest member in large search spaces (2*N* possibilities in this case)[26]. These appealing features of GAs come along with some challenges, namely that there are no clear convergence criteria, tuning parameters need to be specified, and they can be computationally expensive when the objective function is hard to calculate.

Nevertheless, the GA approach brings novelty to the field as it nullifies the uncertainty brought by the sampling variability introduced when utilizing stratum-specific selection probabilities. This is achieved by selecting a vector ***R***∗ that characterizes a unique phase 2 subsample and optimizes the fitness measure. Furthermore, GA can forgo strata definition since the search can be agnostic to specific strata configurations. In GA, the budget constraint can be introduced by a so-called cardinality constraint, which consists of selecting a subset of a required size (*n*)[27]. This constraint guarantees that the phase 2 sample is exactly of size *n* as opposed to methods that depend on selection probabilities, which introduce some variation into the achieved phase 2 sample size.

To date, there are several implementations for GAs available in the R statistical language[28] namely packages *GA, genalg, kofnGA, mcga, mco*, and *NMOF*. Of these, only *kofnGA* is specifically designed for fixed-size subset selection with a flexible specification of the objective/fitness function, see Wolters for a detailed explanation of the package[29]. In any GA, it is important to consider the set of control parameters necessary to implement decision rules at each step. Specifically, *kofnGA* requires the user to specify the objective/fitness function (i.e. Λ(*·*)), the subset size (*n*), and the number of candidates (*N*) while additional control parameters related to algorithmic design have default (but adjustable) settings. These control parameters are: *population size* (ℳ), *number of generations* (ℋ), *size of selection tournament* (𝒯), *mutation rate* (𝓇), and *number of elites* (ℰ). A pseudo-algorithm relating standard GA terminology with the two-phase design along with a description of the steps where the control parameters are used is presented in Algorithm 1.

Algorithm 1
Pseudo-algorithm of the implemented genetic algorithm 

![Figure1](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/05/18/2021.05.15.21257266/F1.medium.gif)

[Figure1](http://medrxiv.org/content/early/2021/05/18/2021.05.15.21257266/F1)

The population is the pool of candidate solutions at each iteration from which the fittest members, i.e. members with optimal Λ(*·*) values, will be ultimately selected. The number of generations denotes the number of iterations the algorithm will run for. The size of selection tournament determines the number of members of the population selected to produce the next generation. The mutation rate determines the probability at which random swaps in the indexes of the candidate solutions occur in the population. Lastly, elites are the fittest members in a given generation that get to be kept in the next generation.

The algorithm parameters can be tuned by the user in accordance with the problem at hand. For simplicity, parameters ℋ and ℰ can be reformatted as proportions of the population size (ℳ). Because *kofnGA* does not implement a stopping rule, the algorithm iterates for as many times as specified by the provided number of generations (ℋ). To accelerate convergence, we set 𝒯 and ℰ at high levels (= 0.90 × ℳ) as suggested by Walters[29]. This approach may diminish the search improvements derived from mutations, relying more heavily on the initial population and number of generations. Therefore, instead of setting a completely random initial population (pop in Algorithm 1), we initialize it with an equal number of samples with top 20 performers (based on Λ(*·*)) out of 100 draws of each of the balanced, combined, and LM sampling designs plus the RDS design. This strategy guarantees that GA has at least the same performance as the RDS design.

Additional considerations for the optimization strategies in LM and GA as well as further details on the balanced and combined designs can be found in Sections S2 and S3, respectively (Online Supplementary Material). It is also worth noting that the proposed approaches are feasible in any context where *Y* |*G, Z* can be modeled within the exponential family, and both *G* and *Z* are discrete (or easy to discretize) covariates.

#### 2.4.3 Specifying an optimality criterion

There are several ways to define a functional Λ(*·*) as the optimality criterion/fitness measure, mostly grounded in experimental design[30]. In this report, we explore three criteria: A-optimality, D-optimality and parameter-specific. Each criterion focuses on different features of the variance-covariance matrix (VCM), 𝕁 (Φ)−1 (Table 1). The parameter-specific criterion is optimal to identify designs with minimum variance when testing a single parameter. Similarly, A- and D-criteria would be optimal to identify designs with minimum average variance across all parameters and minimum volume of the confidence ellipsoid, respectively.

View this table:
[Table 1:](http://medrxiv.org/content/early/2021/05/18/2021.05.15.21257266/T1)

Table 1: 
Description of the three optimality criteria evaluated. 𝕁 (Φ)−1 denotes the variance-covariance matrix.

In the outlined post-GWAS setting, the focus lies on testing a single parameter, *β*1. Thus, the parameter specific criterion is the most natural choice. However, when multiple parameters are of interest, i.e. *β*1 is a vector, A-optimality may be preferred if the parameters of interest are loosely correlated, whereas D-optimality may be preferred when the parameters are strongly correlated. This makes intuitive sense when there are two (or more) estimators, for example, ![Graphic][14]</img> and ![Graphic][15]</img> but ![Graphic][16]</img> is of interest. Then ![Graphic][17]</img>, thus, involving off-diagonal elements of the VCM. We evaluate potential differences in the choice of optimality criterion in Section 4.

#### 2.4.4 Ranked designs

Recently, Tao *et al*. proposed general optimal designs for phase 2 studies[10]. In this section, we aim to describe their approach, summarize their findings, and draw comparisons with the proposed designs: LM and GA.

Despite their names, the outcome-dependent sampling (ODS) and residual-dependent sampling (RDS), as defined by Tao *et al*., are not sampling designs in the classical sense given the fact that their specification is independent of any sampling mechanism[10]. Indeed, this is also true for the TZL design. We refer to them as ranked designs because they are defined in terms of ordered quantities: outcome/residuals/scaled residuals for ODS, RDS, and TZL, respectively. Tao *et al*. show that the scaling factor in TZL is given by Var(*G*|*Z*)1*/*2, which is unknown at design stage and thus needs to be specified prior to phase 2. An intuition on why this scaling factor is important for the optimal design, provided by Tao *et al*., is that *G* is harder to be retrieved by *Z* when Var(*G*|*Z*) is large and thus these subjects need to be oversampled[10].

The ranked designs achieve a given phase 2 sample size by selecting an equal number of subjects from each of the top and bottom rankings of the outcome/residuals/scaled residuals. This particularity makes them appealing for a few reasons: 1) the ranked designs are unique, 2) the selection is intuitive and can be performed quickly, and 3) for QTs, no stratification on the outcome is required. Tao *et al*. show that the TZL design reduces to the RDS design when *G* and *Z* are independent (and RDS reduces to ODS when *Y* and *Z* are independent); in Sections 4 and 5, we investigate the effect of misspecifying Var(*G*|*Z*).

There are five main underlying differences between the proposed designs (LM and GA) and the ranked designs, particularly TZL:

1.  LM depends on the stratification strategy undertaken for the outcome while none of the ranked design depends directly on outcome stratification for QTs. On the other hand, GA can, in principle, be performed without defining any stratification, however, selection of initial values may depend on values drawn from LM or other designs to accelerate convergence, which could introduce some dependency on a chosen stratification.

2.  LM provides optimal sampling fractions and thus a sample must be drawn accordingly, subjecting this design to sampling variability. GA avoids this by selecting a unique solution, ***R***∗, with optimal Λ(*·*) value. However, given the stochastic nature of the genetic algorithm, this solution is approximate and varies at each run unless a random seed is specified. In contrast, the ranked designs are not subject to sampling variation.

3.  TZL and the proposed designs (under the parameter-specific criterion) seek to minimize the variance of *β*1. In the case of TZL, this is achieved by maximizing the inverse of the efficiency bound for estimating *β*1 with one observation in their Theorem 1. Note that the proof of this theorem relies on the assumption that *Y* and *G* are approximately independent given *Z*, which is justified when the effect of *G* on *Y* is small, i.e. *β*1 = *o*(1). LM and GA, on the other hand, do not depend on the small *β*1 assumption and can be thus implemented in more general settings.

4.  Related to the point above, LM and GA rely on an empirical approximation to the information matrix whereas the variance considered in TZL uses an exact expected information under the working assumption of *β*1 = *o*(1). This defines a trade-off between the generality of LM and GA and the increase in efficiency of TZL when the assumption is justified.

5.  LM and GA can optimize general functions beyond 𝕁 (Φ)−1 through Λ(*·*), whereas the results in TZL are mostly concerned with ![Graphic][18]</img> (or functions thereof).

It remains unclear what constitutes a good stratification for LM; intuitively, LM should approximate TZL as the number of strata approaches the phase 1 sample size. However, a theoretical proof is beyond the scope of this paper. To circumvent the sampling variability issue in LM, one could draw a predetermined number of subsamples and select the one with optimal Λ(*·*) value. Regarding whether restricted/unrestricted values of *θ* are preferred, Tao *et al*. show that TZL performs well for alternatives close to the null. Although these alternatives are typical for genetic studies, a more comprehensive comparison for alternatives farther away for the null is warranted. Lastly, LM and GA can, in fact, approximate the optimization strategy in TZL by utilizing ![Graphic][19]</img> instead of 𝕁 (Φ)−1 in the objective function, where ![Graphic][20]</img> and 𝔽 = 𝕁 (Φ) is partitioned with respect to *β*1 as ![Graphic][21]</img>. It is worth noting that the dimension of 𝕍1 corresponds to that of the subspace determined by the null hypothesis of interest. In the simplest case of *β*1 being a scalar, 𝕍1 is also a scalar. In LM and GA higher dimensions can be easily accommodated by specifying a different Λ(*·*) on 𝕍1, to obtain say A- or D-optimal designs.

## 3 Simulation Studies

In this section, we describe the data generation steps, analysis plan, and report the results of an initial set of simulations. The main objective is to compare the statistical power of the proposed phase 2 designs, LM and GA, in a post-GWAS fine-mapping scenario by testing for the effect of *G* (the missing-by-design variable), i.e. H : *β*1 = 0. In addition, we compare LM and GA against two ranked designs: TZL and RDS. Of note, we exclude ODS from these studies given the known indirect association between *Z* and *Y* at GWAS stage. Estimates and standard errors are constructed following equation (2) in Section 2.3. For comparability with RDS and TZL, we use ![Graphic][22]</img> in the parameter-specific optimality criterion for LM and GA as described in Section 2.4.4 given that this variance estimate does not depend on the assumption of *β*1 = *o*(1). Additional numerical studies comparing LM and GA against alternative heuristic designs utilizing 𝕁 (Φ)−1 in the optimization are found in Section S3, Online Supplementary Material.

### 3.1 Data generation

We assume a data generating mechanism similar to Espin-Garcia *et al*. [17]. Briefly, for a phase 1 sample size (*N*), and given values for minor allele frequencies (MAFs), *q**G* and *q**z* and the linkage disequilibrium (LD), quantified through the Pearson correlation coefficient, *r*, we simulate two variants on the same haplotype under Hardy-Weinberg equilibrium (HWE): *G*1 and *Z*. Here, *q**G* and *q**Z* are the frequencies of the less common allele in the population for *G*1 and *Z*, respectively, whereas LD is the level of correlation between them. Notably, since the actual allele frequencies cannot be negative and the additive linkage disequilibrium coefficient D is constrained, not all combinations of *r, q**G* and *q**Z* can occur.

The trait of interest is then generated as *Y* = *β* + *β*1*G*1 + *ε*, where *ε* ∼𝒩 (0, *σ*2). We note that in this setting, as opposed to Section 2.2, *Z* is assumed to be conditionally independent of *Y* given *G*. This simulation setup aims to resemble a more realistic scenario in which the GWAS-SNP, *Z*, is not causal itself but rather is in linkage disequilibrium with the causal variant, *G*.

To imitate the GWAS setting in each dataset, we test *γ**z*, the genetic effect of *Z*, for association in the regression model *Y* = *γ* +*γ**z**Z* and only keep replicates that meet a suggestive genome-wide significance criterion of *p <* 1 × 10−5 for the hypothesis H : *γ**z* = 0. Lastly, to study type 1 error (T1E) under this data-generation mechanism, we simulate another SNP, *G* independently from *Z* and *G*1 with MAF *q**G*.

Strata for *Y* are defined by discretizing the trait values into a three-category variable, *Y**st* = {*T*1, *T*2, *T*3}, according to fixed cut points (*C*1, *C*2) as the percentiles (40,60) of a normal distribution with mean *µ* = 2 and variance *σ*2 = 1, so that under the null, Pr(*Y < C*1) = Pr(*Y > C*2) = 0.4. Strata for the biallelic GWAS SNP, *Z*, are defined by considering *Z* as a three-category variable corresponding to genotypes, (*AA, Aa, aa*) and coded by the number of copies of the minor allele (*a*), i.e. *Z* = 0, 1, 2 (additive association).

Of note, stratification by *Y**st* and *Z* is only employed for optimization in LM and for visualization to compare the distribution of selected individuals under other designs.

### 3.2 Assessing the phase 2 designs

The first set of evaluations consists of the following. We specify a phase 1 sample size of *N* = 5000 and a phase 2 sample size of *n* = 540, 1500, 2500, i.e. 0.108, 0.25, and 0.50 of the phase 1 data, respectively. We draw 1250 replicates for each combination of simulation parameters *q**G* = 0.2, *q**Z* = 0.3, *r* = 0.75, *β* = 2, *σ*2 = 1, and *β*1 ∈ ⟨0.1 + 0.2*j*|*j* = 0, …, 3⟩. We evaluate the performance of the the proposed designs against two ranked designs, RDS and TZL, across three statistical tests (Wald, likelihood ratio (LR) and score). Since *β*1 is a scalar, the comparison against ranked designs only examines a parameter-specific criterion as a consequence from considering ![Graphic][23]</img> in the objective function, as mentioned in Section 2.4.4.

To compare power, we assess the ratio of the empirical power of each design over that of the complete data case (relative empirical power, rEP). In addition, estimation efficiency is compared via relative asymptotic and empirical standard error (rASE and rESE, respectively) of ![Graphic][24]</img> for each design over that of the complete data. We deem these measures provide a better reflection of the design performance compared to studies that benchmark against simple random sampling. Note that as these ratios become closer to 1 (100%), the better the studied designs are able to recover the performance of the complete data analysis.

The specification of ***β****des*, the design regression parameters, corresponds to ![Graphic][25]</img>. Here ![Graphic][26]</img> and ![Graphic][27]</img> denote the maximum likelihood estimates (MLEs) from GWAS, i.e. the MLEs for *Y* = *γ* + *γ*1*Z*. Similarly, we specify ***p****des*, the design haplotype distribution between *G* and *Z*, under HWE by estimating *q**Z* from the phase 1 sample and designating *q**G* and *r* to be the equal to their generating values. These design values are used to determine LM, GA, and TZL designs but not RDS, which is agnostic to these design quantities.

Although correct specification of the design quantities is hardly ever attainable, the settings above allow us to evaluate the true type 1 error/power of the studied designs. We discuss in the next section how to proceed in practice when the true design values are unavailable. Moreover, by specifying the regression parameters under the null hypothesis, i.e. ![Graphic][28]</img>, the design problem greatly simplifies to only specify values for ***p****des*.

### 3.3 Results

The ranked designs can be specified without strata definitions, however, for visualization and comparison purposes, we plot the distribution of RDS and TZL according to the predetermined strata. When comparing LM and GA against the ranked designs for the smallest and largest studied genetic effects (*β*1 = 0.1 and *β*1 = 0.7), we observe: (1) LM, GA and TZL vary considerably across values of *β*1 and *n*, (2) LM displays more unstable strata distribution when compared against GA and ranked designs, especially for the smaller phase 2 sample sizes (*n* = 540, 1250), (3) GA follows closely the RDS design especially when *n* = 540, 1250, (4) LM and GA reach an approximately equal strata distribution when *n* = 2500, and (5) as expected, the strata distribution of the RDS design remains practically unchanged between genetic effect sizes and phase 2 sample sizes (Figure 1).

![Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/05/18/2021.05.15.21257266/F2.medium.gif)

[Figure 1:](http://medrxiv.org/content/early/2021/05/18/2021.05.15.21257266/F2)

Figure 1: 
Mosaic plots with the average strata sizes across replicates for the proposed designs against ranked designs across phase 2 sample sizes, *n* = 540, 1250, 2500, under the parameter-specific criterion. Averages were taken from the resulting designs in the main simulation study for the two most extreme values of *β*1 (0.1,0.7).

At *α* =1%, type 1 error (T1E) rates demonstrate well controlled values across the three tests in most cases. For LM, slightly anti-conservative T1E rates under the Wald test and conservative rates under the score test are observed when *n* = 540 and approximate the nominal rate as *n* increases(Table 2). Closer inspection of the *p*-value distribution under LR displays no gross departure from the expected uniform distribution (Figure S1). Overall, we observed the LR statistics showed better behavior compared to score and Wald statistics even under small sample sizes specially under LM design (Table 2). Empirical bias ![Graphic][29]</img> is well centered around zero overall and decreases as *n* increases for all designs when the true value for *β*1 is small (*<* 0.3) (Figure 2). However, for larger values of *β*1 (≥ 0.5), LM and TZL show biased estimates when *n* = 540, 1250 and deteriorate as *β*1 increases (Figure 2). All designs show relatively close agreement between (r)ASE and (r)ESE across values of *β*1 and *n* (Tables S1-S2). TZL shows values of rASE and rESE closer to 1, with GA second, RDS third, and LM coming last. GA, RDS designs exhibit adequate coverage while the coverage for LM and TZL worsens as *β*1 increases for *n* = 540, 1250 (Table S3).

View this table:
[Table 2:](http://medrxiv.org/content/early/2021/05/18/2021.05.15.21257266/T2)

Table 2: 
Type 1 error (T1E) (*α* = 1%) across studied designs, phase 2 sample sizes (*n* = 540, 1250, 2500) and statistical tests under a parameter-specific criterion. Each entry represents 11250 replicates pooled across empirical null scenarios. The rest of the simulation parameters correspond to *q**G* = 0.2, *q**Z* = 0.3, *r* = 0.75, *β* = 2, *σ*2 = 1, *N* = 5000. The complete data T1E is 1.16 for Wald/LR tests and 1.14 for the score test. To further evaluate test validity under the studied sample sizes, we plot histograms of the observed LR test *p*-values in Figure S1.

![Figure 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/05/18/2021.05.15.21257266/F3.medium.gif)

[Figure 2:](http://medrxiv.org/content/early/2021/05/18/2021.05.15.21257266/F3)

Figure 2: 
Boxplots for the distribution of the bias across genetic effect estimates ![Graphic][30]</img> in the studied designs under a parameter-specific criterion. Row facets denote different true *β*1 values (0, 0.1, 0.3, 0.5, 0.7).

Power curves under the LR test at *α* = 1 × 10−8 level, show that TZL consistently demonstrates the highest power across values of *n* with GA second, RDS in the third place and LM having the lowest power. Notably, all designs reach similar power when *n* = 2500 (Figure S2). Interestingly, not all methods show power increases at the same rate due to the differences in efficiency across designs. Additional simulations for larger phase 1 sample size (*N* = 10000) and similar selection fractions (*n/N* = 0.10, 0.25, 0.50) result in analogous type 1 error and power results among designs, suggesting that testing performance is contingent upon sampling fraction and not phase 2 sample size (Section S4.1, Online Supplementary Material). Under the LR test, the rEP is highest for the TZL across values of *β*1 and *n*. GA shows higher power than RDS across virtually all scenarios while LM comes last when *n* = 540, 1250 but reaches similar power to GA when *n* = 2500 (Table 3).

View this table:
[Table 3:](http://medrxiv.org/content/early/2021/05/18/2021.05.15.21257266/T3)

Table 3: 
Relative empirical power (rEP), calculated as the ratio of the empirical power of each studied design over that of the complete data, across studied designs, phase 2 sample sizes, and effect sizes under the LR test (*α* = 1 × 10−8). Phase 1 sample size is *N* = 5000 whereas phase 2 sample size is *n* = 540, 1250, 2500. These results exclude *β*1 *>* 0.5 since power had already reached 100%. The rest of the simulation parameters correspond to *q**G* = 0.2, *q**Z* = 0.3, *r* = 0.75, *β* = 2, and *σ*2 = 1.

Besides the additional simulations on different phase 1 and 2 sample sizes, we also studied the influence of different specifications of the joint distribution of *G* and *Z*. In summary, these results are analogous to what was reported above with GA showing competitive power when compared against alternative designs (Section S4.2, Online Supplementary Material).

## 4 Two-phase Study Design in Practice

For LM, GA and TZL designs, specifying different design quantities, ![Graphic][31]</img>, will lead to different phase 2 subsamples. Little attention has been paid to the practical considerations entailed in choosing a study design. One practical strategy is to make an educated guess for the design quantities; another is to consider a range of plausible values. Though adaptive/sequential designs may be feasible in some circumstances[31], in the post-GWAS setting processing data by batch may be operationally inefficient. In addition, although the sequential strategy will provide more precise design parameters, it will not necessarily aid in solving the withstanding issue of selecting a unique phase 2 sample given that multiple more precise estimates will be potentially identified. Therefore, we propose a strategy that relies only on phase 1 data to select a unique phase 2 sample when a range of effect sizes, allele frequencies, and LD values can be considered at design stage.

### 4.1 A grid search procedure to select a unique phase 2 subsample

It is likely that there will be uncertainty about the specification of the effect size (***β****des*) and haplotype distribution (***p****des*) at design stage, so we must consider a range of probable values and define a grid of intermediate points inside this range. Let {*θ**h*}, *h* = 1, …, *H* be the set of probable values or design quantities of interest. Each design quantity *θ**h* will yield an optimal phase 2 subsample, P2S(*h*), for the second stage. Thus, to select an unique design under the set {*θ**h*}*h*=1,…,*H* we propose the following procedure, which is motivated by robustness considerations.

1.  Given an optimality criterion, Λ(*·*), for each *h*
    
    *   - obtain a phase 2 subsample, namely P2S(*h*), via LM/GA or otherwise by optimizing Λ(*θ**h*).
    
    *   - given P2S(*h*), calculate ![Graphic][32]</img> for *h′* = 1, …, *H*
    
    *   - compute ![Graphic][33]</img>, where *κ* is a summary function, e.g. mean or median

2.  select the P2S(*h*) with minimum *κ*(*h*)

This procedure will identify a unique design from the ones generated using alternative specifications {*θ**h*}. To better understand the proposed procedure, let us assume that we are interested in comparing two designs, PS2(1) and PS2(2) which are optimal when the design values are *θ*1 or *θ*2, respectively. In order to select the best design, we adopt a criterion based on robustness. In other words, we are interested in determining which one of these two designs exhibits an overall superior performance when *θ* differs from its generating design value. To this end, we compute the fitness function for each design and each *θ**h′* with *h′* ≠ *h* and select the design that achieves the best average (or median) performance. The formal description simply extends this principle to comparing *H* designs and selecting the most robust one. Ultimately, regional sequencing data will be collected for the subsample from the resulting design alone. Once data are collected, statistical fine-mapping can be conducted following the analytic strategy described in Section 2.3.

### 4.2 Simulation under realistic LD patterns

The purpose of this simulation study is to evaluate the studied designs when the values of *θ* = (***β′, p′***) ′ are unknown and a range of values for *θ**des* is considered instead. In this study, we generate data under a scenario where multiple “causal variants” and a realistic LD structure from a targeted region were considered. Specifically, we select four loci in chromosome 16 as causal variants, ***G*** = (*G*1, …, *G*4) with corresponding effect sizes ![Graphic][34]</img> in hg19 positions 56989830, 56993324, 56994990, 56995236, and designate rs247617 (pos. 56990716) as the GWAS SNP (hereafter all positions are truncated to the last 5 digits). We then generate a QT, *Y*, across 500 replicates following ![Graphic][35]</img>, where *β* = 2, and *ε ∼*𝒩 (0, *σ*2 = 1). Details of the data generation are provided in Section S5, Online Supplementary Material.

### 4.3 Selecting phase 2 samples under prespecified sets of design quantities

Since multiple causal variants are assumed in this section, we ascertain the performance of the studied designs under alternative optimization criteria in addition to the parameter-specific criterion, specifically under A- and D-optimality. This allows *β*1 to be treated as a vector at design stage, that is ![Graphic][36]</img>, with **** being a zero vector. As before, ![Graphic][37]</img> correspond to the MLEs of the regression model *Y* = *γ* + *γ**z**Z*, i.e. GWAS MLEs. In the simulated data, ![Graphic][38]</img> range (1.68 − 1.78), ![Graphic][39]</img> range (0.097 − 0.176) across the 500 replicates.

As mentioned in Section 2.4.4, it is straightforward to modify the optimality criteria for LM and GA. For TZL, no specific results were provided under alternative optimality criteria. However, Tao *et al*. discussed that as Var(***G***|*Z*) becomes a matrix when ***β***1 becomes a vector, it was sufficient to replace the scaling factor Var(*G*|*Z*)1*/*2 (when *G* is a scalar) with Λ[Var(***G***|*Z*)][10].

Additionally, for LM and GA, the optimization is performed using ![Graphic][40]</img>, as this approach showed best performance in the first simulation study when the true values are close to the null hypothesis (Section 3.2). A unique design is then selected for LM, GA and TZL considering multiple (mispecified) values of ***p****des* following 4.1. We also considered RDS in this simulation study, however, since RDS does not depend on *θ**des* in any way, it was not determined using the outlined procedure 4.1. For each replicate, we select a phase 2 data of size *n* = 1250, 2500. The specification of ***p****des* under parameter-specific, A-, and D-optimality criteria is described below.

#### 4.3.1 Parameter-specific criterion

Under this criterion, *β*1 is a scalar. Thus, we specify ***p****des* = *P* (*G, Z*) = *p**gz* assuming each of the resulting combinations between the following: ![Formula][41]</img>  where *Q*1, *Q*2, *Q*3 denote the first, second, and third quartiles of *q**G* (MAF) or *r**Z,G* (LD between *Z* and *G*) across the 29 sequence variants in the region, e.g. ![Graphic][42]</img> is the median MAF value across seq-SNPs in the fine-mapped region while ![Graphic][43]</img> denotes the 25th percentile across correlation values between the GWAS-SNP, *Z*, and the seq-SNPs, *G*.

#### 4.3.2 A- and D- optimality criteria

Under these criteria, ***β***1 is assumed to be a vector of size 2. Thus, we specify ![Graphic][44]</img> assuming each of the resulting combinations between the following: ![Formula][45]</img>  as before, *Q*1 and *Q*3 denote the first and third quartiles of *q**G* (MAF), *r**Z,G* (LD between *Z* and *G*) or *r**G,G′* (LD between *G* and *G′*) across the 29 sequence variants in the region.

In the simulated data, ![Graphic][46]</img>; and ![Graphic][47]</img>. The average sample distribution of the resulting phase 2 designs across the 500 replicates is portrayed via mosaic plots (Figures S4 and S5). GA, RDS, and TZL designs show rather similar distributions across optimality criteria especially when *n* = 2500. LM, on the other hand, selects only from the extremes of the distribution for common heterozygous (*Z* = 0) for *n* = 1250. In addition, the intersection of the subsamples taken across each design/optimality criterion combination is presented via upset plots for a single replicate (Figures S3 and 3). These plots show that almost a third of the phase 2 subsamples are common among GA, RDS and TZL designs and optimality criteria when *n* = 1250. Notably, the number of common subsamples jumps to about a half among GA, RDS and TZL and optimality criteria when *n* = 2500.

![Figure 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/05/18/2021.05.15.21257266/F4.medium.gif)

[Figure 3:](http://medrxiv.org/content/early/2021/05/18/2021.05.15.21257266/F4)

Figure 3: 
Upset plot for a single replicate in the realistic simulation to quantify the intersection sizes across studied designs and optimality criteria when *n* = 2500. Each bar denotes the size of a given intersection highlighted in the x-axis, i.e. the number of subjects common among designs. The matrix in the x-axis corresponds to each optimality criterion (parameter-specific/A-/D-optimality) and design (LM/GA/TZL) combination as well as RDS.

### 4.4 Single-variant fine-mapping analysis

Once the phase 2 sample is selected in a given replicate, we perform a region scan using the 29 variants, i.e. we test for association one variant at a time, across each phase 2 sample size (*n* = 1250, 2500). To decrease the collinearity between *G* and *Z* in the model, we treat the GWAS SNP, *Z* (rs247617), as a (3-level) categorical covariate at design and analysis stages. We summarize the point estimates, asymptotic standard errors, and empirical power rates (under the LR test) for the region scans across replicates for each design and optimality criteria (Tables S4-4).

View this table:
[Table 4:](http://medrxiv.org/content/early/2021/05/18/2021.05.15.21257266/T4)

Table 4: 
Empirical power rates at significance level *α* = 0.05*/*29 for causal variants only and *n* = 2500 across 500 replicates in realistic fine-mapping simulation single-variant analysis. Base pair positions (pos.) marked with ∗ denote causal variants whereas the ones marked with *†* denote hitchhikers. The remaining ones are non-causal. Positions are truncated to the last 5 digits.

GA, RDS and TZL show similar results in terms of estimation and power across different values of *n* and optimality criteria (for GA and TZL) whereas LM exhibits considerably lower power (Tables S6-4). We observe similar distributions of the LR test *p*-value (in − log10 scale) across replicates for the studied designs with the exception of some outliers; LM shows the smallest (− log10) *p*-values compared to the other designs (Figures S6 and 4). Lastly, no optimality criterion shows consistently best estimation nor power, suggesting that no specific criteria substantially improves overall performance when the design values, *θ**des*, are misspecified.

![Figure 4:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/05/18/2021.05.15.21257266/F5.medium.gif)

[Figure 4:](http://medrxiv.org/content/early/2021/05/18/2021.05.15.21257266/F5)

Figure 4: 
Boxplots of the (− log10) *p*-values across 500 replicates in the fine-mapping simulation single-variant analyses for a phase 2 sample size of *n* = 2500 across optimality criteria (for LM, GA, and TZL only): parameter-specific, A- and D-optimality in each row facet. Each column facet corresponds to the complete data analysis and studied designs respectively. The dashed line corresponds to a Bonferroni-corrected significance threshold of *α* = 0.05*/*29.

It is obvious that in most cases the mean of the estimate for causal sequence variants does not correspond with its true value being both over- and under- estimated (Tables S4-S5). In fact, this discrepancy occurs even for the complete data case, which is unsurprising considering the unaccounted variation resulting from the single-variant analysis.

The power to detect association (at *α* = 0.05*/*29) is above 80% in the complete data for 3 out of 4 causal variants: 89830, 94990, 95236. The power for the remaining causal variant (93324) is almost zero (Table S6). This decrease in power is likely due to the high LD between this variant and the GWAS SNP, *Z* (*r* = 0.94 and D′ = 0.95) which is already included in the regression, thus, diluting its signal. Moreover, there are additional non-causal variants that display power above 80% in the complete data analysis (Table S6). These so-called “hitchhiker variants” achieve significant association as a consequence of their LD with causal variants. The performance of the studied designs for the hitchhiker variants resembles the complete data analysis and its ranking is similar to the one shown with the causal variants. These results indicate that single-variant analysis does not distinguish well between causal and hitchhiker SNPs in complete nor two-phase analysis.

A common strategy to identify potential causal variants from hitchhikers consists of adjusting for the most significant variant (or variants) in the region and performing a new -conditional-scan (i.e. one variant at a time) fitting the following model: ![Graphic][48]</img>, where *G* denotes a variant in the region, *G**top* is the most significant locus from the single-variant analysis, and *Z* is the GWAS SNP treated as a (3-level) categorical variable. This approach aims to discover independent signals in the region. Results for this conditional analysis can be found on Section S6, Online Supplementary Material.

## 5 Application in the Northern Finland Birth Cohort of 1966

We illustrate the methods outlined in Sections 2.4 and 4 using the Northern Finland Birth Cohort of 1966 (NFBC1966), which is a longitudinal, prospective birth cohort constituted by women and their offspring from the two northernmost provinces in Finland: Oulu and Lapland. Comprehensive phenotypic, lifestyle and demographic data were collected after birth via questionnaires and clinical evaluations on the offspring at years 1, 7, 14-16 and 31. The NFBC1966 aims to study genetic, biological, social or behavioural risk factors associated with the onset of different diseases as well as morbidity and mortality derived from adverse events such as pre-term birth and intrauterine growth retardation[32;33]. In particular, as part of an NHLBI-sponsored project designed to characterize the genetic determinants of metabolic and cardiovascular diseases, special attention was paid to a selected list of heritable quantitative traits related to cardiovascular diseases or type 2 diabetes. These traits are body mass index (BMI), high density lipoproteins (HDL), low density lipoproteins (LDL), triglycerides (TG), glucose (GLU), insulin (INS), C-reactive protein (CRP), systolic blood pressure (SBP) and diastolic blood pressure (DBP).

We focus on 5402 subjects for which genotype information was collected using the Illumina Infinium platform, which is comprised by 346,590 SNPs (after standard quality control). In addition to the genotype information, custom targeted sequencing (CTS) was collected for 4511 of them (83.5%) as part of a series of re-sequencing studies to deepen the understanding of genotypic variation on metabolic traits[34]. The CTS data contain the coding sequence and 5’ and 3’ untranslated regions of 78 genes, which were selected based on previous GWAS meta-analyses of cardiovascular diseases. Details of these regions can be found in Service *et al*. [34]. The purpose of this illustration is to optimally select a subsample of subjects for targeted sequencing study (phase 2) and fine-mapping to locate potential causal variants using the methods outlined in the previous sections.

### 5.1 Phase 2 subsample selection

We first identify GWAS-SNPs by performing genome-wide associations on the available quantitative traits. Although genome-wide scans on these very same metabolic traits for the NFBC1966 have been previously carried out in Sabatti *et al*., our analyses differ in a couple of aspects: (1) the sample size we utilized is slightly larger because additional subjects were genotyped at a later time and (2) we perform multiple linear regression adjusting by the SexOCPG covariate described in Sabatti *et al*., which is a composed categorical variable determined by sex, oral contraceptive use and pregnancy status[33].

We center attention on one trait, log-transformed TG (*Y*), as its GWAS has few peaks that identify only two genetic regions for further study: GCKR in chromosome 2 and LPL in chromosome 8 (Figure S7). We comment on the challenges of more complex GWAS scenarios in the discussion. We locate one SNPs in each of these regions that meet the usual genome-wide significance threshold (5 × 10−8): rs1260326 (chr2:27730940, ![Graphic][49]</img>, s.e. ![Graphic][50]</img>, *p* = 5.67 × 10−11, MAF= 35.7%) and rs10096633 (chr8:19830921, ![Graphic][51]</img>, s.e. ![Graphic][52]</img>, *p* = 3.24 × 10−9, MAF= 9.7%). Due to missing data on the TG values, the available subjects for the genome scan was 5300. Of these, the number of subjects with both GWAS and CTS data is *N* = 4493, which is the phase 1 sample size considered for phase 2 analyses. In addition to the GCKR and LPL regions used in the phase 2 selection, we consider another region for analysis: APOA5, to illustrate the correspondance of the two-phase design and analysis with the complete data approach pursued in Service *et al*. [34].

Using the two identified GWAS-SNPs, we select phase 2 subsamples under three of the previously described designs: GA, RDS and TZL. We drop LM as it showed the worst performance in simulations. The phase 2 sample size is specified to be approximately 25%, or 50% of the phase 1 sample size *(n* = 1123, 2246). To define *Z*, we use all allele combinations of the GWAS-SNPs rs1260326 and rs10096633, which results in a nine-category variable. Considering that no optimality criterion performed best in Section 4, we deem it appropriate to assume *β*1 is a scalar and use a parameter-specific criterion, which greatly simplifies the specification of the design quantities, particularly ***p****des*. Phase 2 subsample selection is performed separately per each phase 2 sample size. In each case, a set of design quantities is defined as follows: First, ![Graphic][53]</img>, where ![Graphic][54]</img> and ![Graphic][55]</img> correspond to the MLEs from the following regression model based on phase 1 data: ![Graphic][56]</img>, where ***X*** is a vector of additional covariates including SexOCPG and the first four genetic principal components (PC1-4). Second, for ***p****des* and given that MAF and LD values are unavailable *a priori*, we postulate the following ranges for these design quantities: *q**G* ∈ ⟨0.05 + 0.05*j*|*j* = 0, 1, …, 6⟩ and ***r*** = {0.0, *±*0.17, *±*0.33, *±*0.50}.

For visualization purposes, we categorize TG (*Y**st*) into 3 groups corresponding to commonly used blood test ranges, i.e. normal (*<* 150 mg/dL), borderline high (150 − 199 mg/dL) and high (≥ 200 mg/dL). Notably, the groups in *Y**st* are asymmetrical with respect the middle *Y**st* stratum. On the other hand, the distribution of the nine-category variable determined by the two GWAS-SNPs (rs1260326 and rs10096633) has a small number of subjects for some categories due to the relatively low MAFs of SNPs rs10096633, which differs largely from the simulations (Figure S8). Consequently, the phase 2 subsample distributions tend to not select subjects from those categories. Notably, GA, RDS, and TZL show similar category distribution across phase 2 sample sizes (Figure S9). The proportion of subjects that are common among designs in the phase 2 subsamples is above 95% between pairs of GA, RDS, and TZL across phase 2 sample sizes (Figure S10).

### 5.2 Fine-mapping analysis

CTS data for genes GCKR, LPL and APOA5 were downloaded from the NCBI’s dbGAP repository according to their GRCh37.p13 location *±*5kbps. Since aligned reads were available, we performed variant calling using the GotCloud pipeline developed by the Center for Statistical Genetics at the University of Michigan[35]. Sequence data were analyzed in two ways. First, we used linear regression with complete data, i.e. subjects with both genotyping and CTS data (*N* = 4493), and second via the ML approach described above for each studied design: GA, RDS, and TZL (*n* = 1123, 2246). For the ML analysis, the nine-category variable defined by the GWAS SNPs was used as auxiliary variable *Z*. All analyses were adjusted by the GWAS-SNPs (rs1260326 and rs10096633), SexOCPG and PC1-PC4 as covariates.

Our main interest lies in gauging the performance of the two-phase designs with respect to the complete data analysis, evaluating both estimation and hypothesis testing. For estimation, we focus on three sequence variants reported in Table 2 of Service *et al*. [34]: rs268, rs2266788, and rs3135506 (Table 5). For these variants, GA, RDS and TZL show similar results with improving performance as *n* increases. Comparisons of association estimates in Beta-Beta plots show similar spread estimates across designs (Figure S11). Similarly, region plots of association signals for complete data and ML analyses across the studied designs indicate that all designs tend to display results closer to those of the complete data case as *n* increases with no overwhelmingly better design (Figure 5). In addition to the region scans performed for analyzing common variants, we demonstrate that rare variants can be investigated under a two-phase design via burden tests in Section S7, Online Supplementary Material.

View this table:
[Table 5:](http://medrxiv.org/content/early/2021/05/18/2021.05.15.21257266/T5)

Table 5: 
Estimation and testing results for analyzing (log-transformed) triglyceride levels across 3 sequence SNPs from the fine-mapping analysis in the NFBC66.

![Figure 5:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/05/18/2021.05.15.21257266/F6.medium.gif)

[Figure 5:](http://medrxiv.org/content/early/2021/05/18/2021.05.15.21257266/F6)

Figure 5: 
Region plots of the NFCBB66 CTS data for the ML analyses under the studied designs compared to the complete data analysis across different phase 2 sample sizes, *n* = 1123, 2246. Column facets show the region plot for each loci of interest. Row facets show different phase 2 sample sizes.

## 6 Discussion

In this report, we propose and evaluate state-of-the-art sample selection strategies for two-phase designs in the context of post-GWAS fine-mapping studies. We pay special attention in the comparison against a recently proposed optimal design, TZL. Our first set of simulations, considering a parameter-specific criterion, shows a clear advantage of TZL under the strong assumption of correctly specified design values (***p****des* or Var(*G*|*Z*)). On the other hand, TZL demonstrates biased estimation when the phase 2 sample size/non-missing fraction is small (*n/N* = 0.10, 0.25) and the effect sizes are farther from the null *β*1 ≥ 0.5, with improvements noticed when *n* increases. These results are aligned with those obtained under LM, which reinforces our initial belief that LM is a crude approximation to TZL. Moreover, LM is only competitive when *n* = 2500, suggesting that the sampling variability introduced by this method can only be overcome under larger non-missing fractions. In contrast, GA demonstrates unbiased estimation and competitive power (often larger than RDS) across all simulation settings. The appeal of GA lies in its generality as it can be extended to other settings beyond linear, logistic and Cox models while also avoiding uncertainty associated with sampling. Thus, GA provides an alternative approach to obtain efficient and robust two-phase designs across a wider range of settings, including large effect sizes.

Additionally, we investigate the use of different variances considered in the optimization: 𝕁−1, and 𝕍−1, which are, respectively, the inverse of the Fisher information matrix and the variance-covariance matrix of the most powerful test under the null hypothesis (Sections 3.2 and S3, Online Supplementary Material). The results support the use of 𝕍−1 for the parameter-specific criterion when the effect size is close to the null, as expected.

Correct design values specification is never attainable in practice. Hence, our second set of simulations evaluates a more realistic setting for which we propose a grid search approach to select a unique phase 2 design assuming a set of (mispecified) design values is available. In this scenario, GA, RDS and TZL have comparable performance with LM falling behind. GA is advantageous as it can be easily applied to general Λ(*·*) functions. Thus, apart from the parameter-specific criterion, we examine two additional criteria to select a phase 2 design: A- and D-optimality. Notably, based on our simulations we found no evidence in favour of any particular optimality criterion. However, the parameter-specific criterion may be preferred as designating its design values requires fewer assumptions. A- and D-optimality criteria were chosen because they have been amply explored in the literature, possess solid roots in experimental designs, and have natural connection with hypothesis testing. Nonetheless, other criteria may be better suited for optimizing power, for instance, maximizing the non-centrality parameter of the likelihood ratio or score test *χ*2 statistic may improve the power performance of the phase 2 designs.

An important observation from the simulations is that although the best performing designs have higher relative efficiency compared to less favorable designs (e.g. combined or simple random sampling) across all phase 2 sample sizes, this improvement does not automatically translate to a closer agreement with the complete data analysis. That is, power performance is contingent upon non-missing fraction and not necessarily phase 2 sample size itself. This finding is consistent throughout our investigations, where both parameter estimates and *p*−values achieve similar values as in the complete data analysis only when the phase 2 sample size is half of the phase 1 sample size (*n/N* = 0.5). Thus, careful evaluation of the statistical power of the phase 2 design needs to be considered in advance.

The competitive performance of GA notwithstanding, there are other considerations in implementing this algorithm. For instance, we provide in all simulations an ad-hoc approach to initialize the population of possible solutions to accelerate convergence. In general, giving a particular initialization is not necessary, however, a completely random initial population may need a larger number of generations (ℋ) to achieve good performance. In addition, given the stochastic nature of the search, there are few guarantees that the final solution has indeed reached a global as opposed to a local optimum. It is also worth mentioning that the tuning parameter settings for the proposed GA are intended as a guideline only and do not replace a more careful evaluation in specific problems. Walters suggests iterative calls, which consist of running the GA multiple times, so that the final population in each run serves as the initial population in subsequent runs[29].

The budgetary constraint implicitly assumes that the cost of sequencing samples is the same for all study samples. This assumption may be relaxed as sequencing costs may vary due to location, tissue availability or number of samples. Thus, extensions to consider differential costs are yet to be considered; one example of such approach under tracing study designs can be found in Moon *et al*. [36]. Another important issue that deserves further investigation in terms of budget constraints (or otherwise) is use of differential sequencing depths across samples.

Further investigation regarding selection of optimal phase 2 subsample under a set of loosely defined design values, *θ* = (***β′***, ***p′***) ′ ∈ Θ, is warranted. Indeed, beyond the proposed grid search approach, alternative means to select a phase 2 subsample across ranges of *θ* are possible, e.g. via min-max approaches[37]. However, this selection problem may also be addressed under a Bayesian framework for which a prior (joint) distribution for ***β*** and ***p*** needs to be specified[3;38;39;40]. The appeal of this approach is that it may better incorporate the uncertainty in the design values for selecting an phase 2 subsample although at the expense of computational complexity.

Beyond the feasibility and applicability of the proposed methodologies in practice, the illustration on the amply studied NFBC66 raises some additional questions on considerations posed in the design and analysis of two-phase post-GWAS fine-mapping studies. First, our rare-variant analysis shows no association in either of the studied regions. This result is not that surprising for a couple of reasons: the limited sample size in the NFBC66 and the low correlation between the GWAS SNPs and the computed genetic score. Additionally, the burden test assumes the same direction across all variants, which is a limitation in various settings. Thus, further investigation is required for variance component rare-variant tests, as the properties of these test statistics have not been thoroughly studied under the proposed missing by design scenario. Second, current practice in the field involves the imputation of GWAS data using high-quality reference panels such as the TOPMed Imputation Panel[41]. In principle, one can use the imputed variants to construct a suitable auxiliary variable, *Z*, to select a phase 2 subsample using the proposed framework. Alternative methods that accommodate differences between genotyped and imputed data for subjects not selected for phase 2 sequencing have been discussed[42;43]. Hence, comparisons between these methods and the approach undertaken in this article can be also evaluated. Lastly, in a similar vein, methodological extensions for situations when multiple loci are pinpointed by GWAS and/or multiple traits of equal interest are collected remain as topics of future work. A starting point in this direction may involve the calculation and application of polygenic risk scores to inform the phase 2 subsampling.

Another issue deserving further investigation is the influence of the phase 2 sample selection in the variant calling pipeline. For simplicity, in this application, variant calling was performed per each loci on all available CTS samples. However, even though genotype likelihoods are typically inferred by sample, population-specific filters such as MAF may change with the design. Thus, sensitivity analyses of these filters can be additionally explored.

We emphasize that although this report focuses on a normally distributed continuous trait, all the derivations apply in the context of generalized linear models within the exponential family. Furthermore, we are engaged in the development of an R package for this general case. Results of this research and the accompanying software aim to support investigators decision-making pertaining to study design, evaluation and analysis of two-phase studies. These tools can serve to make more efficient use of limited budgetary resources for data acquisition and analysis.

Two-phase study designs can be sought in other contexts. In particular, their use in a variety of ‘omics problems is broadly relevant as new and more costly technologies continue to arise. Beyond the case of fine-mapping where causal variants from GWAS-identified regions can be pinpointed at a fraction of the cost, this approach can be extended for phase 2 variables that are not categorical, for example, methylation, gene expression or other ‘omics measurements. Additionally, methods that introduce functional knowledge to further inform the inference (possibly through Bayesian methods) deserves further investigation.

## Supporting information

Supplemental information [[supplements/257266_file02.pdf]](pending:yes)

## Data Availability

The data/analyses presented in the current publication are based on the use of study data downloaded from the dbGaP web site, under study id phs000276.v2.p1.

[https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study\_id=phs000276.v2.p1](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000276.v2.p1) 

## Author contributions

OE-G wrote a first draft of this manuscript and subsequent revisions, performed simulations and analyzed the NFBC66 data. RVC and SBB provided statistical guidance and methodological support. All authors contributed in the design of the numerical experiments, overall analysis plan, as well as reviewing and approving the final version of the manuscript.

## Financial disclosure

This research is supported by funding from the Canadian Institutes of Health Research: CIHR Operating Grant MOP-84287 (RVC, SBB), CIHR Training Grant GET-101831 (OE-G); and the Ontario Institute for Cancer Research (OICR) through funding provided by the Government of Ontario (OE-G). OE-G has been fellow trainee of OICR Biostatistics Training Initiative and CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology) - CIHR Training Grant in Genetic Epidemiology and Statistical Genetics.

## Data Availability

The data/analyses presented in the current publication are based on the use of study data downloaded from the dbGaP web site, under phs000276.v2.p1.

## Conflict of interest

The authors declare no potential conflict of interests.

## Supporting information

The supporting information file contains 7 additional sections, 24 tables, and 27 figures.

## A Derivation of the information matrix

This section derives the observed and expected information matrices, which allow us to compute the optimality criterion. The expected information matrix in particular is called repeatedly during the optimization performed in both LM and GA approaches.

### A.1 Log-likelihood

Considering the observed-data likelihood in (1), the log-likelihood takes the form ![Formula][57]</img>  where ***p*** = (*p*1, …, *p**J*) is the vector of probabilities corresponding to the *J* unique pairs (*g**j*, *z**j*) in 𝒢 × *Ƶ*.

### A.2 Score equations

Taking the first derivative of (4) with respect to ***β*** and ***p*** for a single observation, we have ![Formula][58]</img>  for the exponential family case, ![Graphic][59]</img>, where *V* (*µ**i*) is the variance function, *a*(*ϕ*) is the dispersion parameter and ![Graphic][60]</img> is the profile weight.

### A.3 Fisher Information Matrix

By definition, the Fisher information matrix (FIM) is given by ![Graphic][61]</img>.

Note that ![Formula][62]</img>  where the sums are taken with respect to all observed values of (*g, z*) and Pr(*Y* = *y, G* = *g, Z* = *z*) = Pr(*y, g, z*) = *f**β*(*y*|*g, z*)*p**g,z*.

Thus, considering that the score equations in subsection A.2 can be rewritten as ![Formula][63]</img>  the cross-products are defined as follows ![Formula][64]</img> 

After taking expected values on both sides and by equation (5), we have ![Formula][65]</img> 

The expression above can be approximated empirically especially when *G* and *Z* are discrete. However, under the null hypothesis of interest, i.e. *β*1 = 0, *Y* and *G* are conditionally independent given *Z*. Thus, letting ![Graphic][66]</img>, the expressions above simplify to: ![Formula][67]</img>  where ![Graphic][68]</img>, and ω*i,j* = *Pr*(*G* = *g**j* | *z**i* = *z**j*). Additionally, ![Formula][69]</img>  here, ![Graphic][70]</img>,is the *l;m* entry of the outer product ![Graphic][71]</img> where ![Graphic][72]</img> is replaced by E[*g*2 |*z**i*] and ![Graphic][73]</img>.

Therefore, under the null hypothesis of interest, the Fisher information matrix can be expressed as ![Formula][74]</img>  

## Acknowledgments

The NFBC1966 Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with the Broad Institute, UCLA, University of Oulu, and the National Institute for Health and Welfare in Finland. This manuscript was not prepared in collaboration with investigators of the NFBC1966 Study and does not necessarily reflect the opinions or views of the NFBC1966 Study Investigators, Broad Institute, UCLA, University of Oulu, National Institute for Health and Welfare in Finland and the NHLBI.

Computations were performed on the Niagara supercomputer at the SciNet HPC Consortium and Galen, the HPC facility at the Lunenfed-Tanenbaum Research Institute (LTRI). SciNet is funded by: the Canada Foundation for Innovation under the auspices of Compute Canada; the Government of Ontario; Ontario Research Fund - Research Excellence; and the University of Toronto. The LTRI HPC facility is supported by the Canada Foundation for Innovation. OE-G would like to thank Prof. Olli Saarela and Prof. Richard J. Cook for constructive discussions and feedback. Authors thank the anonymous reviewers who helped enrich this work.

*   Received May 15, 2021.
*   Revision received May 15, 2021.
*   Accepted May 18, 2021.


*   © 2021, Posted by Cold Spring Harbor Laboratory

The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission.

## References

1.  [1].Walker AM. Anamorphic analysis: sampling and estimation for covariate effects when both exposure and disease are known.. Biometrics 1982; 38(4): 1025–1032.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2307/2529883&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=7168792&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F18%2F2021.05.15.21257266.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1982QF89700015&link_type=ISI) 

2.  [2].White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology 1982; 115(1): 119–128.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=7055123&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F18%2F2021.05.15.21257266.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1982MY30800013&link_type=ISI) 

3.  [3].Karvanen J, Kulathinal S, Gasbarra D. Optimal designs to select individuals for genotyping conditional on observed binary or survival outcomes and non-genetic covariates. Computational Statistics & Data Analysis 2009; 53(5): 1782–1793. doi: 10.1016/J.CSDA.2008.02.010
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/J.CSDA.2008.02.010&link_type=DOI) 

4.  [4].Schaid DJ, Jenkins GD, Ingle JN, Weinshilboum RM. Two-phase designs to follow-up genome-wide association signals with DNA resequencing studies. Genetic Epidemiology 2013; 37(3): 229–238. doi: 10.1002/gepi.21708
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/gepi.21708&link_type=DOI) 

5.  [5].Breslow NE, Chatterjee N. Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. Appl. Statist 1999; 48(4): 457–468. doi: 10.1111/1467-9876.00165
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/1467-9876.00165&link_type=DOI) 

6.  [6].Reilly M. Optimal sampling strategies for two-stage studies.. American Journal of Epidemiology 1996; 143(1): 92–100.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/oxfordjournals.aje.a008662&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=8533752&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F18%2F2021.05.15.21257266.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1996TN42400011&link_type=ISI) 

7.  [7].McNamee R. Optimal design and efficiency of two-phase case-control studies with error-prone and error-free exposure measures. Biostatistics 2005; 6(4): 590–603. doi: 10.1093/biostatistics/kxi029
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/biostatistics/kxi029&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=15860543&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F18%2F2021.05.15.21257266.atom) 

8.  [8].McIsaac MA, Cook RJ. Response-dependent two-phase sampling designs for biomarker studies. Canadian Journal of Statistics 2014; 42(4): 268–284. doi: 10.1002/cjs.11228
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/cjs.11228&link_type=DOI) 

9.  [9].Zhao Y, Lawless JF, McLeish DL. Design and relative efficiency in two-phase studies. Journal of Statistical Planning and Inference 2012; 142(11): 2953–2964. doi: 10.1016/j.jspi.2012.04.013
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jspi.2012.04.013&link_type=DOI) 

10. [10].Tao R, Zeng D, Lin DY. Optimal Designs of Two-Phase Studies. Journal of the American Statistical Association 2020; 115(532): 1946–1959. doi: 10.1080/01621459.2019.1671200
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/01621459.2019.1671200&link_type=DOI) 

11. [11].Little RJA, Rubin DB. Statistical analysis with missing data. Hoboken, N.J: Wiley. 2nd ed ed. 2002.
    
    
12. [12].Robins JM, Hsieh F, Newey W. Semiparametric Efficient Estimation of a Conditional Density with Missing or Mismeasured Covariates. J. R. Statist. Soc. B 1995; 57(2): 409–424.
    
    
13. [13].Lawless JF, Kalbfleisch JD, Wild CJ. Semiparametric methods for response-selective and missing data problems in regression. J. R. Statist. Soc. B 1999; 61(2): 413–438. doi: 10.1111/1467-9868.00185
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/1467-9868.00185&link_type=DOI) 

14. [14].Zhao Y, Lawless JF, McLeish DL. Likelihood methods for regression models with expensive variables missing by design. Biometrical Journal 2009; 51(1): 123–136. doi: 10.1002/bimj.200810487
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/bimj.200810487&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19197954&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F18%2F2021.05.15.21257266.atom) 

15. [15].Zhang Z, Rockette HE. Semiparametric maximum likelihood for missing covariates in parametric regression. Annals of the Institute of Statistical Mathematics 2006; 58(4): 687–706. doi: 10.1007/s10463-006-0047-7
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s10463-006-0047-7&link_type=DOI) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000242496700003&link_type=ISI) 

16. [16].Lin DY, Zeng D, Tang ZZ. Quantitative trait analysis in sequencing studies under trait-dependent sampling.. Proceedings of the National Academy of Sciences of the United States of America 2013; 110(30): 12247–52. doi: 10.1073/pnas.1221713110
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTEwLzMwLzEyMjQ3IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMDUvMTgvMjAyMS4wNS4xNS4yMTI1NzI2Ni5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

17. [17].Espin-Garcia O, Craiu RV, Bull SB. Two-phase designs for joint quantitative-trait-dependent and genotype-dependent sampling in post-GWAS regional sequencing. Genetic Epidemiology 2018; 42(1): 104–116. doi: 10.1002/gepi.22099
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/gepi.22099&link_type=DOI) 

18. [18].Ibrahim JG. Incomplete Data in Generalized Linear Models. Journal of the American Statistical Association 1990; 85(411): 765–769. doi: 10.1080/01621459.1990.10474938
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/01621459.1990.10474938&link_type=DOI) 

19. [19].Ibrahim JG, Chen MH, Lipsitz SR. Monte Carlo EM for Missing Covariates in Parametric Regression Models. Biometrics 1999; 55(2): 591–596. doi: 10.1111/j.0006-341X.1999.00591.x
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.0006-341X.1999.00591.x&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=11318219&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F18%2F2021.05.15.21257266.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000081320800036&link_type=ISI) 

20. [20].Zhang Z, Rockette HE. An EM algorithm for regression analysis with incomplete covariate information. Journal of Statistical Computation and Simulation 2007; 77(2): 163–173.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/10629360600565202&link_type=DOI) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000242216500006&link_type=ISI) 

21. [21].Louis TA. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society 1982; 44: 226–233.
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1982PF14100009&link_type=ISI) 

22. [22].Schaid DJ, Chen W, Larson NB. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nature Reviews Genetics 2018; 19(8): 491–504. doi: 10.1038/s41576-018-0016-z
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41576-018-0016-z&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29844615&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F18%2F2021.05.15.21257266.atom) 

23. [23].Reilly M, Pepe MS. A Mean Score Method for Missing and Auxiliary Covariate Data in Regression Models. Biometrika 1995; 82(2): 299–314.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/biomet/82.2.299&link_type=DOI) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1995RH60200006&link_type=ISI) 

24. [24].Whittemore A, Halpern J. Multi-stage sampling in genetic epidemiology.. Statistics in Medicine 1997; 16(1-3): 153–67.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/(SICI)1097-0258(19970130)16:2<153::AID-SIM477>3.0.CO;2-7&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=9004389&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F18%2F2021.05.15.21257266.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1997WA62500011&link_type=ISI) 

25. [25].Holland JH. Genetic Algorithms. Scientific American 1992; 267(1): 66–73.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/scientificamerican0792-66&link_type=DOI) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1992HY64800012&link_type=ISI) 

26. [26].Whitley D. A genetic algorithm tutorial. Statistics and Computing 1994; 4(2): 65–85. doi: 10.1007/BF00175354
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/BF00175354&link_type=DOI) 

27. [27].Ruiz-Torrubiano R, García-Moratilla S, Suárez A. Optimization Problems with Cardinality Constraints . 2010
    
    
28. [28].R Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2016.
    
    
29. [29].Wolters MA. A Genetic Algorithm for Selection of Fixed-Size Subsets with Application to Design Problems. Journal of Statistical Software 2015; 68(Code Snippet 1). doi: 10.18637/jss.v068.c01
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.18637/jss.v068.c01&link_type=DOI) 

30. [30].Emery AF, Nenarokomov AV. Optimal Experiment Design. Measurement Science and Technology 1998; 9: 864–876.
    
    
31. [31].McIsaac MA, Cook RJ. Adaptive sampling in two-phase designs: A biomarker study for progression in arthritis. Statistics in Medicine 2015; 34(21): 2899–2912. doi: 10.1002/sim.6523
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/sim.6523&link_type=DOI) 

32. [32].Järvelin MR, Sovio U, King V, et al. Early Life Factors and Blood Pressure at Age 31 Years in the 1966 Northern Finland Birth Cohort. Hypertension 2004; 44(6): 838–846. doi: 10.1161/01.HYP.0000148304.33869.ee
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1161/01.HYP.0000148304.33869.ee&link_type=DOI) 

33. [33].Sabatti C, Service SK, Hartikainen AL, et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature Genetics 2009; 41(1): 35–46. doi: 10.1038/ng.271
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.271&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19060910&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F18%2F2021.05.15.21257266.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000262085300014&link_type=ISI) 

34. [34].Service SK, Teslovich TM, Fuchsberger C, et al. Re-sequencing Expands Our Understanding of the Phenotypic Impact of Variants at GWAS Loci. PLoS Genetics 2014; 10(1): e1004147. doi: 10.1371/journal.pgen.1004147
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pgen.1004147&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24497850&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F18%2F2021.05.15.21257266.atom) 

35. [35].Jun G, Wing MK, Abecasis GR, Kang HM. An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data.. Genome Research 2015; 25(6): 918–25. doi: 10.1101/gr.176552.114
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NjoiZ2Vub21lIjtzOjU6InJlc2lkIjtzOjg6IjI1LzYvOTE4IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMDUvMTgvMjAyMS4wNS4xNS4yMTI1NzI2Ni5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

36. [36].Moon NC, Zeng L, Cook RJ. Tracing studies in cohorts with attrition: Selection models for efficient sampling. Statistics in Medicine 2018; 37(15): 2354–2366. doi: 10.1002/sim.7646
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/sim.7646&link_type=DOI) 

37. [37].Sitter RR. Robust Designs for Binary Data. Biometrics 1992; 48(4): 1145–1155.
    
    
38. [38].Chaloner K, Verdinelli I. Bayesian Experimental Design : A Review. Statistical Science 1995; 10(3): 273– 304.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1214/ss/1177009939&link_type=DOI) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1995UN93700003&link_type=ISI) 

39. [39].Zhou X, Joseph L, Wolfson DB, Bélisle P. A Bayesian A-Optimal and Model Robust Design Criterion. Biometrics 2003; 59(4): 1082–1088.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=14969488&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F18%2F2021.05.15.21257266.atom) 

40. [40].Karvanen J, Vanhatalo J, Auranen K, Kulathinal S, Mäntyniemi S. Optimal design of observational studies: overview and synthesis. 2016; [arxiv.org/abs/1609.08347](http://arxiv.org/abs/1609.08347).
    
    
41. [41].Taliun D, Harris DN, Kessler MD, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 2021; 590(7845): 290–299. doi: 10.1038/s41586-021-03205-y
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-021-03205-y&link_type=DOI) 

42. [42].Hu YJ, Li Y, Auer PL, Lin DY. Integrative analysis of sequencing and array genotype data for discovering disease associations with rare mutations. Proceedings of the National Academy of Sciences 2015; 112(4): 201406143. doi: 10.1073/pnas.1406143112
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMDoiMTEyLzQvMTAxOSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIxLzA1LzE4LzIwMjEuMDUuMTUuMjEyNTcyNjYuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

43. [43].Wong KY, Zeng D, Lin DY. Robust Score Tests With Missing Data in Genomics Studies. Journal of the American Statistical Association 2019; 114(528): 1778–1786. doi: 10.1080/01621459.2018.1514304
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/01621459.2018.1514304&link_type=DOI)

 [1]: /embed/inline-graphic-1.gif
 [2]: /embed/inline-graphic-2.gif
 [3]: /embed/inline-graphic-3.gif
 [4]: /embed/graphic-1.gif
 [5]: /embed/graphic-2.gif
 [6]: /embed/inline-graphic-4.gif
 [7]: /embed/inline-graphic-5.gif
 [8]: /embed/inline-graphic-6.gif
 [9]: /embed/inline-graphic-7.gif
 [10]: /embed/graphic-3.gif
 [11]: /embed/graphic-4.gif
 [12]: /embed/inline-graphic-8.gif
 [13]: /embed/inline-graphic-9.gif
 [14]: /embed/inline-graphic-10.gif
 [15]: /embed/inline-graphic-11.gif
 [16]: /embed/inline-graphic-12.gif
 [17]: /embed/inline-graphic-13.gif
 [18]: /embed/inline-graphic-14.gif
 [19]: /embed/inline-graphic-15.gif
 [20]: /embed/inline-graphic-16.gif
 [21]: /embed/inline-graphic-17.gif
 [22]: /embed/inline-graphic-18.gif
 [23]: /embed/inline-graphic-19.gif
 [24]: /embed/inline-graphic-20.gif
 [25]: /embed/inline-graphic-21.gif
 [26]: /embed/inline-graphic-22.gif
 [27]: /embed/inline-graphic-23.gif
 [28]: /embed/inline-graphic-24.gif
 [29]: /embed/inline-graphic-25.gif
 [30]: F3/embed/inline-graphic-26.gif
 [31]: /embed/inline-graphic-27.gif
 [32]: /embed/inline-graphic-28.gif
 [33]: /embed/inline-graphic-29.gif
 [34]: /embed/inline-graphic-30.gif
 [35]: /embed/inline-graphic-31.gif
 [36]: /embed/inline-graphic-32.gif
 [37]: /embed/inline-graphic-33.gif
 [38]: /embed/inline-graphic-34.gif
 [39]: /embed/inline-graphic-35.gif
 [40]: /embed/inline-graphic-36.gif
 [41]: /embed/graphic-11.gif
 [42]: /embed/inline-graphic-37.gif
 [43]: /embed/inline-graphic-38.gif
 [44]: /embed/inline-graphic-39.gif
 [45]: /embed/graphic-12.gif
 [46]: /embed/inline-graphic-40.gif
 [47]: /embed/inline-graphic-41.gif
 [48]: /embed/inline-graphic-42.gif
 [49]: /embed/inline-graphic-43.gif
 [50]: /embed/inline-graphic-44.gif
 [51]: /embed/inline-graphic-45.gif
 [52]: /embed/inline-graphic-46.gif
 [53]: /embed/inline-graphic-47.gif
 [54]: /embed/inline-graphic-48.gif
 [55]: /embed/inline-graphic-49.gif
 [56]: /embed/inline-graphic-50.gif
 [57]: /embed/graphic-18.gif
 [58]: /embed/graphic-19.gif
 [59]: /embed/inline-graphic-51.gif
 [60]: /embed/inline-graphic-52.gif
 [61]: /embed/inline-graphic-53.gif
 [62]: /embed/graphic-20.gif
 [63]: /embed/graphic-21.gif
 [64]: /embed/graphic-22.gif
 [65]: /embed/graphic-23.gif
 [66]: /embed/inline-graphic-54.gif
 [67]: /embed/graphic-24.gif
 [68]: /embed/inline-graphic-55.gif
 [69]: /embed/graphic-25.gif
 [70]: /embed/inline-graphic-56.gif
 [71]: /embed/inline-graphic-57.gif
 [72]: /embed/inline-graphic-58.gif
 [73]: /embed/inline-graphic-59.gif
 [74]: /embed/graphic-26.gif