A probabilistic graphical model for estimating selection coefficient of nonsynonymous variants from human population sequence data
==================================================================================================================================

* Yige Zhao
* Guojie Zhong
* Jake Hagen
* Hongbing Pan
* Wendy K. Chung
* Yufeng Shen

## Abstract

Accurately predicting the effect of missense variants is important in discovering disease risk genes and clinical genetic diagnostics. Commonly used computational methods predict pathogenicity, which does not capture the quantitative impact on fitness in humans. We developed a method, MisFit, to estimate missense fitness effect using a graphical model. MisFit jointly models the effect at a molecular level (*d*) and a population level (selection coefficient, *s*), assuming that in the same gene, missense variants with similar *d* have similar *s*. We trained it by maximizing probability of observed allele counts in 236,017 European individuals. We show that *s* is informative in predicting allele frequency across ancestries and consistent with the fraction of *de novo* mutations in sites under strong selection. Further, *s* outperforms previous methods in prioritizing *de novo* missense variants in individuals with neurodevelopmental disorders. In conclusion, MisFit accurately predicts *s* and yields new insights from genomic data.

## Main

Missense variants, which cause single amino acid changes in proteins, are the most common type of variant in protein-coding regions. They are major contributors to genetic risk of developmental disorders1-3, cancer, and other diseases. However, as missense variants have a potentially broad range of functional impact but generally a low chance of recurrence, most missense variants identified in cohorts or clinical sequencing have uncertain effect4-9. Deep mutational scanning (DMS) assays can assist with interpretation of missense variants10-31, but there is limited scalability as different proteins have different and multifaceted functions that require different functional assays. Therefore, computationally predicting the effect of missense variants is important to support the scale required for novel disease gene discovery and interpretation.

Although many methods have been developed to predict variant effects, there is a long-standing ambiguity of the concepts used to describe variant effect. We adopt a set of definitions32 to explain the related causes and consequences specifically for different aspects of missense variant effect (Supplementary Fig. 1). At the molecular level, we define the effect (*d*) as change of abundance, localization, or function of a protein. At organism level, a damaging variant (with larger *d*) is defined as pathogenic if it increases the risk of diseases or conditions. Pathogenic variants are often the focus in human genetic studies and clinical testing. Databases like ClinVar9 and HGMD33 have curated pathogenic variants, which are used as the training labels in supervised methods, such as CADD34, REVEL35, M-CAP36, gMVP37, VEST38, MetaSVM39, MVP40 and MPC41. Although these methods have proven helpful, they usually suffer from inconsistent performance across genes, since most of the curated pathogenic variants are from only a few thousand genes that are well-established as disease-associated. We suggest that *predicted pathogenicity* is an uncertain aggregation of variant functional effect, gene risk, and even disease properties. Our knowledge of gene to disease association is incomplete and in fact, identification of new associations is a primary goal of predicting variant effect in genetic studies. Therefore, we seek other metrics for describing missense effects in prediction that can be quantified without knowing gene-disease associations.

One such metric is selection coefficient (*s*) which quantifies the fitness effect of variants in a population. A pathogenic variant is usually subject to negative selection in human populations. Although *s* of a variant depends on the penetrance of the variant to various conditions and the total fitness effect of the conditions, the consequence of *s*, especially of heterozygotes, can be observed in allele frequencies in human populations42. It is therefore theoretically feasible to estimate *s* without knowing any traits with which the variant is associated. Biobank-scale genomic sequencing efforts4-8 have generated a large number of human population genome sequences that enable estimation on heterozygous selection coefficient of gene-aggregated protein truncating variants (PTVs)43-46. However, estimation for missense variants is much more challenging because we cannot reasonably assume all or most of missense variants in one gene have the same selection coefficient. Existing prediction of selection for individual variants45 does not directly utilize protein context, and is still based on a very small sample size.

Here we describe a new method, MisFit, to jointly predict molecular effect and human fitness effect of missense variants through a probabilistic graphical model. We aimed to estimate selection coefficient for variants under moderate to strong negative selection. In the model, the molecular effect depends on amino acid change in the protein context, and heterozygous selection coefficient depends on molecular effect of the variant and gene-level importance in selection in human populations. We trained the model using population genome data without pathogenicity labels and evaluated it using deep mutational scan readout data and *de novo* and inherited variants in developmental disorders.

## Results

### Using Poisson-Inverse-Gaussian distribution to model allele counts in human populations

The distribution of allele counts in population sequencing samples is determined by heterozygous selection coefficient (*s*), mutation rate (*v*) and number of chromosomes (*n*). To infer *s*, we first need to model the probability of observed allele counts *p*(*m*|*s*; *v, n*). Allele frequency *q* of a variant at equilibrium state equals *q* = *v*/*s*, and therefore the allele count *m* follows a Poisson distribution with an expectation *nv*/*s*. When taking genetic drift into account, the distribution has strong overdispersion. Nei’s model47 describes allele count as a Negative Binomial distribution with an additional parameter, effective population size *N**e*. However, as there was exponential growth in recent generations48,49, *N**e* is not a constant, and there is no closed form to describe *p*(*m*|*s*; *v, n*). Here, we used a long-tailed distribution, Inverse-Gaussian (IG) distribution, to approximate the distribution of *q*, which results in a Poisson-Inverse-Gaussian (PIG) distribution of *m*. The parameters associated with the PIG distribution are functions of *s, v, n*, which are optimized prior to MisFit training steps by simulated allele frequencies given dense grids of *s, v* and European effective population size history (Methods). We are mainly interested in those rare variants with relatively large *s*; therefore, we chose to optimize the distribution for the recent exponential population growth. In this way, we were able to easily obtain a tractable gradient to *s* with a time complexity independent to *n*.

To investigate how to approximate allele frequency distribution in a finite and expanding population, we performed a simulation based on a demographic history model of European population48. Given *v* and *s*, we sampled each generation by a Wright-Fisher process (Methods). We set the final effective population size to 1.5 million, as it best fits the distribution of observed sample allele counts of rare synonymous C-to-T variants in methylated CpG sites with high roulette50 mutation rate (*v* > 10−7 per generation) (Supplementary Fig. 2). This final population size is smaller than recent work45,46 (5 million), which is optimized for all variants with gnomAD mutation rate (with an lower average *v* ∼ 6 × 10−9).

We fitted the PIG model parameters (Supplementary Fig. 3) based on simulated variants under different settings of *v* ∈ [10−9, 3 × 10−7], *s* ∈ [10−6, 1]. When *s* is small, random drift makes the distribution of allele counts more resemble a Negative Binomial distribution with small *N**e*. When *s* is large, the distribution is closer to a Poisson distribution, as these variants are likely to emerge recently when the effective population size is large (**Fig. 1a,b**). The PIG model fits the simulated results better than other simple distribution models in all ranges (Supplementary Fig. 4).

![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/13/2023.12.11.23299809/F1.medium.gif)

[Figure 1.](http://medrxiv.org/content/early/2023/12/13/2023.12.11.23299809/F1)

Figure 1. Poisson-Inverse-Gaussian (PIG) distribution with adjusted parameters to approximate allele count distribution.
**a** mean and **b** variance of sample allele frequency under different population genetics models, including our PIG model and Negative Binomial model with different effective population size. Diploid sample size is 200K. **c** The accuracy of MLE estimation of *s*. Here *s* is a categorical variable of {0.00001, 0.0001, 0.001, 0.01, 0.1, 1}. Accuracy is measured by the proportion that the estimated categorical *s* equals the simulated in 400 simulated groups. Each group contains a certain number of variants (x-axis) with same *s*. Solid lines are samples from a single population, while dashed lines are samples from two populations (half of the indicated number for each population).

### Feasibility of estimating selection coefficient for a group of variants

Given the generally low mutation rate at 10−8, the highest probability usually lies at 0 count regardless of *s* (Supplementary Fig. 4), so it is nearly impossible to precisely estimate *s* for individual single nucleotide variants only using allele counts. We therefore investigated the feasibility of estimating *s* for a group of variants with similar *s*. We aggregated certain numbers of sites simulated from the same *s* as a group (**Fig. 1c**). We investigated whether the maximum-likelihood-estimation (MLE) for the whole group based on the PIG model is consistent with the simulation condition.

For deleterious variants of *s* > 0.01 with high mutation rate, the accuracy is high. More variants to aggregate and higher mutation rate always helps with better estimation. The PIG model does not provide good performance for *s* < 10−4, because randomly including or excluding a common variant in the group can significantly change the joint likelihood. Notably, increasing sample size in a single population only helps with variants under strong selection (*s* > 0.01) (Fig. 1c, Supplementary Fig. 5). The over-dispersion of allele counts for milder variants mostly comes from the uncertainty of allele frequency (the long-tailed distribution of *q*) due to genetic drift, rather than from sampling (the Poisson distribution given *nq*). Adding samples from another population improves accuracy more than from the same population. Based on the results, we implicitly group missense variants by the degree of damage (*d*) in the same gene in the MisFit model.

### MisFit model structure and training process

We describe the architecture of MisFit in **Fig. 2**. The degree of damage (*d*) for each single amino acid substitution depends on the protein sequence and structure context. Here we used the embeddings (*x*) from the last layer in the masked protein language model, ESM-2 (650M)51 to capture the protein sequence and structure context. We added additional transformer blocks and fully-connected dense layers to generate a distribution of *d* (Supplementary Fig. 6b). Rescaling and normalization of *d* by a gene-level, species-averaged selection strength gives out probabilities of each amino acid at the position. The heterozygous selection coefficient (*s*) depends on *d* and the gene-level selection strength in the human population. Here we modeled *s* in the logit scale as linear to *d*. We set a global prior for the maximum missense selection coefficient for each gene (*s**gene*, the value of *s* when *d* equals 1). (Methods). Finally, probability of generating allele count *n* given *s* in given by the PIG model as previously described.

![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/13/2023.12.11.23299809/F2.medium.gif)

[Figure 2.](http://medrxiv.org/content/early/2023/12/13/2023.12.11.23299809/F2)

Figure 2. MisFit model for estimating molecular and fitness effect.
a) Overview of MisFit model. MisFit\_D is learned from ESM-2 (650M) protein embedding, and generates probability of amino acid in orthologues. Heterozygous s is linear to d in logit scale, with gene-level maximum from a global prior. MisFit_S is a point estimate of s per variant, which maximizes the allele count likelihood in population samples. b) MisFit model in view of a probabilistic generative process.

In the first stage, we trained the model to estimate parameters in transformer and dense layers, to maximize the log likelihood with allele counts, amino acid in orthologues52,53 (Methods) and ESM-2 zero-shot prediction. During this training stage the model is mainly trained by all possible missense SNVs in 4,073 constrained genes (missense z score > 2 or gnomAD6,7 pLI > 0.5) well covered with both mammal sequence alignment and human population sequence, because damaging variants in these genes are expected to be under relatively strong selection, and thus the difference in molecular effect can result in a broad range of selection coefficient for training. Allele counts in 236,017 samples are used in training, including 145,103 UKBB8 unrelated European samples and 90,914 gnomAD7 Non-Finnish European population samples. Then we updated *s**gene* for all genes.

In the second stage, with the estimated *d* and *s**gene*, we performed variational inference to approximate the posterior distribution of *s* for each missense SNV. Here, *s* is learned by training another dense neural network as functions of *d* and population data, to enable efficient inference in one forward-pass. (Supplementary Fig. 6b, Methods)

### Comparison of gene-level constraint

MisFit-estimated *s**gene* quantifies gene-level selection strength on missense variants. Commonly used metrics for such information include gnomAD missense z score and o/e. Though *s**gene* for each gene generally correlates well with both metrics (Supplementary Fig. 7), they represent different aspects. Missense z score is effectively the significance level assuming a Poisson distribution from the expected number of variants. Thus, when a gene is short, missense z score tends to have a small absolute value. *s**gene* and o/e directly represent the degree of constraint, although the uncertainty for short genes might be large.

We also used the model to estimate *s* for protein truncating SNVs (PTVs) in each gene using the same population samples. We assume all PTVs in a gene have the same *s* and the gene-level *s**PTV* has a logit-normal distribution prior (Method). The estimated heterozygous selection coefficient is highly correlated with previous results44-46 (Supplementary Fig. 8) for 5,876 genes with *s**PTV* > 0.01.

PTVs mainly decrease protein levels by nonsense mediated decay. As most of the missense variants are hypomorphs with partial loss of function, *s**gene* and *s**PTV* are highly correlated (**Fig. 3a**). However, some variants can be damaging through mechanisms other than loss of function. We highlighted risk genes with known genetic modes54 (**Fig. 3b-e**, Supplementary Table 1). Autosomal recessive genes are least intolerant of PTVs compared with other genes associated with dominant inheritance. Haploinsufficient genes are under strong selection on PTVs. Genes with dominant negative effects are likely to be under strong selection on missense and PTVs. Notably, for gain-of-function, a subset of genes are only constrained on missense but not on PTVs. For example, several germline missense variants in oncogene *KRAS* lead to Noonan syndrome by hyperactivation of the protein55. The gene-level selection on missense variants is significantly higher than PTVs (*s**gene* = 0.21, *s**PTV*, = 0.0013). Similar properties are observed for other RAS proteins with gain-of-function variants (*HRAS*: *s**gene* = 0.067, *s**PTV* = 0.0015, *NRAS*: *s**gene* = 0.26, *s**PVT* = 0.0076).

![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/13/2023.12.11.23299809/F3.medium.gif)

[Figure 3.](http://medrxiv.org/content/early/2023/12/13/2023.12.11.23299809/F3)

Figure 3. Gene-level missense selection Sgene compared with selection coefficient of protein-truncating variants.
**a** the distribution of all genes (black contour) with Pearson correlation coefficient. **b-e** in genes harboring variants of known mechanisms (red contours): **b** autosomal recessive (n = 641) **c** happloinsufficient (n = 66) **d** gain-of-function (n = 69) **e** dominant negative (n = 54).

### MisFit is predictive of allele counts of ultrarare variants in different populations

As MisFit_S is able to predict *s* with amino acid resolution (Supplementary Fig. 9), we asked how informative MisFit_S is to predict allele frequency of rare variants in a population of different ancestries. We extracted 215,138 positions without observed missense variants or with ultra-rare (sample allele frequency < 5 × 10−6) missense variants and high mutation rate (*v* > 10−7) in 4,073 constrained genes of the training set (UKBB and gnomAD NFE, 236,017 samples). We binned the variants by estimated MisFit_S and analyzed the counts in a second population of a different ancestry, which is gnomAD African/African American (AFR) with 28,872 individuals. Putative variants in these positions would have emerged very recently, and their allele frequencies are relatively independent between the two populations. As expected, the proportion of variants with 0-count in gnomAD AFR samples is positively correlated with MisFit_S (Supplementary Fig. 10a). The opposite trend is observed for variants with 10 times higher allele frequency in AFR (Supplementary Fig. 10b). To assess which part of the model helps with prediction of *s*, we built several models with fewer and simpler components. In the baseline model (model 0), *s* is estimated from only the mutation rate and allele counts with a global prior. In this set of high mutation rate variants, allele count is informative as shown in the stepwise curve caused by allele counts of 0, 1, 2 in the training set. However, the difference in absolute value of selection is subtle (Supplementary Fig. 10c). Adding the gene-level selection (model 1) in the model largely smooths the estimation and outputs a wider range of *s*. Using ESM-2 zero-shot score to infer probability of damage (model 2) further helps the prediction, indicated by a greater slope of the monotonic increase, but is not as good as the full MisFit model, which uses the ESM-2 embedding.

### Comparison of selection coefficient with de novo fraction

Next, we evaluate whether MisFit_S approximates heterozygous selection coefficient *s* in absolute scale. We obtained missense *de novo* (16,876 cases, 5,750 controls) and inherited variants (6,507 cases, 2,992 controls) from an autism spectrum disorder study1 (Supplementary Table 2). MisFit_S of *de novo* variants are significantly higher than inherited variants (Supplementary Fig. 11). We binned the variants based on MisFit_S and normalized the counts as per individual (**Fig. 4a**). The difference between cases and controls is significant for *de novo* missense variants for strongly deleterious variants (MisFit_S > 0.01), but is subtle for inherited variants, even if limiting the data to known autism genes (Supplementary Fig. 12).

![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/13/2023.12.11.23299809/F4.medium.gif)

[Figure 4.](http://medrxiv.org/content/early/2023/12/13/2023.12.11.23299809/F4)

Figure 4. *de novo* or inherited missense variants binned by of MisFit_S.
**a** Count of *de novo* or inherited missense variants. **b** The proportion of *de novo* to all variants in autism dataset. Error bars show 95% confidence intervals.

In a new generation when selection has not occurred, *de novo* variants are expected to take up a proportion that is equivalent to *s* when *s* is relatively large44. We aggregated the variants by their selection coefficient and calculated the fraction from the *de novo* variants. The *de novo* ratio in autism cases is consistent with MisFit_S (**Fig. 4b**), indicating the accuracy of estimated *s* in absolute scale. In controls, highly deleterious *de novo* variants are depleted. Similar results are observed for protein-truncating variants of high-confidence estimated by MisFit (Supplementary Fig. 13), consistent with a previous study44.

### Utility of selection coefficient for analysis of de novo variants in developmental disorders

In addition to the autism data, we obtained *de novo* variants from studies of neurodevelopmental disorders (NDD, most individuals have global developmental delay or intellectual disability) 56 (31,565 cases) (Supplementary Table 2). Previous studies1,3,56 have shown that a substantial fraction of *de novo* missense variants in these cases are risk variants for NDD. Autism and NDD are relatively common conditions with early-onset phenotypes. Autism has a prevalence approaching 0.02857, and selection on autism is around 0.758. Thus, highly penetrant risk variants are not likely to be transmitted into the next generation, resulting in a high selection coefficient. As expected, *de novo* variants in cases have a higher MisFit_D and MisFit_S than controls (**Fig. 5**). We compared our results with other missense variant effect prediction methods34-37,51,59-61. Although there is no ground truth to know which variants actually increase disease risk, we could calculate the enrichment of variants under different thresholds, which is the ratio of number of variants in cases to what is expected in controls (Methods). Among variants ranked in the top 2 to 10 percentiles, MisFit_S reached a higher enrichment ratio (**Fig. 6**) than any other method.

![Figure 5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/13/2023.12.11.23299809/F5.medium.gif)

[Figure 5.](http://medrxiv.org/content/early/2023/12/13/2023.12.11.23299809/F5)

Figure 5. Distribution of MisFit\_D and MisFit\_S for *de novo* variants in autism and NDD datasets.
NDD: neurodevelopmental disorders.

![Figure 6.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/13/2023.12.11.23299809/F6.medium.gif)

[Figure 6.](http://medrxiv.org/content/early/2023/12/13/2023.12.11.23299809/F6)

Figure 6. Enrichment ratio of *de novo* missense variants under different thresholds for multiple prediction methods.
Thresholds of MisFit_S are annotated. The solid horizontal line is the overall enrichment ratio including all variants, the dashed line stands for no enrichment.

We then derived the precision-recall-proxy curves (Supplementary Fig. 14, Methods) by the excess number of variants under thresholds. MisFit_S outperforms other methods in high precision range, reaching a precision of 0.70 and 0.89 for autism and NDD, respectively, at MisFit_S = 0.03. The next best methods are AlphaMissense61 and gMVP37. The estimated precision can serve as weights or informative priors in statistical methods like DeNovoWEST56 or extTADA62-64 to improve the power in risk gene discovery. The selection coefficients estimated by baseline methods with fewer components are also informative in enrichment of *de novo* variants but are inferior to MisFit (Supplementary Fig. 15-16).

### MisFit identifies damaging variants consistent with deep mutational scan data

MisFit-estimated *d* (MisFit_D) is about the molecular effect of missense variants, which can be partly measured by deep mutational scanning (DMS) experiments. We compared MisFit_D with published methods34-37,51,59-61 on predicting damaging variants in DMS for individual genes. First, we collected functional readout scores from 32 DMS assays in 26 genes11-17,19-31 with 44,100 single amino acid substitutions (Supplementary Table 3). We calculated the Spearman correlation between the functional scores and computational scores (**Fig. 7a**). MisFit_D has a similar performance with ESM and AlphaMissense.

![Figure 7.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/13/2023.12.11.23299809/F7.medium.gif)

[Figure 7.](http://medrxiv.org/content/early/2023/12/13/2023.12.11.23299809/F7)

Figure 7. Performance in predicting damaging variants in deep mutational scanning assays and cross-gene consistency.
a) Spearman correlation coefficient of predicted scores with functional scores from deep mutational assays. Mean is annotated in red. b) AUROC of predicting confidently labeled damaging or benign variants in deep mutational assays. Mean is annotated in red. c) MCC in each gene with a global threshold that achieves best MCC in the combined dataset. Mean is annotated in red. d) Sensitivity in different genes when setting a threshold to achieving a global sensitivity of 0.5 (dashed) in the combined dataset. Standard deviation is annotated in blue. For b-d, different assays of same gene are combined so that variants with a damaging label in any of the assays will be regarded as damaging.

As raw functional readouts from these experiments could be noisy, we further restricted the sets to variants in 13 genes with DMS annotated binary labels or with a bi-modal functional score distribution (Supplementary Fig. 17). For the latter, we labeled damaging variants by two-component Gaussian Mixture models for each assay independently (Methods, Supplementary Table 4). For genes with multiple DMS assays, we combined these datasets and label a variant as damaging if it is damaging in any one of the assays. The average area under ROC curve (AUROC) for MisFit_D still approaches the state-of-art performance (**Fig. 7b**).

In some genetic analysis, we often set a heuristic and fixed threshold across all genes when selecting possibly damaging variants. To evaluate the performance under this setting, we combined the DMS assays across genes, and tested the performance in the combined dataset. Since the labels are unbalanced, we define the optimal threshold as that which achieves the highest Matthew’s correlation coefficient (MCC) in the combined dataset. When setting this optimal threshold for classification, we calculated the MCC in each individual gene. MisFit_D remains effective, meaning that the prediction is consistently informative across genes (Fig. 7c). MisFit_D is intended to quantify the degree of damage solely based on variant-level property, and we expect it to be distributed similarly across genes. In contrast, selection coefficient (MisFit_S) is by nature determined by both variant- and gene-level properties and should not have the same range in different genes. Supplementary Fig. 18 shows gene-specific score distribution and optimal threshold.

Finally, we investigated the distribution of sensitivity in different genes (Supplementary Fig. 19). Sensitivity is only related to the damaging variants in the dataset. Deep mutational scanning assays are usually designed to evaluate only one aspect of gene function, so the identified damaging variants could be more reliable, while benign ones may disrupt the protein in some other ways not evaluated by the assays. Under a threshold achieving a global sensitivity of 0.5, MisFit_D has a low variance across genes (Fig. 7d). Overall, unsupervised methods (MisFit, ESM1b and ESM2) have lower variance of sensitivity across genes than supervised methods (gMVP, REVEL, and AlphaMissense).

## Discussion

We developed a probabilistic graphical model, MisFit, to estimate the fitness effect of missense variants using large population sequencing data. Selection coefficient (*s*) is a quantitative measurement of fitness effect that can be informed by allele frequency in human populations, but it is very difficult to estimate for individual variants. MisFit addresses this issue by modeling it as a sigmoid-shaped function of the molecular effect *d* of a variant with a gene-specific prior, and jointly modeling *d* as a non-linear function, approximated by deep neural networks, of its protein sequence context. We trained the model using large sets of population sequencing data without any label of pathogenicity. The estimated *s* is highly correlated with frequency of ultra-rare variants in an independent population. Its value is consistent with theoretical expectation of the proportion of *de novo* mutations among observed variants in a population.

Previous efforts in estimating gene-specific6,7,65 or sub-genic41,66 regional constraints of missense variants showed the feasibility of using human population data to identify coding regions that are under strong selection, but these methods are heuristic and do not estimate the effect of individual variants. MisFit is based on population genetics models, representing an improved approach of using large-scale population sequencing datasets for estimating variant effect. Additionally, the effect of a variant at organism and population levels is a combination of how the variant alters the protein and how the protein is involved in key biological processes relevant to human traits and diseases. Two variants with the same degree of damage to protein function may have different effects at the organism and population levels if they occur in different genes. Methods that predict pathogenicity by supervised learning are confounded by gene-level properties, as shown by the large variance of classification accuracy across genes given a fixed threshold evaluated by deep mutational scan data. MisFit’s graphical model is designed to untangle the gene-variant confounding. As a result, MisFit\_D has a more consistent scale across genes assessed by mutational scan data; and MisFit\_S, as a natural combination of variant and gene properties, has superior performance in prioritizing *de novo* variants in studies of developmental disorders that have strong negative consequence in fitness.

In a longer timescale across species, negative selection is manifested in conservation among homologous sites. Some unsupervised models, such as ESM51,67,68 and EVE60, predict amino acid probabilities using representation learning based on massive amounts of protein sequences or multiple sequence alignment of homologous proteins. Those alleles used in training are effectively neutral to become nearly fixed in the corresponding species69. When further taking phylogenic history into account, observed sequences are correlated, but the distribution may deviate from the stationary distribution of fitness landscape. Although these models are empirically effective70,71, resolution of estimating relatively large *s* is poor as all deleterious variants are likely to be depleted from the collection of wild-type sequences.

The resolution in estimating relatively large *s* is especially important for analysis of rare variants in genetic studies of early onset conditions. If we assume an early onset condition is the main trait under selection for a risk variant, then the selection of the variant could be approximated as prevalence × relative risk × selection of the condition. Thus, risk variants in conditions with high prevalence and low fecundity, such as intellectual disability and autism58, tend to have large selection coefficient. This explains why MisFit shows superior performance in prioritizing *de novo* variants in autism or NDD datasets. Additionally, the fitness effects of protein truncating variants and missense variants estimated by MisFit using the same data are directly comparable in a quantitative way. This could improve the power of identifying new risk genes and help characterize genetic etiology of human diseases.

We used the embeddings from a protein language model (ESM-2) to represent protein sequence context as the input for the non-linear function that predicts the effect at molecular level. ESM-2 embeddings implicitly capture protein structure information51. Explicitly representing protein structure features as input 61,53 may improve prediction by better capturing residue interactions and critical sites.

Finally, based on the simulation results, the accuracy of MisFit in estimating mildly deleterious variants (*s* < 0.001) is limited. Random drift of these variants causes significant dispersion of allele frequency. Merely increasing the sample size of the same population does not help with the estimation. On the other hand, including diverse populations with different continental ancestry in training would improve the accuracy, as ancestral effective population size increases and variance from genetic drift decreases. We expect that the sample size of non-European individuals with genome sequencing will increase substantially in the near future from ongoing efforts such as gnomAD7, All of Us5, GenomeAsia72, and the Three Million African Genomes project 73. We will be able to use these data to improve estimation of fitness effect of variants under moderate selection in the future.

## Methods

### Simulation based on European effective population size history

We simulated the distribution of allele frequency based on the history of effective population size of European population for 10,000 generations. We obtained the European effective population size history from the Schiffels and Durbin model48. We smoothened the data by setting a growth rate for each period and adjusted the final effective population size to 1.5 million, which is most consistent with distribution of observed allele counts of rare synonymous variants with high roulette mutation rate (*v* > 10−7). We assume no linkage and the same background mutation rate of the average mutation rate, no positive selection effect, and each locus obtains one type of mutation at most. We simulated the evolution of alleles with dense grids of mutation rates *v* ∈ [10−9, 3 × 10−7], and selection coefficients *s* ∈ [10−6, 1]. For a given mutation rate and selection coefficient, we simulated 100,000 independent sites. The simulation follows the Wright-Fisher process considering mutation, drift and selection. We set a backward mutation rate of *v* = 10−8. Suppose the effective population size at *t**th* generation is *N**t*, we have ![Formula][1]</img>  ![Formula][2]</img>  *f**t*−1 and *q**t*−1 are the pre-selection and post-selection allele frequency in the previous generation, and *f*′*t*. is allele frequency in zygotes after introducing new mutations. Here, 2*s* (clipped at 1 if *s* > 0.5) is homozygous selection coefficient by fixing a dominance factor of 0.5. Then we sample population allele counts in the new generation by a binomial distribution: ![Formula][3]</img>  ![Formula][4]</img>  In the latest generation, sample allele counts *m* within sample allele number *n* drawn from population could be regarded as a Hypergeometric distribution. ![Formula][5]</img>  As we have *m**final* ∼ *Binomial*(2*N**final*, *f*′*final*), this is equivalent to ![Formula][6]</img>  Considering the age of sequencing samples (UK Biobank and gnomAD) are relatively old, the observed alleles are already subject to selection. We therefore used the adjusted post-selection allele frequency for training the model. ![Formula][7]</img>  To investigate how a second population with a different genetic ancestry can help with estimation, we simulated a pseudo-population with the same European population size history. Here, *q* is kept same for both populations at the beginning, and then evolves independently for the recent *N**r* generations. We set *N**r* to be 2,000 based on the split time of European and Africa population. In this way, the final *q* in two populations are partially correlated.

### Modeling allele counts

Assuming infinite effective population size, allele frequency *q* at equilibrium state is deterministic given mutation rate *v* and heterozygous selection coefficient *s*. ![Formula][8]</img>  Therefore, the allele count *m* in samples with allele number *n* follows a Poisson distribution: ![Formula][9]</img>  Although the formula gives us an overview of the relationship between expected *m* and *s*, there is a substantial overdispersion of *m* caused by random drift effect. Taking random drift into account, Nei’s model47 describes *q* as a Gamma distribution. ![Formula][10]</img>  *N**e* is the effective population size. Then we have a Negative Binomial distribution for *m*. ![Formula][11]</img>  However, the real *N**e* is not constant. There has been exponential population growth in all major continental populations. We used an Inverse Gaussian model with adjusted parameters *μ**IG*, *λ**IG* to describe the distribution of allele frequency. Inverse Gaussian distribution can model a very long tail while keeping the probability density at 0 to be 0 (In contrast, Gamma distribution may give out infinity density at 0). More importantly, the likelihood function *p*(*m*|*s*; *v, n*) should have a tractable gradient to *s*. Then *m* follows a Poisson Inverse Gaussian (PIG) distribution: ![Formula][12]</img>  ![Formula][13]</img>  μ*IG* and *λ**IG* are Inverse Gaussian mean and shape respectively. For each setting of *v, s*, we used the simulated allele frequency *q*s*im* to estimate ![Graphic][14]</img>. Then we fit functions *μ**IG* = *f*1(*v, s*) and *λ**IG* = *f*2(*v, s*). Specifically, log *μ**IG* is a softminus over s′ = log it (*s*) and linear over log *v*, while log *λ**IG* is quadratic to log *v* (Supplementary Fig. 3). The likelihood of PIG distribution is calculated by Bessel function of second kind.

### Data used in training and testing

#### Proteins and variants

We limit the gene set to 18,708 protein-coding genes. One protein sequence is selected per gene, based on the following order: 1. Uniprot canonical isoform; 2. Corresponding to the transcript of ‘MANE select’; 3. Corresponding to Ensembl canonical transcript (usually the longest). Among them, 18,605 have available population sequencing data for missense variants and 16,623 for protein-truncating variants. All possible single nucleotide variants in the coding region +-2 bp of the selected transcripts are annotated. For protein-truncating variants, ‘stop-gained’, ‘splice_donor’ and ‘splice_acceptor’ variants are further annotated by LOFTEE6, and only high-confidence (HC) ones are used in training or genetic analysis.

#### Population sequence data

We used the allele counts from UKBB unrelated European population (145,103 exomes from November 2020 release) and gnomAD Non-Finnish European population (56,885 exomes of v2.1.1 plus 34,029 genomes of v3.1.2) sequencing data. We set the allele number (sample size) for positions without observed variants by the allele number of the nearest position with observed variant in the same exon, to account for sequencing depth variation. Same was done for gnomAD African / African American population (8,128 exomes plus 20,744 genomes) in analysis. Variants that do not pass RF, InbreedingCoeff or AS_VQSR filtering, or is located in low-complexity-region (annotated by gnomAD), are excluded in training and analysis. Site-specific mutation rate was mainly obtained from roulette50 mutation rates. Variants on sex chromosomes do not have available roulette mutation rates, so we used gnomAD6 mutation rates based on 3-mer context and methylation level, and calibrated them to an average of 10−8 in consistent with roulette. During training, mutation rate and allele count are added across all single nucleotide variants that lead to the same amino acid change.

#### Protein sequence embeddings

Protein sequence embeddings are extracted from the last layer of ESM-2 (650M) model for each 600 AA length fragments (overlapping 200 AA if longer than 600 AA). The zero-shot prediction of ESM-2 comes from logits value in the last layer of ESM-2 and further renormalized to 20 amino acids excluding other tokens.

#### Mammalian homologues

Homologous variants used in training include: a. 21.8 million alternative amino acids in multiple sequence alignment in 465 mammals from Zoonomia Project52; b. 2.9 million alternative amino acids in 233 primate species from primateAI-3D53.

#### Deep mutational scanning assays

We selected 32 deep mutational scanning assays from literature (Supplementary Table 3). Several experiments provide classification of damaging or benign variants in the publications. For the remaining experiments, we model the functional scores (usually as log enrichment or depletion) by a two-component Gaussian mixture for each experiment. Amino acid substitutions with probability of damaging > 0.75 are defined as damaging and that < 0.25 are defined as benign. We selected experiments with bimodal score distribution, of which the confident damaging + benign variants make up more than 90% of all variants. In total, 13 genes with damaging / benign labels were selected for evaluating AUROC and MCC (Supplementary Table 4). If there are multiple assays for the same gene (*CYP2C9, PTEN, VKORC1*), we took the union of damaging labels as positives.

### MisFit model architecture and parameters

In a basic model, for a gene *i*, the maximum heterozygous selection coefficient for missense variants is denoted as *s**i*. In our model settings, *s**i* is transformed into logit scale ![Graphic][15]</img> to facilitate numeric computation. For each variant *k* at position *j, d**ijk* is assumed to be a random variable of logit-normal distribution. ![Formula][16]</img>  ![Formula][17]</img>  For variant-level heterozygous selection coefficient ![Graphic][18]</img>, we assume it’s linear to *d**ijk* (ranging from 0 to 1), where the minimum is set to *log it*(10−4) and the maximum is *s**i*′. ![Formula][19]</img>  First, we trained these global prior parameters using all missense variants in all genes. In each training step, ![Graphic][20]</img> is sampled from the corresponding priors. Next, for each ![Graphic][21]</img>, we also use Normal distribution as the variational inference family and learn the variational parameters ![Graphic][22]</img>. ![Formula][23]</img>  In this part ![Graphic][24]</img> are optimized to maximize ![Graphic][25]</img> and minimize ![Formula][26]</img>  *KL*() stands for Kullback–Leibler divergence.

Finally, we perform variational inference on posterior ![Graphic][27]</img>. ![Formula][28]</img>  In order to retrieve ![Graphic][29]</img> in one forward pass, they are modeled as functions (corresponding to dense layers *NN*2 in stage 2 in Supplementary Fig. 6c) of ![Graphic][30]</img>.

In the full MisFit model, ![Graphic][31]</img> are replaced by priors learned from protein embeddings *x*, which are different for each amino acid change. ![Formula][32]</img>  Additionally, ![Graphic][33]</img> along with amino acid substitution rate and gene-level constraint, is used to generate the probability of existence for mammal variants as a part of the loss function to utilize conservation among mammals.

And then correspondingly, ![Formula][34]</img>  ![Formula][35]</img>  MisFit scores are defined as follows:

MisFit_D: ![Graphic][36]</img>, the mean of ![Graphic][37]</img> transformed back to original scale.

MisFit_Sgene: ![Graphic][38]</img>, the mean of ![Graphic][39]</img> transformed back to original scale.

MisFit_S: ![Graphic][40]</img>, the derived ![Graphic][41]</img> when using the point estimate of ![Graphic][42]</img>, and the point estimate of *d**ijk*. Specifically, ![Graphic][43]</img> is the mean of ![Graphic][44]</img>, and ![Graphic][45]</img> is the mean of ![Graphic][46]</img>. Then the value is transformed back to the original scale.

For protein truncating variants, the model is simplified as follows. ![Formula][47]</img>  ![Formula][48]</img>  And we define:

MisFit_SPTV: ![Graphic][49]</img>, the posterior mean of ![Graphic][50]</img> transformed back to original scale (for protein truncating variants).

In our model, the random variable *s* are all represented in logit scale, and our point estimate of *s* is also inferred in logit scale then transformed back to the original scale. This eases the calculation and potentially limits the systematic bias (Supplementary Note).

### Model training

MisFit contains 4.4M parameters in total. Training of MisFit involves several stages:

Stage 0: train a baseline model for future comparison. Also, use the results to initialize ![Graphic][51]</img>, and set ![Graphic][52]</img> as prior in MisFit.

Stage 1: a. train the prior functions of damage ![Graphic][53]</img> using all missense variants 13,406 genes well covered with both mammal sequence alignment and human population sequence; b. train both ![Graphic][54]</img> and ![Graphic][55]</img> using 4,073 constrained genes (gnomAD6,7 missense z score > 2 or pLI > 0.5); c. further infer ![Graphic][56]</img> for all genes.

Stage 2: train ![Graphic][57]</img> using all missense variants in all genes for posterior inference.

Stage 1 training takes around 10 hours on 2 NVIDIA A40 GPUs.

### Enrichment of de novo variants and estimated precision-recall

*De novo* missense variants in 4 previous genetic studies are used for analysis (Supplementary Table 2). Given a score threshold (to enrich disease risk variants), the number of selected variants is *m*1 and *m* in cases and controls respectively. These numbers are normalized by number of synonymous variants ![Graphic][58]</img> and ![Graphic][59]</img> to calculate the enrichment ratio. ![Formula][60]</img>  Sensitivity (recall approximate) is estimated by the total number of excess of variants comparing cases and control. ![Formula][61]</img>  Precision is estimated by ![Formula][62]</img>  

## Supporting information

Supplementary Notes and Figures [[supplements/299809_file02.pdf]](pending:yes)

Supplementary Tables [[supplements/299809_file03.xlsx]](pending:yes)

## Data Availability

All data produced in the present work are contained in the manuscript

[https://github.com/ShenLab/MisFit](https://github.com/ShenLab/MisFit) 

## Data availability

The code for model training and analysis could be found at [https://github.com/ShenLab/MisFit](https://github.com/ShenLab/MisFit). Prediction results, including MisFit\_D, MisFit\_S for missense variants, and MisFit_Sgene, MisFit_SPTV for each gene, are available at Dropbox.

## Acknowledgements

We thank gnomAD and UK biobank investigators for making genomic data available. We thank the SPARK study and especially the thousands of families participating in SPARK. This work is supported by NIH grants (R35GM149527, R01GM120609, and P50HD109879), Simons Foundation (SFARI #1019623), and Columbia Precision Medicine Pilot grants program. We thank Itsik Pe’er, Molly Przeworski, Mohammed AlQuraishi, Adi Garg, Alan Tian, Haicang Zhang, Na Zhu, Chang Shu, and Lu Qiao for helpful discussions.

*   Received December 11, 2023.
*   Revision received December 11, 2023.
*   Accepted December 13, 2023.


*   © 2023, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NoDerivs 4.0 International), CC BY-ND 4.0, as described at [http://creativecommons.org/licenses/by-nd/4.0/](http://creativecommons.org/licenses/by-nd/4.0/)

## References

1.  1.Zhou, X. et al. Integrating de novo and inherited variants in 42,607 autism cases identifies mutations in new moderate-risk genes. Nature Genetics 54, 1305–1319 (2022).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=35982159&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

2.  2.Jin, S.C. et al. Contribution of rare inherited and de novo variants in 2,871 congenital heart disease probands. Nature Genetics 49, 1593–1601 (2017).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.3970&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28991257&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

3.  3.Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature13908&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25363768&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000344631400039&link_type=ISI) 

4.  4.Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature19057&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27535533&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000381804900026&link_type=ISI) 

5.  5.The “All of Us” Research Program. New England Journal of Medicine 381, 668–676 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMsr1809937&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31412182&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

6.  6.Karczewski, K.J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-020-2308-7&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32461654&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

7.  7.Siwei, C. et al. A genome-wide mutational constraint map quantified from variation in 76,156 human genomes. bioRxiv, 2022.03.20.485034 (2022).
    
    
8.  8.Sudlow, C. et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLOS Medicine 12, e1001779 (2015).
    
    
9.  9.Landrum, M.J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Research 44, D862–D868 (2016).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkv1222&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26582918&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

10. 10.Fowler, D.M. & Fields, S. Deep mutational scanning: a new style of protein science. Nature Methods 11, 801–807 (2014).
    
    
11. 11.Bandaru, P. et al. Deconstruction of the Ras switching cycle through saturation mutagenesis. eLife 6, e27810 (2017).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.7554/eLife.27810&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28686159&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

12. 12.Weile, J. et al. A framework for exhaustively mapping functional missense variants. Molecular Systems Biology 13, 957 (2017).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoibXNiIjtzOjU6InJlc2lkIjtzOjk6IjEzLzEyLzk1NyI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIzLzEyLzEzLzIwMjMuMTIuMTEuMjMyOTk4MDkuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

13. 13.Findlay, G.M. et al. Accurate classification of BRCA1 variants with saturation genome editing. Nature 562, 217–222 (2018).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-018-0461-z&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30209399&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

14. 14.Heredia, J.D. et al. Mapping Interaction Sites on Human Chemokine Receptors by Deep Mutational Scanning. The Journal of Immunology 200, 3825–3839 (2018).
    
    
15. 15.Kotler, E. et al. A Systematic p53 Mutation Library Links Differential Functional Impact to Cancer Mutation Pattern and Evolutionary Conservation. Molecular Cell 71, 178-190.e8 (2018).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.molcel.2018.06.012&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29979965&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

16. 16.Matreyek, K.A. et al. Multiplex assessment of protein variant abundance by massively parallel sequencing. Nature Genetics 50, 874–882 (2018).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41588-018-0122-z&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29785012&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

17. 17.Mighell, T.L., Evans-Dutson, S. & O’Roak, B.J. A Saturation Mutagenesis Approach to Understanding PTEN Lipid Phosphatase Activity and Genotype-Phenotype Relationships. The American Journal of Human Genetics 102, 943–955 (2018).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ajhg.2018.03.018&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

18. 18.Esposito, D. et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biology 20, 223 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s13059-019-1845-6&link_type=DOI) 

19. 19.Jiang, R.J. M.Sc., University of Toronto (Canada) (2019).
    
    
20. 20.Chan, K.K. et al. Engineering human ACE2 to optimize binding to the spike protein of SARS coronavirus 2. Science 369, 1261–1265 (2020).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEzOiIzNjkvNjUwOC8xMjYxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjMvMTIvMTMvMjAyMy4xMi4xMS4yMzI5OTgwOS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

21. 21.Chiasson, M.A. et al. Multiplexed measurement of variant abundance and activity reveals VKOR topology, active site and human variant impact. eLife 9, e58026 (2020).
    
    
22. 22.Jones, E.M. et al. Structural and functional characterization of G protein– coupled receptors with deep mutational scanning. eLife 9, e54895 (2020).
    
    
23. 23.Suiter, C.C. et al. Massively parallel variant characterization identifies NUDT15 alleles associated with thiopurine toxicity. Proceedings of the National Academy of Sciences 117, 5394–5401 (2020).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMToiMTE3LzEwLzUzOTQiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMy8xMi8xMy8yMDIzLjEyLjExLjIzMjk5ODA5LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

24. 24.Sun, S. et al. A proactive genotype-to-patient-phenotype map for cystathionine beta-synthase. Genome Medicine 12, 13 (2020).
    
    
25. 25.Amorosi, C.J. et al. Massively parallel characterization of CYP2C9 variant enzyme activity and abundance. The American Journal of Human Genetics 108, 1735–1751 (2021).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ajhg.2021.07.001&link_type=DOI) 

26. 26.Jia, X. et al. Massively parallel functional testing of MSH2 missense variants conferring Lynch syndrome risk. The American Journal of Human Genetics 108, 163–175 (2021).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ajhg.2020.12.003&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

27. 27.Weile, J. et al. Shifting landscapes of human MTHFR missense-variant effects. The American Journal of Human Genetics 108, 1283–1300 (2021).
    
    
28. 28.Erwood, S. et al. Saturation variant interpretation using CRISPR prime editing. Nature Biotechnology 40, 885–895 (2022).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41587-021-01201-1&link_type=DOI) 

29. 29.Roychowdhury, H. & Romero, P.A. Microfluidic deep mutational scanning of the human executioner caspases reveals differences in structure and regulation. Cell Death Discovery 8, 7 (2022).
    
    
30. 30.Gersing, S. et al. A comprehensive map of human glucokinase variant activity. Genome Biology 24, 97 (2023).
    
    
31. 31.Warren van, L. et al. Systematically testing human HMBS missense variants to reveal mechanism and pathogenic variation. bioRxiv, 2023.02.06.527353 (2023).
    
    
32. 32.MacArthur, D.G. et al. Guidelines for investigating causality of sequence variants in human disease. Nature 508, 469–476 (2014).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature13127&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24759409&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000334741600026&link_type=ISI) 

33. 33.Stenson, P.D. et al. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Human Genetics 133, 1–9 (2014).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s00439-013-1358-4&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24077912&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

34. 34.Rentzsch, P., Witten, D., Cooper, G.M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Research 47, D886–D894 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gky1016&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

35. 35.Ioannidis, N.M. et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. The American Journal of Human Genetics 99, 877–885 (2016).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ajhg.2016.08.016&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27666373&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

36. 36.Jagadeesh, K.A. et al. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nature Genetics 48, 1581–1586 (2016).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.3703&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27776117&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

37. 37.Zhang, H., Xu, M.S., Fan, X., Chung, W.K. & Shen, Y. Predicting functional effect of missense variants using graph attention neural networks. Nature Machine Intelligence 4, 1017–1028 (2022).
    
    
38. 38.Carter, H., Douville, C., Stenson, P.D., Cooper, D.N. & Karchin, R. Identifying Mendelian disease genes with the Variant Effect Scoring Tool. BMC Genomics 14, S3 (2013).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/1471-2164-14-3&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23819870&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

39. 39.Dong, C. et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Human Molecular Genetics 24, 2125–2137 (2014).
    
    
40. 40.Qi, H. et al. MVP predicts the pathogenicity of missense variants by deep learning. Nature Communications 12, 510 (2021).
    
    
41. 41.Kaitlin, E.S. et al. Regional missense constraint improves variant deleteriousness prediction. bioRxiv, 148353 (2017).
    
    
42. 42.Fuller, Z.L., Berg, J.J., Mostafavi, H., Sella, G. & Przeworski, M. Measuring intolerance to mutation in human genetics. Nature Genetics 51, 772–776 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41588-019-0383-1&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

43. 43.Cassa, C.A. et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nature Genetics 49, 806–810 (2017).
    
    
44. 44.Weghorn, D. et al. Applicability of the Mutation–Selection Balance Model to Population Genetics of Heterozygous Protein-Truncating Variants in Humans. Molecular Biology and Evolution 36, 1701–1710 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/molbev/msz092&link_type=DOI) 

45. 45.Agarwal, I., Fuller, Z.L., Myers, S.R. & Przeworski, M. Relating pathogenic loss-of function mutations in humans to their evolutionary fitness costs. eLife 12, e83172 (2023).
    
    
46. 46.Tony, Z., Jeffrey, P.S., Hakhamanesh, M. & Jonathan, K.P. Bayesian estimation of gene constraint from an evolutionary model with gene features. bioRxiv, 2023.05.19.541520 (2023).
    
    
47. 47.Nei, M. The frequency distribution of lethal chromosomes in finite populations. Proceedings of the National Academy of Sciences 60, 517–524 (1968).
    
    [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czo4OiI2MC8yLzUxNyI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIzLzEyLzEzLzIwMjMuMTIuMTEuMjMyOTk4MDkuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

48. 48.Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nature Genetics 46, 919–925 (2014).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.3015&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24952747&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

49. 49.Tennessen, J.A. et al. Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes. Science 337, 64–69 (2012).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjExOiIzMzcvNjA5MC82NCI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIzLzEyLzEzLzIwMjMuMTIuMTEuMjMyOTk4MDkuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

50. 50.Seplyarskiy, V. et al. A mutation rate model at the basepair resolution identifies the mutagenic effect of Polymerase III transcription. bioRxiv, 2022.08.20.504670 (2023).
    
    
51. 51.Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1126/science.ade2574&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36927031&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

52. 52.Christmas, M.J. et al. Evolutionary constraint and innovation across hundreds of placental mammals. Science 380, eabn3943 (2023).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1126/science.abn3943&link_type=DOI) 

53. 53.Gao, H. et al. The landscape of tolerated genetic variation in humans and primates. Science 380, eabn8153 (2023).
    
    
54. 54.Gerasimavicius, L., Livesey, B.J. & Marsh, J.A. Loss-of-function, gain-of-function and dominant-negative mutations have profoundly different effects on protein structure. Nature Communications 13, 3895 (2022).
    
    
55. 55.Schubbert, S. et al. Germline KRAS mutations cause Noonan syndrome. Nature Genetics 38, 331–336 (2006).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng1748&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=16474405&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000235589600015&link_type=ISI) 

56. 56.Kaplanis, J. et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 586, 757–762 (2020).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-020-2832-5&link_type=DOI) 

57. 57.Maenner, M.J. et al. Prevalence and Characteristics of Autism Spectrum Disorder Among Children Aged 8 Years - Autism and Developmental Disabilities Monitoring Network, 11 Sites, United States, 2020. MMWR Surveill Summ 72, 1–14 (2023).
    
    
58. 58.Power, R.A. et al. Fecundity of Patients With Schizophrenia, Autism, Bipolar Disorder, Depression, Anorexia Nervosa, or Substance Abuse vs Their Unaffected Siblings. JAMA Psychiatry 70, 22–30 (2013).
    
    
59. 59.Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nature Genetics 50, 1161–1170 (2018).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

60. 60.Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-021-04043-8&link_type=DOI) 

61. 61.Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023).
    
    
62. 62.He, X. et al. Integrated Model of De Novo and Inherited Genetic Variants Yields Greater Power to Identify Risk Genes. PLOS Genetics 9, e1003671 (2013).
    
    
63. 63.Nguyen, H.T. et al. Integrated Bayesian analysis of rare exonic variants to identify risk genes for schizophrenia and neurodevelopmental disorders. Genome Medicine 9, 114 (2017).
    
    
64. 64.Fu, J.M. et al. Rare coding variation provides insight into the genetic architecture and phenotypic context of autism. Nature Genetics 54, 1320–1331 (2022).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=35982160&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

65. 65.Petrovski, S., Wang, Q., Heinzen, E.L., Allen, A.S. & Goldstein, D.B. Genic Intolerance to Functional Variation and the Interpretation of Personal Genomes. PLOS Genetics 9, e1003709 (2013).
    
    
66. 66.Havrilla, J.M., Pedersen, B.S., Layer, R.M. & Quinlan, A.R. A map of constrained coding regions in the human genome. Nature Genetics 51, 88–95 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41588-018-0294-6&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30531870&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

67. 67.Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences 118, e2016239118 (2021).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxODoiMTE4LzE1L2UyMDE2MjM5MTE4IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjMvMTIvMTMvMjAyMy4xMi4xMS4yMzI5OTgwOS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

68. 68.Joshua, M. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv, 2021.07.09.450648 (2021).
    
    
69. 69.Ohta, T. The Nearly Neutral Theory of Molecular Evolution. Annual Review of Ecology and Systematics 23, 263–286 (1992).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1146/annurev.es.23.110192.001403&link_type=DOI) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1992JZ28100011&link_type=ISI) 

70. 70.Eli, N.W., Alan, N.A., Jonathan, F. & Debora, S.M. Non-identifiability and the Blessings of Misspecification in Models of Molecular Fitness. bioRxiv, 2022.01.29.478324 (2023).
    
    
71. 71.Verkuil, R. et al. Language models generalize beyond natural proteins. bioRxiv, 2022.12.21.521521 (2022).
    
    
72. 72.Wall, J.D. et al. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-019-1793-z&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F13%2F2023.12.11.23299809.atom) 

73. 73.Wonkam, A. Sequence three million genomes across Africa. Nature 590, 209–211 (2021).

 [1]: /embed/graphic-8.gif
 [2]: /embed/graphic-9.gif
 [3]: /embed/graphic-10.gif
 [4]: /embed/graphic-11.gif
 [5]: /embed/graphic-12.gif
 [6]: /embed/graphic-13.gif
 [7]: /embed/graphic-14.gif
 [8]: /embed/graphic-15.gif
 [9]: /embed/graphic-16.gif
 [10]: /embed/graphic-17.gif
 [11]: /embed/graphic-18.gif
 [12]: /embed/graphic-19.gif
 [13]: /embed/graphic-20.gif
 [14]: /embed/inline-graphic-1.gif
 [15]: /embed/inline-graphic-2.gif
 [16]: /embed/graphic-21.gif
 [17]: /embed/graphic-22.gif
 [18]: /embed/inline-graphic-3.gif
 [19]: /embed/graphic-23.gif
 [20]: /embed/inline-graphic-4.gif
 [21]: /embed/inline-graphic-5.gif
 [22]: /embed/inline-graphic-6.gif
 [23]: /embed/graphic-24.gif
 [24]: /embed/inline-graphic-7.gif
 [25]: /embed/inline-graphic-8.gif
 [26]: /embed/graphic-25.gif
 [27]: /embed/inline-graphic-9.gif
 [28]: /embed/graphic-26.gif
 [29]: /embed/inline-graphic-10.gif
 [30]: /embed/inline-graphic-11.gif
 [31]: /embed/inline-graphic-12.gif
 [32]: /embed/graphic-27.gif
 [33]: /embed/inline-graphic-13.gif
 [34]: /embed/graphic-28.gif
 [35]: /embed/graphic-29.gif
 [36]: /embed/inline-graphic-14.gif
 [37]: /embed/inline-graphic-15.gif
 [38]: /embed/inline-graphic-16.gif
 [39]: /embed/inline-graphic-17.gif
 [40]: /embed/inline-graphic-18.gif
 [41]: /embed/inline-graphic-19.gif
 [42]: /embed/inline-graphic-20.gif
 [43]: /embed/inline-graphic-21.gif
 [44]: /embed/inline-graphic-22.gif
 [45]: /embed/inline-graphic-23.gif
 [46]: /embed/inline-graphic-24.gif
 [47]: /embed/graphic-30.gif
 [48]: /embed/graphic-31.gif
 [49]: /embed/inline-graphic-25.gif
 [50]: /embed/inline-graphic-26.gif
 [51]: /embed/inline-graphic-27.gif
 [52]: /embed/inline-graphic-28.gif
 [53]: /embed/inline-graphic-29.gif
 [54]: /embed/inline-graphic-30.gif
 [55]: /embed/inline-graphic-31.gif
 [56]: /embed/inline-graphic-32.gif
 [57]: /embed/inline-graphic-33.gif
 [58]: /embed/inline-graphic-34.gif
 [59]: /embed/inline-graphic-35.gif
 [60]: /embed/graphic-32.gif
 [61]: /embed/graphic-33.gif
 [62]: /embed/graphic-34.gif