PT - JOURNAL ARTICLE AU - Lopez-Pineda, Arturo AU - Vernekar, Manvi AU - Grau, Sonia Moreno AU - Rojas-Muñoz, Agustin AU - Moatamed, Babak AU - Michael Lee, Ming Ta AU - Nava-Aguilar, Marco A. AU - Gonzalez-Arroyo, Gilberto AU - Numakura, Kensuke AU - Matsuda, Yuta AU - Ioannidis, Alexander AU - Katsanis, Nicholas AU - Takano, Tomohiro AU - Bustamante, Carlos D. TI - Validating and automating learning of cardiometabolic polygenic risk scores from direct-to-consumer genetic and phenotypic data: implications for scaling precision health research AID - 10.1101/2022.03.01.22271722 DP - 2022 Jan 01 TA - medRxiv PG - 2022.03.01.22271722 4099 - http://medrxiv.org/content/early/2022/03/08/2022.03.01.22271722.short 4100 - http://medrxiv.org/content/early/2022/03/08/2022.03.01.22271722.full AB - Introduction A major challenge to enabling precision health at a global scale is the bias between those who enroll in state sponsored genomic research and those suffering from chronic disease. More than 30 million people have been genotyped by direct-to-consumer (DTC) companies such as 23andMe, Ancestry DNA, and MyHeritage, providing a potential mechanism for democratizing access to medical interventions and thus catalyzing improvements in patient outcomes as the cost of data acquisition drops. However, much of these data are sequestered in the initial provider network, without the ability for the scientific community to either access or validate. Here, we present a novel geno-pheno platform that integrates heterogeneous data sources and applies learnings to common chronic disease conditions including Type 2 diabetes (T2D) and hypertension.Methods We collected genotyped data from a novel DTC platform where participants upload their genotype data files, and were invited to answer general health questionnaires regarding cardiometabolic traits over a period of 6 months. Quality control, imputation and genome-wide association studies were performed on this dataset, and polygenic risk scores were built in a case-control setting using the BASIL algorithm.Results We collected data on N=4,550 (389 cases / 4,161 controls) who reported being affected or previously affected for T2D; and N=4,528 (1,027 cases / 3,501 controls) for hypertension. We identified 164 out of 272 variants showing identical effect direction to previously reported genome-significant findings in Europeans. Performance metric of the PRS models was AUC=0.68, which is comparable to previously published PRS models obtained with larger datasets including clinical biomarkers.Discussion DTC platforms have the potential of inverting research models of genome sequencing and phenotypic data acquisition. Quality control (QC) mechanisms proved to successfully enable traditional GWAS and PRS analyses. The direct participation of individuals has shown the potential to generate rich datasets enabling the creation of PRS cardiometabolic models. More importantly, federated learning of PRS from reuse of DTC data provides a mechanism for scaling precision health care delivery beyond the small number of countries who can afford to finance these efforts directly.Conclusions The genetics of T2D and hypertension have been studied extensively in controlled datasets, and various polygenic risk scores (PRS) have been developed. We developed predictive tools for both phenotypes trained with heterogeneous genotypic and phenotypic data generated outside of the clinical environment and show that our methods can recapitulate prior findings with fidelity. From these observations, we conclude that it is possible to leverage DTC genetic repositories to identify individuals at risk of debilitating diseases based on their unique genetic landscape so that informed, timely clinical interventions can be incorporated.Competing Interest StatementARM, SMG, MTML, ALP, CDB, AI, MNA and NK are employees of or consultants to Galatea Bio. MV, KN, YM, and TT are employees of Genomelink. CDB, IA, and NK are shareholders of Galatea Bio stock. CDB, KN, YM, and TT are shareholders of Genomelink stock. The remaining authors declare that there is no conflict of interest regarding the publication of this article.Funding StatementThis research is based on results obtained from a project, JPNP19001, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This study was approved by the institutional review board (IRB) at WCG IRB (https://www.wcgirb.com/) under IRB tracking number protocol number 20201332.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe data that supports the findings of this study is available for qualified researchers at non-profit institutions upon entering into an agreement with Genomelink. All information will be shared subject to the above criterion upon request via info{at}genomelink.io.