PT - JOURNAL ARTICLE AU - Yang, Xi AU - PourNejatian, Nima AU - Shin, Hoo Chang AU - Smith, Kaleb E AU - Parisien, Christopher AU - Compas, Colin AU - Martin, Cheryl AU - Flores, Mona G AU - Zhang, Ying AU - Magoc, Tanja AU - Harle, Christopher A AU - Lipori, Gloria AU - Mitchell, Duane A AU - Hogan, William R AU - Shenkman, Elizabeth A AU - Bian, Jiang AU - Wu, Yonghui TI - GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records AID - 10.1101/2022.02.27.22271257 DP - 2022 Jan 01 TA - medRxiv PG - 2022.02.27.22271257 4099 - http://medrxiv.org/content/early/2022/02/28/2022.02.27.22271257.short 4100 - http://medrxiv.org/content/early/2022/02/28/2022.02.27.22271257.full AB - There is an increasing interest in developing massive-size deep learning models in natural language processing (NLP) - the key technology to extract patient information from unstructured electronic health records (EHRs). However, there are limited studies exploring large language models in the clinical domain; the current largest clinical NLP model was trained with 110 million parameters (compared with 175 billion parameters in the general domain). It is not clear how large-size NLP models can help machines understand patients’ clinical information from unstructured EHRs. In this study, we developed a large clinical transformer model – GatorTron – using >90 billion words of text and evaluated it on 5 clinical NLP tasks including clinical concept extraction, relation extraction, semantic textual similarity, natural language inference, and medical question answering. GatorTron is now the largest transformer model in the clinical domain that scaled up from the previous 110 million to 8.9 billion parameters and achieved state-of-the-art performance on the 5 clinical NLP tasks targeting various healthcare information documented in EHRs. GatorTron models perform better in understanding and utilizing patient information from clinical narratives in ways that can be applied to improvements in healthcare delivery and patient outcomes.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study was partially supported by a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-2018C3-14754), a grant from the National Cancer Institute, 1R01CA246418 R01, grants from the National Institute on Aging, NIA R56AG069880 and R21AG062884, and the Cancer Informatics and eHealth core jointly supported by the UF Health Cancer Center and the UF Clinical and Translational Science Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding institutions.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:IRB (202100049) of the University of Florida gave approval for this work as exempt.The approval includes but is not limited to HIPAA waiver to enroll.Peter Iafrate, IRB Chairman, University of Florida.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesAll data produced are available online at N2C2 https://n2c2.dbmi.hms.harvard.edu/data-sets