PT - JOURNAL ARTICLE AU - Goli, Rohan AU - Komatineni, Keerthana AU - Alluri, Shailesh AU - Hubig, Nina AU - Min, Hua AU - Gong, Yang AU - Sittig, Dean F. AU - Rennert, Lior AU - Robinson, David AU - Biondich, Paul AU - Wright, Adam AU - Nøhr, Christian AU - Law, Timothy AU - Faxvaag, Arild AU - Weaver, Aneesa AU - Gimbel, Ronald AU - Jing, Xia TI - Keyphrase Identification Using Minimal Labeled Data with Hierarchical Contexts and Transfer Learning AID - 10.1101/2023.01.26.23285060 DP - 2024 Jan 01 TA - medRxiv PG - 2023.01.26.23285060 4099 - http://medrxiv.org/content/early/2024/11/18/2023.01.26.23285060.short 4100 - http://medrxiv.org/content/early/2024/11/18/2023.01.26.23285060.full AB - Background Interoperable clinical decision support system (CDSS) rules provide a pathway to interoperability, a well-recognized challenge in health information technology. Building an ontology facilitates creating interoperable CDSS rules, which can be achieved by identifying the keyphrases (KP) from the existing literature. Ontology construction is traditionally a manual effort by human domain experts, and the newly advanced natural language processing techniques, such as KP identification, can be a critical complementary automatic part of building ontology. However, KP identification requires human expertise, consensus, and contextual understanding for data labeling.Methods This paper presents a semi-supervised KP identification framework (long short-term memory-based encoders and the conditional random fields -based decoder models, BiLSTM-CRF) using minimal human labeled data based on hierarchical attention (i.e., at word, sentence, and abstract levels) over the documents and domain adaptation. We created synthetic labels for initial training and human-labeled data for fine-tuning. We also tested different options during NLP preprocessing and ML training to optimize the ML pipeline.Results Our method outperforms the prior neural architectures by learning through synthetic labels for initial training, document-level contextual learning, language modeling, and fine-tuning with limited gold standard label data. After comparison, we found that the BIO encoding schema performed slightly better than Blue, and domain adaptation techniques can improve the quality of synthetic labels. In addition, document-level context, pre-trained LM, and pre-trained WE all contributed to better model performance in our tasks. Add 2 to 4 human-labeled documents for every 100 synthetic labeled documents improves the model performance without exhausting human-labeled documents too quickly.Conclusions To the best of our knowledge, this is the first functional framework for the CDSS sub-domain to identify KPs, which is trained on limited human labeled data. It contributes to the general natural language processing (NLP) architectures in areas such as clinical NLP, where manual data labeling is challenging, and light-weighted deep learning models play an important role in real-time KP identification as a complementary approach to human experts’ effort.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThe work was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R01GM138589 and partially under P20GM121342. This work has also benefited from research training resources and the intellectual environment enabled by the NIH/NLM T15 South Carolina Biomedical Informatics and Data Science for Health Equity (SC BIDS4Health) research training program (T15LM013977). The content is solely the authors responsibility and does not necessarily represent the official views of the National Institutes of Health.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe code is available on GitHub: https://github.com/xjing16/cdss4pcp_nlpml_pipeline.NLPNatural language processingCDSSClinical decision support systemHDEHuman domain expertBiLSTMBidirectional long short-term memoryBiLMBidirectional language modelCRFConditional random fieldGSGold standardKPKeyphrase