Abstract
Importance Many clinically significant health conditions are frequently underreported, underdiagnosed or recorded only in unstructured textual health records, yet they contain critical information for patient assessment, care and prognosis.
Objective To determine whether deep learning-based natural language processing employed for named entity recognition can effectively identify health conditions, such as incontinence, falls, mobility limitations and loneliness in unstructured textual electronic health records. The identified conditions were further used to predict all-cause mortality.
Design This cohort study utilized electronic health records from public primary, secondary, tertiary, long-term and home care from 2010 to 2022, providing up to 12 years of follow-up. The named entity recognition task to identify incontinence, falls, mobility limitations, and loneliness was implemented using Google’s Bidirectional Encoder Representations from Transformers deep learning model pre-trained for the Finnish language. Diagnostic codes for incontinence and falls were collected for comparisons.
Setting Retrospective electronic health records across the Central Finland wellbeing services county.
Participants Structured summary data and 10.6 million free-text entries from 102,525 patients aged 50 to 80 years at baseline.
Exposure Incontinence, falls, mobility limitations and loneliness were considered as exposures.
Main Outcomes and Measures The performance of the named entity recognition models was evaluated by precision, recall and F1 scores benchmarked against human ratings. Cox regression models were used to assess and compare NER- and diagnostic code-identified falls and incontinence onsets in predicting all-cause mortality.
Results The deep learning model demonstrated excellent performance with recall, precision and F1 scores of 0.86, 0.88, and 0.87 for falls; 0.84, 0.78, and 0.81 for incontinence; 0.86, 0.84, and 0.85 for mobility limitations and 0.91, 0.84, and 0.87 for loneliness, respectively. Compared to diagnostic codes, named entity recognition identified greater numbers of falls (31987 vs 4090) and incontinence (7059 vs 3873) onsets and yielded greater hazard ratios: 1.31 vs 1.04 for falls and 1.99 vs 0.65 for incontinence.
Conclusions and Relevance Deep learning-based named entity recognition models reliably identified incontinence, falls, loneliness and mobility limitations in free-text medical records, presenting new opportunities to use unstructured clinical data to identify vulnerable patients and apply the method in research applications.
Question Can deep learning-based natural language processing (NLP) identify health conditions, such as incontinence, falls, mobility limitations and loneliness in unstructured electronic health records (EHRs)?
Findings The results of this cohort study demonstrate that a deep-learning NLP model can effectively identify incontinence, falls, mobility limitations and loneliness in textual EHR data. This approach also results in improved mortality prediction compared to available diagnostic codes.
Meaning NLP approaches could be used to identify underreported and underdiagnosed health conditions in textural EHR data, enabling identification of vulnerable and at-risk patients.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This work was funded by grants (no. 349335 and 349336) from the Research Council of Finland, the Instrumentarium Science Foundation, the Sigrid Juselius Foundation and the Research Council of Finland funding to Tampere University for strategic profiling in health data science (PROFI-6, 2021-2026) and Samfundet Folkhalsan.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The Human Sciences Ethics Committee of the University of Jyväskylä has certified these conditions pertinent to our study and stated that an ethical review is not required.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
Electronic health records used in this data are personally identifying and hence not publicly available. Our code and pipelines implemented as Python Jupyter notebook scripts are available at: https://github.com/jakelin212/frailty_nlp_ner