A Supervised Text Classification System Detects Fontan Patients in Electronic Records with Higher Accuracy than ICD Codes

Y Guo; MA Al-Garadi; WM Book; LC Ivey; FH Rodriguez; CL Raskind-Hood; C Robichaux; A Sarker

doi:10.1101/2023.03.01.23286659

Abstract

Background The Fontan operation palliates single ventricle heart defects and is associated with significant morbidity and premature mortality. Native anatomy varies; thus, Fontan cases cannot always be identified by International Classification of Diseases, Ninth and Tenth Revision, Clinical Modification (ICD-9-CM and ICD-10-CM) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing (NLP) based machine learning (ML) models, which utilize free text notes of patients, to automatically detect Fontan cases, and compare their performances with ICD code based classification.

Methods and Results We included free text notes of 10,935 manually validated patients, of whom 778 (7.1%) were Fontan and 10,157 (92.9%) non-Fontan patients, from two large, diverse healthcare systems. Using 5-fold cross validation, we trained and evaluated multiple ML models, namely support vector machines (SVM) and a transformer based model for language understanding named RoBERTa (2 versions), for automatically identifying Fontan cases based on free text notes. To optimize classifier performances, we experimented with different text representation techniques, including a sliding window strategy to overcome the length limit imposed by RoBERTa. We compared the performances of the ML models to ICD code based classification using the F₁ score metric. The ICD classification model, SVM, and RoBERTa achieved F₁ scores of 0.81 (95% CI: 0.79-0.83), 0.95 (95% CI: 0.92-0.97), and 0.89 (95% CI: 0.88-0.85) for the positive (Fontan) class, respectively. SVM obtained the best performance (p<0.05), and both NLP models outperformed ICD code based classification (p<0.05). The novel sliding window strategy improved performance over the base RoBERTa model (p<0.05) but did not outperform SVM. ICD code based classification tended to have more false positives compared to both NLP models.

Conclusions Our proposed NLP models can automatically detect Fontan patients based on clinical notes with higher accuracy than ICD codes. Since the sensitivity of ICD codes is high but the positive predictive value is low, it may be beneficial to apply ICD codes as a filter prior to applying NLP/ML to achieve optimal performance.

Competing Interest Statement

The authors have declared no competing interest.

Clinical Trial

N/A

Funding Statement

Centers for Disease Control and Prevention Cooperative Agreement, Congenital Heart Defects Surveillance across Time And Regions (CHD STAR) Grant/Award Number: CDC?RFA?DD19?1902.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Not Applicable

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Emory University Institutional Review Board (IRB) approved the study on August 26, 2020 (IRB# STUDY00001030) and included a complete waiver of HIPAA authorization as well as waiver of informed consent

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Not Applicable

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Not Applicable

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Not Applicable

Footnotes

Author CRediT statement Yuting Guo: conceptualization, methodology, investigation, data curation, writing original draft and review and editing, analysis
Mohammed Al-Garadi: conceptualization, methodology, investigation, data curation
Wendy M. Book: conceptualization, methodology, investigation, resources, data curation, writing original draft and review and editing, supervision, funding acquisition
Lindsey C. Ivey: methodology, data curation, editing draft, supervision
Fred H. Rodriguez III: methodology, data curation, editing draft, supervision
Cheryl L. Raskind-Hood: conceptualization, methodology, draft editing
Chad Robichaux: data curation
Abeed Sarker: conceptualization, methodology, investigation, data curation, writing original draft and review and editing, preparation, supervision, project administration

Data Availability

As per Emory University policy, patient data will not be shared with researchers outside the Emory Health system. Collaborators interested in joining the studies will be reviewed on a case by case basis and will only be allowed to access relevant data following approval from Emory University IRB. All code associated with the natural language processing and machine learning experiments is publicly available (see GitHub link in paper).

https://github.com/yguo0102/Fontan_classification

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.