Abstract
Objective Early identification of bipolar disorder (BD) provides an important opportunity for timely intervention. In this study, we aimed to develop machine learning models using large-scale electronic health record (EHR) data including clinical notes for predicting early-onset BD.
Method Structured and unstructured data were extracted from the longitudinal EHR of the Mass General Brigham health system. We defined three cohorts aged 10 – 25 years: (1) the full youth cohort (N=300,398); (2) a sub-cohort defined by having a mental health visit (N=105,461); (3) a sub-cohort defined by having a diagnosis of mood disorder or ADHD (N=35,213). By adopting a prospective landmark modeling approach that aligns with clinical practice, we developed and validated a range of machine learning models including neural network-based models, across different cohorts and prediction windows.
Results We found the two tree-based models, Random forests (RF) and light gradient-boosting machine (LGBM), achieving good discriminative performance across different clinical settings (area under the receiver operating characteristic curve 0.76-0.88 for RF and 0.74-0.89 for LGBM). In addition, we showed comparable performance can be achieved with a greatly reduced set of features, demonstrating computational efficiency can be attained without significant compromise of model accuracy.
Conclusion Good discriminative performance for early-onset BD is achieved utilizing large-scale EHR data. Our study offers a scalable and accurate method for identifying youth at risk for BD that could help inform clinical decision making and facilitate early intervention. Future work includes evaluating the portability of our approach to other healthcare systems and exploring considerations regarding possible implementation.
Competing Interest Statement
Dr. Smoller is a member of the Scientific Advisory Board of Sensorium Therapeutics (with equity), and has received grant support from Biogen, Inc. He is PI of a collaborative study of the genetics of depression and bipolar disorder sponsored by 23andMe for which 23andMe provides analysis time as in-kind support but no payments.
Funding Statement
This study was supported in part by a gift from the Ryan Licht Sang Bipolar Foundation and NIMH R01MH118233 (JWS).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
IRB of Mass General Brigham gave ethical approval for this work.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data availability
Protected Health Information restrictions apply to the availability of the clinical data here, which were used under IRB approval for use only in the current study. As a result, this dataset is not publicly available.