Continuous-Time and Dynamic Suicide Attempt Risk Prediction with Neural Ordinary Differential Equations
=======================================================================================================

* Yi-han Sheu
* Jaak Simm
* Bo Wang
* Hyunjoon Lee
* Jordan W. Smoller

## ABSTRACT

Suicide is one of the leading causes of death in the US, and the number of attributable deaths continues to increase. Risk of suicide-related behaviors (SRBs) is dynamic, and SRBs can occur across a continuum of time and locations. However, current SRB risk assessment methods, whether conducted by clinicians or through machine learning models, treat SRB risk as static and are confined to specific times and locations, such as following a hospital visit. Such a paradigm is unrealistic as SRB risk fluctuates and creates time gaps in the availability of risk scores. Here, we develop two closely related model classes, Event-GRU-ODE and Event-GRU-Discretized, that can predict the dynamic risk of events as a continuous trajectory based on Neural ODEs, an advanced AI model class for time series prediction. As such, these models can estimate changes in risk across the continuum of future time points, even without new observations, and can update these estimations as new data becomes available. We train and validate these models for SRB prediction using a large electronic health records database. Both models demonstrated high discrimination performance for SRB prediction (e.g., AUROC > 0.92 in the full, general cohort), serving as an initial step toward developing novel and comprehensive suicide prevention strategies based on dynamic changes in risk.

## INTRODUCTION

More than 700,000 people die by suicide worldwide annually, according to the World Health Organization1. In the US, suicide rates continue to increase despite calls for and efforts in suicide prevention2. One of the cornerstones of effective suicide prevention is identifying individuals at high risk for suicide-related behaviors (SRBs) to enable early intervention. Healthcare settings provide an important opportunity for risk assessment as most people who attempt or die by suicide are seen by a healthcare provider in the preceding weeks3. In recent years, the application of statistical and machine learning models to healthcare data have shown promise in improving prediction of suicide-related behavior4–10. Nevertheless, whether based on clinician evaluations alone or in conjunction with newer data-driven approaches4,5, current assessment methods are largely confined to specific time points and settings, such as following a healthcare visit, and treat risk as a static estimate over a given prediction window (Figure 1(a)). Such approaches are intrinsically limited given the dynamic nature of SRB risk, potentially compromising efforts to improve prevention. The precision of static risk estimates will decay with increasing intervals since the time point of prediction, progressively misaligning estimated and actual risk levels. In addition, in the absence of frequent healthcare visits, gaps in the availability of SRB risk estimates would occur, which could contain periods of high risk (i.e., the “uncovered time” depicted in Figure 1(a)). Ideally, an effective SRB risk estimate tool would provide a continuous-time, dynamic trajectory of risk that covers all future time points and can be updated with ongoing observations (Figure 1(b)).

![Figure 1](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/02/27/2024.02.25.24303343/F1.medium.gif)

[Figure 1](http://medrxiv.org/content/early/2024/02/27/2024.02.25.24303343/F1)

Figure 1 
Schematic diagram for (a) static SRB risk predictions. In the static paradigm, risks of SRB are modeled as uniform probabilities for a pre-defined prediction window (Δt) evaluated at a specific time point, usually following a hospital visit (e.g., t0, t1). Δt may vary in length based on the query. The estimated static risk is carried over through the entire prediction window to represent the SRB risk at any given point until the end of the prediction window (e.g., between t and t+Δt). This approach is vulnerable to errors due to its inherent mismatch with the dynamic nature of SRBs, and fails to provide comprehensive coverage of estimated risk throughout the entire trajectory. On the x-axis: time since the first hospital visit; on the y-axis: actual (blue) and modeled (orange) SRB risk (cumulative incidence) for a given prediction window. t: the time of the first hospital visit; t1: a subsequent hospital visit when new information regarding a life event is recorded; Δt1: the corresponding prediction window queried; Error: the discrepancies between estimated and actual risks; tLE: the time a life event occurred, potentially affecting SRB risk; Uncovered time: periods when an estimated suicide risk is not available; (b) An ideal continuous-time and dynamic SRB risk prediction paradigm. In this paradigm, risks are dynamically modeled to fluctuate naturally, i.e., for every time point, the model produces an estimated SRB risk for a given prediction window following that time point, thus reducing errors associated with static approach. The continuous-time approach also provides coverage of the entire time trajectory (i.e., no “uncovered time”). tupdate: the time when new information, such as a life event, is recorded and incorporated into the risk model. Notably, if data from sources outside healthcare settings is available, tupdate can occur independently of hospital visits, allowing for more timely updates based on the information source. Similarly, equating the timing of SRB events with the recording of diagnosis codes can introduce inaccuracies in the modeled trajectories. Further details are discussed in the Discussion section.

In this study, we aimed to develop and validate a first iteration of continuous-time, dynamic risk prediction models for SRBs. We utilize data from a large-scale electronic health records (EHR) database, encompassing more than 1.7 million patients with a wide range of demographic and clinical features, to employ a recent advance in artificial intelligence (AI): Neural Ordinary Differential Equations (Neural ODEs) 11. Neural ODEs are a type of deep learning model that perceives the sequential evolution of data as continuous-time differential equations, allowing for more flexible and efficient modeling of complex, dynamic systems. Specifically, our models leverage GRU-ODE12, an innovation that extends the original Neural ODE framework by building on Gated Recurrent Units (GRUs)13, a specialized type of recurrent neural network. GRU-ODE is particularly well-suited for modeling the irregularly spaced and sparse longitudinal trajectories such as those captured in her data. We further adapt GRU-ODE and develop two closely related model classes for event-based predictions (e.g., a recorded SRB), which we term “Event-GRU-ODE” and “Event-GRU-Discretized.” In this framework, each patient’s SRB risk at each time step is modeled as a non-linear transformation of a latent, continuous, and dynamic trajectory. Conceptually, such a trajectory can be interpreted as a series of abstract vectors representing the evolution of mental states), enabling the estimation of time-varying probabilities of SRBs within any specified future time window. These models have an “interleaving” design – i.e., they consist of two components— one that models a base latent trajectory of risk, and another that updates the trajectory based on information acquired from new observations. Both Event-GRU-ODE and Event-GRU-Discretized are capable of producing continuous-time predictions, with the only difference being that the former imposes an additional continuity assumption on the modeled latent trajectory – i.e., in the absence of new information, the change in the modeled value of the latent trajectory cannot exceed a threshold with respect to each time interval (see Methods for details). In short, these models can estimate continuous changes in risk across the continuum of future time points, even in the absence of new observations, and can be updated when new observations become available. By extending SRB risk estimates across the continuum of time, this approach effectively addresses one of the major limitations of current SRB risk assessment methods and could serve to inform more responsive interventions that are better able to address the complex nature of suicide risk.

## RESULTS

### Patient Cohort Characteristics

Data were obtained for 1,706,417 patients from the Research Patient Data Registry (RPDR)14, the EHR data repository of the Mass General Brigham (MGB) Healthcare system. From this patient set, 1,536,179 were randomly sampled for model development (80% of all patients for training, and 10 % for validation/hyperparameter tuning), and 170,238 patients (10%) were included as a hold-out test set. Table 1 summarizes the demographic composition of the data sets. Overall, the patients were predominantly females (i.e., 58% females in both training and test sets), and a majority self-identified as White (77.8% in the training/validation sets, and 77.6% in the test set) and were in the age range of 45-65 years old (Table 1 and Supplementary Figure 1). As shown in Supplementary Figure 2, most patients had fewer than 50 healthcare encounters within our “study time frame” (i.e., between Jan. 1, 2016 and Dec. 31, 2019).

View this table:
[Table 1](http://medrxiv.org/content/early/2024/02/27/2024.02.25.24303343/T1)

Table 1 Demographic breakdown of the development (training and validation) and test sets.

Our models continuously generate predictions on a daily basis for each patient, with each prediction varying in its temporal distance from both the patient’s first recorded observation and the last observation before the prediction was made. We illustrate the distribution of the number of months between the prediction timepoint and (1) the first observation (representing the total length of information available for the prediction in question) and (2) the last available observation in Supplementary Figure 3. Most predictions were made with less than 2 years of data from the first observation and less than 5 months from the last observation. See below for the effect of length of observed patient history on model performance.

### Main Model Performance Metrics

Table 2 shows the prediction performance of the Event-GRU-ODE and Event-GRU-Discretized models. In general, both models achieved excellent discrimination (area under the receiver operator curve, AUROC > 0.9) across different prediction windows (ROC curves are shown in Supplementary Figure 4). All model metrics were reported at a fixed 95% specificity. The highest AUROCs were recorded with a 1-month prediction window (Event-GRU-ODE=0.940, Event-GRU-Discretized=0.942), while the highest area under the precision-recall curve (AUPRC) and positive predictive value (PPV) were observed using a 1.5-year prediction window, for both models (AUPRC=0.089 and PPV=0.016 for Event-GRU-ODE; AUPRC=0.089 and PPV=0.017 for Event-GRU-Discretized). All AUPRCs and PPVs were less than 0.09 and 0.02, respectively, due to the low SRB prevalence in the data (0.01% - 0.12%). Although we observed the lowest PPV with the shortest prediction window, those classified as high risk were 15 times (RR=15.16) more likely to have an SRB than those classified as low risk. While both models had nearly identical AUROC, AUPRC and PPV, the Event-GRU-Discretized model achieved slightly better results in sensitivity and relative risk (at 95% specificity) than Event-GRU-ODE, with longer prediction windows (e.g., 1 year or 1.5 years).

View this table:
[Table 2](http://medrxiv.org/content/early/2024/02/27/2024.02.25.24303343/T2)

Table 2 
Model performance comparison between Event-GRU-ODE (i.e., “ODE” in the table) and Event-GRU-Discretized (i.e., “Discretized” in the table), for the five prediction windows (Pred. win.). Performance metrics are AUROC, AUPRC, positive predictive value (PPV), sensitivity and relative risk (RR = PPV/prevalence), reported at 95% specificity.

### Effect of Time Length of Observed Patient History on Model Performance

Figure 3 plots model metrics for the Event-GRU-ODE model as a function of the time length of observed patient history, smoothed by the locally weighted scatterplot smoothing (LOWESS) method and bootstrapped 95% confidence intervals. In general, performance tended to be better with longer observed patient history trajectories (and, therefore, more accumulated information), except for the AUPRC curve using the 1-month prediction window. We observe wider confidence intervals around the LOWESS fits towards the end of the follow-up period and with longer prediction windows. This widening may be attributed to a combination of factors: patients moving in and out of the hospital system, resulting in not all being followed for the entire study period; and the exclusion of time steps from the model training when their prediction windows extend beyond the study period. Both of these factors contribute to a reduction in the amount of data available for training during the later time steps, relative to the initial records of each patient. The aggregated prediction plots for the discretized model are shown in Supplementary Figure 5.

### Subgroup Analysis by Clinical Settings and Demographic Factors of Interest

Figure 4 presents model metrics for Event-GRU-ODE stratified by three different clinical settings: (1) “General,” including all available patients and visits; (2) “Psych ED,” for which predictions were made at the time of an emergency department visits involving psychiatric evaluation/consultation; and (3) “Psych Inpatient,” for which predictions were made at time of a psychiatric inpatient admission. AUPRC and PPV were significantly higher for the “Psych ED” and “Psych Inpatient” cohort compared to the “General” cohort, presumably due to the increased prevalence of SRB events among the two sub-cohorts the across different prediction windows (2.43% - 11.38% for “Psych ED”, and 0.78% - 7.24% for “Psych Inpatient”). The highest PPV (42.2%) was achieved for the Psych ED setting, with a prediction window of 1.5 years.

Event-GRU-ODE performance metrics stratified by important patient demographic characteristics (i.e., gender, self-reported race, age, income, and public payor for healthcare) are shown in Supplementary Figures 6-10. In general, AUROCs demonstrated modest variation across different demographic groups but were good (AUROC > 0.8) in all settings except for longer prediction windows among those with age < 20. There was greater variation in AUPRCs which were highest among individuals whose self-reported race was Black (for prediction windows of 6 months or more), and also tended to be higher among individuals aged 20-60 years and those gender was male. The discrete model version demonstrated similar characteristics in both subgroup analyses (data not shown).

### Effect of Training Data Size on Model Performance

The size of available training data can be a limiting factor in settings with constrained resources. Theoretically, if the continuity assumption underlying the Event-GRU-ODE model holds true, this assumption should benefit model performance when training data is limited. To examine the overall effect of training data size on performance across both model versions, as well as the impact of the continuity assumption (demonstrated by contrasting the ODE and discretized models), we present in Table 3 the results (in AUROC and AUPRC) from training both models with various data sample sizes (i.e., 1/8, 3/8, 5/8, and 8/8 or 100% of the training set). Additionally, we plot the comparison of prediction performance (in AUROC, AUPRC, and PPV at 95% specificity) between the ODE and discretized models for the 3- month prediction in Figure 2. We choose the 3-month (i.e., 90-days) prediction window due to the clinical utility of shorter-term prediction intervals. In general, model performance was only modestly reduced when training data was 3/8 (37.5%) of the full training dataset. We observe a more significant decline in model performance when the training data size is reduced from 3/8 to 1/8 of all available training data. However, discriminative metrics were still robust (AUROC > 0.84 for both model classes) under the 1/8 training data setting. Of note, Event-GRU-ODE achieved higher AUPRC and PPV (a model metric which have been shown to be more informative than AUROC when base prevalence is low15,16) compared to Event-GRU-Discretized with smaller training data sizes, i.e., with only 3/8 or 1/8 of training data. Performance plots for the other prediction windows are shown in Supplementary Figures 11-14.

![Figure 2](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/02/27/2024.02.25.24303343/F2.medium.gif)

[Figure 2](http://medrxiv.org/content/early/2024/02/27/2024.02.25.24303343/F2)

Figure 2 
Model performance (in (a) AUROC; (b) AUPRC; and (c) PPV at 95% specificity) comparison between the two models trained with different proportions of the training data (1 fold, 3 folds, 5 folds and 8 folds, i.e., 1/8, 3/8, 5/8, and all of the training set), using a 3-month prediction window.

View this table:
[Table 3](http://medrxiv.org/content/early/2024/02/27/2024.02.25.24303343/T3)

Table 3 AUROC and AUPRC scores of models trained using different proportions of the data set (1 fold, 3 folds, 5 folds and 8 folds, which are 1/8, 3/8, 5/8 and 100% of the training set, respectively).

## DISCUSSION

In this study, we present a continuous-time, dynamic modeling approach for predicting SRB risks. Unlike current assessment methods, this approach accommodates the time-varying nature of SRB risks and, by generating risk estimates across the entire time trajectory, it mitigates gaps in risk score availability. We utilized data from a large-scale EHR database containing data from multiple medical centers and hospitals, including features extracted from unstructured clinical notes, to develop and validate two model classes (Event-GRU-ODE and Event-GRU-Discretized) based on an advanced neural network architecture - Neural ODEs. Both model classes achieved high discrimination performance across different clinical settings. For example, in the general setting, AUROCs were between 0.93-0.94, and relative risks (the modeled risk divided by baseline prevalence during the designated prediction window) ranged from 13.2 and 15.2 across different prediction windows at 95% specificity. Such performance is better or comparable to those observed in previous analyses based on conventional static risk estimates by our group and others4,6–10,17–19.

With respect to analyses by clinical setting, the results largely tracked the base rate of SRBs. As expected, AUROCs were generally higher and PPVs lower for the “General” (all patients) cohort than the inpatient or ED sub-cohorts where base rates are higher. Stratified analyses by demographics found good discrimination across groups (AUROCs > 0.80) while AUPRCs were more variable (again, consistent with varying base rates of the outcome). For prediction windows of 6 months or longer, AUPRCs tended to be highest among those who identified as Black and lowest among those who identified as Asian.

While the two model classes achieved similar performance across various metrics when all training data were used, Event-GRU-ODE attained higher AUPRC and PPVs than its discretized counterpart for the majority of configurations in the settings where the size of training data is limited. This implies that the continuity prior assumption12 (see Methods) may be beneficial in such settings.

While these proof-of-concept analyses focused solely on EHR data for model development, a key advantage of our model classes lies in their flexible architecture, including their natural incorporation of a time dimension. This flexibility facilitates the easy integration of additional data modalities, such as genetic data or other biomarkers, and enables updates to the modeled future risk trajectory based on new incoming information at any point in time. This is crucial because, although a risk trajectory generated at a certain point can theoretically cover all future times, it may diverge from reality if it extends too far into the future without incorporating updated information. Sources of new information can include future healthcare visits, data from digital sensors, or outputs from mobile devices, which may themselves be structured as time series. Crucially, updating the modeled risk trajectory with incoming information at a specific time step requires only the modeled latent state vector from the previous time step, the relevant new information content, and the model parameters. This approach eliminates the need for access to the original data utilized to generate the risk trajectory up to the current time point, which may contain protected health information, thereby mitigating privacy concerns when updating predictions outside healthcare facilities.

Our results should be interpreted with several limitations in mind. Firstly, in this initial iteration of continuous-time SRB prediction models, our models rely solely on EHR data, with limited use of information contained within clinical notes (i.e., only utilizing the presence of specific extracted terms). A natural next step would be to refine the precision of event timing (including SRBs), potentially through the application of natural language processing techniques (e.g., large language models20–22) in combination with the information contained in clinical notes. Moreover, integrating sources of information beyond the healthcare system could further ensure timely updates of the modeled trajectory and potentially enhance model performance. Secondly, this initial study on continuous-time SRB prediction models has not yet explored their generalizability in depth. Lastly, model interpretation in time series modeling is an evolving field. While applying standard interpretation techniques, such as SHAP (Shapley Additive Explanations)23, is technically possible, it often leads to complex and non-straightforward interpretations. Due to this complexity, we didn’t include an analysis of feature importance.

Effective suicide prevention strategies necessitate a collaborative approach between the healthcare system and the broader community, including schools, workplaces, and families. The long-term goal of developing risk prediction tools that can adapt to dynamic changes in SRB risk, integrate new information continuously, and offer timely assessments, is to enhance the design of suicide prevention strategies. For example, data from community settings could be integrated into real-time risk assessments for more timely delivery of support or interventions. Alternatively, healthcare visits might be strategically scheduled during periods of predicted spikes in risk, facilitating better planning around times of the year or following specific life events that may elevate risk.

In conclusion, leveraging neural ODE—a recent advance in deep learning for time-series analysis—we have developed continuous-time and dynamic risk prediction models to assess the risk of SRB using EHR data. We validated their effectiveness across various prediction windows, with robust performance across diverse demographics and clinical settings. Our results confirm the feasibility of continuous-time and dynamic prediction for SRB and highlight the value of continued innovation in this area to enable more accurate and comprehensive SRB risk stratification methods.

## METHODS

### Data source and study population

The data used for this study were derived from RPDR14 , the research EHR data registry of the MGB healthcare system located in Massachusetts, USA. The RPDR is a centralized data registry that collects and compiles clinical information from multiple EHR systems within MGB, which includes more than 7 million patients with over 3 billion records seen across more than 8 hospitals, including two major teaching hospitals14. In this study, we used data between the period of Jan. 1, 2016 and Dec. 31, 2019 (the “study time frame”) and included all patients who had at least 3 visits and a minimum of 30 days of medical record, and were older than age 10 at the time of their first medical record entry within this time period. We chose the start date to minimize the impact of data heterogeneity caused by different recording practices due to the hospital system’s transition from ICD (International Classification of Disease)-9 to ICD-10.

### Suicide-related behavior definition

For the purpose of this study, we adopt a list of ICD codes, previously developed and validated8,24, as the definition of SRB. These codes were shown to have to be valid for capturing intentional self-harm by extensive chart review by expert clinicians, a PPV of greater than 70%24. An “SRB event” is then defined by the occurrence of any of the codes in the list.

### Predictors

We include two types of predictors: (1) *covariates*, including the demographic variables (total of 7, a full list is provided in Table 1). These enter the model at the beginning of the trajectories; and (2) *observations*, which are recorded for each hospital encounter during the patient trajectory, including: a. diagnosis codes, which were mapped from ICD codes25,26 to PheWAS codes27 which have been shown to capture clinically meaningful concepts28 (total 1,871 features); b. medications, which were mapped to RXNORM codes29 (total 1,712 features); and c. a set of previously defined features derived using natural language processing (NLP)30,31, which we named as “NLP CUIs” (i.e., concept unique identifiers32), indicating the presence of mental health-related concepts as defined in the Unified Medical Language System (UMLS32) in the clinical notes (total of 2,488 features). Observations were ordered temporally and enter the model as inputs according to their recorded date relative to the date of the first entry of each patient (see Figure 5 for a schematic illustration of the data setup). Note that while we chose a time resolution of one day as the basic unit for a time step, any time resolution can theoretically be chosen given that data are available (the smaller the time step, the closer the approximation to continuity).

![Figure 3](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/02/27/2024.02.25.24303343/F3.medium.gif)

[Figure 3](http://medrxiv.org/content/early/2024/02/27/2024.02.25.24303343/F3)

Figure 3 
Aggregated prediction performance (in (a) AUROC and (b) AUPRC) by the length of observed history (smoothed by LOWESS) since the beginning of every patient trajectory, for Event-GRU-ODE.

![Figure 4](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/02/27/2024.02.25.24303343/F4.medium.gif)

[Figure 4](http://medrxiv.org/content/early/2024/02/27/2024.02.25.24303343/F4)

Figure 4 
Model performance by clinical settings, namely “General” (general setting, green), “Psych ED” (psychiatric emergency department, blue) and “Psych Inpatient” (psychiatric inpatients, orange), for Event-GRU-ODE. The percentages in the parentheses along the x-axis indicate the prevalence of SRB in each setting.

![Figure 5](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/02/27/2024.02.25.24303343/F5.medium.gif)

[Figure 5](http://medrxiv.org/content/early/2024/02/27/2024.02.25.24303343/F5)

Figure 5 
Schematic diagram of the data setup for one hypothetical patient. Static covariates (demographic observations), which may include any combination of diagnoses, medications, and NLP CUIs, are entered into the model at time t0 (i.e., the time of the first patient entry). Subsequent observations are ordered sequentially over time, with the exact time steps (days after the first record) preserved. Labels are created for each time step, considering the presence of a SRB record. For any time step tA within the interval where data is available, its associated prediction Window A, starting from tA+1, is assigned a label value of 1 if SRB is recorded within that period. Note that tA is not bounded to the dates of where there are recorded entries. With the trained model, predictions can be made for any future time step (tB, tC, …) occurring after the last recorded entry at tf (note that labels are not available for future times and the probability of their occurrences are the targets for predictions (red circle with question mark)).

Continuous-valued features (e.g., age range at the time follow-up started) were discretized by binning, and all features were converted to one-hot encoding before entering the models.

### Prediction task definition

Based on the SRB definition, we formulate the suicide risk prediction task as follows: ![Formula][1]</img>  where Δ*t* is the length of the “prediction window” and *eτ* is 1 if an event (i.e., suicide-related behavior) was recorded at time *τ*. This means *Yt* (i.e., the prediction label) is set to 1 if an event happens within the next Δ*t* time (i.e., a “prediction window”). In this study, we look at prediction windows of 1 month, 3 months, 6 months, 1 year, and 1.5 years.

Prediction labels are created for each patient, for each day, from the date of the first visit up to the date of the last visit, minus the length of the prediction window to ensure the full length of prediction window is observed. Of note, to compensate for the fact that events recorded toward the end of patient history would contribute fewer labels under this scheme, a “buffering time” of 30 days is appended to the date of the last visit, during which the label of the last visit will be carried over up to the 30th day after the last visit.

### Event-GRU-ODE model

We model a patient trajectory as a *D*-dimensional stochastic process ***Y***(*t*) whose dynamics is driven by a stochastic differential equation (SDE): ![Formula][2]</img>  where d***W***(t) is a Wiener process. The EHR records in a patient trajectory typically consist of sparse and irregularly observed longitudinal data, which makes modeling the SDE directly very difficult.

We follow the idea from statistical mechanics, that instead of modeling SDE directly, one models the time evolution of the probability density function of the observables. Therefore, we propose to use a latent variable process ***h***(*t*) that can be mapped to the mean and covariance of the target variables of interest, i.e., the SRB labels ***Y***t (see “Prediction task definition”). In detail, we model the probabilities of a SRB event happening between time period *t* and *t* + Δ*t*, in the form of a Bernoulli distribution. This means that the latent process ***h***(*t*), varying in time, is mapped to the parameter *p*(*t*) of the Bernoulli distribution, which determines its mean and variance varying in time. The dynamics of the underlying latent process are governed by ordinary differential equations (ODEs), whose parameters are learned from data.

Specifically, we model the mapping from the observed variables to the hidden variable with “Event-GRU-ODE” as described below, and use a feed-forward neural network to map the hidden variable ***h***(*t*) to the output probabilities (i.e., the parameter *p* of Bernoulli distribution of the random variable). This SDE formulation reflects the evolving uncertainty of the patient’s future condition regarding SRB.

#### Model components

To model the latent variable ***h***(*t*), used for predicting the probabilities of SRB events, and incorporate incoming health-related observations, such as diagnosis and measurements, we propose a two-mode system, consisting of: (1) *Continuous time propagation* of the latent variable ***h***(*t*); and (2) *Discrete updates* that happen each time a new observation *xt* is received at time *t*. Each time there is an observation that triggers the discrete update the system switches to a new state of the ODE (which still has the same parameters and dynamics as before).

#### Continuous time propagation

For the *continuous time propagation*, we employ recently proposed methods in the field of Neural ODEs, specifically GRU-ODE33. This means when there are no observations, the latent variable follows an ODE whose trajectory is controlled by the GRU-ODE: ![Formula][3]</img>  where *Wc* are the weights of the neural networks of GRU-ODE (learned end-to-end with other components of the model).

To derive GRU-ODE we start with the standard classic GRU13, given by its equations for update gate *z*, reset gate r and candidate hidden vector ![Graphic][4]</img>: ![Formula][5]</img>  

![Formula][6]</img> ![Formula][7]</img> ![Formula][8]</img> 
The key idea is to observe that the last equation, which combines the candidate hidden vector ![Graphic][9]</img> and the previous hidden vector ***h****t*−1, can be rewritten as a difference equation for ***h****t*: ![Formula][10]</img>  Taken to the limit of Δ*t* → 0 the difference equation gives an ODE with negative feedback on the current hidden state ***h***(*t*): ![Formula][11]</img>  GRU-ODE has several useful properties, such as stability due to negative feedback and Lipschitz continuity with respect to time with constant *K* = 2, which helps make the model robust to numerical errors as well as effective even with smaller data sample size33.

#### Initial hidden state

The initial hidden state, ***h***(0), is computed from the static covariates of the patient using a multilayer feed-forward neural network.

#### Discrete update

The continuous-time GRU-ODE component propagates the latent vector up to the observation time (*t*) where the model then updates the hidden state ***h***(*t*) to a new value ***h***′ by using a discrete network, which is a standard GRU13: ![Formula][12]</img>  where *xt* is the observation at time *t* and *Wd* are the weights of the update GRU network and a multilayer feed-forward network that preprocesses *xt*. After this update the system switches back to continuous-time propagation (GRU-ODE) starting from the state ***h***′(*t*).

Finally, we use an output network *f*(*h*(*t*); *Wo*) that maps the hidden state *h*(*t*) to the SRB event probabilities *p*(*t*) for the time *t*. Using *cross-entropy loss* on the outputs of the network and feeding in observations *xt* we can train Event-GRU-ODE in fully *end-to-end manner*.

### Event-GRU-Discretized model

The Event-GRU-ODE model makes the assumption that between two observations (e.g., two healthcare encounters) the latent vector ***h*** is propagating by following an ODE, forcing the trajectory to be continuous in time.

We can relax this continuity assumption by allowing the time propagation to be updated by the discrete GRU. This propagation is still carried out in iterative fashion using time step size as in Event-GRU-ODE. Relaxing the continuity assumption could allow the model to be more flexible and might affect performance. We name this model “Event-GRU-Discretized.”

### Model training and evaluation

We implemented and evaluated two models, namely “Event-GRU-ODE” and “Event-GRU-Discretized”, for the continuous time suicide risk prediction task, with 5 different prediction windows (i.e., 1 month, 3 months, 6 months, 1 year and 1.5 years). We randomly split the data into 10 “folds,” each contains data from 10% of all patients. Among them, 8 folds (80%) were used as a training set, one fold (10%) as a hold-out validation set (10%) for tuning all the model hyperparameters, and one fold (10%) as a hold-out test set for model evaluation (10%). During training, models were optimized using the Adam optimizer34. More detailed training procedures, including hyperparameter settings, are provided in Supplementary Note A.

For model performance evaluation, we report area under the receiver operator characteristic curve (AUROC), area under Precision-Recall curve (AUPRC), as well as positive predictive value (PPV) and sensitivity with specificity set to 0.95.

In order to evaluate how well Event-GRU-ODE makes its predictions as a function of the length of observed patient history trajectories, we aggregated all the model predictions made on each day after the first EHR entry of each patient, measured model performance (in AUROC and AUPRC) along the time trajectory, and plotted the model metrics over time for the study time frame. To increase the signal-to-noise ratio, plots were smoothed by the LOWESS method with its smoothing window (i.e., the fraction parameter) set to be 0.3. We also performed bootstrap resampling to add the estimated 95% confidence intervals around each LOWESS fit on the plot.

In addition, we assessed potential differences in model performance across various patient subgroups. First, we defined three patient groups of interest based on clinical settings to stratify prediction results and evaluate model performance in each patient population: (1) “General” representing the general outpatient cohort, which encompasses the full study population and their complete trajectories within the study period; (2) “Psych ED” for predictions during emergency department visits involving psychiatric evaluation/consultation within the study period; (3) “Psych Inpatient” for predictions made during psychiatric inpatient admissions within the study period. Furthermore, we conducted stratified analyses on key demographic groups (gender, race, age group, income level by ZIP code, whether the patient has a public insurance payor) to examine potential heterogeneity in model performance across different subgroups.

Lastly, to further compare the two models and their capabilities to learn from fewer samples, we trained each model using all the training data as well as randomly sampling 5/8, 3/8 and 1/8 (i.e., 5, 3, and one fold, respectively) of the training data.

## Supporting information

Supplementary materials [[supplements/303343_file02.pdf]](pending:yes)

## DATA AVAILABILITY

Protected Health Information restrictions apply to the availability of the clinical data here, which were used under IRB approval for use only in the current study. As a result, this dataset is not publicly available.

## AUTHOR CONTRIBUTIONS

Y.-H.S. conceived of the study; Y.-H.S. and J.W.S. acquired the data for the study; J.S. developed the models utilized in the study; Y.-H.S., B.W., and H.L. curated data and performed analyses of the study; J.W.S., supervised analyses of the study; J.W.S. provided computational resources for the study; Y.-H.S., J.S., and B.W. wrote the first draft; and all authors revised the manuscript and contributed to interpretation of the analyses of the study.

## COMPETING INTERESTS

Dr. Smoller is a member of the Scientific Advisory Board of Sensorium Therapeutics (with equity), and has received grant support from Biogen, Inc. He is PI of a collaborative study of the genetics of depression and bipolar disorder sponsored by 23andMe for which 23andMe provides analysis time as in-kind support but no payments. The authors have no other conflict of interests to disclose.

## ACKNOWLEDGEMENTS

Drs. Smoller, Sheu, and Mr. Lee were supported in part by NIMH R01 MH117599.

*   Received February 25, 2024.
*   Revision received February 25, 2024.
*   Accepted February 27, 2024.


*   © 2024, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/)

## REFERENCES

1.  1.World Health Organization Fact Sheet: Suicide. [https://www.who.int/news-room/fact-sheets/detail/suicide](https://www.who.int/news-room/fact-sheets/detail/suicide).
    
    
2.  2.Curtin, S. C., Garnett, M. F. & Ahmad, F. B. Provisional Estimates of Suicide by Demographic Characteristics: United States, 2022. Vital Statistics Rapid Release (2023).
    
    
3.  3.Ahmedani, B. K. et al. Variation in patterns of health care before suicide: A population case-control study. Prev. Med. 127, 105796 (2019).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F02%2F27%2F2024.02.25.24303343.atom) 

4.  4.Nock, M. K. et al. Prediction of Suicide Attempts Using Clinician Assessment, Patient Self-report, and Electronic Health Records. JAMA Netw. Open 5, e2144373 (2022).
    
    
5.  5.Simon, G. E. et al. Reconciling statistical and clinicians’ predictions of suicide risk. Psychiatr. Serv. 72, 555–562 (2021).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F02%2F27%2F2024.02.25.24303343.atom) 

6.  6.Sheu, Y.-H. et al. An efficient landmark model for prediction of suicide attempts in multiple clinical settings. Psychiatry Res. 323, 115175 (2023).
    
    
7.  7.Simon, G. E. et al. Predicting suicide attempts and suicide deaths following outpatient visits using electronic health records. Am. J. Psychiatry 175, 951–960 (2018).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F02%2F27%2F2024.02.25.24303343.atom) 

8.  8.Barak-Corren, Y. et al. Validation of an Electronic Health Record-Based Suicide Risk Prediction Modeling Approach Across Multiple Health Care Systems. JAMA Netw. Open 3, e201262 (2020).
    
    
9.  9.Bayramli, I. et al. Temporally informed random forests for suicide risk prediction. J. Am. Med. Inform. Assoc. 29, 62–71 (2021).
    
    
10. 10.Walsh, C. G., Ribeiro, J. D. & Franklin, J. C. Predicting risk of suicide attempts over time through machine learning. Clinical Psychological Science 5, 457–469 (2017).
    
    
11. 11.Chen, R. T. Q., Rubanova, Y., Bettencourt, J. & Duvenaud, D. K. Neural Ordinary Differential Equations. Advances in Neural Information Processing Systems (2018).
    
    
12. 12.De Brouwer, E., Simm, J., Arany, A. & Moreau, Y. GRU-ODE-Bayes: Continuous Modeling of Sporadically-Observed Time Series. Advances in Neural Information Processing Systems (2019).
    
    
13. 13.Cho, K. et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1724–1734 (Association for Computational Linguistics, 2014). doi:10.3115/v1/D14-1179.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3115/v1/D14-1179&link_type=DOI) 

14. 14.Nalichowski, R., Keogh, D., Chueh, H. C. & Murphy, S. N. Calculating the benefits of a Research Patient Data Repository. AMIA Annu. Symp. Proc. 1044 (2006).
    
    
15. 15.Saito, T. & Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10, e0118432 (2015).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0118432&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25738806&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F02%2F27%2F2024.02.25.24303343.atom) 

16. 16.Ozenne, B., Subtil, F. & Maucort-Boulch, D. The precision--recall curve overcame the optimism of the receiver operating characteristic curve in rare diseases. J. Clin. Epidemiol. 68, 855–859 (2015).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jclinepi.2015.02.010&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F02%2F27%2F2024.02.25.24303343.atom) 

17. 17.Tsui, F. R. et al. Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts. JAMIA Open 4, ooab011 (2021).
    
    
18. 18.Walsh, C. G. et al. Prospective Validation of an Electronic Health Record-Based, Real-Time Suicide Risk Model. JAMA Netw. Open 4, e211428 (2021).
    
    
19. 19.Walsh, C. G., Ribeiro, J. D. & Franklin, J. C. Predicting suicide attempts in adolescents with longitudinal clinical data and machine learning. J. Child Psychol. Psychiatry 59, 1261–1270 (2018).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/jcpp.12916&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F02%2F27%2F2024.02.25.24303343.atom) 

20. 20.Chen, Z., et al. [2311.16079] MEDITRON-70B: Scaling Medical Pretraining for Large Language Models. arXiv (2023).
    
    
21. 21.Agrawal, M., Hegselmann, S., Lang, H. & Kim, Y. Large language models are few-shot clinical information extractors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (2022).
    
    
22. 22.Toma, A. et al. Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding. arXiv (2023) doi:10.48550/arxiv.2305.12031.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.48550/arxiv.2305.12031&link_type=DOI) 

23. 23.Lundberg, S. M. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems (2017).
    
    
24. 24.Barak-Corren, Y. et al. Predicting suicidal behavior from longitudinal electronic health records. Am. J. Psychiatry 174, 154–162 (2017).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F02%2F27%2F2024.02.25.24303343.atom) 

25. 25.ICD - ICD-9-CM - International Classification of Diseases, Ninth Revision, Clinical Modification. [https://www.cdc.gov/nchs/icd/icd9cm.htm](https://www.cdc.gov/nchs/icd/icd9cm.htm).
    
    
26. 26.ICD- 10 - CM International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM). [https://www.cdc.gov/nchs/icd/icd-10-cm.htm](https://www.cdc.gov/nchs/icd/icd-10-cm.htm).
    
    
27. 27.Bastarache, L. Using Phecodes for Research with the Electronic Health Record: From PheWAS to PheRS. Annu. Rev. Biomed. Data Sci. 4, 1–19 (2021).
    
    
28. 28.Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nbt.2749&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24270849&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F02%2F27%2F2024.02.25.24303343.atom) 

29. 29.Liu, S., Wei Ma, Moore,  R., Ganesan,  V. & Nelson, S. RxNorm: Prescription for electronic drug information exchange. IT Prof. 7, 17–23 (2005).
    
    
30. 30.Castro, V. et al. Semi-automated Dictionary Curation of Symptoms and Events Preceding Suicide Attempts in Clinical Notes. in (AMIA Informatics Summit, 2020).
    
    
31. 31.Goryachev, S., Castro, V., Gainer, V. & Murphy, S. PhenoNLP: Phenotype-centric Natural Language Processing on Over 200 Million Clinical Notes. in (AMIA Informatics Summit, 2020).
    
    
32. 32.Bodenreider, O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32, D267–70 (2004).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkh061&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=14681409&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F02%2F27%2F2024.02.25.24303343.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000188079000061&link_type=ISI) 

33. 33.De Brouwer, E., Simm, J., Arany, A. & Moreau, Y. GRU-ODE-Bayes: Continuous Modeling of Sporadically-Observed Time Series. Advances in Neural Information Processing Systems (2019).
    
    
34. 34.Kingma, D. & Ba, J. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations (ICLR) (2015).

 [1]: /embed/graphic-9.gif
 [2]: /embed/graphic-10.gif
 [3]: /embed/graphic-11.gif
 [4]: /embed/inline-graphic-1.gif
 [5]: /embed/graphic-12.gif
 [6]: /embed/graphic-13.gif
 [7]: /embed/graphic-14.gif
 [8]: /embed/graphic-15.gif
 [9]: /embed/inline-graphic-2.gif
 [10]: /embed/graphic-16.gif
 [11]: /embed/graphic-17.gif
 [12]: /embed/graphic-18.gif