Points-Based Models for Predicting Severe Disease related to Covid-19 with the SARS-CoV-2 Omicron Variant
=========================================================================================================

* Yatir Ben Shlomo
* Jacob G. Waxman
* Galit Shaham
* Doron Netzer
* Ben Reis
* Ran Balicer
* Noa Dagan

## Abstract

**Introduction** Throughout the SARS-CoV-2 pandemic, resources for various aspects of patient care have been limited, necessitating risk-stratification. The need for good risk-stratification tools has been enhanced by the availability of new Covid-19 therapeutics that are effective at preventing severe disease among high-risk patients if given promptly following SARS-CoV-2 infection. We describe the development of two points-based models for predicting the risk of deterioration to severe disease from an Omicron-variant SARS-CoV-2 infection.

**Methods** We developed two logistic regression-based models for predicting the risk of severe Covid-19 within a 21-days follow-up period among Clalit Health Services members aged 18 and older, with confirmed SARS-CoV-2 infection from December 25, 2021 to March 16, 2022. In the first model, aimed for the use of healthcare providers, the model coefficients were linearly transformed into integer risk points. In the second model, a simplified version designed for self-assessment by the general public, the risk points were further scaled down to smaller numbers with less variability across risk factors.

**Results** 613,513 individuals met the inclusion criteria, of which 1,763 (0.287%) developed the outcome. The AUROC estimates for both models were 0.95, although the ‘full’ model demonstrated more granular risk-stratification capabilities (77 vs. 27 potential thresholds on the test set). Both models proved effective in identifying small subsets of the population enriched with individuals who ended up deteriorating. For example, prioritizing the top 1%, 5% or 10% individuals in the population for interventions with the full model results in coverage of 36%, 68% or 83% (respectively) of the individuals that actually end up deteriorating. Risk point count increased with age, number of chronic conditions and previous hospitalizations, and decreased with recent vaccination and infection.

**Discussion** The models presented, one more expressive and one more accessible, are transparent and explainable models applicable to the general population that can be used in the prioritization of Covid-19-related resources, including therapeutics.

## Introduction

Despite widespread and successful vaccination campaigns, novel variants and waning of vaccine-induced immunity are perpetuating the SARS-CoV-2 pandemic. (1) Novel Covid-19 therapeutics are a limited resource due to their cost and the challenging logistics of administering them promptly following infection diagnosis, delivering the medication directly to the homes of patients with confirmed infections. As a result, there is a need for risk prioritization that will facilitate identification of the patients most at need for these treatments on a daily basis.

The development of prediction models requires balancing the complexity or expressivity of the model, on the one hand, with its explainability or transparency on the other hand. Higher expressivity is generally associated with higher performance and greater variability of risk estimations. This usually translates to a model with high granularity of risk cut-offs that can be used to classify the population into increased or regular risk groups. At the extreme end of the expressivity spectrum are machine-learning models that can be based on an extremely large number of variables.

However, these models are “black-boxes”, and this poses a challenge when prioritizing sensitive resources that requires models to be explainable and transparent (“white-boxes” that can be easily understood). Transparency was a central theme in the early discussions surrounding prioritization of Covid-19 vaccines (2). Due to the expressivity-explainability tradeoff, developers must recognize the importance of these properties within the context of the model’s designated use.

Almost two years into the Covid-19 pandemic, Clalit Health Services (CHS), a healthcare organization responsible for the care of over half of the Israeli population, has deployed and used Covid-19 prediction models in both extremes of the of the expressivity-explainability continuum(3,4). For the purpose of prioritization of sensitive, limited resources such as Covid-19 therapeutics, that are available in various and changing amounts, it was recognized that there is a need to develop models that will be explainable and transparent, while providing good performance with enough granularity to enable allocation of resources dynamically across various risk cut-offs. A points-based scoring system that augments and simplifies a more expressive raw model has the benefit of being both adequately expressive, while maintaining transparency and explainability, representing a middle point in the expressivity-explainability continuum.

This paper describes the development and performance of two points-based logistic regression-derived models for the allocation of Covid-19 therapeutics on a daily basis. Although both models are transparent and provide good performance, they still differ in the level of tradeoff between explainability and granularity. The ‘full’ model, targeted for use by healthcare providers, has the highest possible fidelity with the underlying raw logistic regression model, maximizing its risk-stratification capabilities. The ‘simplified’ model, which aims to enable self-assessment by the general public – an important tool for policymaker-public communication, maximizes ease of calculation at the expense of expressivity and granularity.

## Methods

### Data source

The Models were trained with data from CHS, the largest integrated payer-provider healthcare organization in Israel. The CHS data repositories include over 4.8 million members, accounting for approximately 53% of the Israeli population. These data repositories contains both inpatient and outpatient data, which enables CHS to ascertain a full medical history of its members. COVID-19 related data such as diagnostic test results, hospitalizations and disease severity, are collected centrally by the Israeli Ministry of Health (MOH) and sent daily to the respective health care provider.

### Study population

The derivation population used for developing the models included all individuals aged 18 and above, who were identified as covid-19 positives (excluding borderline cases) and did not receive any medication for preventing severe COVID-19 outcome, during December 25, 2021 through March 16, 2022. This period marks the Omicron surge in Israel.

### Outcomes and model variables

The outcome of interest was severe Covid-19, defined according to the Israeli ministry of health criteria (including events of Covid-19-related deaths). The follow-up period for determining the outcome was 21 days from the date of confirmed infection.

The outcome was modeled using a list of known risk factors and prognostic variables. The following variables were introduced into the model: demographic factors; known or possible risk factors for severe Covid-19 as defined by the Centers of Diseases Control and Prevention(5); amount of hospitalizations in recent years as a marker for burden and stability of chronic conditions; and immunization status (vaccination and previous infection). Due to evidence of waning immunity over time(6), the effect of Covid-19 vaccination was included as a variable quantifying the time that has elapsed since the individual’s latest vaccine dose (second, third or fourth).

### Statistical Analysis

The outcome was modeled using logistic regression. The final decision of which variables to include in the model was based on statistical significance of effects; preventing multi-collinearity; ease of determining the value of each variable for self-assessment of risk scores by CHS members; and fairness considerations.

Multi-collinearity status was evaluated by calculating the Variable Inflation Factor (VIF) for each covariate in the model. The model was fitted on a random subset of 80% of the dataset (the train set), while the remaining 20% was kept for model evaluation (the test set).

#### Full Points Model

The raw model coefficients were transformed to risk points (integer values) via multiplication by a constant factor and rounding the result. This constant factor was chosen such that the smallest statistically significant raw model coefficient, in absolute value, was transformed to a single risk point. The final risk score is calculated by summation of risk points of all the model variables. Since certain conditions, such as recent vaccination, decrease the predicted risk of the outcome, the resulting model has the potential to assign a negative risk score for the young, healthy, vaccinated individuals. In order to prevent negative scores, the associated risk points of all levels of the age group variable (including the reference level) was increased accordingly by a constant number, such that the minimal possible risk score generated by the model is zero points.

#### Simplified Points Model

In order to enable easy and accurate self-assessment of risk, an additional model was derived by including an additional processing step, in which raw model coefficients were divided by 3 prior to their being rounded to integer values. Resulting values (after division by 3) were capped to a minimal value of 1 (or -1 for negative coefficients) in order to prevent a variable from being associated with 0 risk points. Accordingly, risk points 0.5 - 4.49 were simplified to 1, risk points 4.5 – 7.49 were simplified to 2 and so forth, with similar transformation occurring for negative risk points.

#### Missing values

Other than obesity status, no variables had missing values. For individuals with missing BMI (required to define the obesity status), imputation was performed with obesity status set to 0.

#### Evaluation of Model Performance

Both the full and simplified points models were evaluated on the test dataset. The models were evaluated for area under the receiver operator curve (AUROC). In addition, for each potential threshold (i.e. each unique risk score the model generated) the following metrics were calculated: sensitivity, positive predictive value (PPV) and lift.

#### Sensitivity Analysis

The exclusion of patients who received Covid-19 therapy might introduce bias in estimating the risk for the highest risk patients (who are more likely to receive this therapy). To examine the effect on the resulting model, a separate analysis included these patients as a sensitivity analysis.

## Results

A total of 613,513 individuals met the inclusion criteria: aged 18 and above, a confirmed SARS-CoV-2 infection between December 25, 2021 and March 16, 2022, and continuous CHS membership (Supplementary Figure 1). A total of 1.1% had missing BMI value that was imputed as no obesity.

1,763 (0.287%) individuals developed the outcome during the follow-up period. The dataset used for training had 490,810 observations, 122,703 observations were set aside for the purpose of model evaluation. Baseline characteristics of the dataset with stratification by outcome are described in Table 1.

View this table:
[Table 1:](http://medrxiv.org/content/early/2022/06/28/2022.06.27.22276907/T1)

Table 1: 
Study population, stratified by outcome

In preliminary analysis the following variables were considered, but excluded from the models due to small or non-significant effect: pregnancy, hypertension, smoking status, and asthma. The binary variable ‘sex’ was removed due to fairness considerations. The categorical variable ‘number of vaccine doses’ was removed to prevent collinearity with the variable of ‘months from last Covid-19 vaccine dose’. The final model variables include: age group, obesity, active cancer, chronic kidney disease (CKD), diabetes of any type, heart disease)including cerebrovascular disease), neurological disease, liver disease, chronic obstructive pulmonary disease, immunosuppression, number of hospitalizations in the 3 years prior to the index date (as a categorical variable: 0, 1-2, 3-5, 6 or more), previous Covid-19 infection (as a binary variable), months from last Covid-19 vaccination dose (provided that at least 2 doses were administered, as a categorical variable: unvaccinated, 1-6, 7-10, 11 or more). The highest VIF value for the final model variables was 2.2, indicating that there was no evidence of any further multi-collinearity.

The logistic regression model coefficients are specified in supplementary Table S1. Age is the single most important predictive risk factor for developing a severe outcome. Indeed, a clear, monotonically increasing, association between age and the outcome is evident. The most important prognostic variables following age were: number of hospitalizations and the time from last vaccination dose. Both variables show clear monotonic trends with the outcome.

The final full and simplified point-based models are described in Table 2. Test set performance metrics are presented on Tables 3 and 4, showing the predicted risk associated with each risk score, as well as the percent of population above each possible risk score serving as a threshold, for the full and simplified models respectively. The full model demonstrated more granular risk-stratification capabilities, offering 77 potential thresholds compared to 27 potential thresholds of the simplified model. For dynamic resource allocation that includes prioritizing shares of up to 20% of the population, the full models offers risk-thresholds with maximal increase of 2% between thresholds. The simplified model offers numerous thresholds as well, but the maximal increase in this range is 4%, offering less granularity.

View this table:
[Table 2:](http://medrxiv.org/content/early/2022/06/28/2022.06.27.22276907/T2)

Table 2: 
Full and simplified point models – allocation of points per risk factor

View this table:
[Table 3:](http://medrxiv.org/content/early/2022/06/28/2022.06.27.22276907/T3)

Table 3: 
Test set performance metrics of the full model for all possible risk scores (thresholds)

View this table:
[Table 4:](http://medrxiv.org/content/early/2022/06/28/2022.06.27.22276907/T4)

Table 4: 
Test set performance metrics of the simplified model for all possible risk scores (thresholds)

The test-set AUROC of both the full and simplified models was 0.95. Tables 3 and 4 also present the test set performance metrics per each potential model threshold. Both models proved effective in identifying small subsets of the population that held high proportion of those who ended up deteriorating – for example, the full model offers prioritizing 1%, 5% or 10% of the population for interventions, capturing 36%, 68% or 83% of the high-risk patients.

### Sensitivity Analysis

To verify that excluding patients who received Covid-19 therapy did not introduce significant bias in the risk estimated by the models, we compared the resulting models when these individuals were included in the analysis to the models in the main analysis in which these individuals were excluded. The models are summarized in supplementary Table S2.

The differences are small and mostly amount to a single risk point difference in some of the variables. Discrepancies larger than a single risk points occurred only once, in the risk points for age group 30-39, where only 25 events are observed and the estimate is not statistically significant. The changes in test-set AUROC for the resulting models is 0.002 (0.948 vs 0.950).

## Discussion

We developed two related point-based models for adults, aged 18 years old and above, that accurately predict the risk of deterioration from infection with the Omicron variant of SARS-CoV-2 to severe Covid-19. Although both models are transparent and offer good granularity, they still vary in their level of simplicity, with the more complex model suitable for use by healthcare providers and the simpler one targeted for use by the general public. Whilst the more complex model is more expressive (as would be expected) and allows for more granular risk stratification, both models show very high levels of discrimination (with AUROC = 0.95 for both models).

Prioritization has played a critical role in this pandemic, and over the course of it CHS employed risk-stratification tools in numerous settings. For example, clinically-high-risk individuals, as defined by a predictive model, were targeted for proactive outreach by family care practitioners and nurses, with the aim of helping them minimize their risk. They did this by making these individuals aware of their risk and offering services such as home delivery of their regular medications to help them reduce their exposure risk(4). In addition, the predictive modes were used to prioritize patients for inpatient care once they were infected (in situations where it would have been clinically justifiable for them to be managed in the community). More recently, there was a need for prioritization of eligible candidates for novel Covid-19 therapeutics. Existing tools were either not transparent enough or did not offer enough granularity of thresholds(3,4)

A new targeted approach for optimizing the daily allocation of Covid-19 therapeutics was needed. Therefore, we leveraged CHS’s rich EMR to develop up-to-date and accurate models that take into account both vaccination status and known risk factors, all within the current variant ‘landscape’ predominated by Omicron. We put an emphasis on models that would be both transparent and suitable expressive. Most existing models for evaluating risk of severe Covid-19 were developed and published in the beginning of the pandemic and are not suitable for the prioritization of Covid-19 therapeutics(7) – they are either not transparent in a manner that allows for manual risk evaluating, do not take into account features such as Covid-19 vaccines or prior SARS-CoV-2 infection, or are based on a population of patients that were admitted to a hospital rather than the general population of infected individuals.

Our study is subject to a number of limitations. First, our analysis excludes individuals that received a Covid-19 therapy since high-risk individuals successfully treated with these therapeutics might be wrongly classified by the model as low risk and not prioritized correctly. This may bias our predictions towards underestimation since the highest risk patients were more likely to receive these preventative treatments. However, due to the limited supply, challenging logistics of administering these drugs, and varying adherence rates, the study cohort still held substantial representation of high-risk patients. In a sensitivity analysis, which included these treated subjects, the results were not significantly different in terms of the final point assignment and performance metrics (Supplementary Table S2).

Second, the model was trained on data acquired during the period in which the Omicron variant was predominant in Israel – a variant that has been shown to result in lower rates of severe outcomes relative to previous variants. Hence, the model might not be generalizable to novel future variants. Third, rare conditions associated with a high risk of deterioration to severe Covid-19 may be missed by the model due to lack of the required sample size for being allotted a point. This concern is applicable to all risk prediction models or risk factor lists. This concern could be mitigated somewhat by the inclusion of risk points for all-cause hospitalization, as well as enabling diversion of resources to individuals classified as low risk at the treating physician’s discretion.

In conclusion, we describe the development of two point-based models assessing risk of deterioration to severe Covid-19 or Covid-19-related death in individuals infected with the Omicron variant. These models are highly transparent, while maintaining good performance and sufficient granularity, and can aid in the daily prioritization of Covid-19.

## Supporting information

TRIPOD checklist [[supplements/276907_file03.pdf]](pending:yes)

Supplemental data [[supplements/276907_file04.pdf]](pending:yes)

## Data Availability

Owing to data privacy regulations, the raw data for this study cannot be shared.

## Footnotes

*   ** Joint senior authors

*   Received June 27, 2022.
*   Revision received June 27, 2022.
*   Accepted June 28, 2022.


*   © 2022, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/)

## References

1.  1.Statement on the tenth meeting of the International Health Regulations (2005) Emergency Committee regarding the coronavirus disease (COVID-19) pandemic [Internet]. [cited 2022 Mar 31]. Available from: [https://www.who.int/news/item/19-01-2022-statement-on-the-tenth-meeting-of-the-international-health-regulations-(2005)-emergency-committee-regarding-the-coronavirus-disease-(covid-19)-pandemic](https://www.who.int/news/item/19-01-2022-statement-on-the-tenth-meeting-of-the-international-health-regulations-(2005)-emergency-committee-regarding-the-coronavirus-disease-(covid-19)-pandemic)
    
    
2.  2.World Health Organization. WHO SAGE values framework for the allocation and prioritization of COVID-19 vaccination, 14 September 2020. Geneva: World Health Organization; 2020.
    
    
3.  3.Barda N, Riesel D, Akriv A, Levy J, Finkel U, Yona G, et al. Developing a COVID-19 mortality risk prediction model when individual-level data are not available. Nat Commun. 2020 Sep 7;11(1):4439.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F06%2F28%2F2022.06.27.22276907.atom) 

4.  4.Dagan N, Barda N, Riesel D, Grotto I, Sadetzki S, Balicer R. A score-based risk model for predicting severe COVID-19 infection as a key component of lockdown exit strategy. medRxiv. 2020 May 23;
    
    
5.  5.Certain Medical Conditions and Risk for Severe COVID-19 Illness | CDC [Internet]. [cited 2020 Dec 9]. Available from: [https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/people-with-medical-conditions.html](https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/people-with-medical-conditions.html)
    
    
6.  6.Goldberg Y, Mandel M, Bar-On YM, Bodenheimer O, Freedman L, Haas EJ, et al. Waning immunity of the BNT162b2 vaccine: A nationwide study from Israel. medRxiv. 2021 Aug 30;
    
    
7.  7.Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ. 2020 Apr 7;369:m1328.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE3OiIzNjkvYXByMDdfMi9tMTMyOCI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIyLzA2LzI4LzIwMjIuMDYuMjcuMjIyNzY5MDcuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9)