Development and utilization of an intelligent application for aiding COVID-19 diagnosis

Zirui Meng; Minjin Wang; Huan Song; Shuo Guo; Yanbing Zhou; Weimin Li; Yongzhao Zhou; Mengjiao Li; Xingbo Song; Yi Zhou; Qingfeng Li; Xiaojun Lu; Binwu Ying

doi:10.1101/2020.03.18.20035816

ABSTRACT

Background COVID-19 has been spreading globally since emergence, but the diagnostic resources are relatively insufficient.

Results In order to effectively relieve the resource deficiency of diagnosing COVID-19, we developed a machine learning-based diagnosis model on basis of laboratory examinations indicators from a total of 620 samples, and subsequently implemented it as a COVID-19 diagnosis aid APP to facilitate promotion.

Conclusions External validation showed satisfiable model prediction performance (i.e., the positive predictive value and negative predictive value was 86.35% and 84.62%, respectively), which guarantees the promising use of this tool for extensive screening.

BACKGROUND

Since the outbreak of Corona Virus Disease 2019(COVID-19) in Wuhan, in December 2019, the epidemic has spread rapidly all over the world, which consequently brought great challenges to global public health(1). As of March 14, a total of 142,539 cases had been confirmed worldwide, with many more requiring tests(2). Currently, accurate nucleic acid detection plays a pivotal role in diagnosis and prevention of COVID-19, and reverse transcription polymerase chain reaction is the main method for nucleic acid detection(3). However, it is not feasible to use this time-and labor-consuming approach for screening a large growing number of suspected patients and asymptomatic infected patients(4). Rapid and universal screening method is essential to save medical resources, improve diagnosis efficiency and avoid cross-infection.

Taking advantage of the emerging machine learning technique, which enables analysis of the existing multidimensional data by applying appropriate algorithms for feature expression and classification, we have potential to improve the accuracy of diagnosis. For instance, in a recent study published in Nature, deep learning-based statistical model was used to improve breast cancer treatment and survival through identifying breast cancer patients with a high risk of long-term recurrence(5). In this study, we analyzed a variety of basic laboratory examinations indicators that related to host reactions to generate a free COVID-19 diagnosis aid APP. Then, the accuracy of this app was validated in an independent cohort(6).

METHODS

Patients and information collection

We have complied with all relevant ethical regulations for work with human subjects and the studies were approved by the Clinical Trials and Biomedical Ethics Committee of West China Hospital, Sichuan University. Through the West China hospital, a regional medical center, we collected the information of confirmed COVID-19 cases from various regional medical institutions, between 20th December 2019 and 10th February 2020. Diagnostic criteria were: RT-PCR of respiratory or blood samples was positive for SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2) nucleic acid; or viral gene sequencing of respiratory or blood sample was highly homologous with SARS-CoV-2. Controls were patients collected in the same site and the same time period, who were free of COVID-19 but had a diagnosis of viral pneumonia. Results of laboratory examinations on admission were collected which included age, gender and 35 indicators of whole blood count, coagulation test and biochemical examination. In addition, the independent validation cohort consisted of samples collected prospectively from one single medical center, using similar data collection strategy.

Feature selection and model construction

Factors under consideration are age, sex and indicators of basic laboratory examinations(see Table 1). Using 70% samples from the derivation cohort (i.e., the training set), we first performed feature selection using Least Absolute Shrinkage and Selection Operator (LASSO) to select the valuable indicators of COVID-19. The main idea of LASSO for feature selection is to construct a first-order penalty function, which can shrink the regression coefficient (b) of each variable within a certain range, and eliminate the feature with a coefficient of 0, and finally obtain an optimally refined model(7). The penalty terms for LASSO are: sum (abs(b)) <= t. Then these indicators were used to construct a primitive model by multivariate logistic regression. The parameters were further adjusted through internal validation (i.e., the other 30% samples, the testing set) to obtain the optimized final model.

View this table:

Table 1. Candidate of the model.

Evaluation

In external validation cohort, we compared the predicted results, with the follow-up results to evaluate the diagnostic validity by positive predictive values, and negative predictive values. Finally, to facilitate further use in clinic setting, all model parameters and code were encapsulated as a visual application(APP), COVID-19 Diagnosis Aid APP.

Statistical analysis

Continuous variables are represented by the median (upper and lower quartiles). Categorical variables are expressed in terms of frequency. LASSO algorithm was used for feature selection and the model was constructed by multivariate logistic regression. The diagnostic performance of the model was assessed by AUC. Calibration curves and Hosmer-Lemeshow test were used to evaluate the degree of overestimation or underestimation of the model. DCA were used to measure the net clinical benefits. The LASSO algorithm was performed by “glmmet” package. The logistic regression model was constructed by “glm” package. All statistical analyses were completed using R 3.5.0 version.

RESULTS

Participants and Clinical Characteristics

A total of 620 samples were included in the derivation cohort, among which 431 samples (211 COVID-19 vs. 220 Control) were in the training set and 189 samples (91 COVID-19 vs. 98 Control) were in testing set by simple randomization. The frequency of COVID-19 in the training set (48.96%) was not significantly different from that in the testing set (48.14%). We included 145 samples in the independent validation cohort.

Model development and evaluation

Through Lasso regression screening and Multivariate logistic regression, 9 representative variables (age, Activated Partial Thromboplastin Time, Red Blood Cell Distribution Width-SD, Uric Acid, Triglyceride, Serum Potassium, Albumin/globulin, 3-Hydroxybutyrate, Serum Calcium) with good identification value were selected and constructed an optimized diagnostic model(see Figure 1). According to the suggestive information from the model, among 145 samples that used for validation, 80 samples were estimated as COVID-19 and 65 samples were considered COVID-19-free. Compared with the status determined by nucleic acid detection (the golden standard), we found 69 real COVID-19 samples were correctly identified by the app (69/80, positive predictive value =86.25%), while 55, out of the 65 COVID-19-free samples, were confirmed to be negative (55/65, negative predictive value =84.62%). The area under curve (AUC) of the model were 0.890 and 0.872 in the testing set and independent validation cohort, respectively(see Figure 2)(8). The calibration curve performed well and Hosmer-Lemeshow test results P value was much greater than 0.05(see Figure 3). The DCA quantitatively demonstrated high clinical net benefit over the entire probability threshold(see Figure 4)(9).

Figure 1.

(a) Coefficient diagram of LASSO variables. Each curve in the figure represents the trajectory of the coefficient of an independent variable. The ordinate is the value of the coefficient. The lower abscissa, λ, is the parameter that controls the severity of the penalty. The upper abscissa is the number of non-zero coefficients in the model under the penalty parameter. (b) Adjustment parameters in the LASSO model. The lambda is screened by 10 folds cross-validation. A dashed vertical line is drawn at one standard error (1-SE standard) of the minimum and minimum standards. Lambda.1se corresponds to a model with good performance but the fewest number of arguments.

Figure 2. Receiver operator characteristic curve.

The AUC were 0.890 and 0.872 in derivation and validation cohort respectively.

Figure 3. Calibration curves.

The 45° shaded line represents the ideal prediction and the prediction probability is consistent with the actual observation probability. The Solid line represents the actual prediction of the model. The stapled histogram on the bottom line represents the distribution of patients’ predicted probability. Abbreviations: Dxy = Somer’s D rank correlation, R2 = Nagelkerke-Cox-Snell-Maddala-Magee R-squared index, D = Discrimination index, U = Unreliability index, Q = Quality index, Emax = maximum absolute difference in predicted and calibrated propabilities, S:z = Spiegelhalter Z-Test, S:p = two-tailed p-value of the Spiegelhalter Z-test.

Figure 4. Decision curve analysis.

The horizontal axis is the threshold probability of occurrence of COVID-19. The vertical axis shows the clinical benefits that patients may gain or lose using the APP. Dotted line: prediction model. Solid line: all patients were COVID-19. Horizontal line: all patients were not COVID-19.

Construction of application

The model was encapsulated as an APP, COVID-19 Diagnosis Aid APP. The risk of COVID-19 can be obtained by inputting the required indicators into the APP. It will be online in the android store and IOS APP Store soon for facilitating the further application(see Figure 5).

Figure 5. APP schematic.

The model can show the probability of COVID-19 after inputing the corresponding index into the the APP.

DISCUSSION

According to the latest epidemiological information of the World Health Organization(WHO), SARS-CoV-2 is spreading worldwide with an intensifying situation, and many countries including South Korea, Italy, and the United States have experienced outbreaks(10). Timely diagnosis, isolation and treatment are critical to control the global spread of COVID-19. However, nucleic acid testing is not always feasible or fully affordable in all regions. We therefore need more ways to realize the early and accurate community control and screening of suspicious people. To our knowledge, this is the first COVID-19 Diagnosis Aid APP based on laboratory tests for extensive screening. It has good accuracy and stability, which can help medical institutions to predict patients’ situation in advance and accurately invest the limited healthcare resources, so as to effectively control the spread of the epidemic.

Based on the optimized diagnostic model developed in current study, users can obtain the probability of COVID-19 according to the simple, easily available and reliable indicators from laboratory examinations. Given satisfiable model prediction performance (i.e., AUC was 0.872; positive predictive value and negative predictive value was reach 86.25% and 84.62%, respectively in independent validation cohort), the application of this app can provide reliable suggestion for further assessment of the index person, which has potential to reduce anxiety of public, as well as unnecessary hospital visit and nucleic acid testing.

It is undeniable that nucleic acid detection is one of the most reliable standards for the diagnosis of COVID-19, but we should also be aware of its shortcomings and deficiencies in practice. Nucleic acid testing requires professional technical platforms and physicians, and false negative results due to multiple reasons cannot be completely avoided(11). Instead, laboratory indicators that identified in our study are accessible, and can be widely used in community hospitals by family doctors to complete the early stage of suspected population screening.

On the other hand, based on the technological revolution brought by the Internet, this APP software installed on personal mobile can not only help to discover COVID-19 in the most convenient way, but also timely monitor dynamic trends of the prevalence of the contagious disease through data transmission to the Disease Control Center or professional medical institution(12, 13). This provides real-time data with predictive value, which can be reliable basis for more accurate health and epidemic prevention policies. Furthermore, this APP software is continuously upgradeable, with the possibility of extended scope of application. For example, we have added the data acquisition function which enabled the documentation of epidemiological and clinical features and a personalized reminder for necessary revisits. Also, with futher implementation of communication function, it is possible to establish a direct pre-connection between individuals and medical centers, which may benefit patients by avoiding unnecessary hospital visits or shorten waiting time spend before admission (14).

CONCLUSION

COVID-19 Diagnosis Aid APP can efficiently and accurately calculate the infection probability through simple and easily-obtained laboratory examination results. It will promote the screening of a large number of suspected people, save limited medical resources, optimize the diagnosis process, and it can constantly learn, adapt and upgrade, which are vital to control the global spread of COVID-19.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Ethics approval and consent to participate

The protocol of this study was approved by the West China Hospital, Sichuan University Medical Ethics Committee and conformed to the principles of the Declaration of Helsinki.

The registry number for clinical trial

ChiCTR2000030542

Consent for publication

Not applicable

Data Availability Statement

The datasets used and analysed during the current study are available from the corresponding author on reasonable request.

Competing interests

The authors declare that they have no competing interests

Funding

This study was supported by Science and Technology Department of Sichuan Province “2020YFS0004” and Science and Technology Project of West China Hospital “HX-2019-nCov-066”

Authors’ contributions

Zirui Meng analyzed the data and was a major contributor in writing the manuscript. Minjin Wang designed the APP framework and was a major contributor in writing the manuscript. Huan Song analyzed the data and was a major contributor in writing the manuscript. Shuo Guo analyzed the data. Yanbing Zhou analyzed the data. Weimin Li analyzed patients’ medical records. Yongzhao Zhou analyzed patients’ medical records. Mengjiao Li collected the data. Xingbo Song verified the performance of the app. Yi Zhou verified the performance of the app. Qingfeng Li collected the data. Xiaojun Lu designed the functions of the app. Binwu Ying proposed research ideas and design research programs. All authors read and approved the final manuscript.

Acknowledgements

The authors thank Beijing Zhongben Tech.co,.Ltd. for their technical contributions to this study.

REFERENCE

1.↵
Z W, JM M. Characteristics of and Important Lessons From the Coronavirus Disease 2019 (COVID-19) Outbreak in China: Summary of a Report of 72□314 Cases From the Chinese Center for Disease Control and Prevention. JAMA. 2020.
2.↵
Data as reported by national authorities by 10 AM CET 14 March 2020 [Available from: https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200314-sitrep-54-covid-19.pdf?sfvrsn=dcd46351_6.
3.↵
VM C, O L, M K, R M, A M, DKW C, et al. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin. 2020;25(3).
4.↵
T K, SM J, NM L, R K, K H, T M, et al. Communicating the Risk of Death from Novel Coronavirus Disease (COVID-19). Journal of clinical medicine. 2020;9(2).
5.↵
OM R, SJ S, JA S, SF C, JL C-J, M C, et al. Dynamics of breast-cancer relapse reveal late-recurring ER-positive genomic subgroups. Nature. 2019;567(7748):399–404.
OpenUrl CrossRef
6.↵
F S, N S, F S, Z Z, J S, H L, et al. Emerging Coronavirus 2019-nCoV Pneumonia. Radiology. 2020:200274.
7.↵
Tibshirani R. The lasso method for variable selection in the cox model. Stat Med. 1997;16(4):385–95.
OpenUrl CrossRef PubMed Web of Science
8.↵
Muntner P, Colantonio LD, Cushman M, Goff DC, Jr.., Howard G, Howard VJ, et al. Validation of the atherosclerotic cardiovascular disease Pooled Cohort risk equations. JAMA. 2014;311(14):1406–15.
OpenUrl CrossRef PubMed Web of Science
9.↵
Vickers AJ, Elkin EB. Decision curve analysis: A novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565–74.
OpenUrl CrossRef PubMed Web of Science
10.↵
S BS, P R, Y S, A M, C C, A S, et al. First cases of coronavirus disease 2019 (COVID-19) in France: surveillance, investigations and control measures, January 2020. Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin. 2020;25(6).
11.↵
T A, Z Y, H H, C Z, C C, W L, et al. Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases. Radiology. 2020:200642.
12.↵
CS W, MR T, J B, TP M-T, K H, D P, et al. Taking connected mobile-health diagnostics of infectious diseases to the field. Nature. 2019;566(7745):467–74.
OpenUrl
13.↵
EV G, A S, ND C. Using the Internet to identify infectious-disease outbreaks. The New England journal of medicine. 2004;351(24):2558–9.
OpenUrl CrossRef PubMed
14.↵
JT W, K L, GM L. Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study. Lancet (London, England). 2020.