ABSTRACT
Background COVID-19 has been spreading globally since emergence, but the diagnostic resources are relatively insufficient.
Results In order to effectively relieve the resource deficiency of diagnosing COVID-19, we developed a machine learning-based diagnosis model on basis of laboratory examinations indicators from a total of 620 samples, and subsequently implemented it as a COVID-19 diagnosis aid APP to facilitate promotion.
Conclusions External validation showed satisfiable model prediction performance (i.e., the positive predictive value and negative predictive value was 86.35% and 84.62%, respectively), which guarantees the promising use of this tool for extensive screening.
BACKGROUND
Since the outbreak of Corona Virus Disease 2019(COVID-19) in Wuhan, in December 2019, the epidemic has spread rapidly all over the world, which consequently brought great challenges to global public health(1). As of March 14, a total of 142,539 cases had been confirmed worldwide, with many more requiring tests(2). Currently, accurate nucleic acid detection plays a pivotal role in diagnosis and prevention of COVID-19, and reverse transcription polymerase chain reaction is the main method for nucleic acid detection(3). However, it is not feasible to use this time-and labor-consuming approach for screening a large growing number of suspected patients and asymptomatic infected patients(4). Rapid and universal screening method is essential to save medical resources, improve diagnosis efficiency and avoid cross-infection.
Taking advantage of the emerging machine learning technique, which enables analysis of the existing multidimensional data by applying appropriate algorithms for feature expression and classification, we have potential to improve the accuracy of diagnosis. For instance, in a recent study published in Nature, deep learning-based statistical model was used to improve breast cancer treatment and survival through identifying breast cancer patients with a high risk of long-term recurrence(5). In this study, we analyzed a variety of basic laboratory examinations indicators that related to host reactions to generate a free COVID-19 diagnosis aid APP. Then, the accuracy of this app was validated in an independent cohort(6).
METHODS
Patients and information collection
We have complied with all relevant ethical regulations for work with human subjects and the studies were approved by the Clinical Trials and Biomedical Ethics Committee of West China Hospital, Sichuan University. Through the West China hospital, a regional medical center, we collected the information of confirmed COVID-19 cases from various regional medical institutions, between 20th December 2019 and 10th February 2020. Diagnostic criteria were: RT-PCR of respiratory or blood samples was positive for SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2) nucleic acid; or viral gene sequencing of respiratory or blood sample was highly homologous with SARS-CoV-2. Controls were patients collected in the same site and the same time period, who were free of COVID-19 but had a diagnosis of viral pneumonia. Results of laboratory examinations on admission were collected which included age, gender and 35 indicators of whole blood count, coagulation test and biochemical examination. In addition, the independent validation cohort consisted of samples collected prospectively from one single medical center, using similar data collection strategy.
Feature selection and model construction
Factors under consideration are age, sex and indicators of basic laboratory examinations(see Table 1). Using 70% samples from the derivation cohort (i.e., the training set), we first performed feature selection using Least Absolute Shrinkage and Selection Operator (LASSO) to select the valuable indicators of COVID-19. The main idea of LASSO for feature selection is to construct a first-order penalty function, which can shrink the regression coefficient (b) of each variable within a certain range, and eliminate the feature with a coefficient of 0, and finally obtain an optimally refined model(7). The penalty terms for LASSO are: sum (abs(b)) <= t. Then these indicators were used to construct a primitive model by multivariate logistic regression. The parameters were further adjusted through internal validation (i.e., the other 30% samples, the testing set) to obtain the optimized final model.
Evaluation
In external validation cohort, we compared the predicted results, with the follow-up results to evaluate the diagnostic validity by positive predictive values, and negative predictive values. Finally, to facilitate further use in clinic setting, all model parameters and code were encapsulated as a visual application(APP), COVID-19 Diagnosis Aid APP.
Statistical analysis
Continuous variables are represented by the median (upper and lower quartiles). Categorical variables are expressed in terms of frequency. LASSO algorithm was used for feature selection and the model was constructed by multivariate logistic regression. The diagnostic performance of the model was assessed by AUC. Calibration curves and Hosmer-Lemeshow test were used to evaluate the degree of overestimation or underestimation of the model. DCA were used to measure the net clinical benefits. The LASSO algorithm was performed by “glmmet” package. The logistic regression model was constructed by “glm” package. All statistical analyses were completed using R 3.5.0 version.
RESULTS
Participants and Clinical Characteristics
A total of 620 samples were included in the derivation cohort, among which 431 samples (211 COVID-19 vs. 220 Control) were in the training set and 189 samples (91 COVID-19 vs. 98 Control) were in testing set by simple randomization. The frequency of COVID-19 in the training set (48.96%) was not significantly different from that in the testing set (48.14%). We included 145 samples in the independent validation cohort.
Model development and evaluation
Through Lasso regression screening and Multivariate logistic regression, 9 representative variables (age, Activated Partial Thromboplastin Time, Red Blood Cell Distribution Width-SD, Uric Acid, Triglyceride, Serum Potassium, Albumin/globulin, 3-Hydroxybutyrate, Serum Calcium) with good identification value were selected and constructed an optimized diagnostic model(see Figure 1). According to the suggestive information from the model, among 145 samples that used for validation, 80 samples were estimated as COVID-19 and 65 samples were considered COVID-19-free. Compared with the status determined by nucleic acid detection (the golden standard), we found 69 real COVID-19 samples were correctly identified by the app (69/80, positive predictive value =86.25%), while 55, out of the 65 COVID-19-free samples, were confirmed to be negative (55/65, negative predictive value =84.62%). The area under curve (AUC) of the model were 0.890 and 0.872 in the testing set and independent validation cohort, respectively(see Figure 2)(8). The calibration curve performed well and Hosmer-Lemeshow test results P value was much greater than 0.05(see Figure 3). The DCA quantitatively demonstrated high clinical net benefit over the entire probability threshold(see Figure 4)(9).
Construction of application
The model was encapsulated as an APP, COVID-19 Diagnosis Aid APP. The risk of COVID-19 can be obtained by inputting the required indicators into the APP. It will be online in the android store and IOS APP Store soon for facilitating the further application(see Figure 5).
DISCUSSION
According to the latest epidemiological information of the World Health Organization(WHO), SARS-CoV-2 is spreading worldwide with an intensifying situation, and many countries including South Korea, Italy, and the United States have experienced outbreaks(10). Timely diagnosis, isolation and treatment are critical to control the global spread of COVID-19. However, nucleic acid testing is not always feasible or fully affordable in all regions. We therefore need more ways to realize the early and accurate community control and screening of suspicious people. To our knowledge, this is the first COVID-19 Diagnosis Aid APP based on laboratory tests for extensive screening. It has good accuracy and stability, which can help medical institutions to predict patients’ situation in advance and accurately invest the limited healthcare resources, so as to effectively control the spread of the epidemic.
Based on the optimized diagnostic model developed in current study, users can obtain the probability of COVID-19 according to the simple, easily available and reliable indicators from laboratory examinations. Given satisfiable model prediction performance (i.e., AUC was 0.872; positive predictive value and negative predictive value was reach 86.25% and 84.62%, respectively in independent validation cohort), the application of this app can provide reliable suggestion for further assessment of the index person, which has potential to reduce anxiety of public, as well as unnecessary hospital visit and nucleic acid testing.
It is undeniable that nucleic acid detection is one of the most reliable standards for the diagnosis of COVID-19, but we should also be aware of its shortcomings and deficiencies in practice. Nucleic acid testing requires professional technical platforms and physicians, and false negative results due to multiple reasons cannot be completely avoided(11). Instead, laboratory indicators that identified in our study are accessible, and can be widely used in community hospitals by family doctors to complete the early stage of suspected population screening.
On the other hand, based on the technological revolution brought by the Internet, this APP software installed on personal mobile can not only help to discover COVID-19 in the most convenient way, but also timely monitor dynamic trends of the prevalence of the contagious disease through data transmission to the Disease Control Center or professional medical institution(12, 13). This provides real-time data with predictive value, which can be reliable basis for more accurate health and epidemic prevention policies. Furthermore, this APP software is continuously upgradeable, with the possibility of extended scope of application. For example, we have added the data acquisition function which enabled the documentation of epidemiological and clinical features and a personalized reminder for necessary revisits. Also, with futher implementation of communication function, it is possible to establish a direct pre-connection between individuals and medical centers, which may benefit patients by avoiding unnecessary hospital visits or shorten waiting time spend before admission (14).
CONCLUSION
COVID-19 Diagnosis Aid APP can efficiently and accurately calculate the infection probability through simple and easily-obtained laboratory examination results. It will promote the screening of a large number of suspected people, save limited medical resources, optimize the diagnosis process, and it can constantly learn, adapt and upgrade, which are vital to control the global spread of COVID-19.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Ethics approval and consent to participate
The protocol of this study was approved by the West China Hospital, Sichuan University Medical Ethics Committee and conformed to the principles of the Declaration of Helsinki.
The registry number for clinical trial
ChiCTR2000030542
Consent for publication
Not applicable
Data Availability Statement
The datasets used and analysed during the current study are available from the corresponding author on reasonable request.
Competing interests
The authors declare that they have no competing interests
Funding
This study was supported by Science and Technology Department of Sichuan Province “2020YFS0004” and Science and Technology Project of West China Hospital “HX-2019-nCov-066”
Authors’ contributions
Zirui Meng analyzed the data and was a major contributor in writing the manuscript. Minjin Wang designed the APP framework and was a major contributor in writing the manuscript. Huan Song analyzed the data and was a major contributor in writing the manuscript. Shuo Guo analyzed the data. Yanbing Zhou analyzed the data. Weimin Li analyzed patients’ medical records. Yongzhao Zhou analyzed patients’ medical records. Mengjiao Li collected the data. Xingbo Song verified the performance of the app. Yi Zhou verified the performance of the app. Qingfeng Li collected the data. Xiaojun Lu designed the functions of the app. Binwu Ying proposed research ideas and design research programs. All authors read and approved the final manuscript.
Acknowledgements
The authors thank Beijing Zhongben Tech.co,.Ltd. for their technical contributions to this study.