KDClassifier: Urinary Proteomic Spectra Analysis Based on Machine Learning for Classification of Kidney Diseases

Wanjun Zhao; Yong Zhang; Xinming Li; Yonghong Mao; Changwei Wu; Lijun Zhao; Fang Liu; Jingqiang Zhu; Jingqiu Cheng; Hao Yang; Guisen Li

doi:10.1101/2020.12.01.20242198

Abstract

Background By extracting the spectrum features from urinary proteomics based on an advanced mass spectrometer and machine learning algorithms, more accurate reporting results can be achieved for disease classification. We attempted to establish a novel diagnosis model of kidney diseases by combining machine learning with an extreme gradient boosting (XGBoost) algorithm with complete mass spectrum information from the urinary proteomics.

Methods We enrolled 134 patients (including those with IgA nephropathy, membranous nephropathy, and diabetic kidney disease) and 68 healthy participants as a control, and for training and validation of the diagnostic model, applied a total of 610,102 mass spectra from their urinary proteomics produced using high-resolution mass spectrometry. We divided the mass spectrum data into a training dataset (80%) and a validation dataset (20%). The training dataset was directly used to create a diagnosis model using XGBoost, random forest (RF), a support vector machine (SVM), and artificial neural networks (ANNs). The diagnostic accuracy was evaluated using a confusion matrix. We also constructed the receiver operating-characteristic, Lorenz, and gain curves to evaluate the diagnosis model.

Results Compared with RF, the SVM, and ANNs, the modified XGBoost model, called a Kidney Disease Classifier (KDClassifier), showed the best performance. The accuracy of the diagnostic XGBoost model was 96.03% (CI = 95.17%-96.77%; Kapa = 0.943; McNemar’s Test, P value = 0.00027). The area under the curve of the XGBoost model was 0.952 (CI = 0.9307-0.9733). The Kolmogorov-Smirnov (KS) value of the Lorenz curve was 0.8514. The Lorenz and gain curves showed the strong robustness of the developed model.

Conclusions This study presents the first XGBoost diagnosis model, i.e., the KDClassifier, combined with complete mass spectrum information from the urinary proteomics for distinguishing different kidney diseases. KDClassifier achieves a high accuracy and robustness, providing a potential tool for the classification of all types of kidney diseases.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This work was funded by grants from the National Natural Science Foundation of China (grant no. 31901038), the China Postdoctoral Science Foundation (2019M653438), the Post-Doctoral Research Foundation of West China Hospital of Sichuan University (2018HXBH062), and the 1.3.5 Project for Disciplines of Excellence, West China Hospital, Sichuan University (ZYGD18014, CJQ).

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Diagnosis and pathological examinations of the kidney diseases were conducted at the Department of Nephrology, Sichuan Provincial People's Hospital. Informed consent was obtained from the patients. The study protocol was approved by the Medical Ethics Committee of the Sichuan Provincial People's Hospital and West China Hospital.

All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Data Availability

The mass spectra data on the urinary proteomics were deposited into the ProteomeXchange Consortium through the PRIDE partner repository using the dataset identifier PXD018996.

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.