Reviewer #1 (Public Review):
In this valuable study, authors Sabanayagam and colleagues used multiple ML models on longitudinal data from a cohort of Chinese, Malay and Indian participants with diabetes to identify predictors for incident DKD.
The study involves a large multi-ethnic data cohort of Asian patients with diabetes and the use of machine learning methods to predict 6-yr CKD incidence risk in patients with diabetes. The final sample size for the study cohort included almost 1365 patients and 339 features. The authors tested multiple ML methods to identify which ML method provided the best prediction accuracy based on a select set of features.
Strengths:
The study is very interesting and timely as efforts are needed to develop prognostic methods for the incidence of Chronic Kidney disease in patients with diabetes. The strength of the study is the diversity in its cohort and the impressive breadth of associated covariates ranging from demographic, lifestyle, socioeconomic, physical, laboratory, retinal imaging, genetic, and blood metabolomics profile for patients.
An important factor to consider when assessing a predictive risk for the progression of a disease is to consider all possible risk components ranging from environmental, metabolic, physiological, and Social determinants of health, which the authors have done very well.
The authors also did not restrict their analysis by selecting a single algorithm upfront for their analysis which strengthens the scientific process without any bias in the outcome.
The authors do go about a data-driven approach by recursively eliminating features that may not be significant in providing them with statistically significant results. With a data set of a given size, this would be a logical way to go about the analysis.
The authors do accept the limitations of their study in the context of not having a validation dataset which is important to address in the scientific process.
Shortcomings:
However, the study does have a few shortcomings which, hopefully when addressed/clarified can help strengthen and streamline the analysis.
1. Statistical significance versus clinical significance:
The authors seem to use recursive feature elimination to come up with a set of top features for each Ml algorithm and select features from a varied feature set. However, the authors may need to pay attention to what the features (that come up as significant) are trying to allude to. For e.g. the authors seem to have dropped the datasets with features that contain the genetic and imaging parameters: D= B+ Genetic parameters and F= B+ Imaging parameters+ Blood metabolites+ Genetic parameters.
They provide reasons for the low performance of the ML models for dropping the features but do not elaborate on whether they investigated the reasons for the drop in performance.
They state this in the manuscript with no citation:
(line 82) "Similarly, genetic abnormalities in diabetes have also been shown to increase the risk of DKD."
... which makes it difficult to assess which of the 76 snps were associated with CKD and in which population and to what extent.
Similarly, the authors also have previously found features in imaging data have shown an association with CKD:
We and several others have previously shown that retinal microvascular changes including retinopathy, vessel narrowing, or dilation, and vessel tortuosity were associated with CKD [6, 7].
However, they also drop the dataset that includes the imaging features citing poor model performance and no investigations beyond that.
2. The authors speak about the advantage of using ML approaches to overcome shortcomings of traditional assumptions from linear models, however, in the consideration of their covariates they might also want to understand the clinical association between some of their selected features. for e.g. BMI, HbA1c, duration of diabetes, and systolic BP may somehow not be entirely independent of each other (especially in the context of influencing one another and driving diabetes) and multi-collinearity may need to be looked into.
3. The following sections seem to require citation:
no citation:
59: As CKD is asymptomatic till more than 50% of kidney function decline, early detection of individuals with diabetes who are at risk of developing DKD may facilitate prevention and appropriate intervention for DKD.
Elaborate on rationale (what is challenging?) and citation needed:
62 Early identification of individuals at risk of developing CKD in type 2 diabetes is challenging. Therefore, characterization of new biomarkers is urgently needed for identifying individuals at risk of progressive decline of eGFR and timely intervention for improving outcomes in DKD.
Citation needed or rationale needs to be back:
Machine learning methods using 'Big data', or multi-dimensional data may improve prediction as they have less restrictive statistical assumptions compared to traditional regression models which assume linear relationships between risk factors and the logit of the outcomes and absence of multi-collinearity among explanatory variables.
Citation:
Similarly, genetic abnormalities in diabetes have also been shown to increase the risk of DKD.
Citation:
81 Similarly, genetic abnormalities in diabetes have also been shown to increase the risk of DKD.
Citation:
The detailed methodology of the SEED has been published elsewhere.
Citation:
Malay ethnicity has been identified to be a high-risk group for CKD by several studies conducted in Singapore.
4. The authors are attempting to rationalize the outcome of their findings rather than challenge them to improve the robustness of their analysis. In this section, it would help strengthen their analysis if they could find ways to eliminate reasons other than the one they provided or perform additional analysis that could show proof of their claim:
While black ethnicity was a risk factor for CKD in the meta-analysis, in our study, we found Chinese and Malay ethnicity to be at higher risk of developing incident DKD compared to Indian ethnicity. One reason for the Indian ethnicity to be at lower risk of developing DKD could be Indian ethnicity being a high-risk group for diabetes, they may be well aware of the risk, and comply with screening, medication, etc. that could reduce their risk of developing DKD.
5. Following up on the above point, the authors have decided to use SDOH (social determinants of health) to identify prognostic risk factors for the incidence of CKD in diabetic patients without considering what the model may be trying to say regarding ethnicity vs socioeconomic status? it would be good to look at the association of SDOH metrics against ethnicity to see if the ethnic populations at higher risk for CKD could be disadvantaged due to socioeconomic factors and if so these need to be mentioned in the analysis.
6. EN vs other models: the authors claim that EN has much better results than other models in a study where the entire cohort has patients with diabetes possibly progressing towards CKD. usually, Risk models assume that disease progresses in a certain trajectory. However, multiple trajectories for the disease may exist due to heterogeneity of the disease and also non-linear relationships between features and disease outcome might influence this. This is what ML models can specifically address over traditional linear models. However, the pathophysiological progression from diabetes to CKD isn't as non-linear as assumed to be since heterogeneity in disease at that stage (~CKD stage 4) is primarily low and non-additive effects are most likely negligible, which also explains why EN and then LASSO perform so much better than the other models - This needs to be addressed by the authors in the paper.
I hope that addressing these points will help strengthen the paper and streamline it while also making the analysis and the outcomes clinically and statistically significant.