Abstract
Background Traditional clinical assessments often lack individualization, relying on standardized procedures that may not accommodate the diverse needs of patients, especially in early stages where personalized diagnosis could offer significant benefits. We aim to provide a machine-learning framework that addresses the individualized feature addition problem and enhances diagnostic accuracy for clinical assessments.
Methods Individualized Clinical Assessment Recommendation System (iCARE) employs locally weighted logistic regression and Shapley Additive Explanations (SHAP) value analysis to tailor feature selection to individual patient characteristics. Evaluations were conducted on synthetic and real-world datasets, including early-stage diabetes risk prediction and heart failure clinical records from the UCI Machine Learning Repository. We compared the performance of iCARE with a Global approach using statistical analysis on accuracy and area under the ROC curve (AUC) to select the best additional features.
Findings The iCARE framework enhances predictive accuracy and AUC metrics when additional features exhibit distinct predictive capabilities, as evidenced by synthetic datasets 1-3 and the early diabetes dataset. Specifically, in synthetic dataset 1, iCARE achieved an accuracy of 0·999 and an AUC of 1·000, outperforming the Global approach with an accuracy of 0·689 and an AUC of 0·639. In the early diabetes dataset, iCARE shows improvements of 1·5-3·5% in accuracy and AUC across different numbers of initial features. Conversely, in synthetic datasets 4-5 and the heart failure dataset, where features lack discernible predictive distinctions, iCARE shows no significant advantage over global approaches on accuracy and AUC metrics.
Interpretation iCARE provides personalized feature recommendations that enhance diagnostic accuracy in scenarios where individualized approaches are critical, improving the precision and effectiveness of medical diagnoses.
Funding This work was supported by startup funding from the Department of Psychology at the University of Kansas provided to A.A., and the R01MH125740 award from NIH partially supported J.M.G.’s work.
1. Background
Clinical assessment is the ongoing process of gathering information about a patient and constructing an increasingly comprehensive conceptualization of their health and needs (e.g., for diagnosis, prognosis, or treatment planning). A critical task in clinical assessment is selecting the next piece of information to collect about the patient to maximize information gain. Given the unique nature of each patient’s condition, it is essential to recognize that there are often no one-size-fits-all solutions. This need for personalization is especially high when symptom presentation and treatment effectiveness are heterogeneous across individuals; examples include oncology, psychiatry, and the treatment of chronic diseases such as diabetes, cardiovascular disease, and neurodegenerative disorders.1–3 We can also find an example from the study of dementia where the informativeness of APOE ε4 as one of the best predictors of dementia varies by race.4–6 Although useful, achieving personalization in clinical practice is challenging. Personalization requires massive data, raising privacy concerns and the potential misuse of sensitive information.7 In this paper, we will discuss how the framework of individualized feature selection from machine learning (ML) can be used to efficiently guide the task of personalization in clinical assessment.
Feature selection is the process of identifying and prioritizing the most relevant and informative input variables (i.e., features) that will optimize model performance, interpretability, and generalization while minimizing model complexity and overfitting.8 Overfitting occurs when a model becomes too complex, capturing noise in addition to the signal, which causes it to fail to generalize to unseen data (e.g., novel patients or new observations of known patients). It is important to reduce overfitting so that the model performs well in real-world scenarios.9 This is usually achieved by using popular techniques like sequential forward selection (SFS) or backward elimination, which iteratively add or remove features to see their effect on model performance.10–13 However, these traditional techniques lack individualization, resulting in every patient being given the same recommendation. Personalized feature selection, on the other hand, places patients at the center of the decision-making process, taking into account each individual patient’s unique characteristics and recognizing that different patients may need different thresholds for diagnosis.14 This problem definition aligns with the aim of personalized clinical assessment recommendations, where the goal is to tailor the choice of the next test based on the unique characteristics of each patient.
Recent studies in individualized feature selection have begun to address this gap by developing methods that personalize the selection of features based on individual patient data. For instance, a study on wearable electroencephalogram (EEG) monitoring platforms uses linear discriminant analysis (LDA) and the least absolute shrinkage and selection operator (LASSO) method to select discriminative features tailored to each subject’s seizure patterns.15 However, this approach does not focus on dynamic and iterative feature addition and is highly specific to EEG data. Additionally, an unsupervised personalized feature selection framework tailors feature selection to each instance in high-dimensional data.31 However, our objective is to tackle a supervised individualized feature selection problem. Additionally, a framework employing fixed prediction models, local feature explainers, and ensembles of imputed samples provides flexible risk estimation for samples with missing features.16 This framework relies heavily on imputations and uses a single fixed prediction model. On the other hand, we want to create a framework that provides an individualized model directly without the need for imputations or a singular prediction model.
We propose a general framework that recommends which features to obtain next for each patient, promoting a more accurate diagnosis through personalization. Taking inspiration from locally weighted learning, iCARE leverages patient-specific data to tailor the selection of clinical assessments for individualized healthcare recommendations used in diagnosis.17–19 Our approach utilizes a locally weighted model tailored to each patient, which was analyzed using a feature explainer, to dynamically adapt feature selection strategies based on each patient’s unique characteristics. The iCARE framework relies on three main components: (1) a sample weight calculation module, (2) an ML model trained on weighted samples, and (3) a feature explainer for the generated models. We analyzed the framework using synthetic datasets to show its personalization capability and also compared it with a traditional approach on both synthetic and real-world datasets. We hypothesized that our framework would provide more accurate diagnoses than the traditional approaches.
2. Methods
2.1. Framework Architecture
Figure 1 provides an overview of the architecture of our iCARE framework. The architecture consists of an input processing module identifying missing features of incoming patients. A similarity calculation module is then used to calculate similarity scores between incoming patients and patients in the pool of known cases. This pool of known cases comprises labeled data, which includes values for predictive features such as age, sex, and test results, along with an outcome label indicating whether the individual is sick or not sick. It can be created from any data source representing known past cases with relevant features. Using these weights, a weighted logistic regression is trained using the pool of known cases. The weights assigned to each sample reflect its relevance to the novel patient’s profile, allowing for personalized model training. The trained model is then analyzed using Shapley Additive Explanations (SHAP) to quantify the importance of individual features in the locally trained logistic regression model.20 SHAP values are based on cooperative game theory in which a prediction is broken down to show how each feature influenced the outcome of a model. Finally, the feature recommendation module will take the explanations and produce a recommendation. It evaluates whether the feature is present in the patient’s initial feature set. If any significant feature is missing, the framework recommends its inclusion to further enhance predictive accuracy.
2.2. Experimental Design
Figure 2 provides an overview of the experiment to compare the performance of the iCARE recommendation against a global feature recommendation (i.e., Global) strategy. Initially, we define a set of initial features using the least important feature. The dataset was split into a pool of known cases and test cases. With the procedure applied before, the test cases will have only the initial features, simulating conditions where patients don’t have all the informative features. From here, we generated a global recommendation and an individualized recommendation. The global recommendation is done by training a logistic regression on the pool of known cases and analyzing it with SHAP. The feature with the highest SHAP value is selected for recommendation. On the other hand, the individualized recommendation uses the iCARE framework. We then evaluate the recommendation and repeat this process 100 times.
To evaluate the recommendations, we append the pool of known cases and a single case with the recommended feature value from the initial dataset. We then train a logistic regression using the pool of known cases and predict the outcome for the single case using the model. In addition to this, we also define the locally weighted (LW) procedure, which just uses a weighted logistic regression on this step instead of a regular logistic regression. We repeat this process until all test cases receive the predicted outcome. We then collect this prediction and calculate the accuracy and AUC (Area Under the Receiver Operating Characteristic Curve) metrics. These metrics were then averaged over 100 iterations.
This experiment was repeated using a different number of initial features. We selected the least informative feature as it represents a realistic scenario where incoming patients will more likely have less informative features. This iterative approach allowed us to assess the impact of the model performance across the various frameworks on different initial available features.
2.3. Dataset
We evaluate our framework with both synthetic and real-world datasets. The synthetic datasets were created to simulate ideal and non-ideal scenarios. The real-world datasets utilized in this study were obtained from the UCI Machine Learning Repository, specifically the early-stage diabetes risk prediction, heart failure clinical records, and heart disease dataset.21–23 We provide the code to generate the synthetic dataset, as well as the details on preprocessing steps for real-world datasets in the supplementary materials.
2.4. Statistical Analysis
We performed t-tests (⍺=0.05) on the accuracy and AUC to assess the statistical significance of the performance differences between the four frameworks. To account for familywise error and reduce the risk of Type I errors, we applied Holm-adjusted p-values to the results of these multiple comparisons. Using Holm-adjusted p-values provides a more conservative and reliable measure of statistical significance compared to the standard p-values obtained from the t-test.
3. Findings and Interpretation
3.1. Reasoning Process of the Framework
The iCARE framework is grounded in the principle of localized learning and feature importance analysis to generate personalized clinical recommendations. A locally weighted logistic regression model trained using weighted patient samples from the repository of known cases focuses on learning similar patients. Due to this, iCARE will excel in scenarios where patients with similar profiles benefit from similar recommendations. Given an incoming patient with available features and a selection of potential features to be recommended, iCARE will be able to recommend the best feature given that the available features are informative of the predictiveness of the added features. For example, if in the dataset, groups of people aged below 50 benefit from additional feature A, and those above 50 benefit from additional feature B, iCARE will be able to capture this information from age (i.e., available feature) and recommend the appropriate feature (i.e., feature A or B) to an incoming patient that will give the best information gain.
We created synthetic datasets 1-3 to simulate ideal scenarios and confirm our hypothesis on the reasoning process of iCARE. Synthetic dataset 1 represents the most ideal scenario, characterized by two additional features exhibiting predictive power over different regions of the initial features value space, as shown in Figure 3. Conversely, synthetic dataset 2 illuminates the necessity for sample-weighted inference (as indicated by LW) when confronted with non-linear predictive regions highlighted in Figure 424,25. Furthermore, synthetic dataset 3 serves as a testament to the robustness of our framework, particularly in scenarios involving overlapping regions on the initial features value space that can be seen in Figure 5.
We created synthetic Datasets 4-5 to simulate hypothetical non-ideal scenarios. Synthetic Dataset 4, depicted in Figure 6, simulates a non-ideal scenario where both additional features are equally useful (i.e., the available feature does not give information about the predictiveness of the additional features). Notably, both the left and right graphs showcase identical predictive regions. This visualization emphasizes scenarios where both features share the same predictive power in the same region. In synthetic dataset 5, represented in Figure 7, we created a scenario where only one additional feature out of the rest is useful. This visualization emphasizes scenarios where one feature dominates others regarding predictive strength. The iCARE framework is expected to have similar performance to a global feature selection, highlighting no added benefit from personalization.
3.2. Performance on Synthetic Dataset
In Figure 8, we provide the comparison between the different approaches on the synthetic datasets 1-3. In synthetic dataset 1, where two additional features exhibit predictive power over distinct regions, the iCARE frameworks are expected to perform significantly better than the Global frameworks. As expected, we obtain statistically significant (⍺=0.05) differences in iCARE versus Global metrics and iCARE+LW versus Global+LW metrics using t-test, with the iCARE performing better than its Global counterpart, confirming the framework’s capability to provide the best recommendation when additional features’ predictive capabilities are clearly distinguishable given the initial feature values. Similarly, in synthetic dataset 2, characterized by non-linear predictive regions, the iCARE frameworks, especially when incorporating locally weighted inference (LW), are expected to outperform their non-LW counterparts. Statistical significance (⍺=0.05) across all comparisons can be found on our t-test, notably for iCARE versus iCARE+LW and Global versus Global+LW metrics, unseen in synthetic datasets 1 and 3. Furthermore, in synthetic dataset 3, featuring overlapping regions with identical predictive power for both features, both iCARE frameworks are expected to perform slightly better than the Global framework. The actual results align with this expectation, demonstrating the framework’s ability to make accurate recommendations even in cases where features exhibit similar predictive capabilities. Similar to synthetic dataset 1, statistical significance (⍺=0.05) can be observed on our t-test when comparing iCARE with the Global framework. These results confirm the hypothesis of our framework’s ability to give the best recommendation in cases where the additional features’ predictive capabilities can be clearly distinguished, given the initial feature values.
In Figure 9, we provide the comparison between the different approaches on the synthetic datasets 4-5. For synthetic dataset 4, characterized by features sharing the same predictive power in the space of the initial feature value, we expected little to no difference when comparing iCARE versus Global frameworks. The actual outcome confirms this expectation, as both iCARE and Global frameworks exhibit similar performance. Similarly, for synthetic dataset 5, where there is only one useful feature, we expected a similar outcome to synthetic dataset 4. As predicted, the actual results show little variation between iCARE and Global frameworks. We observed some variances in performance; however, this can primarily be attributed to the use of locally weighted inference (i.e., LW) rather than inherent differences in the iCARE framework itself.
Furthermore, synthetic datasets 4 and 5 revealed no statistical significance for comparisons between iCARE and iCARE+LW versus Global and Global+LW metrics, which aligned with our hypothesized outcomes. The complete result of the statistical test can be seen in Table 1. These findings further confirm the hypothesis of our framework’s ability to give the best recommendation in cases where the additional features’ predictive capabilities can be clearly distinguished, given the initial feature values.
3.3. Performance on Real-World Dataset
In extending our evaluation to real-world scenarios, we scrutinize the performance of our framework on datasets representative of clinical contexts. Specifically, we assess its effectiveness in predicting outcomes in early diabetes and heart failure datasets, leveraging a range of personalized recommendations of features to enhance predictive accuracy and AUC metrics. In the experiment on the early diabetes dataset using three initial features, we observe that personalization leads to increased Accuracy and AUC, as seen in Figure 10. The superiority of iCARE models is shown to be statistically significant, as shown in Table 2. The three initial features that were used in this experiment are age, gender, and obesity status. Using a global approach, the feature that is recommended the majority of the time is polydipsia (i.e., excessive thirst; 75/100 iterations). It suggests that, on average, polydipsia might be more informative across the entire population when combined with age, gender, and obesity status. However, when using iCARE, two features are recommended: Polyuria (Frequent Urination) and Polydipsia. On average, Polyuria is recommended for 68% of patients, and Polydipsia is recommended for 32% of patients. The prominence of the Polyuria recommendation suggests that Polyuria might provide more relevant or discriminative information for certain patients. Polydipsia and Polyuria are both classic symptoms of diabetes.26,27 The framework’s recommendation pattern suggests variability in symptom presentation and importance among different patients. The higher recommendation of Polyuria suggests that for many patients, this symptom may be an earlier or more pronounced indicator of diabetes than Polydipsia.
In contrast, the iCARE framework does not yield substantial benefits on the heart failure dataset, as shown in Figure 11. We observed overlapping error bars in both accuracy and AUC metrics across different feature spaces in this dataset. In some instances (e.g., accuracy for the number of features = 4), iCARE models even underperform compared to their Global counterparts, highlighting the limitations of the approach in specific contexts. The statistical test in Table 3 shows that this difference in performance is statistically significant. This finding highlights a similar outcome to synthetic dataset 4, where it shows no added benefit when the additional features to be recommended have no distinct predictive capabilities, as well as synthetic dataset 5, where only one additional feature is useful as seen in synthetic dataset 5.
3.4. Comparison with Other Frameworks
To further evaluate the effectiveness of iCARE, we compared its performance in feature selection against the imputation-based explanation-guided (Eguided) feature selection method.16 As shown in Figure 12, iCARE achieved a +6% higher F1 score on average in the Early Diabetes Dataset and +10.6% higher F1 score on average in the Heart Disease Dataset over Eguided. The Heart Failure dataset highlighted before showed an example where personalized feature selection is not needed. The result showed consistency to our previous results where both iCARE and Eguided failed to perform better than a global feature selection. However, iCARE performed on average +2.2% higher in F1 score than Eguided on this dataset. These results suggest that iCARE generally provides superior performance on these datasets compared to Eguided, though a few important considerations must be noted. First, the Eguided framework was originally evaluated on a much larger dataset, comprising 100,000 samples and 252 features, while our datasets are considerably smaller. Additionally, in the original Eguided study, XGBoost was employed as the prediction model, whereas we used logistic regression for both, given that iCARE relies on logistic regression. This difference in model selection may influence the relative performance of Eguided, as XGBoost could provide additional performance benefits in larger or more complex datasets. Overall, these findings reinforce the potential of iCARE as a robust feature selection framework for personalized and dynamic feature recommendations, particularly in clinical datasets of smaller scale.
4. Discussion
4.1. Importance of Sample Weighing
Sample weighing was utilized in the sample calculation model to create a weighted logistic regression model. This weighted model emphasizes patients with characteristics similar to those of incoming patients. The weighing strategy allows SHAP to be locally sensitive to the context of the current incoming patient, enabling feature importance rankings to be customized to the individual context rather than global trends. Sample weights have previously been used to address various challenges. For instance, a recent study proposed a weighted undersampling scheme for Support Vector Machines (SVM) to improve classification performance in dealing with imbalanced data sets.28 This method assigned different weights to the majority of samples based on their distance to the hyperplane, akin to how iCARE assigns weights based on patient similarity. Another research study focused on personalized diagnosis for Alzheimer’s Disease, utilizing subject-specific classifiers iteratively refined through reweighting of training data.29 Although not aimed at addressing the feature recommendation problem, the rationale for employing sample weighting remains relevant, as it serves to prioritize key subjects. Overall, incorporating sample weights in iCARE enables personalized feature rankings that can navigate diverse patient populations and complex clinical scenarios.
4.2. SHAP as Feature Importance Measure
Within the iCARE framework, SHAP values play a pivotal role in selecting the most important features for personalized feature addition. We use SHAP to quantify the importance of individual features within the locally trained logistic regression model. By assigning importance values to each feature for a specific prediction, SHAP facilitates understanding the factors influencing the model’s output. In the context of iCARE, SHAP integration with a weighted classifier presents a novel approach to personalized feature recommendation. This combination allows for the prioritization of features based on their impact on the current patient’s prediction. While SHAP has been previously employed to measure feature importance, its integration within the framework of a weighted classifier for personalized recommendation distinguishes iCARE as a novel and impactful approach to healthcare decision support systems.30
4.3. Preliminary Examination of Dataset
As shown in the experiment, not every dataset requires personalization. The heart failure dataset in our experiment does not benefit from our iCARE framework. We used two procedures to determine whether a dataset is suitable for personalization. The fastest dataset analysis method is to use SHAP value analysis on a ceiling model. If the analysis reveals multiple important features that contribute to the model’s predictions, it suggests that the dataset may benefit from personalization. While the presence of multiple important features increases the likelihood of benefiting from personalization, it does not guarantee it. Another approach involves leveraging a pool of known cases to cross-validate the performance of the personalized model, similar to how we test our framework. While this method is slower compared to SHAP value analysis, it directly assesses the performance of personalized models through statistical testing on performance metrics accuracy and AUC values. This method confirms whether personalization is beneficial and allows us to predict how much performance gain can be expected from personalization.
4.4. Limitations and Future Directions
The iCARE framework has several limitations. First, it currently lacks a mechanism to determine whether a dataset warrants personalized feature recommendation automatically. This reliance on a naive dataset evaluation approach necessitates multiple experimental iterations, which may not be feasible in all scenarios. Future work could focus on developing robust criteria or indicators to assess the need for personalization more efficiently. Second, iCARE involves training a locally weighted model for every incoming patient, which may not be suitable for machine learning models requiring extensive training time or scenarios necessitating numerous rapid inferences. iCARE also requires a pool of known cases with a complete set of features to provide individualized feature recommendations. Lastly, iCARE assumes that the initial features available are informative of the predictive space for potential additional features, an assumption that may not always hold true. Future research should explore methods to comprehensively assess the informativeness of initial features to enhance the framework’s effectiveness. Addressing these limitations and advancing research in these directions could further enhance the capabilities and applicability of the iCARE framework, ultimately contributing to improved personalized clinical assessments and decision-making in healthcare settings.
5. Conclusion
The iCARE system addresses the challenge of personalized feature selection in clinical assessments by dynamically tailoring the selection of clinical tests based on each patient’s unique characteristics. The framework excels over a global feature selection framework in predictive accuracy, especially in cases where the initial features are informative of the predictiveness of the added features. Although personalization might not be needed in all cases, iCARE provides a flexible framework that can be applied using other machine learning algorithms. We believe that with further testing, this general framework can be applied in various fields, extending its utility beyond clinical assessments.
Declaration of generative AI use
During the preparation of this work the author used ChatGPT in order to improve the writing for clarity and proofreading. After using this tool the authors reviewed and edited the content as needed and take full responsibility for the content in the publication.
Data Availability
All synthetic data generation procedure are contained in the supplementary document. The real-world datasets are available online at https://doi.org/10.24432/C5VG8H, https://doi.org/10.24432/C5Z89R, and https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset.
https://doi.org/10.24432/C5VG8H
https://doi.org/10.24432/C5Z89R
https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
Footnotes
Section 3: Findings and Interpretation has a new result, Comparisons with Other Frameworks, which compares the feature selection of iCARE with an existing framework using the F1 metric.