ABSTRACT
Background Maximal oxygen uptake (VO2 max), an indicator of cardiorespiratory fitness (CRF), requires exercise testing and, as a result, is rarely ascertained in large-scale population-based studies. Non-exercise algorithms are cost-effective methods to estimate VO2 max, but the existing models have limitations in generalizability and predictive power. This study aims to improve the non-exercise algorithms using machine learning (ML) methods and data from U.S. national population surveys.
Methods We used the 1999-2004 data from the National Health and Nutrition Examination Survey (NHANES), in which a submaximal exercise test produced an estimate of the VO2max. We applied multiple supervised ML algorithms to build two models: a parsimonious model that used variables readily available in clinical practice, and an extended model that additionally included more complex variables from more Dual-Energy X-ray Absorptiometry (DEXA) and standard laboratory tests. We used Shapley additive explanation (SHAP) to interpret the new model and identify the key predictors. For comparison, existing non-exercise algorithms were applied unmodified to the testing set.
Results Among the 5,668 NHANES participants included in the final study population, the mean age was 32.5 years and 49.9% were women. Light Gradient Boosting Machine (LightGBM) had the best performance across multiple types of supervised ML algorithms. Compared with the best existing non-exercise algorithms that could be applied in NHANES, the parsimonious LightGBM model (RMSE: 8.51 ml/kg/min [95% CI: 7.73 -9.33]) and the extended model (RMSE: 8.26 ml/kg/min [95% CI: 7.44 -9.09]) significantly reducing the error by 15% (P <0.01) and 12% (P<0.01 for both), respectively.
Conclusion Our non-exercise ML model provides a more accurate prediction of VO2 max for NHANES participants than existing non-exercise algorithms.
What is Known
Although cardiorespiratory fitness is recognized as an important marker of cardiovascular health, it is not routinely measured because of the time and resources required to perform exercise tests.
Non-exercise algorithms are cost-effective alternatives to estimate cardiorespiratory fitness, but the existing models are restricted in generalizability and predictive power.
What the Study Adds
We improve non-exercise algorithms for cardiorespiratory fitness prediction using advanced ML methods and a more comprehensive and representative data source from U.S. national population surveys.
More health factors that are associated with cardiorespiratory fitness are newly identified.
Nationally representative estimates for cardiorespiratory fitness in the U.S. over the recent 20 years are generated.
INTRODUCTION
Cardiorespiratory fitness (CRF) refers to the integrated capacity of the circulatory and respiratory systems to transport oxygen from the atmosphere to the mitochondria to perform physical work.1,2 Low cardiorespiratory fitness is an important risk factor for cardiovascular disease, all-cause mortality, and mortality rates attributable to various cancers, especially of the breast and colon/digestive tract.2-9 Improvements in CRF are associated with reduced mortality risk.10
The gold standard measurement of CRF is a direct measurement of the maximal oxygen uptake (VO2max) using cardiopulmonary exercise testing (CPX), which combines conventional maximal exercise testing with ventilatory gas expired analysis.11-14 When the instrumentation and trained personnel to perform CPX are not available, a widely accepted alternative is to estimate the VO2max indirectly from heart rate response to a submaximal exercise.11,15 Although CRF is recognized as an important marker of cardiovascular health, population health surveys and cohort studies do not routinely measure CRF because of the time and resources required to perform either of these exercise tests.
As an alternative, many investigators have developed non-exercise-based algorithms to estimate CRF without performing a maximal or submaximal test. This approach uses information typically available in healthcare settings to provide an inexpensive and rapid way of estimating CRF.8,9,16-28 While these non-exercise algorithms are useful, they have several limitations. First, most of these algorithms were developed based on small population cohorts. The performance of these algorithms is poor when applied to other groups. Second, these algorithms were typically developed with only a limited number of health indicators available as candidate variables for prediction and could not consider a wider range of other factors that may be associated with CRF. Third, most of the previous non-exercise algorithms used linear models, and do not account for nonlinearities and interactions among the predictors.29 Addressing these limitations may improve the performance of the non-exercise algorithms and provide more accurate estimates of VO2max for research and clinical practice.
One promising approach for addressing the last limitation is machine learning (ML). Over the past decade, numerous ML applications have appeared in the latest clinical literature, especially for outcome prediction models.30 Given that the performance and utility of prediction models depend on the data source and methods used, a large and growing body of studies has sought to boost the existing clinical prediction models with the evolving ML approaches applied to a larger data set of a broader range of parameters. 31,32
Accordingly, we sought to improve the non-exercise algorithms for CRF prediction with ML methods and a more comprehensive and representative data source from U.S. national population surveys. Using data from the National Health and Nutrition Examination Survey (NHANES) and state-of-the-art ML algorithms, we aimed to overcome the limitations of the existing non-exercise algorithms and develop an approach to predict VO2max more accurately without the need for an exercise test.
METHODS
This cross-sectional study received an exemption for review from the Institutional Review Board at Yale University because NHANES data are publicly available and de-identified. The study was reported following the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guidelines.33
Data source
Data from NHANES were used for model development, validation, and prediction in this study. The NHANES is a series of cross-sectional, weighted, multi-stage sampled surveys that provide nationally representative estimates of the non-institutionalized US population. Survey participants received in-home interviews, followed by standardized physical examinations conducted in mobile examination centers, and laboratory tests using blood and urine specimens provided by participants during the physical examination. Since 1999, NHANES has become a continuous program conducted in 2-year cycles. However, due to the coronavirus disease 2019 (COVID-19) pandemic, data collection for the latest NHANES 2019-2020 cycle was suspended in March 2020 and the data collected from 2019 to March 2020 were combined with data from the NHANES 2017-2018 cycle to form a nationally representative sample of NHANES 2017-March 2020 pre-pandemic data.34
Study population
We used data from 3 cycles conducted from 1999-2000 through 2003-2004 for model development and validation because CRF was measured only for that period in NHANES. Participants for CRF measurements were selected based on age (age eligibility for CRF measurement in NHANES: 12-49 years), medical conditions, medications, physical limitations, heart rate and blood pressure. The screening was done before the treadmill test using questions in the household interview, questions administered by the physician in the NHANES Mobile Examination Center (MEC), and measurements of heart rate and blood pressure. The list of exclusion criteria used for the CRF component in NHANES was attached in the supplemental material (Supplemental Table S1). We further limited our study population to participants aged 16-49 years because of the age restriction in the NHANES physical activity data section.
Study outcome
Our primary outcome of interest was VO2max. NHANES’s protocol for measuring VO2max (ml/kg/min) is the submaximal exercise test. Based on gender, age, body mass index, and self-reported level of physical activity, participants were assigned to one of eight treadmill test protocols. Heart rate was monitored continuously using an automated monitor with four electrodes connected to the thorax and abdomen of the participant and was recorded at the end of the warm-up, each exercise stage, and each minute of recovery. Then VO2max is then estimated by extrapolation using measured heart rate responses to known levels of exercise workloads in the CRF exam, assuming the relation between heart rate and oxygen consumption is linear during exercise.11 Detailed descriptions of the protocol are provided in the NHANES Cardiovascular Fitness Procedure Manual.35
Candidate variables
To develop a data-driven model, we considered a variety of predictors available in NHANES, including variables from home interview, physical examination, and laboratory data section. Variables used in this study were selected based on potential associations with overall health status, cardiovascular health, respiratory system, and metabolic systems, according to a systematic review.2 Then selected variables were examined for their continuities, to see if they were consistently available or at least can be estimated by other consistently available variables throughout the specified NHANES data collection period (1999-March 2020). This was important for the model’s generalizability and sustainability in NHANES. The exception was the daily alcohol intake (not available in 2019-March 2020) and Dual-Energy X-ray Absorptiometry (DEXA) data (not available in 2007-2010 and 2019-March 2020). DEXA data in NHANES contains information about body composition like the fat percentage of different body parts, which is important to the prediction of CRF.4,16,24,25,28 Finally, a total of 49 variables from different data sections were selected in the final feature set. (Supplemental Table S4). We applied a log-transformation to the reported weekly moderate-intensity physical activity time to account for its skewed distribution. All the 49 selected variables in the final study population had a missing rate of less than 10%, except for albuminuria (10.4%). We explained detailed methods to address missing data in the following statistical analysis section.
Existing non-exercise algorithms
We exhaustively searched for non-exercise algorithms that were developed in previous studies16,18,23-25,28 or presented in systematic reviews2,36 as baseline models for both validation and comparison in this study. The existing algorithms were selected based on the following two criteria: (1) The non-exercise algorithm targeted a relatively large and general population in the U.S. (2) The non-exercise algorithms used the variables that are available or can be approximated in NHANES, and the algorithm itself can be applied in the context of NHANES. These selected non-exercise algorithms were applied and evaluated as they were, i.e., using their original parameters or coefficients. Detailed information on the final algorithms considered is reported in Table 1.
Machine learning algorithms
We used multiple types of supervised machine learning algorithms for the modelling of VO2max using the candidate variables. The algorithms were K-Nearest Neighbors (KNN),37 Least Absolute Shrinkage and Selection Operator (LASSO),38 Support Vector Regression (SVR),39 Random Forest (RF),40 and Gradient Boosting decision tree41 (GBDT) family which included Extreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM).42,43 The detailed descriptions of these algorithms are listed in Supplemental Table S3.
Model development
We built two types of models, the parsimonious model, and the extended model, according to the range of selected feature sets. The parsimonious model was trained on only variables that are easily available in normal clinical settings, mostly from the demographics, interviews, and regular physical examination. The extended model was fitted using all the selected candidate variables, including standard lab test results and DEXA data. We developed two types of models because the parsimonious model may be more applicable in common practice, while the extended model may have better predictive power and could shed light on the unexplored associations between CRF and other potential factors.
We split the final study population (N=5668) into a training set (80%) and a validation set (20%). Given the continuous outcome VO2max, each ML model was trained for a regression task using root mean square error (RMSE) as the loss function. Optimal hyperparameters (parameters whose values are used to control the learning process) were tuned using a 100 times random search with 5-fold cross-validation on the training set. Specifically, with each set of hyperparameters chosen at random, the training set was first randomly partitioned into 5 equal subsets (folds) and each fold was left out to be the hold-out fold in turn. For each hold-out fold, we repeated the process of training the ML algorithm with the specific set of hyperparameters on the remaining four folds and assessing the performance on this hold-out set. The optimal set of hyperparameters was selected according to a minimum averaged RMSE over the 5 assessments to the 5 hold-out folds. For each version of the different ML algorithms, a final model was obtained by retraining the algorithm on the entire training set with the optimal hyperparameters, and the performance of this final model was assessed on the validation set.
Model evaluation and model application
We evaluated each final model’s performance based on the Root-Mean-Square Error (RMSE) and the coefficient of determination (R2) with the 95% confidence intervals determined using bootstrapping with 1000 replicates from the validation set. Then we applied our optimal ML prediction models to NHANES cycles after 2005, where CRF was not measured. Specifically, we used the parsimonious model to predict VO2max for each eligible NHANES subject (Supplemental Table S2) from 2005 through 2020 and applied our extended model to eligible participants from 2005-2006 and 2011-2018 (DEXA data were not available for 2007-2010 and 2019-March 2020). Finally, we respectively generated the corresponding predicted nationally representative estimates for the different periods and compared them to the exercise test estimates from 1999-2004.
Feature importance
We assessed feature importance using SHAP44,45 (Shapley Additive Explanation) values to identify the relative contribution of a feature to the final prediction. SHAP values are based on Shapley values which are derived from conditional game theory, attempting to fairly distribute the prediction among all the features.46 SHAP generalizes the Shapley values to predictive modeling and quantifies the contribution that each feature brings to the prediction made by the model.
Statistical Analysis
We assigned equal weight to each NHANES participant in our final study population when developing the models, thus we did not consider the sampling structure of survey data in the phase of model development. To deal with the missingness, we adopted two strategies according to the algorithms we used. For LASSO, KNN, SVR, and RF, we imputed the missing data using multivariate imputation by chained equation.47 For XGBoost and LightGBM in the GBDT family, we took advantage of their built-in functionality, which ignores missing values during a split when a decision tree grows, then allocates them to whichever side reduces the loss/error the most, to address missingness without extra missing value imputation. We also applied the GBDTs models to the imputed data to compare the two strategies of handling missing values in this study. Categorical features were converted into a binary format using the 1-hot encoding (ie, variables with k categories were transformed into k binary indicator variables).
While generating nationally representative estimates, we incorporated the corresponding strata and weights for the estimation as per NHANES requirement. Participants’ weights were pooled and divided by the number of years studied, following the NHANES guidance. We reported descriptive statistics of participant characteristics, in the nationally representative form, as the mean (SD) or mean (range) for continuous variables and as the percentage for categorical variables, as appropriate. All statistical tests were two-sided, with a level of significance of 0.05. Data analysis was performed using R (version 4.0.1) and Python (version 3.8.5).
RESULTS
Baseline Characteristics
From NHANES 1999-2004, we included 5,668 participants in our final study population for model development. Among the national representative estimates of the baseline characteristics, the mean (SD) age was 32.5 (10.0) years; 49.9% were female, 69.5% were White, 15.4% were Hispanic, 10.6% were Black, and 5% belonged to the other race and ethnicity. Additionally, 21.8% of the study population lacked coverage of health insurance, 30.7% were in a low family income level, 18.0% were unemployed or not in the labor force, and 55.7% had a degree greater than high school. The observed values of VO2 max had a mean (range) of 40.1 (18.2-132.1).
Model comparison
Among the existing non-exercise algorithms assessed in the validation set, Jackson’s sex-specific algorithms28 in general had the best performance: the version using a 5-level physical activity score and BMI provided the best combination of RMSE (9.71 [95% CI: 8.94-10.62]) and R2 (0.25 [95% CI: 0.17-0.26]). Jackson’s sex-specific algorithms were followed by the models from Matthews23 and Jurca24, whose RMSEs were slightly over 10. Other methods had relatively poor performances, with RMSEs greater than 11. Detailed information for the existing non-exercise models is listed in Table 1.
As for the newly developed parsimonious ML models, the LightGBM model with the algorithm’s default solution for the missingness performed the best with an RMSE of 8.51 [95% CI: 7.73-9.33] and an R2 of 0.28 [95% CI: 0.23-0.33] on the validation set. Among the extended ML models that include all candidate variables, the LightGBM model with the algorithm’s default solution for the missingness also outperformed the others with an RMSE of 8.26 [95% CI: 7.44-9.09] and an R2 of 0.32 [95%CI: 0.26-0.38] on the validation set (Supplemental Figure S5). The parsimonious and the extended LightGBM models significantly reduced the error by 12% (P <0.001) and 15% (P<0.001) compared with the best existing non-exercise model verified in this study, respectively (Figure 2). Although the extended LightGBM model performed significantly better than the parsimonious one (P <0.05), the agreement between these two models was strong since according to the Bland-Altman plot (Supplemental Figure S6), there was no significant systematic difference. We also investigated the distributions of the differences between the observations and the predictions provided by either the optimal ML models or the existing non-exercise algorithms. Based on the density curves of those distributions (Supplemental Figure S7), among the optimal ML models which in general had smaller biases than the existing algorithms, there were no systematic or consistent differences in their biases. Thus, the parsimonious model, though not as accurate as the extended model, can be a reliable alternative in most cases where the extended model is not applicable. Other newly developed ML models in this study also generally outperformed the existing non-exercise algorithms (Table1, Figure 2, Supplemental Figure S5).
Feature importance
We selected the top 16 relevant features for each optimal ML model based on the overall feature importance scores provided by SHAP. The SHAP analysis on the optimal parsimonious LightGBM model revealed that, in order of importance, gender, waist circumference, the 60s resting pulse rate, weekly total moderate-intensity activity time, BMI, age, height, race and ethnicity, family poverty income ratio, highest education level, daily alcohol intake, diastolic blood pressure weight, weight difference with last year, systolic blood pressure, and smoking status are the key predictors of VO2max. The extended LightGBM model’s SHAP values additionally identified the averaged leg percent fat, averaged arm percent fat, trunk percent fat, calcium, total cholesterol, creatinine, leg bone mineral density (BMD), potassium, BUN (blood urea nitrogen), and trunk lean as important features (Figure 3). Specifically, considering only the continuous features, the percent fat of leg, arm, and trunk, the 60s resting pulse rate, age, calcium, poverty income ratio, total cholesterol, diastolic blood pressure, creatinine, waist circumference, BMI, weight, and systolic blood pressure are negatively associated with the predicted VO2max.
The higher values those features are, the lower values those features would make the predicted VO2max become. As for the weekly total moderate-intensity activity time, BUN, trunk lean, height, and weight loss in one year, they are positively associated with the predicted VO2max (Figure 3, Supplemental Figure S9, S10).
Prediction for NHANES
The extended and the parsimonious model predictions had similar distributions to the exercise test measurements from 1999-2004 (Supplemental Figure S8). Sex-specific nationally representative estimates of the mean VO2max were reported in Figure 4. Different types of estimates were relatively consistent across NHANES cycles, and no significant trends or changes in VO2max at the national scale were found.
DISCUSSION
The ML models developed in this study outperformed existing non-exercise algorithms for predicting VO2max. Such models can be applied to NHANES cycles that lacked the CRF component to make nationally representative estimates of cardiorespiratory fitness. In addition, this model may be useful in clinical situations where there is a need for an estimate of the VO2max in the absence of an exercise test.
The algorithm developed in this study advances the field in several ways. First, we used the NHANES data, a national data source, to develop our models. Therefore, models built on NHANES data have excellent generalizability, which allows us to provide a nationally representative estimation. Most previous studies depended on local data or were biased towards specific subgroups with poor external validity. For instance, the NASA study population was only representative of the NASA/Johnson Space Center workforce,19,20 the ACLS study consisted mostly of white participants who were of higher SES.8,9
Second, we used a more comprehensive set of variables to develop the models and applied state-of-the-art ML algorithms to address non-linear associations and interactions among the predictors. The comprehensive set of variables explored in this study came from both the NHANES interviews and the physical examination sections, which included demographic, socioeconomic, and health-related questions, medical and physiological measurements, and laboratory tests administered by highly trained medical personnel. By using ML algorithms, our models had stronger predictive power, and they were more efficient to train. Our models significantly outperformed the previous methods, which had a limited number of predictors and used mostly linear models.
Third, we applied the improved models to generate nationally representative estimates of CRF, thereby providing new insights into cardiorespiratory health in the US population. NHANES CRF data were measured in 1999-2004 only, such that national data on fitness in the U.S. has been lacking ever since 2005. We demonstrated, in NHANES, the feasibility of predicting the discontinued measurement of VO2max with the model generated from cycles that had VO2max (in the CRF component) recorded. The predicted outcomes showed consistency with the real measurements in the form of nationally representative estimates, they could be considered reliable approximates for those cycles that lacked VO2max data.
Our findings have important clinical and public health implications. First, using accessible national survey data and ML algorithms, we demonstrated the feasibility to develop reliable prediction models of CRF and comprehensively evaluate the CRF status of the U.S population. These models can be calibrated and generalized to other populations to predict CRF. Moreover, we provide a useful pipeline for researchers and policymakers to leverage existing national data to predict important population health variables that were not consistently available, either missing in earlier cycles or discontinued to be recorded in later periods. Second, the nationally representative estimates of the predicted VO2max, along with the ones from the real measurements, provided insights into the cardiorespiratory fitness status of the US population over the recent 20 years. These results will enable a better understanding of overall cardiovascular health and offer a reference for the formation of relevant public health guidelines and policies. Another implication was the newly discovered factors that correlated to CRF. Besides the commonly acknowledged predictors, we found family income, BUN, blood pressure, cholesterol, leg bone mineral density, trunk lean, and creatinine also important in the CRF prediction. Regarding fat percent for different parts of the body (leg, trunk, arm), we found the averaged leg percent fat had the highest relevance with CRF. The disparity in fat percent of different body parts as risk factors is worthy of future research.
Our study has several limitations. First, the extended models built in this study were cross-sectional. Although they furnished reliable estimates of CRF at the population level, they could not assess changes in CRF. The effect of time was neglected when developing the models in this study, leaving time interactions unevaluated. Second, the available CRF outcomes in NHANES were derived from the submaximal exercise test, which is not the optimal way of measuring CRF. In a submaximal exercise test, rather than being directly measured, VO2max is calculated with the assumption that the relation between heart rate and oxygen consumption is linear during exercise. Therefore, the models and the corresponding predictions may be subject to the errors or violations introduced by the indirect property of the submaximal exercise test. Third, although the models built in this study were expected to have better generalizability, they were still limited by the exclusion criterion and selected variables in NHANES. The exclusion criterion NHANES adopted narrowed down the study population to relatively healthy younger-age (16-49) people who qualified for the submaximal exercise test, which constrained the use of the model in the elderly population. Also, the various kinds of NHANES variables used in this study might not be available in other healthcare settings. Thus, the actual feasibility of the use of our models outside the scope of NHANES remains to be determined by external validation, and it requires extra attention to the study population either for applying the models or interpreting the results from the models. Finally, we only considered CRF as the continuous outcome in this study while clinical measurements are more useful as prognostic indicators when a specified level of the parameter being measured identifies a threshold of increased risk for adverse health outcomes. Currently, there is no complete consensus on a level of CRF that classifies an asymptomatic individual as high risk, nor is there agreement as to what level of CRF is sufficient in the context of health and disease prevention. We did not add information on this important topic, but such insight can be gained by examining the association between the VO2max predicted by the model and other significant health outcomes in future NHANES studies. In all, additional work is needed to assess the applicability of this approach in primary care and other settings, to verify the validity of non-exercise estimates of CRF as predictors of health outcomes, and to establish a target level of CRF for primary prevention, placing CRF levels into clinical relevance like guidelines for blood pressure or body mass index.
In summary, we presented the ML models built from NHANES data to predict VO2max. We demonstrated the viability of using these models to be a proxy for direct measurements via exercise testing. They can make more accurate predictions and reliable national representative estimates for CRF than the previously published models.
Data Availability
All data produced are available online at https://wwwn.cdc.gov/nchs/nhanes/Default.aspx
Funding
None.
Disclosures
In the past three years, Harlan Krumholz received expenses and/or personal fees from UnitedHealth, Element Science, Aetna, Reality Labs, Tesseract/4Catalyst, F-Prime, the Siegfried and Jensen Law Firm, Arnold and Porter Law Firm, and Martin/Baughman Law Firm. He is a co-founder of Refactor Health and HugoHealth, and is associated with contracts, through Yale New Haven Hospital, from the Centers for Medicare & Medicaid Services and through Yale University from Johnson & Johnson. Bobak Mortazavi received expenses and/or personal fees from HugoHealth, as a consultant. Dr. Khera receives support from the National Heart, Lung, and Blood Institute of the National Institutes of Health under award, 1K23HL153775, and is a founder of Evidence2Health, a precision health and digital health analytics platform. The other co-authors report no potential competing interests.
Supplemental Material
Tables S1–S4
Figure S5-S9
References 37-43, 48-49
ACKNOWLEDGEMENTS
Non-standard Abbreviations and Acronyms
- CRF
- Cardiorespiratory fitness
- VO2max
- Maximal oxygen uptake
- CPX
- Cardiopulmonary exercise testing
- ML
- Machine learning
- NHANES
- National Health and Nutrition Examination Survey
- STROBE
- Strengthening the Reporting of Observational Studies in Epidemiology
- COVID-19
- coronavirus disease 2019
- MEC
- Mobile Examination Center
- KNN
- K-Nearest Neighbors
- LASSO
- Least Absolute Shrinkage and Selection Operator
- SVR
- Support Vector Regression
- RF
- Random Forest
- GBDT
- Gradient Boosting decision tree
- XGBoost
- Extreme Gradient Boosting
- LightGBM
- Light Gradient Boosting Machine
- SHAP
- Shapley additive explanation