ABSTRACT
Background Maximal oxygen uptake (VO2 max), an indicator of cardiorespiratory fitness (CRF), requires exercise testing and, as a result, is rarely ascertained in large-scale population-based studies. Non-exercise algorithms are cost-effective methods to estimate VO2 max, but the existing models have limitations in generalizability and predictive power. This study aims to improve the non-exercise algorithms using machine learning (ML) methods and data from U.S. national population surveys.
Methods We used the 1999-2004 data from the National Health and Nutrition Examination Survey (NHANES), in which a submaximal exercise test produced an estimate of the VO2max. We applied multiple supervised ML algorithms to build two models: a parsimonious model that used variables readily available in clinical practice, and an extended model that additionally included more complex variables from more Dual-Energy X-ray Absorptiometry (DEXA) and standard laboratory tests. We used Shapley additive explanation (SHAP) to interpret the new model and identify the key predictors. For comparison, existing non-exercise algorithms were applied unmodified to the testing set.
Results Among the 5,668 NHANES participants included in the final study population, the mean age was 32.5 years and 49.9% were women. Light Gradient Boosting Machine (LightGBM) had the best performance across multiple types of supervised ML algorithms. Compared with the best existing non-exercise algorithms that could be applied in NHANES, the parsimonious LightGBM model (RMSE: 8.51 ml/kg/min [95% CI: 7.73 -9.33]) and the extended model (RMSE: 8.26 ml/kg/min [95% CI: 7.44 -9.09]) significantly reducing the error by 15% (P <0.01) and 12% (P<0.01 for both), respectively.
Conclusion Our non-exercise ML model provides a more accurate prediction of VO2 max for NHANES participants than existing non-exercise algorithms.
What is Known
Although cardiorespiratory fitness is recognized as an important marker of cardiovascular health, it is not routinely measured because of the time and resources required to perform exercise tests.
Non-exercise algorithms are cost-effective alternatives to estimate cardiorespiratory fitness, but the existing models are restricted in generalizability and predictive power.
What the Study Adds
We improve non-exercise algorithms for cardiorespiratory fitness prediction using advanced ML methods and a more comprehensive and representative data source from U.S. national population surveys.
More health factors that are associated with cardiorespiratory fitness are newly identified.
Nationally representative estimates for cardiorespiratory fitness in the U.S. over the recent 20 years are generated.
Competing Interest Statement
In the past three years, Harlan Krumholz received expenses and/or personal fees from UnitedHealth, Element Science, Aetna, Reality Labs, Tesseract/4Catalyst, F-Prime, the Siegfried and Jensen Law Firm, Arnold and Porter Law Firm, and Martin/Baughman Law Firm. He is a co-founder of Refactor Health and HugoHealth, and is associated with contracts, through Yale New Haven Hospital, from the Centers for Medicare & Medicaid Services and through Yale University from Johnson & Johnson. Bobak Mortazavi received expenses and/or personal fees from HugoHealth, as a consultant. Dr. Khera receives support from the National Heart, Lung, and Blood Institute of the National Institutes of Health under award, 1K23HL153775, and is a founder of Evidence2Health, a precision health and digital health analytics platform. The other co-authors report no potential competing interests.
Funding Statement
This study did not receive any funding
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The study used (or will use) ONLY openly available human data that were originally located at:https://wwwn.cdc.gov/nchs/nhanes/Default.aspx
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
All data produced are available online at https://wwwn.cdc.gov/nchs/nhanes/Default.aspx
Non-standard Abbreviations and Acronyms
- CRF
- Cardiorespiratory fitness
- VO2max
- Maximal oxygen uptake
- CPX
- Cardiopulmonary exercise testing
- ML
- Machine learning
- NHANES
- National Health and Nutrition Examination Survey
- STROBE
- Strengthening the Reporting of Observational Studies in Epidemiology
- COVID-19
- coronavirus disease 2019
- MEC
- Mobile Examination Center
- KNN
- K-Nearest Neighbors
- LASSO
- Least Absolute Shrinkage and Selection Operator
- SVR
- Support Vector Regression
- RF
- Random Forest
- GBDT
- Gradient Boosting decision tree
- XGBoost
- Extreme Gradient Boosting
- LightGBM
- Light Gradient Boosting Machine
- SHAP
- Shapley additive explanation