ABSTRACT
Background Global targets to reduce salt intake have been proposed but their monitoring is challenged by the lack of population-based data on salt consumption. We developed a machine learning (ML) model to predict salt consumption based on simple predictors, and applied this model to national surveys in low- and middle-income countries (LMICs).
Methods Pooled analysis of WHO STEPS surveys. We used 19 surveys with spot urine samples for the ML model derivation and validation; we developed a supervised ML regression model based on: sex, age, weight, height, systolic and diastolic blood pressure. We applied the ML model to 49 new STEPS surveys to quantify the mean salt consumption in the population.
Results The pooled dataset in which we developed the ML model included 45,152 people. Overall, there were no substantial differences between the observed (8.1 g/day (95% CI: 8.0-8.2 g/day)) and ML-predicted (8.1 g/day (95% CI: 8.1-8.2 g/day)) mean salt intake (p= 0.065). The pooled dataset where we applied the ML model included 157,699 people; the overall predicted mean salt consumption was 8.1 g/day (95% CI: 8.1-8.2 g/day). The countries with the highest predicted mean salt intake were in Western Pacific. The lowest predicted intake was found in Africa. The country-specific predicted mean salt intake was within reasonable difference from the best available evidence.
Conclusions A ML model based on readily available predictors estimated daily salt consumption with good accuracy. This model could be used to predict mean salt consumption in the general population where urine samples are not available.
Funding Wellcome Trust (214185/Z/18/Z)
INTRODUCTION
The association between high sodium/salt intake and high blood pressure, a major risk factor of cardiovascular diseases (CVD), is well-established.1–3 More than 1.7 million CVD deaths were attributed to a diet high in sodium in 2019, with ∼90% of these deaths occurring in low- and middle-income countries (LMICs).4, 5 Consequently, salt reduction has been included in international goals: the World Health Organization (WHO) recommendation of limiting salt consumption to <5 g/day,2 and the agreement by the WHO state members of a 30% relative reduction in mean population salt intake by 2025.6 Because available evidence suggests that sodium/salt consumption is higher than the global targets,7–9 we need timely and consistent data of sodium/salt consumption in the general population to track progress of salt reduction targets.
Global efforts have been made to produce comparable estimates of sodium/salt intake for all countries.7 Similarly, researchers have summarized all the available evidence in specific world regions.8, 9 Although the global endeavor was based on the gold standard method to assess sodium/salt intake (i.e., 24-hour urine sample), their estimates were up to 2010.7 Therefore, robust and comparable sodium/salt intake estimates for all countries lack for the last ten years. The regional endeavors summarized population-based evidence, yet they conducted study-level meta-analyses in which the original studies could have followed different laboratory methods, and they did not study all countries in the region. Therefore, comparability across studies could be limited and evidence lacks for many countries. Finding a method to estimate sodium/salt consumption in national samples leveraging on available data is needed to update and complement the existing evidence.7–10 However, quantifying sodium/salt intake based on 24-hour urine samples is costly and burdensome, limiting its use in population-based studies or national health surveys. As an alternative, equations have been developed to estimate sodium/salt intake based on spot urine (SU) samples.11–14 However, these equations have been used in few WHO STEPS and other national health surveys,15 leaving several countries without data to quantify the local sodium/salt consumption because they do not have access to SU samples.16
If we could (accurately) estimate sodium/salt intake based on variables that are routinely available in national health surveys (e.g., weight or blood pressure), mean sodium/salt intake in countries that currently lack urine data (i.e., 24-hour or spot) could be computed using these available predictors. Advanced analytic techniques like machine learning (ML) could make accurate predictions, and inform about the population sodium/salt intake in LMICs. We developed a ML predictive model to estimate salt intake using routinely available variables in WHO STEPS surveys, and we applied this ML model to WHO STEPS surveys without urine data to compute the mean salt intake in the general population.
METHODS
Study design
This is a data pooling ML analysis of individual-level data of WHO STEPS surveys. First, we downloaded 19 WHO STEPS surveys with SU samples; these surveys were used for the training, validation and testing the ML model. These 19 WHO STEPS surveys represented 17 LMICs; two countries contributed with two surveys: Bhutan 2014 and 2019 as well as Mongolia 2013 and 2019. Second, we downloaded 49 new WHO STEPS surveys which had the variables included in the ML prediction model (see Variables section), but did not have SU samples. The ML model herein developed was applied to these 49 surveys to estimate the mean salt consumption in the population. All WHO STEPS surveys herein analysed were national (i.e., sub-national surveys were excluded).
Rationale
We hypothesized that a ML model could accurately predict salt consumption at the individual level, to then inform the overall mean in the underlying population. In addition, we endeavoured to develop a ML model with simple predictors; that is, variables that are routinely available in national health surveys contrary to urine sample that are seldom collected in national health surveys in LMICs. If the model were indeed accurate, then it could be applied to national surveys without urine samples but with the relevant predictors to inform about the mean salt consumption in the target population. These model-driven estimates could be preliminary, until a national health survey is conducted to study mean salt consumption with urine samples (ideally 24-hour urine samples).
Variables
The predictors we used in the ML model were: sex, age (years), weight (kg), height (m), systolic blood pressure (SBP, mmHg) and diastolic blood pressure (DBP, mmHg).
The WHO STEPS surveys collect anthropometric and three blood pressure measurements. These are taken by trained fieldworkers following a standard protocol.15 We used measured weight and height to compute the body mass index (BMI, kg/m2). We used the mean SBP and mean DBP of the second and third blood pressure measurements (i.e., the first blood pressure measurement was discarded).
The outcome was salt intake as per the INTERSALT equation.11 We chose this equation because it has been used by some WHO STEPS surveys. There is a specific INTERSALT equation for each sex, and they both include the following variables: age (years), BMI (kg/m2), SU sodium (mmol/L), and SU creatinine (mmol/L).11 We used the following sex-specific formulas.
Where the subscript SU indicates spot urine, Na is sodium, Cr is creatinine and BMI is body mass index. Because some STEPS surveys had SU creatinine in mg/dl, these values were multiplied by 0.00884 to obtain SU creatinine in mmol/L. No conversion was needed for sodium in SU samples because all WHO STEPS surveys herein included already had urinary sodium in mmol/L. The INTERSALT equation computes 24-hour sodium intake, which is then divided by 17.1 to obtain the salt intake in grams per day (g/d).11 For descriptive purposes, we also computed salt intake based on the Kawasaki,12 Toft13 and Tanaka14 equations.
Analysis
Data preparation
Our complete-case analysis was restricted to men and non-pregnant women aged between 15 and 69 years. We dropped participants with implausible BMI levels (outside the range 10-80 kg/m2) or with implausible weight (outside the range 12-300 kg) or height records (outside the range 1.00-2.50 m). Participants with SBP outside the range 70-270 mmHg were discarded, and so were participants with DBP outside the range 30-150 mmHg. We excluded records with SU creatinine <1.8 or >32.7 mmol/L for males and <1.8 or >28.3 for females.17, 18 In addition, we excluded participants with estimated salt intake (using the 4 equations) above or below three standard deviations from the equation-specific mean (Supplementary Figure 1).19 After completing data preparation, observations were randomly assigned from the pooled dataset (100%) into three datasets for the ML analysis: training data (50%), test data (30%) and validation data (20%).
Machine learning modelling
Our research aim was a regression problem where we had a known outcome attribute (salt consumption at the subject level). Therefore, we planned a supervised ML regression analysis. Details about the modelling process are available in the Extended Methods (Supplementary Material pp. 03-06). In brief, we designed a work pipeline with five steps. First, data analysis, where we dropped missing observations, we explored the available data to choose scaling and transformation methods to secure all variables were in the same scale or units, and we also planned transformations for categorical variables (e.g., one-hot encoding). Second, feature importance analysis, where we investigated the contribution of each predictor to the regression model through methods like Random Forest and Recursive Feature Elimination. The aim of this second step was to exclude any predictor that would not contribute to the regression model. Notably, all predictors (see Variables section) chosen following expert knowledge were kept in the analysis (i.e., the feature importance analysis did not suggest the exclusion of any predictor). Third, data processing, having explored the available data (first step in the work pipeline), we implemented different scaling and transformation methods (e.g., Box-Cox, Principal Component Analysis and polynomial features). Fourth, data modelling, where we implemented ten ML algorithms: i) Linear Regression (LiR); ii) Hubber Regressor (HuR); iii) Ridge Regressor (RiR); iv) Multilayer Perceptron (MLP); v) Support Vector Regressor (SVR); vi) k-Nearest Neighbors (KNN); vii) Random Forest (RF); viii) Gradient Boost Machine (GBM); ix) Extreme Gradient Boosting (XBG); and x) a customized neural network. All these ML algorithms performed similarly, so the decision to choose one was postponed to the fifth (last) step in the work pipeline. Up to this point, we used the training and validations datasets. Five, forecasting of the predicted attribute in new data (i.e., data not used for model training); in this step we used the test dataset to choose the model that yielded predictions closest to the observed salt intake. Results comparing the observed and the predicted salt intake were computed in the test dataset alone. For each country we ran a paired t-test between the observed and predicted salt consumption, where a difference was deemed significant at a p <0.05. We also computed the absolute difference between the observed and predicted salt intake. We chose the RF algorithm because it showed the mean difference closest to zero in both sexes combined (observed – predicted = 0) (Supplementary table 1, Supplementary figure 2). All summary estimates (e.g., mean salt intake) were computed accounting for the complex survey design of the WHO STEPS surveys.
Application of the developed ML model
Having developed the ML model following the steps above described, we applied the model to 49 WHO STEPS national surveys which did not have urine samples but included the predictors in the ML model (see Variables section). In each of these 49 surveys we computed the mean daily salt intake accounting for the complex survey design. These surveys were pre-processed following the same procedures described in the Data preparation section.
Ethics
We did not seek approval by an Institutional Review Board. We used individual-level survey data which do not include any personal identifiers.
Role of the funding source
The funder had no role in the study design, analysis, interpretation or decision to publish. The authors are collectively responsible for the accuracy of the data. The arguments and opinions in this work are those of the authors alone, and do not represent the position of the institutions to which they belong.
RESULTS
Study population for model derivation and validation
The pooled dataset included 45,152 people from 17 LMICs in 19 WHO STEPS surveys (i.e., two countries, Bhutan and Mongolia, had 2 surveys) conducted between 2013-2019 (Supplementary table 2). Overall, the mean age was 38.6 (95% confidence interval (95% CI): 38.1-39.0) years and the proportion of women was 52.1%. The mean SBP was 122.0 mmHg (95% CI: 121.4-122.6 mmHg) and mean DBP was 79.2 mmHg (95% CI: 78.8-79.6 mmHg). The mean weight was 59.2 kg (95% CI: 58.8-59.7 kg) and the mean height was 1.59 m (95% CI: 1.58-1.59 m).
Observed and predicted mean salt intake during the ML model derivation and validation
In the test dataset including 19 WHO STEPS surveys, the observed mean salt intake computed as per the INTERSALT equation was 8.1 g/day (95% CI: 8.0-8.2 g/day) across countries. The observed salt intake was higher in men (8.9 g/day; 95% CI: 8.7-9.0 g/day) than in women (7.4 g/day; 95% CI: 7.3-7.4 g/day). Across countries, the predicted mean salt intake was 8.1 g/day (95% CI: 8.1-8.2 g/day). Men had a higher predicted mean salt intake (8.9 g/day; 95% CI: 8.8-8.9 g/day) than women (7.4 g/day; 95% CI: 7.4-7.5 g/day). Overall, there were no substantial differences between the observed and predicted estimates (p=0.065). Results for each survey are presented in Figure 1 and Supplementary table 3.
In men across all countries in the test dataset including 19 WHO STEPS surveys representing 17 LMICs, the mean difference between observed and predicted mean salt intake was 0.02 g/day (p=0.277). Across all surveys, the positive mean difference farthest from zero was 1.39 g/day (Malawi, p<0.001), and the negative mean difference farthest from zero was −0.74 g/day (Lebanon, p=0.227). The mean difference closest to zero was 0.03 g/day (Armenia, p=0.787) (Supplementary table 4).
In women across all countries in the test dataset including 19 WHO STEPS surveys representing 17 LMICs, the mean difference between the observed and predicted mean salt intake was −0.02 g/day (p=0.124). The positive mean difference farthest from zero was 1.02 g/day (Malawi, p<0.001) and the negative mean difference farthest from zero was in −0.71 g/day (Brunei Darussalam, p<0.001). The mean difference closest to zero was −0.07 g/day (Armenia, p=0.343) (Supplementary table 4).
None of the LMICs herein analysed, regardless of the method of sodium intake assessment (i.e., observed or predicted), showed a mean salt intake below the WHO recommended level of <5 g/day (Figure 1, Supplementary table 3). The same occurred for the mean salt intake estimates using the Kawasaki, Toft and Tanaka formulas (Supplementary table 5).
Implementation of the developed ML model to predict salt consumption in 49 LMICs
The pooled dataset where we applied the ML model included 157,699 people from 49 LMICs in 49 WHO STEPS surveys conducted between 2004-2018 (Supplementary table 6). Overall, the mean age was 37.7 (95% CI: 37.2-38.1) years and the proportion of women was 48.8%. The mean SBP was 123.8 mmHg (95% CI: 123.1-124.4 mmHg) and mean DBP was 79.1 mmHg (95% CI: 78.7-79.6 mmHg). The mean weight was 60.6 kg (95% CI: 60.2-61.0 kg) and the mean height was 1.61 m (95% CI: 1.61-1.61 m).
Across the 49 LMICs, the predicted mean salt intake was 8.1 g/day (95% CI: 8.1-8.2 g/day), and it was higher in men (8.9 g/day; 95% CI: 8.8-8.9 g/day) than in women (7.4 g/day; 95% CI: 7.3-7.4 g/day). None of the LMICs herein analysed, regardless of sex, showed a predicted mean salt intake below the WHO recommended level of <5 g/day (Figure 2, Supplementary table 7).
In men, the countries with the highest predicted mean salt intakes were Nauru and American Samoa (both with 11.1 g/day), Cook Islands (11.0 g/day) and Niue (10.7 g/day); remarkably, three of these countries (Nauru, Cook Islands and Niue) are in Western Pacific. In contrast, the lowest predicted mean salt intake in men were in Ethiopia (8.3 g/day), Eritrea and Timor-Leste (both with 8.4 g/day), and United Republic of Tanzania, Botswana and Uganda (all with 8.6 g/day); remarkably, all of these countries except Timor-Leste are in Africa.
In women, the countries with the highest predicted mean salt intakes were American Samoa (9.0 g/day), Nauru (8.8 g/day), and Cook Islands, Samoa and Tuvalu (all with 8.6 g/day); the latter four countries are in Western Pacific. Conversely, the lowest predicted mean salt intake in women were in Ethiopia and Eritrea (both with 6.9 g/day), Timor-Leste (7.0 g/day), and Vietnam (7.1 g/day); the first two countries (out of four) are in Africa.
DISCUSSION
Main findings
This work leveraged on 19 national health surveys and readily available predictors to develop a ML model to predict salt consumption; this model was then applied to national surveys in 49 LMICs. The RF ML algorithm yielded the predictions closest to the observed salt intake: the mean difference between predicted and observed salt consumption across surveys was 0.02 g/day in men and −0.02 g/day in women. We used this novel ML model to predict salt consumption in 49 LMICs, where the mean salt consumption ranged from 8.3 g/day (Ethiopia) to 11.1 g/day (Nauru) in men; these numbers in women ranged from 6.9 g/day (Ethiopia and Eritrea) to 9.0 g/day (American Samoa). This work aimed to elaborate on novel analytical tools to predict salt consumption where national surveys have not collected this information limiting their ability to keep track of mean sodium consumption in the general population. Pending external independent validation, our model could be used in monitoring frameworks of salt consumption because most countries do not collect sodium samples in their national health surveys. Our model could contribute to the global surveillance of salt consumption, a relevant cardio-metabolic risk factor.1–3
Public health implications
ML models have been used extensively to predict relevant clinical outcomes (e.g., mortality) and epidemiological indicators (e.g., forecasting COVID-19 cases).20–24 Furthermore, ML algorithms have proven to be useful for understanding complex outcomes (e.g., identifying clusters of people with diabetes) based on simple predictors (e.g., BMI) in nationally-representative survey data.25–27 Our work complements the current evidence on ML algorithms by demonstrating its use in a relevant field: population salt consumption. In so doing, we delivered a pragmatic tool which could be used to inform the surveillance of salt consumption in countries where national surveys do not objectively collect this information (e.g., SU samples). Moreover, this work provided preliminary evidence to update the global estimates of population-based sodium consumption,7 by informing about the mean sodium consumption in 49 LMICs. Our results suggest that mean salt consumption is above the WHO recommended level in all the 49 LMICs herein analysed, and it was the highest among LMICs in Western Pacific, and the lowest among LMICs in Africa. This finding, which is consistent with a global work,7 calls for urgent actions to reduce salt consumption in these 49 LMICs, especially those in Western Pacific.
We do not believe that our -or any other-ML model should replace a comprehensive population-based nationally representative health survey with 24-hour or spot urine samples. However, until such surveys are available in many LMICs and periodically conducted, we could suggest using an estimation approach to shed lights about the mean salt consumption in the population. Our ML model seems to be a reasonably good alternative, and could become a pragmatic tool for surveillance systems that keep track of sodium consumption in accordance with global goals.2, 6
Research in context
A global effort provided mean sodium/salt consumption estimates for 187 countries in 1990 and 2010;7 they used 24-hour urine samples and dietary reports from surveys conducted in 66 countries, including LMICs. Unfortunately, their results were until 2010 and many LMICs were not included. Our results advanced this evidence by providing more recent salt consumption estimates, because most of the surveys in which we applied our ML model were conducted after 2010 (Supplementary table 6).
Compared to the global estimates for the same LMICs in 2010,7 our mean salt consumption estimates were very similar. For example, our 2010 mean salt consumption estimates for Cambodia, Eritrea and the Gambia were 8.0 g/day, 7.2 g/day and 8.3 g/day, whereas the estimates by Powles et al. were 11.0 g/day, 5.9 g/day and 7.7 g/day (Supplementary table 8).7 We further compared our estimates for surveys conducted between 2007 and 2013 (+/- 3 years around 2010) with the 2010 estimates provided by Powles et al.,7 and our results were also within reasonable difference. The largest differences were in Kyrgyzstan (8.7 by our ML model versus 13.4 by Powles et al.7), as well as in Comoros (8.4 versus 4.27) and Rwanda (8.0 versus 4.07). It appears that our predictions were higher than those provided by Powles et al.7 in countries with presumably low salt consumption (e.g., Comoros, Rwanda); conversely, in countries with presumably high salt consumption (e.g., Kyrgyzstan), our predictions revealed smaller estimates than those by Powles et al.7 (Supplementary table 8). These differences could be explained by the fact that our ML model was developed based on SU samples rather than 24-hour urine samples as Powles et al. did. Strong evidence indicates that estimates based on SU may overestimate salt intake at lower levels of consumption and underestimate salt intake at higher levels of consumption.28
In addition to the global work by Powles et al.,7 there are other reports from some specific LMICs. For example, a survey conducted between 2012 and 2016 with 24-hour urine samples in Fiji and Samoa showed that the mean salt consumption was 10.6 g/day and 7.1 g/day, respectively.17 The estimates from our ML model for Fiji (2011) and Samoa (2013) suggested that the mean salt consumption was 8.9 g/day and 9.6 g/day, respectively. A survey in Vanuatu in 2016 based on 24-hour urine sample informed that the mean salt intake was 5.9 g/day;18 our estimate for the year 2011was 8.6 g/day. In 2009 in Vietnam a survey with SU samples revealed that the mean salt consumption was 9.9 g/day;19 our prediction for the year 2015 was 7.9. These comparisons suggest that our ML-predicted estimates are plausible and close to the best available evidence.
Although these comparisons do not validate our predictions in the 49 national surveys, they suggest that our salt consumption estimates are within reasonable distance from the best available evidence. Until better data are available (e.g., national survey with spot or 24-hour urine sample), our model could provide preliminary evidence to inform the national mean salt consumption. Careful interpretation is warranted to understand the strengths and limitations of our ML-based predictions.
Strengths and limitations
We followed sound and transparent methods to develop a ML model to predict salt consumption at the individual level. We leveraged on open-access national data collected following standard and consistent protocols (WHO STEPS surveys). Most of the surveys we analysed were conducted after 2010, providing more recent evidence than the latest global effort to quantify salt consumption in all countries.7 Notwithstanding, we must acknowledge some limitations. First, urine data was based on a spot sample, which is not the gold standard (24-hour urine sample) to measure daily salt consumption. Future work should verify and advance our results using on 24-hour urine samples available in nationally representative samples; in the meantime, our work has led the foundations and hopefully sparked interest to use available data and novel analytical techniques to deliver estimates of salt consumption in the general population. Second, even though we analysed 19 national surveys (representing 17 LMICs) to develop our ML model, the sample size could still be limited for a data-driven ML algorithm (i.e., 22,577 observations were included in model development). A larger and global work in which all relevant data sources are pooled is needed; while this endeavour takes place, our work has provided recent estimates of salt consumption at the population level in 49 LMICs. In this line, there are still countries which were not herein included. Researchers in these countries, along with local (e.g., ministries of health) and international health authorities (e.g., WHO), should conduct studies/surveys to collect data on salt consumption. This would inform global targets but also local needs and interventions.
Conclusions
A ML model based on readily available variables was accurate to predict daily salt consumption. This ML model applied to 49 national surveys with no urine samples to compute daily salt consumption, revealed high levels of salt intake particularly in the Western Pacific region. Pending further validation, this ML model could be used to keep track of the overall sodium consumption where resources are not available to conduct national surveys with urine samples.
DATA SHARING STATEMENT
This study used nationally-representative survey data that are in the public domain, which was requested through the online repository (https://extranet.who.int/ncdsmicrodata/index.php/home). We provide the analysis code of data preparation and data analysis as supplementary materials to this paper.
Data Availability
This study used nationally-representative survey data that are in the public domain, which was requested through the online repository (https://extranet.who.int/ncdsmicrodata/index.php/home). We provide the analysis code of data preparation and data analysis as supplementary materials to this paper.
SUPPLEMENTARY MATERIAL
Expanded methods Overview
We worked with a structured dataset which mostly had numeric attributes (variables). Given our study problem, we opted for a supervised learning model because there was a target attribute (i.e., salt consumption at the subject level); specifically, we conducted a supervised regression because the target attribute was a numeric variable. For the machine learning analyses we used Python and the Scikit-Learn library.
First, we developed a pipeline for data management and model development. This way, we followed a consistent and transparent methodology to secure an optimal model for the training set and that would adequately generalize to other (unseen) datasets. The following figure depicts the pipeline we developed: i) we studied the available data and where needed, we did a one-hot encoding; ii) we did feature importance analysis; iii) we chose and tried different scaling and transformation methods, so that all variables would be in the same scale or units; iv) we tried a set of machine learning models, including a customized neural network; v) we forecasted (predicted) the attribute of interest (salt consumption at the subject level) in an unseen dataset (i.e., not used for model training). Notably, we went backwards and forwards (see arrows in the figure) between the four first stages until we reached the best combinations and results for each model. In the following sections, we will describe each of these five stages.
Data analysis
This was an exploratory analysis to understand the dataset and its characteristics. We worked with a complete-case dataset; in other words, we excluded missing observations in the variables considered in the analysis. Consequently, we did not do any data imputation analysis.
We explored the distribution of all numerical variables, which were in different units and scales; this exploratory analysis informed the choices of data processing methods (e.g., Box-Cox) implemented in the third stage.
Feature importance analysis
Even though we followed expert knowledge to select a reduced, though relevant number of predictors to be included in the regression model, we conducted feature importance analyses to understand the role each predictor would play in the model. This process aimed to eliminate variables that would not carry substantial information for the model. We used Random Forest, Recursive Feature Elimination and Extra Trees. Consistently, these three methods suggested that all the chosen predictors would contribute to a better model.
Data processing
As described in the data analysis section (first stage), numeric variables were in different units and scales; therefore, these variables needed to be scaled or transformed. This scaling would also help to find a better prediction model. It is common knowledge that machine learning models would perform differently (and better) depending on data transformation methods. We did: i) Min-Max whereby numeric variables were scaled to a range between 0 and 1; ii) Standardization; iii) Normalization: iv) Polynomial features of degree 2 (quadratic polynomial); v) Principal Component Analysis with 3 components and explained variance of ≥0.95; and vi) Box-Cox.
Data modelling
There are several machine learning algorithms for a supervised regression model. Those that we used, and that are depicted in the figure, yielded much better results and were studied in detail. That is, at the beginning of our work we explored other algorithms, though these did not perform well and were not considered thereafter. The algorithms we considered were: i) Linear Regression (LiR); ii) Hubber Regressor (HuR); iii) Ridge Regressor (RiR); iv) Multilayer Perceptron (MLP); v) Support Vector Regressor (SVR); vi) k-Nearest Neighbors (KNN); vii) Random Forest (RF); viii) Gradient Boost Machine (GBM); and ix) Extreme Gradient Boosting (XBG).
In addition to these nine machine learning algorithms, we also implemented a neural network (see figure below). This neural network was optimized empirically. We used a batch size = 256; epochs = 300; and optimizer = ‘adam’. The neural network was implemented in Python using the Keras library.
For each model and processing method (see Data Processing section) we studied the R2, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). As shown in the table, all algorithms showed a similar performance. Because all the algorithms had an equivalent performance, the chosen one needed to be defined at the forecasting stage; that is, the one that would generalize better to new (unseen) data.
Forecasting modelling
This stage implies studying the predicted results in new (unseen) data (i.e., data not used for model training). For this stage we used the validation and test datasets. We chose the model that yielded predictions closest to the observed results. In this line, we compared the mean difference between the observed and predicted mean salt intake results (i.e., observed – predicted) across all prediction algorithms.
We observed there was no unique algorithm that had the mean difference closest to zero in men and women at the same time (Supplementary table 1). The RF algorithm had the mean difference closest to zero in both sexes combined (mean difference = 0.0004), the MLP algorithm performed the best in men (mean difference = 0.0052), and in women the GBR algorithm showed the best results (mean difference = −0.0005).
To support our decision process, we plotted the mean differences in men and women for each survey (Supplementary figure 2); this figure only included the predictions based on the top five algorithms (RF, MLP, GBR, LiR and RiR). We counted how many times (i.e., number of surveys) each algorithm had the mean difference closest to zero.
Because the RF algorithm had the mean difference closest to zero in both sexes combined and it was among the top 5 algorithms in men and women (Supplementary table 1), we decided to choose the RF algorithm. Additionally, predictions based on the RF algorithm were the closest to zero across surveys (Supplementary figure 2). These analyses were performed in R (version 4.0.3).
Algorithm application
To make the predictions in the new 49 datasets without information about urine samples, we used the RF model (i.e., ML algorithm and predictors) developed following the methods above described (see Forecasting Modelling section). We re-trained the model with the full dataset used for model development and validation (i.e., train, validated and test dataset pooled), and then predicted the outcome (i.e., mean salt intake) in the 49 new datasets.
Footnotes
Conflict of interest: None to declare.
Contributions: All authors conceived the idea. MC-C conducted the analysis with support from WCG-V and RMC-L. WCG-V drafted the manuscript with support from MC-C and RMC-L. All authors approved the submitted version.
REFERENCES
Reference
- 29.