Abstract
We present our considerations for using multiple imputation to account for missing data in propensity score-weighted analysis with bootstrap percentile confidence interval. We outline the assumptions underlying each of the methods and discuss the methodological and practical implications of our choices and briefly point to alternatives. We made a number of choices a priori for example to use logistic regression-based propensity scores to produce “standardized mortality ratio”-weights and Substantive Model Compatible-Full Conditional Specification to multiply impute missing data (given no violation of underlying assumptions). We present a methodology to combine these methods by choosing the propensity score model based on covariate balance, using this model as the substantive model in the multiple imputation, producing and averaging the point estimates from each multiple imputed data set to give the estimate of association and computing the percentile confidence interval by bootstrapping. The described methodology is demanding in both work-load and in computational time, however, we do not consider the prior a draw-back: it makes some of the underlying assumptions explicit and the latter may be a nuisance that will diminish with faster computers and better implementations.
Introduction
In this paper we present the considerations behind estimating the change in prevalence of post-traumatic stress disorder (PTSD) associated with long-distance migration using multiple imputation to handle missing data, propensity score-weighting to adjust for confounding and bootstrap to produce a percentile confidence interval. We will focus on the many statistical methodological problems we encountered and refer the reader to the accompanying paper (1) for the subject matter problem. The relevant data consisted of a 20-items questionnaire and a clinical examination including assessment of possible psychiatric disorders, applied to a sample of Syrian asylum seekers in Denmark and a sample of Syrian refugees in Lebanon. The outcome, PTSD, was assessed using the “Harvard Trauma Questionnaire” part IV (2), giving a score from 1 to 4 with 2.5 being the commonly used cut-off-score for PTSD.
In a propensity score-weighted analysis you first estimate the propensity given a relevant set of predictors, Pr (E=1∣Z), for each individual in the study population, êi. The association between long-distance migration and PTSD was estimated as the prevalence among those who migrated to Denmark minus a weighted average of the prevalence of PTSD among refugees who migrated to Lebanon, using weights equal to êi / (1-êi). This requires a number of decisions including: Which covariates to include in the propensity score model? What complexity of the model to use? How to deal with extreme weights? And how to calculate the standard error of the parameter of interest? As we had missing data in the covariates and PTSD status, we set out to combine the propensity score-weighted analysis with multiple imputation. This raised additional questions such as: What are the required assumptions of the missing data process? What is the substantial model and what variables should be included in the model? How to combine the multiple imputations with the propensity score analysis? How to find a valid confidence interval for the parameter of interest? In the following sub-sections we outline the problems we had to consider and the underlying theory. In Methods we discuss our considerations on how to implement these in our specific study and in Results we provide details on our final implementation. The problems, theory, considerations and our decisions are summarized in Table 1, 2 and 3.
Consideration and decision for combining multiple imputation and propensity score-weighting and obtaining valid confidence interval
The propensity score analysis
Table 1 provides an overview of the considerations and decision for building the propensity score model. The relevant predictors to include in the propensity score model are covariates that (potentially) confound the relationship between the exposure and the outcome. The outcome itself and variables that are only associated with the exposure should not be included in the model (3). The complexity of the regression model should be examined so that balance is obtained for all covariates between exposure groups. Extreme weights may lead to suboptimal covariate balance and unstable estimates and are most often remedied by smoothing or truncation, at the cost of potentially introducing bias (4). The estimate of association When only considering the propensity score-weighted analysis, the confidence intervals can be produced by applying some approximate formula to obtain a standard error or via bootstrapping.
Missing data
The statistical properties of many missing data methods rely on the hypothesized missingness mechanism. The primary interest in applied epidemiology, is whether the missing data mechanism is ignorable, that is, if valid inference can be drawn despite of missing data. In many applied papers using multiple imputations (MI) the authors states that the data is “missing at random” (MAR) and “as a consequence” the inference based on MI is valid. We briefly consider the definition and importance of “missing data” drawing primarily on Seaman et al., (5). Very loosely speaking, data is MAR, if the risk of a data point being missing does not depend on the unobserved values, but only on the observed values. However, this is only a superficial definition. The terminology “missing at random” (MAR) and “missing completely at random” (MCAR, which imply MAR) has been in use at least since Rubin’s 1976 paper (6) and was recently extended to include “realized” and “everywhere” versions of both MAR and MCAR (5). In the latter paper the definition is based on parametric models for both the data, Z, (which include both outcome variable, Y, and covariates, X) and the missingness indicator vector, M, (which for each entry in z, specify if it is observed). Note, we do not observe the entire z, but only the entries, where the corresponding entry in m is 1 and we let o(z, m) denote the observed part of the data, z. Furthermore we let fθ (z) denote the density for the data and Prφ (m|z) the conditional probability of the missing pattern, m, given the data z, with the parameters (φ, θ) ∈Ω. In a specific study we have the realized data and missing indicator vector
with the realized observed data
.
Example 1. Consider a very small data set with four refugees and four variables: “year of residency”, “sex”, “host country” and “PTSD-status”. One realization could be:
With the realised observed data
Data is said to be realized-MAR if for all φ, for all z, where
that is, the probability of the realized missingness pattern
is the same for all data z that has an observed part that is identical to the realized observed data, i.e. the unobserved part is of no interest. In Example 1, the data is realized-MAR, if the conditional probability of data on “sex” for observation number 1 and data on “year of residency” for observation number 2 are missing and all other entries are observed, does not depend on the value of the missing sex and year of residency as long as all the observed entry is as realized. This is a statement only focusing on the realized missingness pattern and the realized observed data; we do not consider other possible missingness patterns or other possible realizations of the data. To emphasize: it is irrelevant whether for instance “sex” on observation number 2 or “country” on observation number 3 could be missing,
The data generating process is said to be everywhere-MAR if for all φ and all m, Prφ (m|z)=Prφ (m|z’) for all z and z’, where o(z, m)=o(z’, m). That is, data is every-where-MAR if it is realized-MAR for all possible realizations and not only for the actu ally observed realization of the missingness pattern and data. Returning to Example 1, when assuming everywhere-MAR the realized data set is irrelevant: We have to check the whole set of possibly missing data conditional probabilities, Prφ (m|z) for all param-eter values, φ.
The elaborations above was necessary to qualify the question of interest: Is the missingness mechanism ignorable? That is, when can we make valid inference about the parameter of interest, θ, only based on the observed data? Seaman et al. (5), illustrated that the answer depends on the type of statistical inference framework and in the “frequentist likelihood framework” you need the missingness mechanism to be everywhere-MAR (and the parameters (φ, θ) be variation independent, i.e Ω=Ωφ ×Ωθ). So in order to ig-nore the missingness mechanism we have to argue that it is reasonable to assume every-where-MAR. This implies that, for all possible missingness patterns and corresponding observed data, it is reasonable to assume that the risk of that specific pattern does not depend on the value of the missing data but only of the observed data. This is of course an impossible task without some insight into why data is missing in the study. One way to start off is to assume that the missing data mechanism is identical and work indepen-dently from person to person, which reduce the problem to a discussion of the mechanism for a single person.
For example, in Example 1, we have no missing in “PTSD” in the realized observed data, however, we can easily imagine this information missing in another realization of the study. If we assume identical and independent missing mechanism, we have to think of why “year of residency”, “sex”, and “PTSD” could be missing for a person and if the risk of this is independent of the unobserved values given what we have observed for that person. For example, if we only observe “country”, we have to argue, that the risk of this is the same for all individuals in each country, i.e. it does not depend on year, sex or whether or not the person has PTSD. We note that the assumption of independent missingness mechanism might easily be invalid, for example missingness could depend on some unobserved event common for several persons in the study. In the accompanying paper (1) we discuss the everywhere-MAR assumption in the specific study.
In the following, we will assume that the purpose of the data analysis is to estimate β, typically a vector of regression coefficients based on the proposed model for the analysis of interest—i.e. the substantive model—of Y given the covariates X: Pr (y∣x; β).
Multiple imputations
Table 2 gives an overview of the considerations and decision for using multiple imputation to deal with missing data. Many statistical methods assume no missing data or missingness mechanism MCAR and will produce biased estimates otherwise (7). A popular way to deal with missing data is to use multiple imputation which gives unbi-ased estimates assuming ignorable missingness mechanism and correctly specified multiple imputation model (8,9). Briefly, multiple imputation consists of producing a number, K, of data sets with imputed values for the missing data and analyze these complete data sets as planned, resulting in K estimates of β which are combined, typically by taking taking the average, into a final estimate for β. When implemented, the imputation is done for each variable with missing data (a) specifying a regression model for the conditional distribution of the variable given the other (relevant) variables (b) using the observed data to estimate the parameters in this model (c) impute the missing values of the variable by simulating from the Bayesian posterior predictive distribution. The last two steps will in general be taken care of by a software program, as long as the imputation regression models are chosen within the most common regression model families. Often one or several of the “predictor” variables in the imputation regression will have missing values too, resulting in a so-called “chained equation”, that is, the imputed values in one variable are needed to impute the values in another variable and vice versa. Luckily, many software packages can solve this problem using iterative methods. Thus, after deciding on what implementation of multiple imputation to use we are left with problem (a): How to specify the imputation regression models, i.e., what should be used as the substantive model in the multiple imputation, what variables to include in the multiple imputation models and how many iterations must be run between sampling? It has been known for a while that you can introduce bias in the estimation of β, if you do not take care in this specification (10). This can happen if the relationship between y and x in the substantive model is more complicated than the relationship between x and y in the implemented imputation regression models. For example, if you do not include y in the imputation regression model for the covariate xi then the imputed data for xi will be unrelated with y and, as a consequence, you will underestimate the regression coefficient βi relating y to xi in the substantive model. Furthermore, if xi and xj interact in the substantive model for y, then y and xj should (at least) interact in the imputation model for xi to avoid bias in the estimate of the magnitude of the interaction. It is difficult, even for relatively simple substantive models, to determine how to specify the imputation models in order to avoid this problem. Luckily there exist a statistical method that can combine a specification of the substantive regression model, y on x, with univariate regression models for each of the variables in x given the rest of the x’s, into an imputation algorithm (11). This Substantive Model Compatible-Full Conditional Specification (SMC-FCS) algorithm has been implemented in R and Stata for a set of standard regression models (12,13). As the SMC-FCS algorithm is an iterative algorithm, it will not generate independent samples. This implies that one cannot use subsequent samples but only use samples with a specific interval between them.
Bootstrapping
Table 3 gives an overview of the considerations and decision combining propensity score-weighting and multiple imputation and obtaining a valid confidence interval. Non-parametric bootstrapping is a method to find an approximate confidence interval for a parameter, when applying a specific estimation algorithm to a data set. In bootstrapping the only input is the data set and the estimation algorithm and no assumption is made concerning the distribution or the estimation algorithm. However, the realized sample is assumed to be independent and representative of the target population (14). In the simple bootstrap, the estimation algorithm is applied to the original data and to a number of bootstrap samples, i.e. artificial data sets with the same number of observations as the original, but with the observations being sampled randomly with replacement from the original data set. This results in the original estimate and a set of bootstrap estimates from which a 95% confidence interval can be produced as (a) the original estimate +/- 1.96 times the standard deviation of the bootstrap estimates or (b) the 2.5th and 97.5th percentile of the bootstrap estimates. The first strategy will typical require a relative small number of bootstrap samples, but rely on approximate normality of the estimates, while the second require a large number of bootstrap samples, but does not require any assumptions about the distribution of the estimates.
Methods
Based on the theoretical considerations above we outline our estimation algorithm. The analysis plan was defined a priori and included a number of decisions:
the exposure (long-distance migration), outcome (PTSD) and potential confounders (age, sex, socioeconomic status, experienced trauma and mental well-being) (see also (1))
addressing of confounding by logistic regression-based propensity score modeling and of missing data by multiple imputation
three propensity score models of increasing complexity were defined and three levels of weight truncation (no truncation, truncating at the 1st and 99th percentile, or truncating at the 5th and 95th percentile) were examined for covariate balance (15,16). Based on a single imputed data set for each of the three complexities of the propensity score model, the least complex model with the least amount of truncation to obtain acceptable balance, defined as the absolute standardized difference of ≤ 0.10 on all covariates (15) was chosen for the analysis. See supplementary materials and Figures in (1) for details on the specific models and the exploratory plots
given ignorable missingness mechanism, the missing data were multiple imputed using the SMC-FCS algorithm with the chosen propensity score model as the substantive model
for each of the multiply imputed data sets: the propensity scores were computed using the chosen propensity score model, converted into weights and the weighted point estimates produced
the mean of the point estimates from 5) was the estimate of interest
the 95 percentile confidence interval was produced by bootstrapping steps 4-6 a large number of times.
It should be noted that the existing implementation of the SMC-FCS algorithm does not cover our substantive model, the propensity score-weighted analysis, consequently, we decided to use the model for the propensity score as our substantive model. For each partially observed covariate we specified a “prediction model” meaning a regression model to predict the missing value of a partly observed covariate (the response in the regression model in question) given the PTSD score and any additional covariates as deemed relevant based on subject matter insight and exploratory plots. When entering as the response variable, all continuous partially observed covariates were modeled using linear regression with relevant transformation and all discrete covariates were modeled using logistic, multinomial or proportional odds regression. When entering as “predictor variables”, all continuous covariates were modeled as restricted cubic splines with knots at the 10th, 50th and 90th percentiles; all discrete covariates and interactions entered unaltered (see supplementary materials in (1)). The sampling interval between the imputations was decided based on plots of the parameter estimates against the sampling interval.
To combine propensity score-weighting and multiple imputation to produce the estimate of association we used the “within” procedure (17,18): a number of data sets were imputed, for each data set the prevalence difference of PTSD according to long-distance migration was estimated and averaged to give the point estimate (“impute, compute, combine”). The 95-percentile confidence interval was found by bootstrapping this procedure. The procedure is illustrated in Figure 1.
*The bootstrap is repeated multiple times, for example 1000, to be able to estimate the 95-percentile confidence interval.
All data management, analysis and plots were done in R (19) with heavy reliance on packages “smcfcs” (12) for SMC-FCS multiple imputation; “WeightIt” (20) and “cobalt” (21) for estimation of propensity score weights and assessment of covariate balance; “boot” (22) for parallelized bootstrapping; “furrr” (23) for further parallelizing procedures; and “tidyverse” packages (24) for data wrangling and plotting. The code was run on two Ubuntu systems (18.04.5 and 20.04.1) and a Windows 10 system; all running R 4.0.3. The analysis plan and all R code for analysis and plots, including the specific settings in each procedure are available from https://github.com/eiset/ARCH.
Results
The simple propensity score model (no interaction terms) with weight truncation at 1st and 99th percentile obtained acceptable balance on all covariates and was chosen as our model. Unfortunately, but not surprisingly, we had to modify our first choice of substantive model (i.e. the propensity score model used in the imputation) due to computational/numerical problems by collapsing two levels of one of the substantive model covariates and two levels of one of the auxiliary covariate (1).
This slightly modified propensity score model was the substantive model in the SMC-FCS multiple imputation and regression models were set up for all partially observed covariates: For example, for imputing the continuous covariate “Age”, the logarithmic transformation of Age, “log Age”, was modeled with covariates from the substantive model entering as predictor variables: “Socioeconomic status”, “PTSD” (as restricted cubic spline) and auxiliary regressors: “Highest education”, “Number of children”, “Systolic blood pressure” (as restricted cubic spline), and “Marital status”. The Age variable was then passively imputed from “log Age” by exponentiating. All partially observed auxiliary variables were also imputed. The “predictor matrix” in Supplemental Table 1 gives details on models for all partly observed variables.
We set the number of imputations to 10 which is well beyond what is often considered sufficient (25). The convergence plots showed that a sampling interval between imputations, i.e. iterations, of 20 was sufficient; to err on the safe side, we chose 40 iterations. Following recommendations of Carpenter and Bithell (14) we produced 999 bootstrap estimates to compute the 95-percentile confidence interval. For practical reasons, three different computers were used to run the final analysis. The time to run 250 bootstrap estimates was from two to 10 hours depending on the system.
The analysis showed an increased prevalence amounting to 8.76 percentage points (95-percentile confidence interval [-1.39; 18.62 percentage points] with little variation in the sensitivity analysis. We refer to the accompanying paper for discussions of the results (1).
Discussion
In this paper we describe the statistical methodological considerations for combining propensity score-weighting and multiple imputation of missing data. We discuss the assumptions behind both propensity score-weighted estimation and multiple imputation.
In our approach, the substantive model of interest and covariates to include in the propensity score model was explicit. It has been suggested that machine learning or “black box” algorithms may provide reasonable propensity score-weights (4,26,27), however, at the cost of control over the substantive model which is paramount in fulfilling one of the assumptions of multiple imputation: a correctly specified substantive model of interest. And as Bartlett et al. notes “We do not consider the requirement to specify a substantive model at the imputation stage to be a shortcoming…” (11). We truncated extreme weights as advocated by several (4,16,28), acknowledging that the decrease in variance comes at the cost of possibly introducing bias. Stabilized weights is another approach to decrease the variance but comes at a similar cost (29); a recent paper (28) found that when estimating the hazard rate by propensity score-weighted Cox regression the choice between ordinary propensity score-weighting (in this case using weights to produce the “average treatment effect”) or its stabilized version made no difference on the confidence interval coverage and that bootstrap gave the least biased variance estimates with best confidence interval coverage.
The SMC-FCS algorithm (11) allows defining the substantive model of interest and imputation models for each partially observed variable and takes care of combining these in the multiple univariate imputations. This may increase the possibility to define a correctly specified multiple imputation model. While model misspecification is considered the overarching source to bias in propensity score modeling (4,30) recent studies suggest that misspecification of the multiple imputation model may not be detrimental in obtaining valid percentile confidence interval when applying a methodology as proposed in this paper (31). We have based our propensity score model on the available evidence and subject matter knowledge, however, recognize the possibility of some remaining bias, for example from residual confounding and from the collapsing of two levels of one of the substantive model variables. Seaman and White (32) showed that the “within” procedure as proposed by Qu and Lipkovich (33) gives unbiased point estimates assuming ignorable missingness mechanism and that including a “missing-value indicator variable” in the data set may reduce bias when the missingness mechanism is not ignorable, however, increase bias when the it is ignorable. We subjected every variable to careful examination and are satisfied that the “everywhere-MAR” assumption is not violated, however, we acknowledge that this is subject to discussion and cannot be guaranteed. We used bootstrap to produce a 95-percentile confidence interval taking into account uncertainty introduced by modeling in both the propensity score and multiple imputation step. Alternatively, the “Rubin’s rule” are used in several studies and are the traditional choice when doing multiple imputation (without propensity score modeling). Qu and Lipkovich (33) noted that “Rubin’s rule” does not account for the uncertainty introduced in the propensity score estimation and, thus, is not valid in theory while others note that it may produce valid estimates in practice (32). There is no clear evidence on what step to bootstrap when combining propensity score-weighting and multiple imputation (34). In our approach, we bootstrapped the entire “within” procedure to produce a confidence interval that accounts for all uncertainty introduced by modeling. This procedure is similar to that applied to a simple simulated data set with ignorable missingness mechanism by Penning de Vries and Groenwold (18). Schomaker and Heumann (34) suggest that bootstrapping after multiply imputing the data sets may produce similar results at lower computational expense, however, a later study (31) found that this may increase bias compared with bootstrapping the entire procedure.
Our proposed methodology takes several hours to run on “standard” laptop computers and we experienced numerical problems with strata with relatively few observations. Going forward, we are eager to examine the sensitivity of our result to different methodologies for example using other g-methods such as g-computation or other multiple imputation methods such as machine learning algorithms. The produced point estimate and confidence interval could also be compared to alternative methods that lowers the computing time such as “Rubin’s rules” or the recently proposed “von Hippel” method for using bootstrap in multiple imputation (though does not include propensity score modeling) (35).
In this article we have striven to make clear the many choices that we had to go through to produce the estimate of interest. It is our hope that others can make use of our experience in planning their research, creating the analysis plan and running their analysis.
Footnotes
mfstat{at}mollerfryd.dk