Evaluating the comparability of osteoporosis treatments using propensity score and negative control outcome methods in UK and Denmark electronic health record databases ======================================================================================================================================================================== * Trishna Rathod-Mistry * Eng Hooi Tan * Victoria Y Strauss * James O’Kelly * Francesco Giorgianni * Richard Baxter * Vanessa C Brunetti * Alma Becic Pedersen * Vera Ehrenstein * Daniel Prieto-Alhambra ## Abstract Evidence on the comparative effectiveness of osteoporosis treatments is heterogeneous. This may be attributed to different populations and clinical practice, but also to differing methodologies ensuring comparability of treatment groups before treatment effect estimation and the amount of residual confounding by indication. This study assessed the comparability of denosumab vs oral bisphosphonate (OBP) groups using propensity score (PS) methods and negative control outcome (NCO) analysis. A total of 280,288 women aged ≥50 years initiating denosumab or OBP in 2011-2018 were included from the UK Clinical Practice Research Datalink (CPRD) and the Danish National Registries (DNR). Balance of observed covariates was assessed using absolute standardised mean difference (ASMD) before and after PS weighting, matching, and stratification, with ASMD >0.1 indicating imbalance. Residual confounding was assessed using NCOs with ≥100 events. Hazard ratio (HR) and 95% confidence interval (CI) between treatment and NCO was estimated using Cox models. Presence of residual confounding was evaluated with two approaches: (1) >5% of NCOs with 95% CI excluding 1, (2) >5% of NCOs with an upper CI <0.75 or lower CI >1.3. The number of imbalanced covariates before adjustment (CPRD 22/87; DNR 18/83) decreased, with 2-11% imbalance remaining after weighting, matching or stratification. Using approach 1, residual confounding was present for all PS methods in both databases (≥8% of NCOs). Using approach 2, residual confounding was present in CPRD with PS matching (5.3%) and stratification (6.4%), but not with weighting (4.3%). Within DNR, no NCOs had HR estimates with upper or lower CI limits beyond the specified bounds indicating residual confounding for any PS method. Achievement of covariate balance and determination of residual bias were dependent upon several factors including the population under study, PS method, prevalence of NCO, and the threshold indicating residual confounding. Key words * Osteoporosis * fracture * confounding * propensity score * negative control outcome ## Introduction Routinely collected data from clinical practice settings have been used to evaluate the real-world effectiveness of osteoporosis treatments in reducing risk of fracture amongst postmenopausal women. Choice of treatment is dependent on a range of factors including patient medical history such as previous fractures, falls and treatment, patient and clinician treatment preference, and effectiveness of treatment for specific fracture sites (1). According to the European guidance and clinical guidelines in the UK, first line treatment typically includes oral bisphosphonates (alendronate, risedronate, ibandronate) and second line includes denosumab if bisphosphonates are not suitable or tolerated (1–4). Real-world evidence on the effectiveness of denosumab vs. oral bisphosphonates on fracture risk is inconsistent (5). Studies have shown denosumab was equally effective as oral bisphosphonates in reducing the risk of non-vertebral (6) and hip (7) fractures using US claims and Danish registry data, respectively. However, an analysis of Spanish pharmacy data showed that denosumab reduced the risk of hip and any type of fracture more than oral bisphosphonates (8). Although these differing results may reflect genuine differences in effectiveness due to study populations, settings, treatment guidelines, and comparator groups, confounding by indication may also explain these results, due to the inherent differences between patients using first and second-line treatments. For instance, patient characteristics are likely to differ between treatment groups, which may affect fracture prognosis. To account for measured confounding, propensity score (PS) methodology can be used to create balanced treatment groups with respect to measured covariates, such as age and history of fracture, by using the PS in matching, stratification, or inverse probability treatment weighting (IPTW) (9, 10). However, information on important confounders such as bone mineral density, may be missing in administrative databases, and thus cannot be adjusted for using PS methods, which can lead to unmeasured confounding. Negative control outcomes (NCOs) can be used to minimize residual confounding by unmeasured covariates. NCOs are outcomes known not to be causally associated with treatment. NCO methods have been developed to detect the presence of unmeasured (residual) confounding, when an association is found between the treatment and NCO (11, 12). However, there are currently no gold standards or guidelines on the threshold for comparability between treatment groups using NCO. Previous studies have defined bias as non-null effect of an NCO using risk difference and confidence interval (CI) (13, 14), as well as more than 5% of a large set of NCOs having 95% CI excluding 1 (15). This study aims to determine whether cohorts who received denosumab or oral bisphosphonates were comparable using PS matching, stratification and IPTW, and by applying different rules for presence of residual confounding via NCO analysis, in two European databases. The objectives were to: 1. Describe osteoporosis treatment groups, denosumab vs oral bisphosphonates, with respect to demographics, clinical history, and prior medication use 2. Assess whether treatment groups are comparable on measured covariates after PS matching, stratification and IPTW. 3. Detect the presence of residual confounding using different stringency rules via NCO analysis within each PS method. 4. Assess whether comparability is achieved in subgroups of older patients, post fracture patients, and patients with potentially three years of follow-up, plus a post-hoc analysis of second line users. ## Methods ### Study design and setting A retrospective, new user and new switcher, active comparator cohort study was implemented (16). We used data from two European countries, including the UK Clinical Practice Research Datalink (CPRD) GOLD (17) and AURUM (18), which are primary care databases of medical records from general practitioners. CPRD was linked to the following databases: Hospital Episode Statistics Admitted Patient Care, Office for National Statistics mortality data, and the Index of Multiple Deprivation (IMD). Practices that appeared in both the GOLD and AURUM databases were retained in AURUM and excluded from GOLD. We also used data from the Danish National Registries (DNR), which contains linked data from the Danish Civil Registration System (19), the Danish National Patient Registry (20), and the Danish National Prescription Registry (21). DNR contains dispensations in outpatient pharmacies, inpatient and outpatient hospital clinics encounters, and complete follow-up until death or emigration, set within the universal Danish healthcare system (22). ### Participants We included women aged ≥50 years at the time of initiating denosumab or oral bisphosphonates between 01 January 2011 to 31 December 2018 for UK patients and 01 January 2011 to 31 December 2017 for Danish patients. For the new user cohort, the date of initiating denosumab or bisphosphonates was defined as the index date (Figure 1). We included patients who had no history of these treatments in the year before the index date prior to treatment initiation and were registered with their practice for at least one year before index. ![Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/10/16/2023.10.02.23296212/F1.medium.gif) [Figure 1:](http://medrxiv.org/content/early/2023/10/16/2023.10.02.23296212/F1) Figure 1: New user and new switcher study designs To reflect the treatment guidelines in place in the UK and Denmark, where oral BPs are recommended as first-line treatments, and denosumab is recommended as second-line treatment, we assembled a cohort of recent treatment switchers in post-hoc analyses. This approach allowed us to compare patients at a similar stage of treatment. The treatment groups were re-defined as follows, (1) denosumab users included patients who switched from bisphosphonates to denosumab with no previous use of denosumab in the one year prior, and (2) bisphosphonate users included patients who switched from one oral bisphosphonate to a new oral bisphosphonate and had no previous prescription for the new bisphosphonate in the one year prior (Figure 1). The index date was defined as the date of the initiation of the new treatment. Patients were excluded if they had any of the following diagnoses in the five years prior to the index date, or treatments in the year before the index date: Paget’s disease of bone, cancer (except for non-melanoma skin cancer) or its associated treatments (hormonal, endocrine, or radiation therapies), end stage renal disease, or prescription for both denosumab and oral bisphosphonates on the index date. Patients were followed from the index date until the earliest censoring event occurrence: NCO (defined below), death, moved practice, end of study period (31st December 2019 for CPRD and 31st December 2018 for DNR), drug discontinuation or treatment group switch, diagnosis of cancer (except for non-melanoma skin cancer) or its associated treatments (hormonal, endocrine, or radiation therapies), end stage renal disease, or a maximum three-year follow-up. ### Treatment groups The treatment groups were patients initiating (new user) or switching to (switcher) either (1) oral bisphosphonates (alendronate 10mg/day or 70mg/week, ibandronate 150mg/month, risedronate 5mg/day or 35mg/week) or (2) denosumab (60mg/6 months). We used an as-treated exposure definition, where patients were required to be on the initiated or new treatment throughout follow-up. Successive prescriptions were considered as a continuous treatment episode if there was no more than 90 days between prescriptions. Discontinuation was defined as a gap of more than 90 days between prescriptions; date of discontinuation was therefore defined as the end date of drug duration of the last prescription plus an additional 90 days (23). For bisphosphonates, duration of prescription was calculated as the ratio of the number of tablets and daily dose; if duration was missing, the default of 30 days was used. For denosumab, duration was defined as a default of 180 days in line with the dosing interval indicated for osteoporosis. The end date of prescription was defined as the date of the prescription plus duration of prescription. ### Propensity score analysis and covariates PS expressed the probability of being assigned denosumab rather than an oral bisphosphonate conditional on measured covariates. Logistic regression estimated the PS for each patient using over 80 covariates that are known to be associated with risk of fractures or falls (Table S1) (13). Covariates with more than 10 patients in each treatment group were included in the logistic regression model. The PS was then used in three ways to create comparable treatment groups. **In PS matching**, each denosumab user was matched with up to five oral bisphosphonate users that had the closest PS within a calliper width of 0.2 of the pooled standard deviation of the logit of the PS (24, 25). In addition, exact matching was performed on year of index date as it was deemed to be an empiric confounder. **In PS stratification**, all patients in the study population were divided into ten mutually exclusive strata based on the ranked PS distribution of the denosumab group. Within each stratum, the distribution of covariates between treatment groups should be similar. **In IPTW**, each patient was assigned a weight equal to the inverse of their propensity score (1/PS), to create a pseudo-population with a balanced covariate distribution (26). Stabilised weights were calculated by multiplying the weights by the proportion of patients on each treatment to reduce large weights (27). To further minimise large weights, weights were then truncated at 99% percentile of distribution of the stabilised weights (26). Only the IPTW was used in the new switcher cohort. Balance of each covariate between treatment groups was assessed using the absolute standardised mean difference (ASMD) before and after matching, stratification and IPTW. An ASMD less than 0.1 indicates the treatment groups were balanced for that covariate. ### Negative control outcome analysis NCOs are defined as outcomes not causally associated with the treatment of interest, except through shared confounders (11, 12). If the estimated association between treatment and NCO is non-null, one can determine whether there is evidence of residual confounding suggesting treatment groups are not comparable with respect to unmeasured covariates. A preliminary and non-exhaustive list of NCOs were identified, including fracture and non-fracture NCOs. Based on the FREEDOM trial (28), there was no effect of treatment on risk of fracture in the first three months; therefore, a fracture (hip, vertebral, radius, ulna, wrist, humerus, pelvis, or shoulder) occurring in the first three months of treatment was considered a NCO (29). In addition, NCOs were also sourced from previous studies (13, 30, 31), and from an automated method identifying potential NCOs (32). The maximum follow up for fracture NCOs was three months. The preliminary list of NCOs is listed in Table S1. Analysis for a specific NCO was conducted if there were ≥100 events in total. The association between treatment and NCO was estimated using the Cox Proportional Hazards model with robust standard errors, adjusted for year of index and imbalanced covariates with ASMD ≥0.1 for each PS method (33). The estimated hazard ratio (HR) and 95% CI was used to assess for the presence of residual confounding based on two approaches: **Approach 1:** There were more than 5% of NCOs having a 95% CI excluding 1; the denominator was the number of NCOs with ≥100 events and the numerator was the number NCOs with a CI excluding the null value of one. **Approach 2:** There are more than 5% of NCOs having an upper CI of HR <0.75 or lower CI >1.30; the denominator was the number of NCOs with ≥100 events and the numerator was the number of NCOs having CI above or below the specified cut-off values. Treatment groups were deemed comparable if residual confounding was absent in both Approach 1 and Approach 2. ### Subgroup analysis Comparability was further assessed in three subgroups within the PS method that had achieved comparability. The three subgroups included older patients aged ≥65 years, post-fracture patients, and patients with potentially three years of follow-up. The PS was re-estimated for the patients eligible for analysis. ### Results Analysis was performed separately for CPRD (combining the GOLD and AURUM databases) and DNR. ### CPRD #### New user cohort Of the 200,179 eligible patients, 6,528 were denosumab users and 194,191 were bisphosphonate users (Table 1). Most demographics, comorbidities, and prescriptions were balanced between denosumab and bisphosphonate users; however, there were 22 (25.3%) covariates whose distributions were imbalanced (Table 2). Denosumab users were observed to be older, had a higher number of GP visits, hospital admissions, fractures, longer duration of cumulative oral bisphosphate use, higher prevalence of calcium or vitamin D, proton pump inhibitor, lower prescriptions for non-steroidal anti-inflammatory drugs and corticosteroids than bisphosphonate users. View this table: [Table 1:](http://medrxiv.org/content/early/2023/10/16/2023.10.02.23296212/T1) Table 1: Baseline patient characteristics of new users of denosumab and oral bisphosphonates View this table: [Table 2:](http://medrxiv.org/content/early/2023/10/16/2023.10.02.23296212/T2) Table 2: Number of imbalanced covariates in new user cohorts Eighty-seven covariates were used to estimate the PS. The median (interquartile range (IQR)) PS amongst denosumab and bisphosphonate users were 0.123 (0.048, 0.303) and 0.008 (0.004, 0.023), respectively. Figure S1 illustrates the PS distribution is right skewed in both treatment groups however, there were fewer bisphosphonate users with PS above 0.4. ##### PS matching The PS matching algorithm selected 5,837 denosumab patients and 22,393 bisphosphonate patients thus excluding 691 (11%) denosumab users and 171,798 (88%) bisphosphonate users. After matching, the PS distribution was identical in the two treatment groups (Figure S1). Three (3.4%) covariates remained imbalanced: cumulative oral bisphosphonate use, vitamin D deficiency, and factor Xa inhibitor prescription (Figure S2). In NCO analysis, 19 NCOs had at least 100 events. The effect of treatment on each NCO is shown in Figure S3. For Approach 1, four (21.1%) of the NCOs that were analysed had 95% CI excluding one (bowel incontinence, delirium, early fracture, and ingrown toenail). For approach 2, only one (5.3%) NCO, ingrown toenail (HR 2.18, 95% CI: 1.49, 3.21), had its 95% CI excluding one and its lower CI bound >1.30. ##### PS stratification The PS distribution amongst denosumab and bisphosphonate users were similar in the first seven strata. In the 8th, 9th and 10th strata, the median and IQR was slightly larger for denosumab users than bisphosphonate users. Overall, within stratum there was good PS overlap between the treatment groups (Figure S4). On average, three covariates had ASMD >0.1: cumulative oral bisphosphonate use, recency of fracture, and strontium prescription (Figure S5). There were 47 NCOs with at least 100 events and the effect of treatment on NCO is shown in Figure S6. Approach 1 had eight (17.0%) NCOs (atelectasis, bowel incontinence, delirium, early fracture, ingrown toenail, ankle sprain, strabismus, total hip arthroplasty due to osteoarthritis) with 95% CI excluding one. Approach 2 had three (6.4%) NCOs with 95% CI upper bound <0.75 (ankle sprain) or lower bound >1.30 (ingrown toenail and strabismus). ##### IPTW The median (IQR) weights were 0.48 (0.11, 0.68) and 0.98 (0.97, 0.99) amongst the denosumab and bisphosphonate users, respectively. In the pseudo population, 10 covariates (11.5%) had ASMD >0.1: age, calcium/vitamin D prescription, cumulative oral bisphosphonate use, number of fractures, non-steroidal anti-inflammatory drug prescription, number of different drugs, proton pump inhibitor prescription, recency of fracture, region, and strontium prescription (Figure 2a). ![Figure 2a:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/10/16/2023.10.02.23296212/F2.medium.gif) [Figure 2a:](http://medrxiv.org/content/early/2023/10/16/2023.10.02.23296212/F2) Figure 2a: Covariate balance with propensity score weighting in new users (CPRD) There were 47 NCOs with at least 100 events. In Approach 1, nine (19.1%) NCOs had its 95% CI excluding one (accident, anorectal disorder, delusional disorder, foreign body in ear, hypomagnesemia, ingrown nail, iron deficiency, nasal congestion, and schizophrenia). In Approach 2, two (4.3%) NCOs had its CI upper bound <0.75 (accident (0.03 (<0.01, 0.19)) and foreign body in ear (0.11 (0.02, 0.58)) (Figure 3a). #### New switcher cohort In CPRD, the new switcher study design analysed a smaller number of patients, 2,792 new denosumab switchers, and 14,668 new bisphosphonate switchers. In the pseudo-population, a smaller number of imbalanced covariates (2.4%, 2/82), as compared to the new user population, were observed which included previous fracture and cumulative oral bisphosphonate use (Figure S13). Thirteen NCOs had at least 100 events, of which none met the criteria for residual confounding using Approach 2 (Figure S14). ### DNR #### New user cohort DNR was a smaller database containing 79,569 eligible patients of which there were 4,276 denosumab users and 75,293 bisphosphonate users (Table 1). Most covariates were balanced between the two treatment groups; however, 18 (21.7%) covariates were imbalanced (Table 2), a smaller percentage compared to CPRD. Imbalanced patient characteristics were similar to CPRD, except higher prescription of antiparathyroid in denosumab users in DNR. (Table 1). Eighty-three covariates were used to estimate the PS. The median (IQR) PS amongst denosumab and bisphosphonate users was 0.079 (0.044, 0.211) and 0.033 (0.023, 0.048). Figure S7 shows the PS distribution within treatment groups were right skewed however the distributions overlap each other. There were fewer bisphosphonate users with PS greater than 0.6. ##### PS matching 4,187 denosumab users were matched to 16,546 bisphosphonate users, excluding 89 (2%) denosumab and 58,747 (78%) bisphosphonate users. Four covariates (4.2%) remained imbalanced: number of outpatient visits, cumulative oral bisphosphonate use, number of prescriptions, and proton pump inhibitor use (Figure S8). Twelve NCOs had at least 100 events with the effect of treatment on NCO shown in Figure S9. In Approach 1, only one (8.3%) NCO early fracture had its 95% CI excluding one. No NCO had met the criteria of Approach 2. ##### PS stratification The PS distribution was similar between treatment groups in the first seven stratum. In the 8th, 9th and 10th strata, the median and IQR was slightly larger for denosumab users than bisphosphonate users. Overall, there is good overlap of PS distributions between the treatment groups (Figure S10). Nine covariates (10.8%) were imbalanced (Figure S11): age, renal disease, Charlson comorbidity score, cohabitation status, number of GP visits, cumulative bisphosphonate use, number of prescriptions, region, and proton pump inhibitor use. Twenty-six NCOs had at least 100 events and the effect of treatment on NCO is shown in Figure S12. In Approach 1, one (3.8%) NCO early fracture had its 95% CI excluding one. In Approach 2, no NCO had met the criteria. ##### IPTW The median (IQR) of weights was 0.68 (0.25, 1.23) and 0.98 (0.97, 0.99) in denosumab and bisphosphonate users respectively. In the pseudo-population, two (2.4%) covariates, age, and cumulative oral bisphosphonate use, were imbalanced (Figure 2b). ![Figure 2b:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/10/16/2023.10.02.23296212/F3.medium.gif) [Figure 2b:](http://medrxiv.org/content/early/2023/10/16/2023.10.02.23296212/F3) Figure 2b: Covariate balance with propensity score weighting in new users (DNR) Twenty-six NCOs had at least 100 events and the effect of treatment on NCO is shown in Figure 3b. In Approach 1, five (19.2%) NCOs had its 95% CIs excluding 1: anorectal disorder, eye injury, haematochezia, incomplete emptying of bladder, and total hip arthroplasty due to osteoarthritis. In Approach 2, no NCO had its 95% CI bound exceeding the indicated threshold of residual confounding. ![Figure 3a:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/10/16/2023.10.02.23296212/F4.medium.gif) [Figure 3a:](http://medrxiv.org/content/early/2023/10/16/2023.10.02.23296212/F4) Figure 3a: Negative control outcome estimates after PS weighting in new users (CPRD) ![Figure 3b:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/10/16/2023.10.02.23296212/F5.medium.gif) [Figure 3b:](http://medrxiv.org/content/early/2023/10/16/2023.10.02.23296212/F5) Figure 3b: Negative control outcome estimates after PS weighting in new users (DNR) #### New switcher cohort In the new switcher cohorts, 3,726 new denosumab switchers and 2,525 new bisphosphonate switchers were eligible for the new switcher design. In the pseudo-population, a larger number of imbalanced covariates (5.1%, 4/78), as compared to the new user population, were observed which included previous fracture, number of fractures, number of fracture types, and recency of fracture (Figure S15). Comparability could not be assessed as there were no NCOs with at least 100 events. #### Comparability assessment After adjustment using PS methods, comparability of treatment groups based on meeting criteria for both Approach 1 and 2 was not achieved, with the exception of PS stratification in DNR (Table 3). Approach 1 was found to be more stringent than Approach 2 in both CPRD and DNR databases. Based on Approach 2, comparability was achieved for IPTW in CPRD, and all three PS methods in DNR. IPTW was considered preferable as it passed the criteria set in Approach 2 for both databases. The results for the subgroup analysis were similar (Tables S2 and S3). After IPTW, the treatment groups were comparable in the new switcher cohorts in CPRD. View this table: [Table 3:](http://medrxiv.org/content/early/2023/10/16/2023.10.02.23296212/T3) Table 3: Comparability threshold using negative control outcome estimates in new user cohorts ## Discussion ### Key results This study had aimed to determine whether cohorts of patients selected for pharmacological treatment for osteoporosis with denosumab or oral bisphosphonates are comparable with respect to measured and unmeasured confounding. In the UK and Denmark, denosumab is prescribed as second line treatment if an oral bisphosphonate is not suitable. As expected, baseline differences between the two treatment groups were observed with denosumab users being older, previously on preventative fracture treatment, and generally in worse health than bisphosphonate users. PS methods were used to ensure treatment groups were balanced for a wide range of covariates. Each PS method performed reasonably well in creating balanced groups, although the degree of performance varied by country. IPTW performed the least well in CPRD but performed the best in DNR, whereas stratification performed the best in CPRD but the worst in DNR; PS matching performed reasonably well in both databases, although it reduced sample size. Evidence of residual confounding was based on satisfying both rules: more than 5% of NCOs had its 95% CI excluding the null value of one (Approach 1) or the CI was wholly outside the range of 0.75 to 1.30 (Approach 2). Based on both approaches, in CPRD, all PS methods showed evidence of residual confounding suggesting the new user treatment groups were not comparable. In DNR, PS stratification but not PS matching or IPTW showed treatment groups were comparable. Using Approach 1, large sample sizes may lead to narrow 95% CI and limit the possibility of crossing the null, even when effect sizes are clinically insignificant. Therefore, evidence of residual confounding based on Approach 2 alone may be considered for future analyses of comparative effectiveness. IPTW was chosen to proceed with further as this satisfied the threshold in Approach 2 for both data sets. The new switcher analysis in DNR was not possible due to a small number of patients; in CPRD, the treatment groups were shown to be comparable. ### Comparison with previous studies PS and NCO methods are popular approaches to account for confounding; however, there is no consensus on which threshold to use to determine whether treatment groups are comparable. In our study, the degree to which PS methods ensured treatment groups were comparable varied by method and country, with other studies also observing similar findings. Choice of PS cannot be generalised to all databases as it is highly dependent on the research objective, achieving covariate balance, and the type and size of treatment effect estimate of interest (marginal, conditional, average treatment effect, or average treatment effect of the treated) (34–36). Studies have approached the issue of residual confounding in different ways. McGrath (13) had assessed the comparability of newly initiating denosumab vs. oral bisphosphonates after using IPTW and 12 pre-specified NCOs using US claims data, with residual confounding being evident if a meaningful, non-null effect was observed. That study had found comparability was not achieved as associations were found for two NCOs thus comparative effectiveness estimation could not proceed. In another US study, Kim et al (29) evaluated three early fracture NCOs and four non-fracture NCOs; and assessed residual confounding using relative (risk ratio <0.85 or >1.15) and absolute (risk difference > 0.01) measures. In other populations, Levintow (14) had evaluated the comparability of lipid-lowering drugs using 10 pre-defined NCOs, assuming comparability was achieved if the risk difference was close to the null effect of zero (although a range was not specified) and the 95% CI contained zero for all NCOs. Other studies had instead focused on identifying a large set of NCOs in order to calibrate treatment effect estimates for residual confounding (15, 30, 37, 38) with Hripcsak (15) specifying that 95% of NCOs were expected to have the null effect contained within its 95% CI. Most commonly, studies choose one (or few) NCO(s) that assume to have the same confounding structure between treatment and primary outcome, and if a non-null effect is observed then residual confounding is present (11). In contrast, and a strength of our study, we had used a combination of methods. Firstly, early fracture was selected a priori as an NCO assuming it shared the same set of confounders as treatment and primary outcome of fragility fracture. Secondly, an automated method identified over 100 potential NCOs that were unlikely to be associated with treatment; however, the assumption of sharing the same set of confounders is unlikely to be met; use of a large set of NCOs with differing confounding structures may mitigate that assumption. Thirdly, the optimal number of NCOs may be dependent on whether one wants to simply detect the presence of residual confounding, or to use NCOs to calibrate treatment effect estimates for residual confounding which would require at least 30 NCOs (39). The use of a large number of NCOs in our analysis allows for this possibility. ### Strengths and limitations Our study was performed in population-based databases in the UK and Denmark. There was some similarities and differences in the achievement of comparability between databases. Although new-user analyses are generally preferred for comparative studies, the predominant use of denosumab as second-line therapy in both the UK and Denmark may have contributed to difficulties in achieving comparability and to inclusion of patients not fully representative of clinical practice. We therefore conducted a new-switcher analysis to account for prior use of oral bisphosphonates in the denosumab group. Correspondingly, comparability appeared easier to achieve in the CPRD new switcher analysis than the new-user analysis. We assessed whether residual confounding was evident based on two approaches with differing levels of strictness. Choice of threshold carries risks for decision-making on when to proceed with comparative assessments: a threshold that is too stringent risks excluding the possibility of conducting analyses on sufficiently comparable cohorts; a threshold that is not stringent enough may lead to comparisons on non-comparable cohorts. Although our study had identified over 100 potential NCOs, under half were used in analysis as some NCOs were rare (less than 100 events). The decision to exclude rare outcomes was justified as they were unlikely to have adequate power to detect a non-null effect contributing misleading evidence of no residual confounding. However, imposing a minimum number of events had led to some analyses with a small number of assessable NCOs, resulting in only one NCO comparison needing to show a significant difference to exceed the 5% threshold of residual confounding. Decisions made using IPTW may have affected the way standard errors and in turn CIs were calculated. Firstly, robust standard errors were used to ensure NCO estimates were not biased due to either a potential misclassification of PS estimation or the Cox model (40); this approach is known to over-estimate standard errors (41). Secondly, use of truncated weights would have led to more stable weights thus leading to smaller standard errors. Lastly, adjusting for imbalanced covariates may have led to larger standard errors (34). It is unclear what the overall impact was on standard errors, whether they were over- or under-estimated and the impact this would have had when using CIs to assess whether residual confounding existed. Despite these limitations, a key strength is that all three PS methods were considered thus offering the flexibility that any one method could be used for further comparative effectiveness analysis. ### Conclusions Confounding by indication will always be observed in routinely collected medical record data and PS and NCO methods are important tools to account for such confounding. We have shown that assessment of comparability varies depending on the method of PS adjustment and definitions of residual bias. The extent to which residual confounding can be identified is unknown, and studies should consider more than one PS method to test robustness and identify the largest number of NCOs to give the greatest flexibility in detecting residual confounding. Further research is required to determine the optimal threshold to identify residual confounding. ### Ethical approval Access to CPRD data was approved (Protocol #20_000206) according to CPRD’s research data governance framework. The study was reported to the Danish Data Protection Agency through registration at Aarhus University (record: AU-2016-051-000001, serial number 880). An informed consent or ethical approval is not required for studies based solely on existing registry data. ## Supporting information NCO methods paper supplementary [[supplements/296212_file03.docx]](pending:yes) ## Disclosure DPA’s department at Oxford University has received grant/s from Amgen, Chiesi-Taylor, Lilly, Janssen, Novartis, and UCB Biopharma. His research group has received consultancy fees from Astra Zeneca and UCB Biopharma. Amgen, Astellas, Janssen, Synapse Management Partners and UCB Biopharma have funded or supported training programmes organised by DPA’s department. JOK, FG, RB, VCB are employees and own equity in Amgen. EHT, TRM, VYS, VE, ABP have no conflicts of interest to declare. ## Data availability statement Qualified researchers may request data from the deidentified and aggregated results of this study from the corresponding author. The data are not publicly available due to privacy or ethical restrictions. ## Acknowledgments This study was funded by Amgen Inc. The funding source was involved in the study protocol and manuscript review. This study is based in part on data from the Clinical Practice Research Datalink obtained under license from the UK Medicines and Healthcare products Regulatory Agency. The data is provided by patients and collected by the NHS as part of their care and support. The interpretation and conclusions contained in this study are those of the authors alone. We thank Uffe Heide-Jørgensen from Aarhus University for his assistance in data management and statistical analysis of the Danish National Registries. All authors approved the final version of the manuscript and agree to be accountable for the work. ## Footnotes * * Joint first author * Distribution/Reuse Options changed from attribution CC-BY to to No reuse/adaptation without permission. * Received October 2, 2023. * Revision received October 16, 2023. * Accepted October 16, 2023. * © 2023, Posted by Cold Spring Harbor Laboratory The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission. ## References 1. 1.Gregson CL, Armstrong DJ, Bowden J, Cooper C, Edwards J, Gittoes NJL, et al. UK clinical guideline for the prevention and treatment of osteoporosis. Arch Osteoporos. 2022;17(1):58. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s11657-022-01061-5&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 2. 2.Scotland G, Waugh N, Royle P, McNamee P, Henderson R, Hollick R. Denosumab for the prevention of osteoporotic fractures in post-menopausal women: a NICE single technology appraisal. Pharmacoeconomics. 2011;29(11):951–61. 3. 3.National Institute for Health and Care Excellence. Bisphosphonates for treating osteoporosis 2017 [updated 08 July 2019. Available from: [https://www.nice.org.uk/guidance/ta464](https://www.nice.org.uk/guidance/ta464). 4. 4.Kanis JA, Cooper C, Rizzoli R, Reginster JY. European guidance for the diagnosis and management of osteoporosis in postmenopausal women. Osteoporos Int. 2019;30(1):3–44. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s00198-018-4704-5&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30324412&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 5. 5.Curtis JR, Arora T, Liu Y, Lin T-C, Spangler L, Brunetti VC, et al. Comparative Effectiveness Of Denosumab Versus Alendronate Among Postmenopausal Women With Osteoporosis In The U.S. Medicare Program. World Congress on Osteoporosis, Osteoarthritis and Musculoskeletal Diseases; Barcelona, Spain 2023. p. 64–5. 6. 6.Choi N-K, Solomon DH, Tsacogianis TN, Landon JE, Song HJ, Kim SC. Comparative Safety and Effectiveness of Denosumab Versus Zoledronic Acid in Patients With Osteoporosis: A Cohort Study. Journal of Bone and Mineral Research. 2017;32(3):611–7. 7. 7.Pedersen AB, Heide-Jørgensen U, Sørensen HT, Prieto-Alhambra D, Ehrenstein V. Comparison of Risk of Osteoporotic Fracture in Denosumab vs Alendronate Treatment Within 3 Years of Initiation. JAMA Netw Open. 2019;2(4):e192416. 8. 8.Khalid SA, M.; Judge, A.; Arden, N.; van Staa, T.; Cooper, C.; Javaid, M.; Prieto-Alhambra, D. Reduction in fracture rates with Denosumab compared to Alendronate in treatment naïve patients: a propensity-matched ‘real world’ cohort and instrumental variable analysis. World Congress on Osteoporosis, Osteoarthritis and Musculoskeletal Diseases (WCO-IOF-ESCEO 2017)2017. 9. 9.Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/biomet/70.1.41&link_type=DOI) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1983QH66900005&link_type=ISI) 10. 10.Austin PC. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivariate Behav Res. 2011;46(3):399–424. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/00273171.2011.568786&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=21818162&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000291533400002&link_type=ISI) 11. 11.Lipsitch M, Tchetgen Tchetgen E, Cohen T. Negative controls: a tool for detecting confounding and bias in observational studies. Epidemiology. 2010;21(3):383–8. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1097/EDE.0b013e3181d61eeb&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=20335814&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000277071500016&link_type=ISI) 12. 12.Arnold BF, Ercumen A. Negative Control Outcomes: A Tool to Detect Bias in Randomized Trials. Jama. 2016;316(24):2597–8. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2016.17700&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28027378&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 13. 13.McGrath LJ, Spangler L, Curtis JR, Ehrenstein V, Sørensen HT, Saul B, et al. Using negative control outcomes to assess the comparability of treatment groups among women with osteoporosis in the United States. Pharmacoepidemiology and Drug Safety. 2020;29(8):854–63. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/pds.5037&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32537883&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 14. 14.Levintow SN, Orroth KK, Breskin A, Park AS, Flores-Arredondo JH, Dluzniewski P, et al. Use of negative control outcomes to assess the comparability of patients initiating lipid-lowering therapies. Pharmacoepidemiol Drug Saf. 2022;31(4):383–92. 15. 15.Hripcsak G, Suchard MA, Shea S, Chen R, You SC, Pratt N, et al. Comparison of Cardiovascular and Safety Outcomes of Chlorthalidone vs Hydrochlorothiazide to Treat Hypertension. JAMA Intern Med. 2020;180(4):542–51. 16. 16.Lund JL, Richardson DB, Stürmer T. The active comparator, new user study design in pharmacoepidemiology: historical foundations and contemporary application. Curr Epidemiol Rep. 2015;2(4):221–8. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s40471-015-0053-5&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26954351&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 17. 17.Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, et al. Data Resource Profile: Clinical Practice Research Datalink (CPRD). Int J Epidemiol. 2015;44(3):827–36. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyv098&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26050254&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 18. 18.Wolf A, Dedman D, Campbell J, Booth H, Lunn D, Chapman J, et al. Data resource profile: Clinical Practice Research Datalink (CPRD) Aurum. International Journal of Epidemiology. 2019;48(6):1740-g. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyz034&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 19. 19.Schmidt M, Pedersen L, Sørensen HT. The Danish Civil Registration System as a tool in epidemiology. Eur J Epidemiol. 2014;29(8):541–9. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s10654-014-9930-3&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24965263&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 20. 20.Schmidt M, Schmidt SA, Sandegaard JL, Ehrenstein V, Pedersen L, Sørensen HT. The Danish National Patient Registry: a review of content, data quality, and research potential. Clin Epidemiol. 2015;7:449–90. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2147/CLEP.S91125&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26604824&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 21. 21.Pottegård A, Schmidt SAJ, Wallach-Kildemoes H, Sørensen HT, Hallas J, Schmidt M. Data Resource Profile: The Danish National Prescription Registry. Int J Epidemiol. 2017;46(3):798-f. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyw213&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27789670&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 22. 22.Schmidt M, Schmidt SAJ, Adelborg K, Sundbøll J, Laugesen K, Ehrenstein V, et al. The Danish health care system and epidemiological research: from health care contacts to database records. Clin Epidemiol. 2019;11:563–91. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2147/CLEP.S179083&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31372058&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 23. 23.Morley J, Moayyeri A, Ali L, Taylor A, Feudjo-Tepie M, Hamilton L, et al. Persistence and compliance with osteoporosis therapies among postmenopausal women in the UK Clinical Practice Research Datalink. Osteoporos Int. 2020;31(3):533–45. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s00198-019-05228-8&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31758206&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 24. 24.Austin PC. Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharm Stat. 2011;10(2):150–61. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/pst.433&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=20925139&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 25. 25.Austin PC. A comparison of 12 algorithms for matching on the propensity score. Stat Med. 2014;33(6):1057–69. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/sim.6004&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24123228&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 26. 26.Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Stat Med. 2015;34(28):3661–79. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/sim.6607&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26238958&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 27. 27.Xu S, Ross C, Raebel MA, Shetterly S, Blanchette C, Smith D. Use of Stabilized Inverse Propensity Scores as Weights to Directly Estimate Relative Risk and Its Confidence Intervals. Value in Health. 2010;13(2):273–7. 28. 28.Cummings SR, Martin JS, McClung MR, Siris ES, Eastell R, Reid IR, et al. Denosumab for Prevention of Fractures in Postmenopausal Women with Osteoporosis. New England Journal of Medicine. 2009;361(8):756–65. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMoa0809493&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19671655&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000269081200004&link_type=ISI) 29. 29.Kim M, Lin TC, Arora T, Zhao H, Balasubramanian A, Stad RK, et al. Comparability of Osteoporosis Treatment Groups Among Female Medicare Beneficiaries in the United States. J Bone Miner Res. 2023;38(6):829–40. 30. 30.Suchard MA, Schuemie MJ, Krumholz HM, You SC, Chen R, Pratt N, et al. Comprehensive comparative effectiveness and safety of first-line antihypertensive drug classes: a systematic, multinational, large-scale analysis. Lancet. 2019;394(10211):1816–26. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0140-6736(19)32317-7&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31668726&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 31. 31.Kim Y, Tian Y, Yang J, Huser V, Jin P, Lambert CG, et al. Comparative safety and effectiveness of alendronate versus raloxifene in women with osteoporosis. Scientific Reports. 2020;10(1):11115. 32. 32.Voss EA, Boyce RD, Ryan PB, van der Lei J, Rijnbeek PR, Schuemie MJ. Accuracy of an automated knowledge base for identifying drug adverse reactions. J Biomed Inform. 2017;66:72–81. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jbi.2016.12.005&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 33. 33.Nguyen TL, Collins GS, Spence J, Daurès JP, Devereaux PJ, Landais P, et al. Double-adjustment in propensity score matching analysis: choosing a threshold for considering residual imbalance. BMC Med Res Methodol. 2017;17(1):78. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s12874-017-0338-0&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28454568&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) 34. 34.Elze MC, Gregson J, Baber U, Williamson E, Sartori S, Mehran R, et al. Comparison of Propensity Score Methods and Covariate Adjustment: Evaluation in 4 Cardiovascular Studies. Journal of the American College of Cardiology. 2017;69(3):345–57. [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6MzoiUERGIjtzOjExOiJqb3VybmFsQ29kZSI7czo0OiJhY2NqIjtzOjU6InJlc2lkIjtzOjg6IjY5LzMvMzQ1IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjMvMTAvMTYvMjAyMy4xMC4wMi4yMzI5NjIxMi5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 35. 35.Austin PC. The relative ability of different propensity score methods to balance measured covariates between treated and untreated subjects in observational studies. Med Decis Making. 2009;29(6):661–77. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1177/0272989X09341755&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19684288&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000272382400003&link_type=ISI) 36. 36.Amoah J, Stuart EA, Cosgrove SE, Harris AD, Han JH, Lautenbach E, et al. Comparing Propensity Score Methods Versus Traditional Regression Analysis for the Evaluation of Observational Data: A Case Study Evaluating the Treatment of Gram-Negative Bloodstream Infections. Clin Infect Dis. 2020;71(9):e497–e505. 37. 37.Khera R, Schuemie MJ, Lu Y, Ostropolets A, Chen R, Hripcsak G, et al. Large-scale evidence generation and evaluation across a network of databases for type 2 diabetes mellitus (LEGEND-T2DM): a protocol for a series of multinational, real-world comparative cardiovascular effectiveness and safety studies. BMJ Open. 2022;12(6):e057977. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoiYm1qb3BlbiI7czo1OiJyZXNpZCI7czoxMjoiMTIvNi9lMDU3OTc3IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjMvMTAvMTYvMjAyMy4xMC4wMi4yMzI5NjIxMi5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 38. 38.Schuemie MJ, Cepeda MS, Suchard MA, Yang J, Tian Y, Schuler A, et al. How Confident Are We about Observational Findings in Healthcare: A Benchmark Study. Harv Data Sci Rev. 2020;2(1). 39. 39.Schuemie MJ, Hripcsak G, Ryan PB, Madigan D, Suchard MA. Empirical confidence interval calibration for population-level effect estimation studies in observational healthcare data. Proc Natl Acad Sci U S A. 2018;115(11):2571–7. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMToiMTE1LzExLzI1NzEiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMy8xMC8xNi8yMDIzLjEwLjAyLjIzMjk2MjEyLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 40. 40.Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, Davidian M. Doubly robust estimation of causal effects. Am J Epidemiol. 2011;173(7):761–7. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/aje/kwq439&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=21385832&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000289301200007&link_type=ISI) 41. 41.Austin PC. Variance estimation when using inverse probability of treatment weighting (IPTW) with survival analysis. Statistics in Medicine. 2016;35(30):5642–55. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/sim.7084&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27549016&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F10%2F16%2F2023.10.02.23296212.atom)