Distributed Statistical Analyses: A Scoping Review and Examples of Operational Frameworks Adapted to Healthcare
===============================================================================================================

* Félix Camirand Lemyre
* Simon Lévesque
* Marie-Pier Domingue
* Klaus Herrmann
* Jean-François Ethier

## Abstract

Data from multiple organizations are crucial for advancing learning health systems. However, ethical, legal, and social concerns may restrict the use of standard statistical methods that rely on pooling data. Although distributed algorithms offer alternatives, they may not always be suitable for healthcare research frameworks. This paper aims to support researchers and data custodians in three ways: (1) providing a concise overview of the literature on statistical inference methods for horizontally partitioned data; (2) describing the methods applicable to generalized linear models (GLM) and assessing their underlying distributional assumptions; (3) adapting existing methods to make them fully usable in healthcare research. A scoping review methodology was employed for the literature mapping, from which methods presenting a methodological framework for GLM analyses with horizontally partitioned data were identified and assessed from the perspective of applicability in healthcare research. From the review, 41 articles were selected, and six approaches were extracted for conducting standard GLM-based statistical analysis. However, these approaches assumed evenly and identically distributed data across nodes. Consequently, statistical procedures were derived to accommodate uneven node sample sizes and heterogeneous data distributions across nodes. Workflows and detailed algorithms were developed to highlight information-sharing requirements and operational complexity.

## 1 Introduction

### 1.1 Health Research at Scale

Learning health systems (LHS) are coming of age and are being deployed to address important health challenges at different scales. The framework starts by leveraging health data created across various activities. It obviously includes data points from clinics and hospitals, but the perimeter of data required to meaningfully and optimally address important problems is much wider and includes research cohorts, biobanks, quantified self data, environmental exposures and social service delivery.

While some questions might be addressed at the scale of an individual organisation, LHS focus on systems interactions and often require the analysis of processes and outcomes from various organisations. For example, to fully understand a cancer care trajectory, multiple data sources from multiple organisations will need to be examined to cover all relevant aspects (both within the traditional health system, but also in the community). This often implies at least regional organisations or bigger (provinces, states, countries), like in the context of the Health Data Research Network Canada (HDRN) or Health Data Research UK. Similarly, comparing various approaches is often a fruitful way to identify the best approaches and understand what works, why, and how to scale the promising projects. It can also be a way to amass a critical number of observations in the context of rarer diseases for example. Nevertheless, working with data from multiple data sources, from multiple organisations and located in multiple jurisdictions poses significant challenges.

Traditionally, the analytical methods used by researchers in the healthcare domain and others have relied on data pooling (sometimes referred to as data centralisation): all required data is physically copied to a single location where analysis can take place. However, when working with data from multiple jurisdictions (even when part of the same country like the Canadian provinces and territories), data pooling is often very difficult if not impossible for ethical, legal and social acceptability reasons.

There is, therefore, a pressing need to offer analytical methods allowing the analysis of such data without requiring the need to physically copy the data in a central location.

### 1.2 Distributed Analysis

More formally, this paper is concerned with frameworks where the data needed for a statistical analysis consists of the data about *n* individuals (referred to as the *analytical dataset*), which are not all stored in a single source but are partitioned among *K* locations which will be called *nodes* hereafter. The mereological sum of all the data held at each node therefore forms the analytical dataset. Data can be partitioned horizontally or vertically (or in a mixed way).

*   A horizontal partition implies that all data pertaining to a given individual can be found in a single node. If we assume that patients receive care only in one province, Canadian provincial health administrative datasets hosted by organizations like Population Data BC, ICES in Ontario or the Manitoba Centre for Health Policy (MCHP) in Manitoba can be part of a horizontal partition. A clinical trial where each recruiting site captures all data for a given subject is another example.

*   A vertical partition occurs when all data of a certain type is available in a single node for a group of individuals. A classic example is a hospital with its various information systems. All pathology results can be found in the pathology system, all billing information can be extracted from the finance system, all X-rays are accessible in the picture archiving and communication system (PACS), etc. But to get the full picture of the care received by a patient, multiple systems need to be interrogated. Similarly in the research setting, health administrative data may be in a provincial data centre and genomics data could be held in a research institute.

A mixed partition occurs when both principles partly apply: some individuals may have their data spread out across nodes, and different individuals may be present in different nodes.

#### 1.2.1 Assumptions

The difficulties in conducting analyses on a large scale mentioned above are often associated with horizontally partitioned data, and the current work focuses on this type of partition. The methods presented in this article might therefore not be directly applicable to vertically partitioned data.

One group of approaches often labelled as *distributed analysis* involves calculations at each participating node and exchanges of the resulting aggregated statistics with a *coordinating centre* (CC), which can itself also perform additional calculations based on the received aggregated statistics. The CC can be an organisation not responsible for a data node or a data node taking the additional role of CC for a given analysis.

It is important to note that whether in the more traditional way of data pooling or using distributed approaches (where the data is not copied centrally), data sources will be different on multiple levels. They will represent information using data models with significant variability in terms of structure and technology, but also in terms of semantics. This situation also leads to heterogeneous data where the presence of predictors and outcomes is likely to be different in different nodes. Different approaches (e.g. data mediation or extract-transform-load) have been developed to address these issues, and the current work assumes that one of them has been applied so that the data nodes mentioned hereafter are assumed to share the same structure, the same technological syntax and the same semantics as well as no missing data.

#### 1.2.2 Horizontally Partitioned Statistical Analytics

In what follows, the field that pertains to the statistical analysis of horizontally partitioned and semantically homogeneous data that cannot be consolidated into a central location will be called *Horizontally Partitioned Statistical Analytics* (HPSA).

Methodological contributions to this field have arisen from several streams of literature. Meta-analysis and meta-regression methods (see e.g. [45]) can be viewed as part of HPSA, e.g. by considering that each node-specific dataset belongs to a different “study”. However, their scope is narrower compared to HPSA because they typically assume that only established study-level estimates are available as data. Conversely, HPSA allows for the sharing of additional summary statistics between the nodes and the CC, such as gradients and Hessians to ensure the best possible performance at the global level. Since meta-analysis does not leverage any supplementary information that could be obtained from studies with access to patient-level data, it can be susceptible to biased estimation, especially in settings with rare outcomes or in the presence of data nodes with limited sample sizes [14]. As meta-analysis and meta-regression methods have been extensively covered in the literature, approaches specifically designed for the analysis of already-established study-level estimates will not be discussed hereafter.

An important research community that has generated a significant amount of analytical contributions is concerned with the massive data setting. There, a dataset often cannot be processed by a single server and is therefore split across multiple machines, which are then considered as nodes able to perform computations and send aggregated results to a CC tasked to fit a global model from them. The methodological avenues proposed in this setting share similarities with the ones designed for the multi-research facility setting involved in LHS, but also have important differences. For example, in the massive data setting, the experimenter has control over the distribution of individuals across nodes, which is typically not the case in multi-research facility studies. So while these approaches share mechanistic similarities and have been suggested as options to consider in the healthcare domain, some hypotheses may not hold. In regression settings, it is often reasonable to assume that the regression link between the response and covariate predictors is the same across nodes. However, assuming that the sampling distribution of covariates involved is equal across nodes is unrealistic in healthcare, particularly due to the presence of data centres that may systematically involve different types of patients. For example, certain clinics may predominantly serve older individuals. While this may not affect the estimation of parameter values, it can have implications for computing confidence intervals to ensure the validity of inferences.

So far, two reviews discussing methods applicable to horizontally partitioned data have been published in the literature [18][22]. However, their focus is on the massive data setting, which works almost invariably under the assumption of even sampling distribution of covariates and equal sample sizes across nodes, and statistical inference tasks beyond parameter estimation are barely covered. This makes them less helpful for healthcare research purposes since most studies involving data analyses rely on confidence intervals or hypothesis testing in settings where predictors’ distribution and sample sizes vary across nodes.

### 1.3 Contemporary challenges in HPSA

The problem is threefold. First, there is a need to raise awareness regarding the existence of HPSA approaches among researchers aiming at undertaking statistical analyses from horizontally partitioned data, especially in healthcare. The reflex is often to request data pooling because it is perceived as the sole option. This has been the tendency of requests made by researchers to to HDRN Canada. Practitioners are usually concerned with finding the most appropriate statistical model that will take into account as many of the features of their specific context of application as possible. Consequently, a clear and unifying mapping of the state of the HPSA field is needed for them to be informed of the scope of existing methods available for their analyses to see whether alternatives to pooling exist.

Second, as underlined above, methodological contributions came from research fields whose working assumptions can be fundamentally different from the ones researchers would be willing to assume in healthcare research. To ensure proper use of statistical inference techniques, it is necessary that the underlying assumptions of existing methods be adequately identified and understood. If necessary, these methods should be adapted to suit the specific requirements of healthcare applications, thereby ensuring accurate and reliable results.

Third, data custodians have to be properly informed on data-sharing requirements entailed by the use of a specific HPSA method applicable to a given research setting. While HPSA avoids the complexities of pooling data, there are still flows of information that have to be acceptable to data stewards. However, even in basic statistical scenarios, available methods are often presented in a way that makes them challenging to compare in terms of information-sharing requirements and operational complexity. Therefore, there is a need for clearer and more accessible presentations of these methods to facilitate decision-making regarding data sharing and operational implementation.

Although it would be ideal to offer managers a comprehensive operational workflow for each identified method to evaluate the information shared and execution complexity, with their accompanying underlying modelling assumptions, the abundance and diversity of available approaches make it unfeasible to accomplish this in a single paper. In fact, methods often differ in terms of their targeted application beyond their distributed aspect. For example, differences may exist in the studied model (e.g., linear, logistic or Cox regression, additive models), the dimensionality/sparsity of the predictor variable space, use of regularization or shrinkage, the presence of missingness, confounders, imbalances, heterogeneity, etc.

#### 1.3.1 Objectives

The objectives of this article are:

O1 To identify and map, from the literature, methodological approaches that make it possible to perform confidence intervals estimation and hypothesis testings from a horizontally partitioned dataset;

O2 Among the approaches identified, to describe the ones that allow to conduct general linear model analyses, and to identify their distributional assumptions;

O3 Based on the approaches identified for GLM-based inferences, to present methods adapted to the setting of uneven sampling distributions across nodes, and to compare them in terms of information-sharing requirements and operational complexity.

A scoping review methodology was chosen to achieve objective O1 of mapping the state of the field of HPSA that pertains to inference procedures. For our second objective (O2), we identified, from the articles selected from the literature search, the ones that presented a methodological framework for conducting statistical inference procedures from a GLM with horizontally partitioned data. We then used these frameworks to derive and describe GLM estimators that are applicable to horizontally partitioned datasets. For each identified method, we analyzed and reported its communication workflow and the distributional assumptions. For our third objective (O3), we first used statistical theory to adapt the identified procedures to the unequal sample size and uneven covariate distribution setting. Algorithms and mathematical expressions for the quantities involved are reported. For conciseness, we present mathematical formulas for estimation procedures of confidence intervals only. Expressions involved for hypothesis testings are similar and can be deduced following the close connecting between confidence intervals and hypothesis tests in GLMs, see e.g. [1].

The mathematical description of the GLM setting considered for this analysis is described below, along with mathematical notations to be used.

### 1.4 Mathematical framework

In the following, lowercase bold letters will represent vector-valued quantities, while uppercase bold letters will denote matrices. The *j*th element of any vector ***a*** *∈* ℝ*p* will be denoted as [***a***]*j*. Similarly, the entry at position (*j, l*) of any matrix ***A****∈* ℝ*p×p* will be denoted as [***A***]*jl*. If *g* is a real-valued and invertible function, we will use *g**−*1 to represent its inverse. Additionally, if *f****θ*** is a real-valued function that depends on a parameter vector ***θ*** and is twice continuously differentiable, *∇****θ*** *f****θ*** and ![Graphic][1]</img> will respectively indicate the gradient and Hessian matrix of *f****θ*** with respect to ***θ***.

#### 1.4.1 Model mathematical assumptions

A mathematical depiction of the horizontally partitioned data framework studied in this paper is as follows. There are *n* individuals horizontally partitioned across *K* data storage nodes. Each node’s dataset is denoted by ![Graphic][2]</img>, where 1 *≤ k ≤ K*. Here, ![Graphic][3]</img> represents the measurements on the *i*th individual at node *k*, where ![Graphic][4]</img> denotes their response variable and ![Graphic][5]</img> denotes their covariate vector. The total sample size at node *k* is denoted by *n*(*k*). The combined datasets 𝒟 (1), …, 𝒟 (*K*) make up the whole dataset without any duplicated individuals, indicating that ![Graphic][6]</img>.

Throughout the analysis, it is assumed that the ![Graphic][7]</img> ’s are independent across 1 *≤ i ≤ n*(*k*) and 1 *≤ k ≤ K*, and there is no missing data. Additionally, the size of the covariate space (i.e., the dimension of ![Graphic][8]</img>, which is equal to *p* representing the number of features to include as predictors in the GLM) is assumed to be low, eliminating the need for regularization or variable selection. Finally, it is assumed that each node possesses a non-negligible proportion of the whole dataset. Specifically, for each *k ∈* {1, …, *K*}, the quantity *n*(*k*)/*n* is bounded away from 0 and 1 as the sample size *n* tends to infinity, denoted as *n*(*k*)/*n → p*(*k*) *∈* (0, 1).

#### 1.4.2 Mathematical description of the GLM framework

The formulation of the GLM considered in this article encompasses various commonly used regression models such as linear regression, logistic regression, Poisson regression, and probit models. It assumes that the density or probability mass function of each response variable (known as the random components) belongs to the exponential family of distributions. Within this formulation, the (conditional) mean of the response variable is expressed as a function of a linear combination of the corresponding covariate vector. Formally, it assumes that there exist unknown parameters ***β****⋆* *∈* ℝ*p*+1 and *ϕ**⋆* *>* 0, and known model-specific functions *b, c, g, h* such that with ![Graphic][9]</img> and ![Graphic][10]</img>, ![Formula][11]</img>  where, for any ***β*** = [*β*, *β*1, …, *β**p*]*⊤* *∈* ℝ*p*+1 and *ϕ*, ![Formula][12]</img>  

In formula (1), *b* maps real numbers to real numbers and is such that ![Graphic][13]</img>, with ![Graphic][14]</img>. In this framework, *g* is called *link function*, the term ![Graphic][15]</img> is usually referred to as the *natural parameter* and *b* as the *cumulant* function. *ϕ* is often called the *dispersion parameter*, and is either known (e.g. with *ϕ* = 1) or unknown. When *h*(*x*) = *x* (i.e. *h* is the identity function), the link *g* is called *canonical*.

The logistic regression model is obtained upon taking *ϕ* = 1, *h*(*x*) = *x, b*(*x*) = log(1 + *e**x*), *c*(*y, ϕ*) = 0 and *g*(*x*) = log {*x*/(1 *− x*)}. The linear regression model with homoskedastic residual error variance *ϕ* is derived upon setting *h*(*x*) = *x, b*(*x*) = *x*2/2, *c*(*y, ϕ*) = *− y*2/(2*ϕ*) − log(2*πϕ*)/2 and *g*(*x*) = *x*. Hence, both the logistic and the linear regression models rely on a canonical link function in the exponential family distribution.

## 2 Materials and Methods

### 2.1 Methodology related to objective O1

Scoping reviews are well-suited to efficiently map key concepts within a research area [2]. They are widely acknowledged for their ability to clarify working definitions and conceptual boundaries in a specific topic or field [39], facilitating a shared understanding among researchers regarding the status of the research area. These considerations make the scoping review methodology well-designed to achieve objective O1.

Scoping studies utilize systematic searches of relevant databases, employing specific keywords to define the boundaries of the research field. However, identifying these keywords can be challenging, particularly when relevant papers are scattered across different research streams or in independent clusters that do not reference each other. To address the risk of overlooking significant methodological contributions due to a limited number of keywords, a snowballing literature search was initially conducted to generate a comprehensive list of keywords related to HPSA. The scoping review then proceeded with a systematic literature search using the identified keywords. It’s worth noting that, since the planning of the scoping study is independent of the search approach, the guidelines presented in [2] are still appropriate.

#### 2.1.1 Methodology pertaining to the snowballing keywords search

Snowballing is generally used as a literature search method aiming at identifying papers belonging to a given field [54]. It typically consists in three steps:

1.  Initiate searches in prominent journals and/or conference proceedings to gather an initial set of papers.

2.  Conduct a backward review by examining the reference lists of the relevant articles discovered in steps 1 and 2 (continue iterating until no new papers are found).

3.  Perform a forward search by identifying articles that cite the papers identified in the previous steps.

To avoid selection bias, the initial set of papers for the snowballing approach in (1) is sometimes generated through a search in Google Scholar (see e.g. [23]). The latter strategy was used here too.

As mentioned earlier, here, the snowballing search strategy was used in preparation for the application of the scoping review protocol, with the goal of identifying relevant keywords. Specifically, the starting set of papers was assembled by screening titles and abstracts from the first 50 papers generated through a Google Scholar search using the strings *distributed inference* and *federated inference*. The main inclusion criterion was “presents, applies or discusses a statistical inference method to analyse horizontally partitioned data”. Then, the backward and forward snowballing steps approaches were applied.

From the set of keywords found in the selected papers, a list of those relevant to HPSA but not directly associated with any specific method was retained for the scoping review step. It is worth noting that, since the objective of the scoping review is to identify statistical inference methods for horizontally partitioned data, keywords linked to method identifiers have to be excluded from the retained list to avoid pre-selection bias in the scoping review phase of this project.

Selected keywords that were identified from the snowballing literature search are *distributed algorithms, distributed estimation, distributed inference, distributed learning, distributed regression, federated inference, federated estimation, federated learning, privacy-protecting algorithm, privacy-preserving algorithm* and *aggregated inference*.

#### 2.1.2 Methodology pertaining to the scoping review

The scoping review’s methodological framework of Levac et al. [26] (see also [2]) was followed. The steps are briefly described below. A detailed protocol is available in Appendix A.

##### Search strategy

We conducted a comprehensive search across four bibliographic databases, namely (1) MEDLINE, (2) Scopus, (3) MathSciNet, and (4) zbMATH, to encompass the interdisciplinary nature of the topic and identify relevant research articles. Our research strategies were based on two key concepts: distributed data and statistical inference. In addition to the keywords obtained from the snowballing step, we incorporated terms like *confidence interval* to target articles focusing specifically on statistical inference. To ensure the inclusion of recent advancements, our search was limited to papers published from 2000 onwards. This cutoff date was chosen to account for the emergence of distributed data, the prevalence of massive datasets, and advancements in technology. It was set conservatively to capture any early-developed methods and ensure comprehensive coverage of the topic.

##### Selection process

After completing the primary research, a two-stage selection process was employed. Initially, two authors (MPD, FCL) collaborated to screen all articles identified through the research strategy based on their titles and abstracts. Subsequently, the full texts of the selected articles were independently reviewed by both authors to finalize the selection process. This rigorous approach ensured a thorough evaluation of each article’s relevance and eligibility for inclusion.

The primary inclusion criterion for the selection process was as follows: *Presents a solution for conducting inferential statistics on horizontally partitioned data*. This criterion was utilized to ensure that the chosen articles specifically addressed the methods associated with performing statistical inference on horizontally partitioned data.

The following exclusion criteria were derived directly from objective O1:

*   - Does not address inferential statistics, including confidence intervals, hypothesis testing, or asymptotic normality.

*   - Does not provide a methodological contribution.

*   - Presents a solution for encryption or secret-sharing.

To ensure the inclusion of validated approaches, the selection process only considered published papers that had full-text availability in English or French. Discussion papers were excluded as they do not present novel methods or approaches.

Exclusion was considered if any of the exclusion criteria were met or if any of the inclusion criteria were not met. Finally, the references of each included article from the databases were assessed to identify any relevant articles that may not have been captured during the initial screening due to specific keywords. This additional step in the selection process was necessary given the broad range of vocabulary used to describe applicable approaches in our context.

##### Data extraction and analysis plan

Data extraction for the included articles was conducted by one author (MPD) and followed a collectively developed data-charting form. Model type (*parametric regression, semi-parametric regression, non-parametric regression* or *not specific to regression*) and number of communication from CC to nodes (0 or *≥* 1) were among the data extracted. All methods from the included articles were subsequently classified according to their specified characteristics, as outlined in the protocol. Additionally, as part of the analysis, we conducted a screening of the general distributed approaches commonly employed across all specific methods.

### 2.2 Methodology related to objective O2

To achieve objective O2, three steps were taken. First, we identified methodological approaches from articles included in the scoping review that enable parameter and confidence interval estimations from horizontally partitioned data within a standard GLM framework. Methods designed specifically for the particular cases of linear or logistic regression were also reported but were not analyzed in detail. Second, we extracted workflows for each approach to determine the information exchanged between data storage nodes and the CC. Third, we analyzed the mathematical assumptions necessary for parameter estimation and the consistency of confidence interval procedures. We specifically reported the assumptions related to the distribution of node-specific covariates.

#### 2.2.1 Identification of the approaches

To identify approaches that enabled the fitting of any GLM using horizontally partitioned data, two authors (FCL, MPD) independently assessed all articles included in the scoping study. The reviewers specifically looked for articles that discussed approaches applicable to the GLM class described in section 1.4, including likelihood-based methods, M-estimation, and estimating equations. Additionally, we identified and reported articles that specifically focused on regression settings for linear or logistic regression. However, unless the method described was considered easily adaptable to the GLM framework, these articles were not retained for detailed analysis.

A method was selected if it provided an algorithm for fitting GLMs using horizontally partitioned data, aligning with the characteristics outlined in section 1.4. In cases where an article presented asymptotic normality results for the estimators but did not provide an estimator for the asymptotic variance-covariance matrix, the article was still retained, and an estimator for the asymptotic variance was derived using the available calculated quantities.

Since the GLM framework in section 1.4 assumes no missing values, low dimensionality, and a small number of nodes relative to the total sample size, any terms related to these specific conditions mentioned in an article’s methodology were disregarded. Consequently, the calculations for confidence intervals were adjusted accordingly. If an article solely focused on one of these aspects without contributing to the overall methodology, it was not included in the final selection.

Methodological components regarding parameter estimation and confidence interval procedures were extracted from the screened articles. Specifically, the focus was on understanding how parameters should be estimated within a horizontally partitioned framework and how confidence intervals should be computed for these parameters. For each article, the formulas related to quantities shared among the nodes and quantities calculated by the CC were derived and analyzed. These formulas were examined within a workflow that indicated the necessary circulation of information for the procedure to be executed.

##### Reported results

The rationale behind each method that was deemed suitable for fitting GLMs was documented, along with the corresponding reference to the paper included in the scoping study where the method was introduced or discussed.

Articles that discussed approaches specifically applicable to the cases of linear or logistic regression were also mentioned, but not elaborated on in detail.

### 2.3 Methodology related to objective O3

In most statistical settings with horizontally partitioned data, it is commonly assumed that the sample sizes of the data nodes are equal and that the distribution of covariates is the same across all nodes. However, when the number of nodes is fixed and relatively small compared to the sample sizes, it is possible to adapt a particular approach to handle situations where the sample sizes and covariate distributions vary across nodes. This can be achieved by combining the theoretical arguments presented in the original article of the method with the principles of asymptotic statistics theory concerning maximum likelihood estimation.

To adapt a given approach for situations where sample sizes and covariate distributions differ across nodes, the following steps were taken:

1.  The formulas for the relevant quantities were modified to emphasize the changes caused by this scenario. It was ensured that the adapted quantities were equivalent to their counterparts presented in the original article for an equal sample size setting.

2.  Using asymptotic theory, an asymptotic normality result was derived for the estimators of interest, assuming a set of assumptions that accommodated potential variations in sample sizes and covariates sampling distribution across nodes, while still enabling meaningful theoretical arguments.

3.  Formulas for the asymptotic variances were derived. Statistical theory on maximum likelihood estimation was employed to obtain consistent estimators for asymptotic variances. The latter estimators were derived under the constraint that they had to be calculated without requiring any additional communication round between the CC and the nodes. Thus, throughout the adaptation process, the communication workflow remained unchanged compared to the original method.

These steps ensured the mathematical correctness of adapting the approaches to handle different sample sizes and covariate distributions across nodes. Importantly, the adaptation maintained consistency with the original method’s communication workflows.

#### Statistical estimates of interest

A standard GLM typically includes one or two unknown parametric components. The first are the ***β*** parameters, which are commonly assumed to be unknown. The second parameter is the nuisance parameter *ϕ*, which can either be known (e.g., in logistic models) or unknown (e.g., in linear models). In practical applications, when *ϕ* is unknown, its estimated value is often not the main focus, although the latter is necessary to estimate the asymptotic variance of the ***β*** parameter estimates.

In the upcoming analysis, we will assume that the parameter *ϕ* is unknown and estimated using the recommended approach in the selected methods. However, in the case where *ϕ* is known, the process becomes simpler. This involves substituting the known value of *ϕ* and disregarding the estimation step. It is important to highlight that estimating *ϕ* requires additional information to be shared between the nodes and the CC, but it does not necessitate any extra communication round between them.

The estimation process for both the ***β*** parameters and the *ϕ* parameter are discussed. Additionally, we explained how to compute an estimator for the asymptotic variance specifically for the estimator of ***β****⋆*. It is important to note that the results presented below can be modified and extended to develop a similar procedure for estimating *ϕ**⋆*.

Using these results, based on an estimator of ***β****⋆*, say ![Graphic][16]</img> and a formula for the estimator of the asymptotic variance-covariance matrix involved in its associated asymptotic normality result, say ![Graphic][17]</img>, Wald-type (1 *− α*) confidence intervals can be computed for each component of ***β****⋆* using the formula ![Formula][18]</img>  

#### Reported results

For each approach considered, we presented the formulas necessary to compute the final estimates of the ***β*** parameters and their corresponding confidence intervals. The presentation of these formulas was designed to emphasize the communication workflow. Furthermore, a comprehensive algorithm was provided, outlining the step-by-step process.

In addition, the asymptotic normality of the ***β*** parameter estimators was stated, accompanied by the formula for the asymptotic variance and its consistent estimator. Detailed proofs for these results can be found in the Appendix B.

## 3 Results

### 3.1 Results related to objective O1

#### 3.1.1 Search outcomes from the scoping review

As presented in Figure 1, a total of 1407 articles were initially identified across all four databases after removing duplicates. Subsequently, a majority of these articles (n=1274) were excluded based on the evaluation of titles and abstracts, leaving 133 articles for eligibility assessment through full-text review. Following this assessment, 29 articles were included from the databases. Additionally, by reviewing the references of the included articles, 12 more articles were identified and added to the study.

![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/24/2023.12.21.23300389/F1.medium.gif)

[Figure 1.](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/F1)

Figure 1. 
Article selection process for the scoping review. Detailed inclusion and exclusion criteria are described in the text and in the protocol.

Among the additional 12 articles obtained through the assessment of references from included articles, it is observed that most of them did not mention statistical inference or related terms in their abstracts (e.g., [14][15][41]). Consequently, these articles were not captured in the initial database search results. Furthermore, some articles directly referred to the specific method used without including any keywords related to horizontally partitioned data in their abstracts or titles (e.g., [4][11]), which greatly reduced the chance of initially identifying them. However, during the process of reviewing the references of included articles, all the relevant papers that were initially identified through the snowballing strategy were eventually retrieved either through the search strategy or the selection process based on the references of included articles.

#### 3.1.2 Results of the scoping review

Each article included in the scoping review put forth one or multiple methodological approaches pertaining to objective O1. Similarities are differences regarding the communication schemes involved and their background of origin are summarized below.

First, all selected articles discuss one or more statistical procedures that operate on horizontally partitioned data using one of the communication schemes depicted in Figures 2 to 5.

![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/24/2023.12.21.23300389/F2.medium.gif)

[Figure 2.](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/F2)

Figure 2. 
Workflow I.

*   - In Workflow I, as shown in Figure 2, each node calculates summary statistics from its own samples, and the results are sent to the CC. The CC combines the information provided by each node to produce the final estimates. This communication approach is commonly referred to as a “one-shot” or “non-iterative” in the literature, although not always consistently.

*   - In Workflow II, as shown in Figure 3, multiple communication rounds are allowed between the CC and the data storage nodes. This allows for iterative interactions between the nodes and the CC to refine the estimates. 
    
    ![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/24/2023.12.21.23300389/F3.medium.gif)
    
    [Figure 3.](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/F3)
    
    Figure 3. 
    Workflow II.
    
    
*   - Some approaches fundamentally differ from the two previous workflows by assigning a different role to one of the nodes, say node 1, compared to the others. These approaches operate using Workflow III as illustrated in Figure 4, where node 1 follows a distinct communication pattern compared to the other nodes. In the papers included in the scoping review that discuss these approaches, node 1 is invariably designated as the CC. However, in the context of the current paper, their roles were distinguished. The additional step performed by the CC, which involves data aggregation, can be particularly well-suited for privacy protection purposes in practice. 
    
    ![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/24/2023.12.21.23300389/F4.medium.gif)
    
    [Figure 4.](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/F4)
    
    Figure 4. 
    Workflow III.
    
    
*   - The particular setting shown in Workflow IV in Figure 5 requires two back-and-forth communication exchanges between each node and the CC at each iteration. This communication pattern distinguishes it from the other workflows. 
    
    ![Figure 5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/24/2023.12.21.23300389/F5.medium.gif)
    
    [Figure 5.](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/F5)
    
    Figure 5. 
    Workflow IV.
    
    
In light of the preceding discussion, from an operational standpoint, two categories of workflows emerge. On one hand, there are workflows that do not necessitate any communication from the CC to the nodes, which are captured in Workflow I. On the other hand, there are workflows that involve one or more communication exchanges from the CC to the nodes, which are captured in Workflows II, III and IV.

In order to emphasize similarities among the methods presented in these articles and facilitate the identification of methods suitable for specific purposes, a systematic classification is presented in Table 1. The articles are categorized based on the type of models employed and the number of communications from the CC to the individual nodes.

View this table:
[Table 1:](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/T1)

Table 1: Classification of articles included in the scoping review.

The majority of the methods were published within the methodological setting of Big or Massive data/Multi-machine, while some were reported within the context of healthcare research. Within the Big or Massive data/Multi-machine methodological setting, many methods involve an initial step of random data partitioning among multiple machines. However, certain methods assume a scenario where data is already stored on separate machines, as observed in [17] and [24]. Furthermore, it is worth noting that no articles published prior to 2010 were included, aligning with our initial hypothesis regarding the identification of contemporary methodological settings. The majority of the included articles (30 out of 41) were published after the year 2018.

The majority of articles address a setting where a CC exists external to the nodes, as exemplified by articles such as [28], [51], and [58]. In contrast, as mentioned above, some articles designate one of the nodes to assume this central role, as demonstrated in [9].

The methods identified through our search strategy share a common characteristic of utilizing a global model that incorporates population-level parameters. In some cases, these parameters may also include node-specific components to accommodate node-specific statistical heterogeneity in the outcome-predictors relationship, which captures deviations from the population-level conditional probability distribution of the outcome given the predictors.

A few of the reported methods have the capability to yield results identical to those obtained if the individual line data were pooled from all nodes, see e.g., [56] and [44].

### 3.2 Results related to objective O2

Six approaches were selected as applicable to the standard GLM framework discussed in 1.4. They all assumed that nodes had equal sample sizes and identical distributions for the covariates.

#### 3.2.1 Simple averaging

One of the simplest methods for horizontally partitioned data analysis, often referred to as the “simple averaging method” or the “divide-and-conquer” approach, has been extensively studied in the literature, see [60] and [41] which were included in our scoping study. It operates through Workflow I in Figure 2. In this approach, node-level model estimates are gathered and averaged at the CC to generate the final estimates.

In the context of GLM, each node is initially tasked with calculating the maximum likelihood estimator (MLE) of the ***β****⋆* and *ϕ**⋆* parameters using their respective data. Additionally, the Hessian matrix of the log-likelihood function with respect to the ***β*** parameters must be computed for constructing Wald-type confidence intervals. The estimated parameters and the computed Hessian matrix are then transmitted to the CC.

The final parameter estimates of ***β****⋆* are obtained by averaging the node-specific estimates, while the local Hessians and estimates of *ϕ**⋆* are utilized to compute an estimator for the asymptotic variance.

#### 3.2.2 Single distributed Newton-Raphson updating

The single distributed Newton-Raphson updating method is an iterative procedure that includes an additional communication round between the CC and the nodes, compared to the simple averaging method. It was originally proposed as the “distributed one-step” method in [21], but here it is referred to by a different term to avoid any confusion regarding communication complexity. This method operates using Workflow II, as depicted in Figure 3, with *T* = 1 (where *T* represents the number of cycles in the iteration scheme). It enhances the simple averaging estimators by incorporating a single distributed Newton-Raphson updating step.

In the context of GLM, each node first calculates the MLE of ***β****⋆* and *ϕ**⋆*, and transmits them to the CC. The CC aggregates these estimates using averaging and sends the result back to the nodes. The nodes then compute the gradient and the Hessian matrix of the log-likelihood function, evaluated at the received ***β****⋆* and *ϕ**⋆* estimates. Subsequently, the gradient and the Hessian matrix are sent back to the CC, which averages them and computes a Newton-Raphson updating step based on the simple averaging estimates. An estimator for the asymptotic variance can be calculated by utilizing the received Hessian matrices and the updated estimate of *ϕ**⋆*.

#### 3.2.3 Multiple distributed Newton-Raphson updatings

The multiple distributed Newton-Raphson updating method leverages the fact that, for standard GLMs, the algorithm typically used to calculate the MLE of ***β****⋆* and *ϕ**⋆* in a centralized pooled setting can be executed in a distributed manner without any loss of information. This is possible because the algorithm relies on Newton-Raphson updatings (or sometimes Fisher scoring updatings) that are expressed using two sums of node-specific summary statistics, namely local gradients and local Hessian matrices of the log-likelihood function, evaluated at the ***β****⋆* and *ϕ**⋆* estimates from the previous iteration. A version of this method is proposed in [56] under the logistic regression framework. It operates through Workflow II in Figure 3 for a general *T ≥* 1.

#### 3.2.4 Distributed estimating equation

The class of estimating equations estimators is vast and encompasses a broad range of statistical estimation techniques, including likelihood-based approaches that rely on searching for critical points. The fundamental idea behind estimating equations methods is to establish a system of equations that involve both the sample data and the unknown model parameters. These equations are then solved to determine the parameter estimates. MLEs, which are obtained by setting the gradient of the log-likelihood function with respect to the unknown parameters equal to zero, belong to the class of estimating equations estimators.

The distributed estimating equations approach involves gathering summary statistics from nodes at the CC level, enabling the reconstruction of the estimating equations, or more commonly, an approximation of them that would have been obtained in a pooled centralized setting. This method is discussed in [29] and operates using Workflow I, as depicted in Figure 2.

In the context of GLMs, the distributed estimating equations approach involves initially assigning each node the task of computing and sending their local MLEs and the Hessian matrix of their local log-likelihood, evaluated at those MLEs, to the CC. The CC utilizes these received quantities to reconstruct the global estimating equations or an approximation thereof. This reconstruction ultimately leads to an analytical solution for obtaining the resulting estimates. Confidence intervals are computed using a combination of the Hessian matrices and the final estimator of *ϕ**⋆*.

It is important to note that when this approach is applied in the context of linear regression, it enables the acquisition of ***β****⋆* parameter estimates that are identical to those obtained in a pooled centralized setting.

#### 3.2.5 Distributed estimation using a single gradient-enhanced log-likelihood

This method differs fundamentally from the ones discussed thus far, as it involves a distinct role for one particular node in obtaining model parameter estimates. It operates using Workflow III, as depicted in Figure 4, and was proposed in [24] under the name “Surrogate likelihood”. This approach relies on an approximation of the global likelihood by viewing it as an analytic function. It expands the global likelihood into an infinite series around an initial guess ![Graphic][19]</img> and replaces the higher-order derivatives (order *≥* 2) of the global likelihood with those of a Taylor expansion of a node’s (e.g., node *k* = 1) local likelihood around the same value. By following this procedure, the so-called surrogate likelihood can be solved using data from node *k* = 1 and aggregated gradients from nodes *k ∈* {2, …, *K*}.

In the context of GLM, the CC first collects the necessary information to compute initial estimates for the parameters ***β****⋆* and *ϕ**⋆*. These initial estimates can be obtained through various approaches, such as a simple averaging estimator or the MLEs computed using data from node 1. These initial estimates are then transmitted to nodes *k∈* {2, …, *K*}. Each of these nodes calculates the gradient of the log-likelihood function, evaluated at the received estimates, and sends it back to the CC. The CC averages these gradients and sends the result to node Node 1 solves a gradient-enhanced log-likelihood using its own data and the received average gradient. The resulting estimate is sent back to the CC as the final estimate. To compute confidence intervals, each node must send the Hessian matrix of its local log-likelihood function, evaluated at the initial received estimate.

The steps related to estimation can be repeated multiple times.

#### 3.2.6 Distributed estimation using multiple gradient-enhanced log-likelihoods

This method is in the spirit of the *distributed estimation using a single gradient-enhanced log-likelihood* approach described above, except that all nodes have to solve a gradient-enhanced log-likelihood instead of only one of them. Results pertaining to statistical inference are discussed in [17] under a penalized setting. A non-penalized version of this method was introduced in [42] although the latter did not discuss confidence intervals or hypothesis testing, and hence was not included in our scoping review. It operates through Workflow IV depicted in Figure 5.

### 3.3 Results related to objective O3

In what follows, let the log-likelihood of the data stored in node *k* (using 𝒟 (*k*)) be denoted by ![Formula][20]</img>  

Also, let ***D***(*k*)(***β***) *∈* ℝ*p*+1 be such that ![Formula][21]</img>  and define the (*p* + 1) *×* (*p* + 1) matrix ***V*** (*k*)(***β***) as ![Formula][22]</img>  

Since ![Graphic][23]</img>, then, solving the equation ***D***(*k*)(***β***) = 0 yields the node-specific MLE of ***β***, denoted hereafter by ![Graphic][24]</img>. The matrix ***V***(*k*) (***β***) is equal to *−∇****β*** ***D***(*k*) (***β***) and relates to Fisher information matrix through the equation ![Graphic][25]</img>.

Finally, set ![Formula][26]</img>  and ![Formula][27]</img>  

Because *E*(*k*)(*ϕ*, ***β***) = *−ϕ*2(*∂*/*∂ϕ*)*ℓ*(*k*)(***β***, *ϕ*), when *ϕ* is unknown, solving the equation ![Graphic][28]</img> for *ϕ* yields its node-specific MLE of *ϕ**⋆*. We have *F*(*k*)(*ϕ*) = *− ∂*/*∂ϕ E*(*k*)(*ϕ*, ***β***).

#### 3.3.1 Simple averaging

The simple averaging method follows upon execution of Algorithm 1. First, each data node computes their local maximum by solving successively ***D*** (*k*) (***β***) = 0 and ![Graphic][29]</img>. To compute the confidence intervals at the CC level, the entries of the (*p* + 1) *×* (*p* + 1) matrix ![Graphic][30]</img>. have to be computed from Formula (3) with ![Graphic][31]</img> Then, the set ![Formula][32]</img>  is sent to the CC. The parameter estimates are then aggregated by the CC through averaging. Specifically, the CC computes ![Formula][33]</img>  where *w*(1), …, *w*(*K*) are weights (i.e., *w*(*k*) *≥* 0 and ![Graphic][34]</img>) used to combine each node’s contribution. Often, weights can be taken proportional to local sample sizes, leading to the choice *w*(*k*) = *n*(*k*)/*n*.

Wald-type confidence intervals for ***β**** can be constructed based on the fact that the sequence ![Graphic][35]</img> converges in distribution to a centred normal random variable with covariance matrix ![Formula][36]</img>  

See Appendix B.3.3. Since ![Graphic][37]</img> is consistently estimated by ![Graphic][38]</img> and *ϕ**⋆* by ![Graphic][39]</img>, and as *p*(*k*) can be estimated by *n*(*k*)/*n*, it follows that a consistent estimator for **Σ**SA is given by ![Formula][40]</img>  

The simple averaging final estimates are then given by ![Formula][41]</img>  

Algorithm 1:
Simple averaging inference procedure

![Figure6](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/24/2023.12.21.23300389/F6.medium.gif)

[Figure6](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/F6)

#### 3.3.2 Single distributed Newton-Raphson updating

The single distributed Newton-Raphson updating method follows upon execution of Algorithm 2 with *T* = 1. First, the CC gathers summary statistics to compute the simple averaging estimators of ***β****⋆* and *ϕ**⋆* without their accompanying confidence interval. Hence, for *k∈* {1, …, *K*}, and with ![Graphic][42]</img> as above (6), node *k* sends to the CC the quantities ![Formula][43]</img>  which uses them to compute ![Graphic][44]</img> and ![Graphic][45]</img> using the formulas in (7).

For reasons of convenience that will become clear later, the notation ![Graphic][46]</img> and ![Graphic][47]</img> will be utilized instead of ![Graphic][48]</img> and ![Graphic][49]</img>, respectively. In this notation, the set of values ![Formula][50]</img>  is broadcasted to data nodes, which are then tasked to compute and send back the quantities ![Formula][51]</img>  where for any integer *t ≥* 1, one defines ![Formula][52]</img>  

Upon receiving the ![Graphic][53]</img> from each node, the CC calculates the following weighted averages: ![Formula][54]</img>  

This enables the CC to execute Newton-Raphson updates from ![Graphic][55]</img> and ![Graphic][56]</img>, respectively: ![Formula][57]</img>  

It is shown in Appendix B.3.4 that ![Formula][58]</img>  

Since ![Graphic][59]</img> is consistently estimated by ![Graphic][60]</img> and *ϕ** by ![Graphic][61]</img>, and as *p*(*k*) can be estimated by *n*(*k*) /*n*, it follows that a consistent estimator for **Σ**NR is given by ![Formula][62]</img>  

The method’s final estimates are then given by ![Formula][63]</img>  

#### 3.3.3 Multiple distributed Newton-Raphson updatings

The multiple distributed Newton-Raphson updatings method follows upon execution of Algorithm 2 with *T >* 1. The first communication cycle follows the same procedure as described above for the single distributed Newton-Raphson updating method. It involves distributively computing a simple-averaging estimator and then performing a Newton-Raphson iteration starting from this estimator. The Newton descent is calculated as described in Equation (10).

Formally, the algorithm begins with each data node *k* sending the set of quantities ![Graphic][64]</img> as described in Equation (8) to the CC. Next, the CC calculates the simple averaging estimators using Formula (7), and uses them to initialize ![Graphic][65]</img> and ![Graphic][66]</img>.

The following steps are then repeated for a certain number of iterations. At iteration *t*, starting from *t* = 1, the CC broadcasts the values ![Graphic][67]</img> to the data nodes. The data nodes compute the quantities ![Graphic][68]</img> and ![Graphic][69]</img> as defined in Equation (9), and send them back to the CC.

The CC then utilizes these quantities to perform a Newton update. Specifically, it calculates ![Graphic][70]</img> and ![Graphic][71]</img>.

If the iterative cycle is repeated until convergence, the resulting estimates of ***β***⋆ are equivalent to the maximum likelihood estimators derived from pooled data. This is because, in GLMs, for maximum likelihood estimators, if both the pooled and distributed algorithms are initialized with the same values for ![Graphic][72]</img> and ![Graphic][73]</img>, then at each subsequent iteration, the distributed Newton update computed by the CC will be identical to the update obtained in a pooled setting.

For the method to yield consistent estimates, it is not necessary to initialize it with simple averaging estimators. However, using simple averaging estimators as initialization may speed up convergence, particularly in large sample sizes, since these estimators are ![Graphic][74]</img> -consistent.

Let ![Graphic][75]</img> denote the estimator obtained at convergence. Since it is (nearly) equal to the pooled MLE of ***β****⋆*, we can deduce from Appendix B.3.2 that ![Formula][76]</img>  

Following the same reasoning used earlier for the single distributed Newton-Raphson update method, we can consistently estimate the variance-covariance matrix as ![Formula][77]</img> 

Algorithm 2:
Distributed Newton-Raphson updatings algorithm

![Figure7](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/24/2023.12.21.23300389/F7.medium.gif)

[Figure7](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/F7)

#### 3.3.4 Distributed estimating equation

The distributed estimating equation algorithm follows upon execution of Algorithm 3. First, each node is responsible for computing the MLEs ![Graphic][78]</img> and ![Graphic][79]</img> of ***β**** and *ϕ**, respectively, using its own data. These estimators, along with the hessian matrix ![Graphic][80]</img> and ![Graphic][81]</img>, are then sent to the CC. The set ![Formula][82]</img>  is transmitted to the CC. The CC calculates the weighted average of the Hessians and the ![Graphic][83]</img> values as follows: ![Formula][84]</img>  

The parameter estimates can then be calculated as ![Formula][85]</img>  

It is shown in Appendix B.3.6 that ![Graphic][86]</img> converges in distribution to a centred normal random variable with variance-covariance matrix given by ![Formula][87]</img>  which is equal to **Σ**NR. It can be consistently estimated by ![Formula][88]</img> 

Algorithm 3:
Distributed estimating equations inference procedure

![Figure8](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/24/2023.12.21.23300389/F8.medium.gif)

[Figure8](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/F8)

#### 3.3.5 Distributed estimation using a single gradient-enhanced log-likelihood

This method operates through Algorithm 4. First, the necessary information is collected by the CC to compute the initial estimates of ***β*** and *ϕ*, denoted as ![Graphic][89]</img> and ![Graphic][90]</img>. In what follows, we assume these estimates are obtained using the simple averaging estimators calculated through Algorithm 1.

Subsequently, the CC broadcasts ![Graphic][91]</img> to node *k* ∈ {2, …, *K*}. Each node is then requested to compute and transmit back the following quantities: ![Formula][92]</img>  with ![Graphic][93]</img> and ![Graphic][94]</img>.

The CC aggregates the ***D***(*k*)’s and the *E*(*k*)’s using averaging by calculating ![Formula][95]</img>  

The ***V*** (*k*)’s are momentarily stored and will be used later to compute the estimator for the asymptotic variance-covariance matrix of the final estimator of ***β****⋆*. The quantities ![Formula][96]</img>  are then sent to node *k* = 1. Node *k* = 1 computes the global average of the ***D***(*k*)’s by adding its own counterpart, i.e., it first computes ![Formula][97]</img>  and then solves the surrogate likelihood function. Formally, it finds successively the values ![Graphic][98]</img> and ![Graphic][99]</img> that solve ![Formula][100]</img>  

The results are sent back to the CC, along with ![Graphic][101]</img>, yielding ![Formula][102]</img>  

If simple averaging estimators for ![Graphic][103]</img> and ![Graphic][104]</img> are chosen, then ![Graphic][105]</img> converges in distribution to a mean-zero multivariate normal random variable with variance-covariance matrix given by ![Formula][106]</img>  with ***I****p*+1 the *p* + 1 square identity matrix and ![Graphic][107]</img>. See Appendix B.3.7. The latter can be consistently estimated by ![Formula][108]</img>  where ![Formula][109]</img>  with ![Graphic][110]</img>.

Remark 1.
*In the paper [24], where the method described above was originally proposed, the authors discuss a version in which the latter process is repeated multiple times. Their version assumes that the data is uniformly and randomly split across nodes. Under this assumption, the resulting estimator of* ***β****⋆* *is asymptotically equivalent to the pooled estimator, regardless of the number of iterations executed. This equivalence occurs because when the predictors’ distribution is the same across nodes and the node sample sizes are equal, then* ![Graphic][111]</img> *and p*(*k*) ≡ 1/*K, it follows that* ![Graphic][112]</img> *resulting in the following expression for* Σ*SGE,1*: ![Formula][113]</img>  

*The variance-covariance matrix above is also the same as that of the simple averaging estimator in the setting of equal sample sizes and even predictor distributions. Consequently, at each iteration, the probability distribution of the resulting estimator remains unchanged. However, in a more general setting where predictor distributions and sample sizes vary across nodes, these cancellations no longer occur. Therefore, in this case, the probability distribution of the obtained estimator changes after each iteration, and tracking these changes falls beyond the scope of objective 3, see Appendix B.3.7. Hence, the current presentation focused on the case where only one iteration is executed*.

Algorithm 4:
Inference procedure based on the distributed estimation using a single gradient-enhanced log-likelihood method

![Figure9](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/24/2023.12.21.23300389/F9.medium.gif)

[Figure9](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/F9)

#### 3.3.6 Distributed estimation using multiple gradient-enhanced log-likelihood

This method operates through Algorithm 5. First, the CC collects the necessary information to compute the initial estimates, denoted as ![Graphic][114]</img> and ![Graphic][115]</img>. In this case, we assume that these estimates are obtained using the simple averaging estimators calculated through Algorithm 1.

Subsequently, the CC broadcasts ![Graphic][116]</img> to each node, which is then requested to compute and transmit back the following quantities: ![Formula][117]</img>  

Here, ![Graphic][118]</img> and ![Graphic][119]</img>.

The CC aggregates the ***D***(*k*)’s and the *E*(*k*)’s using averaging by calculating ![Formula][120]</img>  

The CC then broadcasts ![Graphic][121]</img> to each node, which are then tasked to solve the surrogate likelihood function. Formally, they find successively the value ![Graphic][122]</img> and ![Graphic][123]</img> that solves ![Formula][124]</img>  

Each node then transmits their set of local surrogate likelihood estimators to the CC: ![Formula][125]</img>  

Using the received sets of quantities ![Graphic][126]</img>, the CC aggregates them through averaging using the following formulas: ![Formula][127]</img>  

It is shown in Appendix B.3.8 that ![Graphic][128]</img> converges in distribution to a multivariate normal random variable with mean 0 and a variance-covariance matrix given by: ![Formula][129]</img>  where ![Graphic][130]</img>. The latter can be consistently estimated with ![Formula][131]</img>  where ![Graphic][132]</img> and ![Graphic][133]</img>.

Algorithm 5:
Inference procedure based on the distributed estimation using multiple gradient-enhanced log-likelihood method

![Figure10](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/12/24/2023.12.21.23300389/F10.medium.gif)

[Figure10](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/F10)

#### 3.3.7 Summary of quantities exchanged for the adapted methods

The following table presents a summary of the quantities exchanged between the nodes and the CC in both directions. Table 2 demonstrates that the quantities involved in exchanges from the nodes to the CC consist of parameter estimates, gradients (***D***(*k*) vectors), Hessians (***V*** (*k*) matrices), as well as real numbers (*E*(*k*) and *F*(*k*)). On the other hand, the quantities shared from the CC to the nodes primarily consist of parameter estimates. Notably, Methods 5 and 6 differentiate themselves by requiring the sharing of aggregated gradient vectors and Hessian matrices as well.

View this table:
[Table 2:](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/T2)

Table 2: Quantities shared in each adapted method’s communication workflow

#### 3.3.8 Comparison of adapted methods

Table 3 compares the main adapted HPSA methods on the quantities shared between the CC and nodes and the operational complexity of the procedures.

View this table:
[Table 3:](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/T3)

Table 3: Comparison of adapted methods

View this table:
[Table 4:](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/T4)

Table 4: Inclusion and Exclusion criteria

View this table:
[Table 5:](http://medrxiv.org/content/early/2023/12/24/2023.12.21.23300389/T5)

Table 5: Data extraction.

Methods 1 and 4 require only one communication from the data nodes to the CC and no communication back from the CC to the nodes. These so-called one-shot methods have the lowest operational complexity. Method 4 requires the additional quantity ![Graphic][134]</img> to be transmitted from each node to the CC.

Methods 2 and 3 perform Newton-Raphson updates using some initial estimator as a basis, usually the simple averaging estimator. While Method 2 requires this initial estimator to be ![Graphic][135]</img> -consistent, if *T* is large enough, any initial value will work for Method 3 (although convergence may be slower). Both methods require ***D***(*k*), ***V*** (*k*), *E*(*k*) and *F*(*k*) to be evaluated and sent to the CC *T* times, with *T* = 1 for Method 2. Compared to Method 1, Method 2 requires the additional quantities ***D***(*k*), *E*(*k*) and *F*(*k*), and Method 3 further requires these quantities to be evaluated and communicated multiple times.

Method 5 relies on an approximation of the log-likelihood function. It requires an initial estimator, usually the simple averaging estimator. This approach treats node 1 differently, making it solve the surrogate log-likelihood using aggregates from the other nodes and its own data. The CC sends the initial estimator to each node, then requires them to evaluate ***D***(*k*), ***V*** (*k*) and *E*(*k*) and send the result back to the CC once. It averages the results and then communicates them to node 1, which solves the surrogate log-likelihood and sends its results back to the CC.

Method 6 applies Method 5 to every node, making each node solve the surrogate log-likelihood function with its own data before averaging the resulting local estimators.

## 4 Discussion

### 4.1 Summary of Findings

The first objective (O1) of this study aimed to identify and map the methodological approaches used and developed in the literature regarding HPSA. To achieve this, we conducted a scoping review, which included 41 articles following our protocol. These articles were categorized based on the types of models and communication schemes involved, as presented in Table 1. The analysis revealed that the majority of methods included in the scoping review focused on methodological settings associated with massive data. The communication schemes of these methods were demonstrated through Workflows I, II, III and IV.

The second objective (O2) of this study aimed to describe the approaches that can be employed for basic GLM regression analyses and identify the distributional assumptions they require. To accomplish this, we identified six approaches that could be classified within Workflows I-IV. However, a limitation of these methods is that they assume identical node sample sizes and node covariate distributions. This assumption reduces their suitability in settings commonly encountered in healthcare research, where data collecting nodes are prone to generating different covariate distributions.

The third objective (O3) of this study was to present methods that relaxed these assumptions by adapting the approaches identified in O2 to the unequal sample sizes and non-identical covariate sample distribution setting. Additionally, we compared these methods in terms of the information shared and operational complexity. This involved adapting the quantities and estimators described in the original articles and deriving new asymptotic results with relaxed assumptions. We proposed a unified framework for inference procedures utilizing these methods. The framework encompasses both estimation and the construction of confidence intervals, providing detailed steps for both the data nodes and the CC.

### 4.2 Challenges and Opportunities

Work pertaining to O1 illustrated why it is so challenging for researchers and data custodians alike to find information regarding HPSA. While the HPSA literature is very recent (all included articles were published in 2010 or later), the literature is non-homogeneous, and it has not come to a consensus on nomenclature. No universal terminology exists, and different terms are used in the different fields developing and applying HPSA methods. Many specific methods introduced in applied contexts are special cases of more general methods which may or may not be cited. These characteristics make finding useful and efficient keywords arduous. This required adapting our research strategy.

This difficulty is compounded by the fact that statistical inference is not the main focus of most of the HPSA literature. The majority of published work is in the prediction, learning and optimization contexts. As a result, method assumptions are rarely discussed. This can be a problem when adapting these methods for inference. Furthermore, the methodological setting is often assumed to be in the massive data context where data is randomly distributed between nodes. This allows the authors to make strong assumptions on node sample sizes and covariate distributions which may be unrealistic in the confidential data complex where different data sources. These methods cannot be used directly for inference using confidential health data. While some work remains to be done when the structure of association between the covariate and the outcome is heterogeneous between nodes, we adapted widely used methods for when the distribution of covariate and sample sizes between nodes are not identical.

Table 1 illustrates how the majority of HPSA methods are focused on parametric models. Some work has also been done for semi-parametric and non-parametric regression, and some methods are introduced outside of the regression framework (although they can also be applied to regression). Many methods do not require communication of quantities from the CC to the data nodes: they only require one transmission from the nodes to the CC. Given the lack of awareness around HPSA, starting by implementing lower operational complexity methods while providing useful results offers a promising path.

The methods can be implemented “manually” (e.g. via email exchanges), but platforms enabling semi-automated distributed fittings of statistical models have been proposed in the literature (e.g. [7]). On the other hand, explicit descriptions of their algorithms and the quantities exchanged are not always easily accessible and this complicates the evaluation of the tools by data custodians and researchers.

This is especially important since it is essential to clarify here that operating an HPSA algorithm does NOT ensure confidentiality in and of itself.

For example, it is known that sharing sample moments can compromise confidentiality. It can be shown that a set of *n* observations is uniquely determined by its first *n* sample moments [40]. This could prove problematic for methods that rely on sharing the first few moments of each node’s sample, especially if number of observations is low, as the sample could be partially reconstructed by the CC.

The results presented here contribute to this objective by clarifying the workflows and quantities exchanged by each method. Nevertheless, further analysis of the confidentiality preserved by HPSA methods is needed to fully understand the risk associated with the sharing of summary statistics, especially as more rounds of communication between the CC and data nodes are completed. The framework of differential privacy (DP) has been used to guarantee the preservation of confidentiality in a few HPSA methods, but a wider application of DP to existing and popular methods has yet to be explored.

## Funding

Health Data Research Network Canada, Natural Sciences and Engineering Research Council of Canada, Fonds de recherche du Québec - Nature et Technologie, the Chaire en informatique de la santé de l’Université de Sherbrooke and the Chaire MEIE Québec - Le numérique au service des systèmes de santé apprenants.

## Informed consent

Not applicable.

## Data availability

Not applicable.

## Conflicts of interest

The authors declare no conflict of interest.

### Abbreviations

CC
:   Coordinating centre
GLM
:   Generalized linear model
HPSA
:   Horizontally Partitioned Statistical Analytics
ICES
:   Institute for Clinical Evaluative Sciences
LHS
:   Learning health system
MCHP
:   Manitoba Centre for Health Policy
MLE
:   Maximum likelihood estimator
PACS
:   Picture archiving and communication system

## A Detailed protocol for the scoping review

### A.1 Research question

1.  What are the existing methods that allow to conduct statistical inference procedures from a horizontally distributed dataset?
    
    *   *Regarding: Methods for different statistical models; Methods for various settings in terms of information shared; Methods for different needs in terms of precision of estimates*.

2.  What are the characteristics of these methods to proceed to a systematic categorisation?
    
    *   *Regarding: Type of algorithm; Settings for nodes and coordinating centre; Capacity to reach exact estimates from data pooling*.

### A.2 Methods

The scoping review will be conducted in accordance with the methodological framework from Levac et al. [26] (based on Arksey and O’Malley [2]).

#### A.2.1 Key-words

The following keywords were identified from the **snowballing literature search**:

*   - *distributed algorithms* [15]

*   - *distributed estimation* [21]

*   - *distributed inference* [24]

*   - *distributed learning* [31]

*   - *distributed regression* [46] (not included in the scoping review final selection since no new estimation methods are discussed).

*   - *federated inference*[57] (not included in the scoping review final selection since the paper was not published when the scoping review search was launched)

*   - *federated estimation* [50] (not included in the scoping review final selection since the paper was not published when the scoping review search was launched)

*   - *federated learning* [27] (not included in the scoping study review final selection the paper focuses solely on estimation, i.e. no confidence interval computation strategies or hypothesis testing framework are discussed).

*   - *privacy-protecting algorithm* [38]

*   - *privacy-preserving algorithm* [14]

*   - *aggregated inference* [22] (not included in the scoping review final selection since no new estimation methods are discussed).

The following keywords will be used to add conciseness to the topic of statistical inference, to avoid screening machine-learning specific articles:

*   - *Statistical inference*

*   - *Confidence interval*

*   - *Statistical estimation*

*   - *Hypothesis tests*

*   - *Significant coefficient, Significance of parameter*

#### A.2.2 Research strategies

In collaboration with a specialist in documentary research at the Université de Sherbrooke, we have selected the following abstract and citation databases: (1) Medline, (2) Scopus, (3) MathSciNet, and (4) zbMATH. The choice of these databases was motivated by the interdisciplinary nature of the research question, which spans the fields of statistics and health.

To develop comprehensive research strategies, we combined the previously mentioned keywords and worked closely with the documentary research specialist.

##### Limits and restrictions

In order to strike a balance between sensitivity and specificity in our research, given the interdisciplinary nature of the topic involving distributed data and statistical inference, we took several considerations into account.

To ensure sensitivity, we opted for interdisciplinary databases that are known to cover a wide range of relevant literature. These include Medline, Scopus, MathSciNet, and zbMATH. By selecting these databases, we aimed to capture a comprehensive set of articles that encompass both statistical and health-related aspects.

On the other hand, to maintain specificity and avoid retrieving a large number of non-relevant articles, we carefully selected keywords that were targeted and specific to our research question. Instead of relying solely on thesauri and synonym search tools, we focused on the vocabulary commonly used in the literature through an extensive overlook (snowballing) approach, particularly for the concept of distributed data. For the concept of statistical inference, we chose synonyms that specifically capture studies centred around this topic.

Furthermore, to keep the scope of our research manageable and relevant to recent developments, we limited our search to articles published since the year 2000. This restriction is justified by the emergence of distributed data in recent years, driven by advancements in technology and the availability of massive datasets. By setting this threshold, we aimed to capture any early-developed methods and approaches related to our research topic.

Overall, our research strategies were designed to strike a balance between sensitivity and specificity, ensuring that we capture a comprehensive range of relevant articles while minimizing the inclusion of non-relevant ones.

##### Medline search querry

 ((AB ((((“Privacy-preserving” OR “Privacy-protecting*” OR “federated” OR “Distributed” OR “aggregated “) N1 (“estimation*” OR “algorithm*” OR “inference” OR “analy*” OR “regression*” OR “model*” OR “ statistic*” OR “learning”))) OR TI ((((“Privacy-preserving” OR “Privacy-protecting*” OR “federated” OR “Distributed” OR “aggregated”) N1 (“estimation*” OR “algorithm*” OR “inference” OR “analy*” OR “ regression*” OR “model*” OR “statistic*” OR “learning”)))) OR SU ((((“Privacy-preserving” OR “ Privacy-protecting*” OR “federated” OR “Distributed” OR “aggregated”) N1 (“estimation*” OR “algorithm *” OR “inference” OR “analy*” OR “regression*” OR “model*” OR “statistic*” OR “learning”))))) AND ( TX ((“statistical inference” OR “confidence interval*” OR “Statistical Estimat*” OR “hypothesis test *” OR “significant coefficient*” OR “significant parameter*”)))) 

##### Scopus search querry

 TITLE-ABS-KEY ((“Privacy-preserving” OR “Privacy-protecting*” OR “federated” OR “Distributed” OR “ aggregated”) W/1 (“estimation*” OR “algorithm*” OR “inference” OR “analy*” OR “regression*” OR “model *” OR “statistic*” OR “learning”) AND (“statistical inference” OR “confidence interval*” OR “ Statistical Estimat*” OR “hypothesis test*” OR “significant coefficient*” OR “significant parameter*”)) 

##### MathSciNet search querry

 “Anywhere=(“Privacy-preserving” OR “Privacy-protecting” OR “federated” OR “Distributed” OR “aggregated”) AND Anywhere=(“estimation*” OR “algorithm*” OR “inference” OR “analy*” OR “regression*” OR “model*” OR “statistic*” OR “learning”) AND Anywhere=(“statistical inference” OR “confidence interval*” OR “ Statistical Estimat*” OR “hypothesis test*” OR “significant coefficient*” OR “significant parameter*”) “ 

##### zbMATH search querry

 ((ti:(“Privacy-preserving” | “Privacy-protecting” | “federated” | “Distributed” | “aggregated”) \& ti:(“ estimation*” | “algorithm*” | “inference” | “analy*” | “regression*” | “model*” | “statistic*” | “ learning”)) | (ut:(“Privacy-preserving” | “Privacy-protecting” | “federated” | “Distributed” | “ aggregated”) \& ut:(“estimation*” | “algorithm*” | “inference” | “analy*” | “regression*” | “model*” | “statistic*” | “learning”))) \& any:(“statistical inference” | “confidence interval*” | “Statistical Estimat*” | “hypothesis test*” | “significant coefficient*” | “significant parameter*”) 

##### Grey literature

As one of the exclusion criteria is to exclude all unpublished studies, no research was conducted among grey literature.

#### A.2.3 Selection process

After removing duplicate references, a manual review of the selected references obtained from the databases was conducted to identify relevant articles that address the research question. In this study, a two-stage selection process was employed to ensure a thorough and systematic approach.

To ensure consistency and minimize bias, all reviewers involved in the selection process met before the commencement of the first stage of selection. This initial meeting aimed to establish a shared understanding of the inclusion criteria and research objectives. By aligning their interpretations and definitions of the inclusion criteria, the reviewers ensured a consistent approach throughout the selection process.

During the selection process, there has been a midpoint meeting among the reviewers after the completion of the first stage of selection. This meeting served as an opportunity to discuss any questions, challenges, or uncertainties that may have arisen during the initial selection. By addressing these issues collectively, the reviewers maintained consistency and addressed discrepancies in their evaluations.

Finally, at the end of the second stage of selection, the reviewers had a final meeting. This meeting allowed for a comprehensive discussion of the selected references and ensured that the final set of included articles met the predefined criteria and effectively addressed the research question.

By conducting regular meetings throughout the selection process and discussing the inclusion criteria, the reviewers aimed to maintain consistency, minimize subjectivity, and enhance the reliability of the article selection.

#### A.2.4 Stages of the selection

##### Selection 1: Titles and Abstracts

All titles and abstracts of the references identified through the research strategy were evaluated by a single author (MPD or FCL). Since this step involved a single reviewer, references that were clearly unrelated to the research question or did not meet the inclusion criteria were automatically excluded from further consideration.

The evaluation process conducted by the single author aimed to swiftly discard references that were obviously irrelevant to the research question. This initial screening helped streamline the subsequent stages of the selection process by removing references that did not align with the study’s objectives or criteria.

##### Selection 2: Full text

The full texts of the references selected in the first stage were reviewed by two authors (MPD and FCL). In instances where there were differing opinions between the two initial reviewers, they engaged in discussions to reach a consensus. To ensure impartiality and a final resolution, a third author (JFE) conducted a third review, overseeing the process and making the ultimate decision in cases where disagreements persisted.

##### Additional strategy

The list of references from all the included articles after the selection process was carefully assessed to identify any additional articles that may not have been captured during the initial screening due to specific keywords. This step aimed to ensure a comprehensive approach by exploring the reference lists of the included articles for relevant references that might have been missed in the initial search.

Through this approach, the review aimed to minimize the possibility of excluding relevant studies and to provide a comprehensive and robust synthesis of the available literature on the subject matter.

##### Inclusion criteria

The following criteria were utilized to guide the selection process. Exclusion was considered for a reference if it met at least one of the exclusion criteria, or if it failed to meet at least one of the inclusion criteria.

#### A.2.5 Data-charting

A data-charting form was collaboratively developed to facilitate the extraction of relevant information from the selected studies. The extraction process was conducted manually, with two authors (MPD and FCL) independently extracting data from the first five studies. Subsequently, the authors convened to verify the adequacy of the process and ensure consistency in data extraction. The remaining studies were then divided between the two authors for data extraction.

During the data extraction phase, specific information pertaining to the research questions was identified and recorded. To account for any uncertainties or variables requiring additional review, a “To be determined” modality was included for each extracted variable. This modality serves as a reminder for a second author to review and validate the extracted data, ensuring accuracy and reliability.

## B Mathematical derivations pertaining to Objective 3

### B.1 Notations used in the Appendix

Recall that in the current setting, there are *n* individuals horizontally partitioned across *K* data storage nodes. Each node’s dataset is ![Graphic][136]</img>, where 1 *≤ k ≤ K* and ![Graphic][137]</img> represents measurements on the *i*th individual at node ![Graphic][138]</img> denotes their response variable and ![Graphic][139]</img> denotes their covariate vector. *n*(*k*) is the total sample size at node *k*. The combined datasets 𝒟 (1), …, 𝒟 (*K*) make up the whole dataset without any duplicated individuals such that ![Graphic][140]</img>.

The current GLM framework assumes that there exists unknown parameters ***β****⋆* *∈* ℝ *p*+1 *∈* ℝ and *ϕ**⋆* *>* 0, and known model-specific functions *b, c, g, h* such that with ![Graphic][141]</img> and ![Graphic][142]</img>, we have ![Graphic][143]</img>, where for any ***β*** = [*β*, *β*1, …, *β**p*]*⊤* *∈* ℝ *p*+1 and *ϕ*, ![Formula][144]</img>  where *b* is such that ![Graphic][145]</img>, with *b**′* (*x*) = *∂b*(*x*)/*∂x*.

We also recall the definition of ***D***(*k*)(***β***) *∈* ℝ *p*+1 at page 13: ![Formula][146]</img>  as well as the one of ***V*** (*k*)(***β***) at page 13: ![Formula][147]</img>  

Finally, let us reiterate the definitions of *E*(*k*) in equation (4) and *F*(*k*) in equation (5), which are expressed as follows: ![Formula][148]</img>  and ![Formula][149]</img>  

### B.2 General estimation in a pooled centralized setting

The likelihood of the full dataset ![Graphic][150]</img>, in a setting where the likelihood contribution of each node would be given by the set of weights ![Graphic][151]</img>, is given by ![Formula][152]</img>  

Pooled maximum likelihood estimates of ***β****⋆* and *ϕ**⋆* are found by calculating a set of values ![Graphic][153]</img> that maximizes (15). This is usually done in two steps. In a first step, equating the gradient with respect to the ***β*** parameters to 0 yields a set of equations that are independent of *ϕ* which, in our framework, are given by ![Formula][154]</img>  

As *g*(*−*1) is often non-linear, iterative methods are necessary to solve the latter equations. When a solution exists and is unique (this is the case under general conditions [53]), the resulting estimator ![Graphic][155]</img> is called the *maximum likelihood estimator*.

In a second step, using ![Graphic][156]</img>, a maximum likelihood estimator of *ϕ**⋆* can be obtained by solving ![Formula][157]</img>  

The above equations can be further reduced when ![Graphic][158]</img>, which happens when *g* is canonical, since in this case, *h*(*x*) *≡ x*.

When *ϕ**⋆* is unknown, it can be estimated by differentiating the log-likelihood at ![Graphic][159]</img> with respect to *ϕ* and equating it to 0. Indeed, since the likelihood equations of ***β*** do not involve *ϕ*, it always holds that ![Formula][160]</img>  

Proceeding in this way yields the following equation for a maximum likelihood estimator of *ϕ* to satisfy: ![Formula][161]</img>  

### B.3 Calculations related to unequal sample sizes and uneven between-nodes covariate distributions

The theoretical validity of each algorithm presented in section 3.3 relies on two main components:

1.  An asymptotic normality result for the estimator of the ***β*** parameters involved;

2.  The consistency, i.e., convergence in probability to the true value, of the estimator for the asymptotic variance-covariance matrix involved in the aforementioned asymptotic normality result. This, in turn, depends on the consistency of the estimator of *ϕ* when the latter is unknown.

Since the current paper is already quite extensive, we will provide theoretical arguments for the asymptotic normality result only, as it is arguably the most interesting from a theoretical perspective. The proof of consistency of the variance-covariance matrix is a lengthy and technical exercise that can be accomplished using our arguments in combination with standard M-estimation theorems, which can be found, for example, in [48], chapter 5.

#### B.3.1 Conditions used to establish asymptotic normality results

The following conditions will be used. For *ℓ ∈* {0, 1, 2, 3}, let ![Formula][162]</img>  

Also, in what follows, for any vector ***a*** ∈ ℝ *p*+1, one defines ∥***a***∥∞ = max1*≤j≤p*+1 |[***a***]*j*| and ![Graphic][163]</img>.

##### Conditions C

(C1) For *k* ∈ {1, …, *K*}, *n*(*k*)/ →*n p*(*k*) *>* 0 as *n* → ∞, and *K* ≥ 2 is finite;

(C2) *b* and *h* are three times continuously differentiable;

(C3) For ![Graphic][164]</img> is a set of i.i.d. random vectors with finite sixth marginal moments, i.e., ![Graphic][165]</img>, and ![Graphic][166]</img>. Further, ![Graphic][167]</img> is positive definite, where ![Formula][168]</img>  

(C4) The ***β***-parameter space Θ *⊂* ℝ*p*+1 considered for the search of ***β****⋆* is compact, and ***β****⋆* lies in the interior of Θ. Further, one has *E*{***D***(*k*)(***β***)} = 0 if and only if ***β*** = ***β****⋆*.

(C5) For ![Graphic][169]</img>, where ϒ*ℓ*(**x**) = sup |*h**ℓ*(***β****⊤***x**)|. Moreover, for ![Graphic][170]</img> and ![Graphic][171]</img>, where ![Graphic][172]</img> and ![Graphic][173]</img>.

Assumption (C1) states that each data node has a non-negligible proportion of the data. Assumption (C2) imposes a smoothness condition on the known quantities involved in the definition of the GLM, enabling the use of standard theoretical arguments to derive the asymptotic normality of the estimated coefficients. It is not restrictive. The assumption (C3) that the within-node predictor distribution is the same across all individuals is made to simplify the arguments and to make them more concise. It could be relaxed in various ways, for example, by assuming equal first and second-order moments of relevant quantities instead of the entire distribution.

The compactness of Θ in Condition (C4) is used to establish that ***D***(*k*)(***β***) and ***V*** (*k*)(***β***) are uniformly consistent across all possible values for ***β****⋆*, which is a commonly used assumption in maximum likelihood estimation. The identification condition ensures that ***β****⋆* is the unique value that maximizes the expectation of the node-specific likelihood.

Assumption (C5) is a technical requirement to establish a uniform consistency result for ***D***(*k*)(***β***) and ***V*** (*k*)(***β***). It is satisfied when the first, second, and third-order derivatives of *h* and *b* are bounded, as long as ![Graphic][174]</img>. More generally, it imposes a condition on the tails of the distribution of the ![Graphic][175]</img>. For example, in Poisson regression, where *h*(*x*) = *x* and *b′* (*x*) = *e**x*, this assumption is satisfied if ![Graphic][176]</img>, where ![Graphic][177]</img> This condition holds, for example, when the ![Graphic][178]</img> s are normally distributed or have compact support.

#### B.3.2 Theory for the pooled centralized setting estimator

Proceeding as in the proof of Lemma 5 one can show that ![Graphic][179]</img>. From there, one has, in view of Lemma 7, that ![Formula][180]</img>  

Since ![Graphic][181]</img> (see Lemma 7), and as ***D***(*k*) is *O*ℙ (*n**−*1/2) (see Lemma 2), one obtains that ![Formula][182]</img>  

Lemma 2 implies ![Graphic][183]</img> converges in distribution to a centred normal random variable with covariance matrix ![Graphic][184]</img> for each 1 *≤ k ≤ K*. Since the 𝒟(*k*)’s are mutually independent, as *K* is finite, and because *n*/*n*(*k*) *→* 1/*p*(*k*) as *n →* ∞, then, in view of the above equation, Slutsky’s theorem ensures that ![Formula][185]</img>  with ![Graphic][186]</img>.

#### B.3.3 Theory for the adapted simple averaging estimator

Since ![Graphic][187]</img>, then, using the definition of ![Graphic][188]</img>, one has ![Formula][189]</img>  

By Lemma 6, under Conditions (C1) to (C5), it holds as *n →* ∞ that, for all ![Graphic][190]</img> converges in distribution to a centred normal random variable with variance-covariance matrix given by ![Graphic][191]</img>. Since the *𝒟* (*k*)’s are mutually independent, as *K* is finite, and because *n*/*n*(*k*) *→* 1/*p*(*k*) as *n →* ∞, then, in view of the above equation, Slutsky’s theorem ensures that ![Formula][192]</img>  

#### B.3.4 Theory for the adapted single distributed Newton-Raphson updating estimator

Let ![Graphic][193]</img> denote the single distributed Newton-Raphson updating estimator ![Graphic][194]</img> of ***β****⋆*. One has from (10) that ![Formula][195]</img>  

Since ![Graphic][196]</img>, ![Formula][197]</img>  

Hence, ![Formula][198]</img>  

One concludes by re-arranging terms in the preceding equation that ![Formula][199]</img>  

Since Lemma 7 ensures the relationship ![Graphic][200]</img>, the right-hand side of the last equation is asymptotically equivalent to the right-hand side of (18). Hence, one concludes that ![Formula][201]</img>  where **Σ**Pooled is as above.

#### B.3.5 Theory for the adapted multiple distributed Newton-Raphson updating estimator

Let ![Graphic][202]</img> denote the multiple distributed Newton-Raphson updatings estimator. When iterations are conducted until convergence, the obtained estimator of ***β****⋆* is equal to ![Graphic][203]</img>. Hence, ![Formula][204]</img>  

#### B.3.6 Theory for the distributed estimating equations estimator

Let ![Graphic][205]</img> denote the obtained distributed estimating equations estimator. Since it has been established above that ![Graphic][206]</img> one obtains from a multivariate Taylor expansion that it holds uniformly in *k ∈* {1, …, *K*} and as *n →* ∞ that ![Formula][207]</img>  

Recalling the definitions of ![Graphic][208]</img> and ![Graphic][209]</img> from (13) and (14) we hence have ![Formula][210]</img>  where, to obtain the last line, we used the fact that Lemma 7 ensures ![Graphic][211]</img>. Rearranging terms and considering ![Graphic][212]</img>, the corresponding right-hand side is then asymptotically equivalent to the right-hand side of (18), and one concludes ![Formula][213]</img>  

#### B.3.7 Theory for the distributed estimation using a single gradient-enhanced log-likelihood

Let ![Graphic][214]</img> to denote the surrogate likelihood estimator computed at node *k* = 1, and recall that ![Graphic][215]</img> satisfies ![Formula][216]</img>  

As ![Graphic][217]</img>, and since Lemma 7 guarantees that ![Graphic][218]</img>, one has from Equation (19) that it holds for each *k ∈* {1, …, *K*} that ![Graphic][219]</img>.

Hence, ![Formula][220]</img>  where one recalls that ![Graphic][221]</img>.

Next, proceeding as in the proof of Lemma 5 one can show that ![Graphic][222]</img>. By Lemma 7, the latter result ensures that ![Graphic][223]</img>. In view of this result, combining the multivariate Taylor’s theorem, the equality *∇****β*** ***D***(*k*)(***β***) = *−****V***(*k*) (***β***) and the fact that ![Graphic][224]</img> yields the relationship ![Graphic][225]</img> and therefore ![Graphic][226]</img>. Moreover, in view of Lemma 6 one has ![Formula][227]</img>  

Denoting by ***I****p*+1 the *p* + 1 square identity matrix, one obtains by combining the derived expression for ![Graphic][228]</img> with (21) and (22) that ![Formula][229]</img>  

Since the ***D***(*k*)’s are *O*P(*n**−*1/2) (see Lemma 2), one deduces that ![Formula][230]</img>  

Therefore, ![Graphic][231]</img> converges in distribution to a mean 0 multivariate normal random variable with variance- covariance matrix given by ![Formula][232]</img>  

If two iterations are executed, one first uses the fact that ![Formula][233]</img>  

From the last equation, an application of the multivariate Taylor expansion combined with Lemma 2 and Lemma 7 ensures ![Graphic][234]</img>. Hence, one obtains that ![Formula][235]</img>  With ![Graphic][236]</img> and ![Graphic][237]</img>, the last equation expresses as ![Formula][238]</img>  

Since from (23) one has ![Formula][239]</img>  it follows that ![Formula][240]</img>  

Hence, in general, the asymptotic distribution of ![Graphic][241]</img> does not match that of ![Graphic][242]</img> in (23).

#### B.3.8 Theory for the distributed estimation using multiple gradient-enhanced log-likelihoods

Let ![Graphic][243]</img> to denote the surrogate likelihood estimator computed at node *k*. Proceeding as we did in the last section to derive (23), one can show that it holds for all *k ∈* {1, …, *K*} that as *n →* ∞, ![Formula][244]</img>  

Therefore, letting ![Graphic][245]</img>, one obtains that ![Formula][246]</img>  

Therefore, ![Graphic][247]</img> converges in distribution to a mean 0 multivariate normal random variable with variance-covariance matrix given by ![Formula][248]</img>  

### B.4 Auxiliary results

The following lemma transfers the conditions on the marginal moments imposed in (C3) into a condition on ![Graphic][249]</img> that is used in the proof of Lemma 2.

Lemma 1.
*Denote by* **x** *a p* + 1 *dimensional random vector such that* ![Graphic][250]</img> *for all* 1 *≤ j ≤ p* + 1. *Then* ![Graphic][251]</img>.

*Proof*. Note first that for a multiindex ***α*** ∈ ℕ*p*+1 such that ∥***α***∥ 1 = 4 we have essentially 5 possibilities for ***α***: There is one non-zero element ![Graphic][252]</img> at position *j*1, there are two non-zero elements ![Graphic][253]</img> and ![Graphic][254]</img>, in ***α*** where we either have ![Graphic][255]</img> and ![Graphic][256]</img> or ![Graphic][257]</img>, there are three non-zero elements ![Graphic][258]</img>, in ***α***, and lastly there are four non-zero elements ![Graphic][259]</img> for *j*1≠*j*2≠*j*3≠*j*4. Concerning the expectation of ∣ **x ∣** ***α*** for a random vector **x** we have by applying the (generalized) Hölder inequality for the five cases that ![Formula][260]</img>  

Given that ![Graphic][261]</img> implies also ![Graphic][262]</img> for 1 *≤ ℓ ≤* 5, we see that *E* {|**x**|***α***} *<* ∞ for every multiindex ***α*** with ∥***α***∥1 = 4. Applying the multinomial theorem to ![Graphic][263]</img> now shows that ![Formula][264]</img>  

Lemma 2.
*Under Conditions (C2)–(C5), it holds that* ![Formula][265]</img>  *for all k ∈* {1, …, *K*}.

*Proof*. Let ![Graphic][266]</img> such that ![Graphic][267]</img>. In this notation we have ![Graphic][268]</img>. For any ***β***1, ***β***2 *∈* Θ one has ![Formula][269]</img>  

Hence, ![Formula][270]</img>  

Since by Condition (C2) *h**′* is differentiable, then, recalling the definition of ϒ*ℓ* in Condition (C5), one deduces from the mean-value theorem and the dual version of the Cauchy–Schwarz inequality |**y**⊤ **x**| ≤ ∥**y**∥∞∥**x**∥1 that ![Formula][271]</img>  

Recalling the definition of ![Graphic][272]</img> in Condition (C5) one similarly has ![Formula][273]</img>  

As one also has ![Graphic][274]</img> and![Graphic][275]</img>, the above equations imply with ![Graphic][276]</img> that ![Formula][277]</img>  where we set ![Formula][278]</img>  

Using first the Hölder and then twice the Minkowski (triangle) inequality we now have ![Formula][279]</img>  

Concerning the individual terms in the parentheses we have by the (generalized) Hölder’s inequality with 1/2 = 1/4 + 1/4 that ![Formula][280]</img>  

Given these estimates we then have ![Formula][281]</img>  

In view of the last equation, Condition (C5) in combination with Condition (C3) and Lemma 1 ensures ![Graphic][282]</img>. Given that the arguments so far are build on the ∥ · ∥ ∞ norm, conditions (C3) and (C4) now show that each component ![Formula][283]</img>  of ![Graphic][284]</img> has the property ![Formula][285]</img>  

This allows to apply Theorem 19.5 in [48] (see example 19.7) to conclude that each component in bounded in probability. Combining this with [49, Lemma 1.4.3] shows that the same is true when considering all components in ![Graphic][286]</img> simultaneously. This finally shows that (26) holds.

Lemma 3.
*Under Conditions (C2)–(C5), it holds as n →* ∞ *that* ![Graphic][287]</img> *converges in distribution to a centred normal random variable with covariance matrix*![Graphic][288]</img>.

*Proof*. To prove the Lemma we use the Cramèr-Wold device. That is, we show that, for any constant **a** ∈ ℝ *p* +1, the random variable ![Graphic][289]</img> converges in distribution to a centred normal random variable, with variance ![Graphic][290]</img>. To do this, first note that as ![Graphic][291]</img>, we have *E*{**a***⊤****D***(*k*)(***β****⋆*)} = 0 and ![Formula][292]</img>  

To obtain the second line, we used the fact that var ![Graphic][293]</img> and the assumption that the![Graphic][294]</img> ‘s are i.i.d. for a given *k*. For the third line, we used the equality ![Graphic][295]</img>.

As the ![Graphic][296]</img> ’s are i.i.d. random variables, it follows ![Graphic][297]</img> is itself a sum of i.i.d. random variables, with mean 0 and finite (constant) variance. Therefore, an application of the Lindeberg-Lévy central limit theorem ensures that ![Graphic][298]</img> converges in law to a centred normal distribution with variance![Graphic][299]</img>.

An application of Cramér-Wold theorem concludes the proof of the Lemma.

The Lemma below can be proven using similar arguments, so their proofs are omitted.

Lemma 4.
*Under Conditions (C1) to (C5), it holds as n →* ∞ *that* ![Formula][300]</img>  *for all k* ∈ {1, …, *K*}.

The next lemma establishes the consistency of ![Graphic][301]</img>.

Lemma 5.
*Under Conditions (C1) to (C5), it holds as n →* ∞ *that* ![Graphic][302]</img>(1) *for all* 1 *≤ k ≤ K*.

*Proof*. To prove the Lemma, the goal is the apply Theorem 5.9 in [48] with *θ ≡* ***β***, Ψ*n* *≡* ***D***(*k*) and Ψ *≡ E****D***(*k*). To do this, it is required to verify that (1) it holds as *n →* ∞ that ![Formula][303]</img>  and (2) that for every *ϵ >* 0,![Graphic][304]</img>.

That (1) holds follows from the fact that under the Lemma’s condition, Lemma 2 applies. That (2) holds follows from the fact that under Condition (C2) the mapping ***β*** *→ E*{***D***(*k*)(***β***)} is continuous, and that under Condition (C4) one has *E*{***D***(k) (***β***)} = 0 if and only if ***β*** = ***β***⋆. Hence, Theorem 5.9 applies, thereby ensuring that ![Graphic][305]</img> (1).

The fact that it holds for all 1*≤ k ≤ K* follows from the fact that under Condition (C1) *K* is finite.

The last three Lemmas ensure the following result.

Lemma 6.
*Under Conditions (C1) to (C5), it holds as n →* ∞ *that, for all* 1 *≤ k ≤ K*, ![Formula][306]</img>  

*Consequently*, ![Graphic][307]</img> *converges in distribution to a centred normal random variable with variance-covariance matrix given by* ![Graphic][308]</img>.

*Proof*. See Theorem 5.21 in [48].

Lemma 7.
*Under Conditions (C1) to (C5), it holds as n →* ∞ *that, for all* 1 *≤ k ≤ K, and any* ![Graphic][309]</img> *such that* ![Graphic][310]</img>, ![Formula][311]</img>  

*Proof*. Let Ψ(***β***) = *E*{***V*** (*k*)(***β***)}. Lemma 4 implies that ![Formula][312]</img>  

Since it is assumed that ![Graphic][313]</img> and as ![Graphic][314]</img>, the result follows from the continuous mapping theorem.

## Acknowledgments

We would like to thank the GRIIS members who enriched this work via multiple conversations over the last few months and kept us going. We would also like to thank Pr. Kim McGrail for her very insightful comments on this work.

*   Received December 21, 2023.
*   Revision received December 21, 2023.
*   Accepted December 24, 2023.


*   © 2023, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), CC BY-NC 4.0, as described at [http://creativecommons.org/licenses/by-nc/4.0/](http://creativecommons.org/licenses/by-nc/4.0/)

## References

1.  [1]. Alan Agresti. Foundations of linear and generalized linear models. John Wiley & Sons, 2015.
    
    
2.  [2]. Hilary Arksey and  Lisa O’Malley. Scoping studies: towards a methodological framework. International journal of social research methodology, 8(1):19–32, 2005. Publisher: Taylor & Francis.
    
    
3.  [3]. E. Atta-Asiamah and  M. Yuan. Distributed inference for degenerate u-statistics. Stat, 8(1), 2019.
    
    
4.  [4]. Moulinath Banerjee,  Cécile Durot, and  Bodhisattva Sen. Divide and conquer in nonstandard problems and the super-efficiency phenomenon. The Annals of Statistics, 47(2), april 2019.
    
    
5.  [5]. Shahab Basiri,  Esa Ollila, and  Visa Koivunen. Robust, scalable, and fast bootstrap method for analyzing large scale data. IEEE Transactions on Signal Processing, 64(4):1007–1017, 2016.
    
    
6.  [6]. Heather Battey,  Jianqing Fan,  Han Liu,  Junwei Lu, and  Ziwei Zhu. Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics, 46(3):1352–1382, 2018.
    
    
7.  [7]. Oya Beyan,  Ananya Choudhury,  Johan van Soest,  Oliver Kohlbacher,  Lukas Zimmermann,  Holger Stenzhorn,  Md. Rezaul Karim,  Michel Dumontier,  Stefan Decker,  Luiz Olavo  Bonino da  Silva Santos, and  Andre Dekker. Distributed Analytics on Sensitive Medical Data: The Personal Health Train. Data Intelligence, 2(1-2):96–107, 01 2020.
    
    
8.  [8]. S. Bruce,  Z. Li,  H.-C. Yang, and  S. Mukhopadhyay. Nonparametric distributed learning architecture for big data: Algorithm and applications. IEEE Transactions on Big Data, 5(2):166–179, 2019.
    
    
9.  [9]. C. Chang,  Z. Bu, and  Q. Long. Cedar: communication efficient distributed analysis for regressions. Biometrics, 2022.
    
    
10. [10]. S.X. Chen and  L. Peng. Distributed statistical inference for massive data. Annals of Statistics, 49(5):2851–2869, 2021.
    
    
11. [11]. Xueying Chen and  Min-ge Xie. A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24:1655–1684, 2014.
    
    
12. [12]. Fengrui Di,  Lei Wang, and  Heng Lian. Communication-efficient estimation and inference for high-dimensional quantile regression based on smoothed decorrelated score. Statistics in medicine, 41(25):5084–5101, 2022.
    
    
13. [13]. R. Duan,  Y. Ning, and  Y. Chen. Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika, 109(1):67–83, 2022.
    
    
14. [14]. Rui Duan,  Mary Regina Boland,  Zixuan Liu,  Yue Liu,  Howard H Chang,  Hua Xu,  Haitao Chu,  Christopher H Schmid,  Christopher B Forrest,  John H Holmes,  Martijn J Schuemie,  Jesse A Berlin,  Jason H Moore, and  Yong Chen. Learning from electronic health records across multiple sites: A communication-efficient and privacypreserving distributed algorithm. Journal of the American Medical Informatics Association, 27(3):376–385, march 2020.
    
    
15. [15]. Rui Duan,  Chongliang Luo,  Martijn J Schuemie,  Jiayi Tong,  C Jason Liang,  Howard H Chang,  Mary Regina Boland,  Jiang Bian,  Hua Xu,  John H Holmes,  Christopher B Forrest,  Sally C Morton,  Jesse A Berlin,  Jason H Moore,  Kevin B Mahoney, and  Yong Chen. Learning from local to global: An efficient distributed algorithm for modeling time-to-event data. Journal of the American Medical Informatics Association, 27(7):1028–1036, july 2020.
    
    
16. [16]. M.J. Edmondson,  C. Luo,  M. Nazmul Islam,  N.E. Sheils,  J. Buresh,  Z. Chen,  J. Bian, and  Y. Chen. Distributed quasi-poisson regression algorithm for modeling multi-site count outcomes in distributed data networks. Journal of Biomedical Informatics, 131, 2022.
    
    
17. [17]. J. Fan,  Y. Guo, and  K. Wang. Communication-efficient accurate statistical estimation. Journal of the American Statistical Association, 2021.
    
    
18. [18]. Yuan Gao,  Weidong Liu,  Hansheng Wang,  Xiaozhou Wang,  Yibo Yan, and  Riquan Zhang. A review of distributed statistical inference. Statistical Theory and Related Fields, 6(2):89–99, may 2022.
    
    
19. [19]. G. Guo,  Y. Sun, and  X. Jiang. A partitioned quasi-likelihood for distributed statistical inference. Computational Statistics, 35(4):1577–1596, 2020.
    
    
20. [20]. E.C. Hector and  P.X.-K. Song. Joint integrative analysis of multiple data sources with correlated vector outcomes. Annals of Applied Statistics, 16(3):1700–1717, 2022.
    
    
21. [21]. C. Huang and  X. Huo. A distributed one-step estimator. Mathematical Programming, 174(1):41–76, 2019.
    
    
22. [22]. Xiaoming Huo and  Shanshan Cao. Aggregated inference. Wiley Interdisciplinary Reviews: Computational Statistics, 11(1):e1451, 2019.
    
    
23. [23]. Samireh Jalali and  Claes Wohlin. Systematic literature studies: database searches vs. backward snowballing. In Proceedings of the ACM-IEEE international symposium on Empirical software engineering and measurement, pages 29–38, 2012.
    
    
24. [24]. M.I. Jordan,  J.D. Lee, and  Y. Yang. Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526):668–681, 2019.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/01621459.2018.1429274&link_type=DOI) 

25. [25]. R.C.S. Lai,  J. Hannig, and  T.C.M. Lee. Method g: Uncertainty quantification for distributed data problems using generalized fiducial inference. Journal of Computational and Graphical Statistics, 30(4):934–945, 2021.
    
    
26. [26]. Danielle Levac,  Heather Colquhoun, and  Kelly K. O’Brien. Scoping studies: advancing the methodology. Implementation science, 5:1–9, 2010. Publisher: Springer.
    
    
27. [27]. Wentao Li,  Jiayi Tong,  Md Monowar Anjum,  Noman Mohammed,  Yong Chen, and  Xiaoqian Jiang. Federated learning algorithms for generalized mixed-effects model (glmm) on horizontally partitioned data from distributed sources. BMC Medical Informatics and Decision Making, 22(1):269, 2022.
    
    
28. [28]. N. Lin and  R. Xi. Fast surrogates of u-statistics. Computational Statistics & Data Analysis, 54(1):16–24, january 2010.
    
    
29. [29]. Nan Lin and  Ruibin Xi. Aggregated estimating equation estimation. Statistics and Its Interface, 4(1):73–83, 2011.
    
    
30. [30]. M. Liu,  Z. Shang, and  G. Cheng. Nonparametric distributed learning under general designs. Electronic Journal of Statistics, 14(2):3070–3102, 2020.
    
    
31. [31]. Chia-Lun Lu,  Shuang Wang,  Zhanglong Ji,  Yuan Wu,  Li Xiong,  Xiaoqian Jiang, and  Lucila Ohno-Machado. Webdisco: a web service for distributed cox model learning without patient-level data sharing. Journal of the American Medical Informatics Association, 22(6):1212–1219, 2015.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/jamia/ocv083&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26159465&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F24%2F2023.12.21.23300389.atom) 

32. [32]. J. Luo,  Q. Sun, and  W.-X. Zhou. Distributed adaptive huber regression. Computational Statistics and Data Analysis, 169, 2022.
    
    
33. [33]. Lan Luo and  Lexin Li. Online two-way estimation and inference via linear mixed-effects models. Statistics in medicine, 41(25):5113–5133, 2022.
    
    
34. [34]. S. Minsker. Distributed statistical estimation and rates of convergence in normal approximation. Electronic Journal of Statistics, 13(2):5213–5252, 2019.
    
    
35. [35]. E. Mozafari-Majd and  V. Koivunen. Two-stage robust and sparse distributed statistical inference for large-scale data. IEEE Transactions on Signal Processing, 70:5351–5365, 2022.
    
    
36. [36]. Emadaldin Mozafari-Majd and  Visa Koivunen. Robust variable selection and distributed inference using t-based estimators for large-scale data. In 2020 28th European Signal Processing Conference (EUSIPCO), pages 2453–2457, 2021.
    
    
37. [37]. E. Nezakati and  E. Pircalabelu. Unbalanced distributed estimation and inference for the precision matrix in gaussian graphical models. Statistics and Computing, 33(2), 2023.
    
    
38. [38]. J.A. Park,  T.H. Kim,  J. Kim, and  Y.R. Park. Wicox: Weight-based integrated cox model for time-to-event data in distributed databases without data-sharing. IEEE Journal of Biomedical and Health Informatics, 27(1):526–537, 2023.
    
    
39. [39]. Micah DJ Peters,  Christina M Godfrey,  Hanan Khalil,  Patricia McInerney,  Deborah Parker, and Cassia Baldini Soares. Guidance for conducting systematic scoping reviews. JBI Evidence Implementation, 13(3):141–146, 2015.
    
    
40. [40]. Serge B. Provost,  Hossein Zareamoghaddam,  S. Ejaz Ahmed, and  Hyung-Tae Ha. The generalized pearson family of distributions and explicit representation of the associated density functions. Communications in Statistics - Theory and Methods, 51(16):5590–5606, 2022.
    
    
41. [41]. Jonathan Rosenblatt and  Boaz Nadler. On the optimality of averaging in distributed statistical learning. Information and Inference, 5(4):379–404, ecember 2016. arXiv:1407.2724 [math, stat].
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/imaiai/iaw013&link_type=DOI) 

42. [42]. Ohad Shamir,  Nati Srebro, and  Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In International conference on machine learning, pages 1000–1008. PMLR, 2014.
    
    
43. [43]. Chengchun Shi,  Wenbin Lu, and  Rui Song. A massive data framework for m-estimators with cubic-rate. Journal of the American Statistical Association, 113(524):1698–1709, 2018. Publisher: Taylor & Francis.
    
    
44. [44]. D. Shu,  J.G. Young, and  S. Toh. Privacy-protecting estimation of adjusted risk ratios using modified poisson regression in multi-center studies. BMC Medical Research Methodology, 19(1), 2019.
    
    
45. [45]. Bimal K Sinha,  Joachim Hartung, and  Guido Knapp. Statistical meta-analysis with applications. John Wiley & Sons, 2011.
    
    
46. [46]. Sengwee Toh,  Robert Wellman,  R Yates Coley,  Casie Horgan,  Jessica Sturtevant,  Erick Moyneur,  Cheri Janning,  Roy Pardee,  Karen J Coleman,  David Arterburn,  Kathleen McTigue,  Jane Anau, and  Andrea J Cook. Combining distributed regression and propensity scores: a doubly privacy-protecting analytic method for multicenter research. Clinical Epidemiology, Volume 10:1773–1786, november 2018.
    
    
47. [47]. Jiayi Tong,  Rui Duan,  Ruowang Li,  Martijn J. Scheuemie,  Jason H. Moore, and  Yong Chen. Robust-odal: Learning from heterogeneous health systems without sharing patient-level data. In Biocomputing 2020, pages 695–706, Kohala Coast, Hawaii, USA, ecember 2019. WORLD SCIENTIFIC.
    
    
48. [48]. Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
    
    
49. [49]. Aad W Van Der Vaart and  Jon A Wellner. Weak convergence and empirical processes: with applications to statistics. Springer, 1996.
    
    
50. [50]. Thanh Vinh Vo,  Trong Nghia Hoang,  Young Lee, and  Tze-Yun Leong. Federated Estimation of Causal Effects from Observational Data, may 2021. arXiv:2106.00456 [cs, stat].
    
    
51. [51]. S. Volgushev,  S.-K. Chao, and  G. Cheng. Distributed inference for quantile regression processes. Annals of Statistics, 47(3):1634–1662, 2019.
    
    
52. [52]. X. Wang,  Z. Yang,  X. Chen, and  W. Liu. Distributed inference for linear support vector machine. Journal of Machine Learning Research, 20, 2019.
    
    
53. [53]. R. W. M. Wedderburn. On the existence and uniqueness of the maximum likelihood estimates for certain generalized linear models. Biometrika, 63(1):27–32, 04 1976.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/biomet/63.1.27&link_type=DOI) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1976BL81800003&link_type=ISI) 

54. [54]. Claes Wohlin. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th international conference on evaluation and assessment in software engineering, pages 1–10, 2014.
    
    
55. [55]. S. Wu,  Y. Xu,  Z. Feng,  X. Yang,  X. Wang, and  X. Gao. Multiple-platform data integration method with application to combined analysis of microarray and proteomic data. BMC Bioinformatics, 13(1), 2012.
    
    
56. [56]. Yuan Wu,  Xiaoqian Jiang,  Jihoon Kim, and  Lucila Ohno-Machado. Grid binary logistic regression (glore): building shared models without sharing data. Journal of the American Medical Informatics Association, 19(5):758–764, september 2012.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1136/amiajnl-2012-000862&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22511014&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F12%2F24%2F2023.12.21.23300389.atom) 

57. [57]. Ruoxuan Xiong,  Allison Koenecke,  Michael Powell,  Zhu Shen,  Joshua T. Vogelstein, and  Susan Athey. Federated Causal Inference in Heterogeneous Observational Data, april 2022. arXiv:2107.11732 [cs, econ, q-bio, stat].
    
    
58. [58]. X. Yue,  R.A. Kontar, and  A.M.E. Gómez. Federated data analytics: A study on linear models. IISE Transactions, 2022.
    
    
59. [59]. Likun Zhang,  Enrique del Castillo,  Andrew J. Berglund,  Martin P. Tingley, and  Nirmal Govind. Computing confidence intervals from massive data via penalized quantile smoothing splines. Computational Statistics & Data Analysis, 144:106885,–25, 2020.
    
    
60. [60]. Yuchen Zhang,  John C. Duchi, and  Martin J. Wainwright. Communication-efficient algorithms for statistical optimization. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 6792–6792, Maui, HI, USA, ecember 2012. IEEE.
    
    
61. [61]. Tianqi Zhao,  Guang Cheng, and  Han Liu. A partially linear framework for massive heterogeneous data. The Annals of Statistics, 44(4), august 2016.

 [1]: /embed/inline-graphic-1.gif
 [2]: /embed/inline-graphic-2.gif
 [3]: /embed/inline-graphic-3.gif
 [4]: /embed/inline-graphic-4.gif
 [5]: /embed/inline-graphic-5.gif
 [6]: /embed/inline-graphic-6.gif
 [7]: /embed/inline-graphic-7.gif
 [8]: /embed/inline-graphic-8.gif
 [9]: /embed/inline-graphic-9.gif
 [10]: /embed/inline-graphic-10.gif
 [11]: /embed/graphic-1.gif
 [12]: /embed/graphic-2.gif
 [13]: /embed/inline-graphic-11.gif
 [14]: /embed/inline-graphic-12.gif
 [15]: /embed/inline-graphic-13.gif
 [16]: /embed/inline-graphic-14.gif
 [17]: /embed/inline-graphic-15.gif
 [18]: /embed/graphic-3.gif
 [19]: /embed/inline-graphic-16.gif
 [20]: /embed/graphic-10.gif
 [21]: /embed/graphic-11.gif
 [22]: /embed/graphic-12.gif
 [23]: /embed/inline-graphic-17.gif
 [24]: /embed/inline-graphic-18.gif
 [25]: /embed/inline-graphic-19.gif
 [26]: /embed/graphic-13.gif
 [27]: /embed/graphic-14.gif
 [28]: /embed/inline-graphic-20.gif
 [29]: /embed/inline-graphic-21.gif
 [30]: /embed/inline-graphic-22.gif
 [31]: /embed/inline-graphic-23.gif
 [32]: /embed/graphic-15.gif
 [33]: /embed/graphic-16.gif
 [34]: /embed/inline-graphic-24.gif
 [35]: /embed/inline-graphic-25.gif
 [36]: /embed/graphic-17.gif
 [37]: /embed/inline-graphic-26.gif
 [38]: /embed/inline-graphic-27.gif
 [39]: /embed/inline-graphic-28.gif
 [40]: /embed/graphic-18.gif
 [41]: /embed/graphic-19.gif
 [42]: /embed/inline-graphic-29.gif
 [43]: /embed/graphic-21.gif
 [44]: /embed/inline-graphic-30.gif
 [45]: /embed/inline-graphic-31.gif
 [46]: /embed/inline-graphic-32.gif
 [47]: /embed/inline-graphic-33.gif
 [48]: /embed/inline-graphic-34.gif
 [49]: /embed/inline-graphic-35.gif
 [50]: /embed/graphic-22.gif
 [51]: /embed/graphic-23.gif
 [52]: /embed/graphic-24.gif
 [53]: /embed/inline-graphic-36.gif
 [54]: /embed/graphic-25.gif
 [55]: /embed/inline-graphic-37.gif
 [56]: /embed/inline-graphic-38.gif
 [57]: /embed/graphic-26.gif
 [58]: /embed/graphic-27.gif
 [59]: /embed/inline-graphic-39.gif
 [60]: /embed/inline-graphic-40.gif
 [61]: /embed/inline-graphic-41.gif
 [62]: /embed/graphic-28.gif
 [63]: /embed/graphic-29.gif
 [64]: /embed/inline-graphic-42.gif
 [65]: /embed/inline-graphic-43.gif
 [66]: /embed/inline-graphic-44.gif
 [67]: /embed/inline-graphic-45.gif
 [68]: /embed/inline-graphic-46.gif
 [69]: /embed/inline-graphic-47.gif
 [70]: /embed/inline-graphic-48.gif
 [71]: /embed/inline-graphic-49.gif
 [72]: /embed/inline-graphic-50.gif
 [73]: /embed/inline-graphic-51.gif
 [74]: /embed/inline-graphic-52.gif
 [75]: /embed/inline-graphic-53.gif
 [76]: /embed/graphic-30.gif
 [77]: /embed/graphic-31.gif
 [78]: /embed/inline-graphic-54.gif
 [79]: /embed/inline-graphic-55.gif
 [80]: /embed/inline-graphic-56.gif
 [81]: /embed/inline-graphic-57.gif
 [82]: /embed/graphic-33.gif
 [83]: /embed/inline-graphic-58.gif
 [84]: /embed/graphic-34.gif
 [85]: /embed/graphic-35.gif
 [86]: /embed/inline-graphic-59.gif
 [87]: /embed/graphic-36.gif
 [88]: /embed/graphic-37.gif
 [89]: /embed/inline-graphic-60.gif
 [90]: /embed/inline-graphic-61.gif
 [91]: /embed/inline-graphic-62.gif
 [92]: /embed/graphic-39.gif
 [93]: /embed/inline-graphic-63.gif
 [94]: /embed/inline-graphic-64.gif
 [95]: /embed/graphic-40.gif
 [96]: /embed/graphic-41.gif
 [97]: /embed/graphic-42.gif
 [98]: /embed/inline-graphic-65.gif
 [99]: /embed/inline-graphic-66.gif
 [100]: /embed/graphic-43.gif
 [101]: /embed/inline-graphic-67.gif
 [102]: /embed/graphic-44.gif
 [103]: /embed/inline-graphic-68.gif
 [104]: /embed/inline-graphic-69.gif
 [105]: /embed/inline-graphic-70.gif
 [106]: /embed/graphic-45.gif
 [107]: /embed/inline-graphic-71.gif
 [108]: /embed/graphic-46.gif
 [109]: /embed/graphic-47.gif
 [110]: /embed/inline-graphic-72.gif
 [111]: /embed/inline-graphic-73.gif
 [112]: /embed/inline-graphic-74.gif
 [113]: /embed/graphic-48.gif
 [114]: /embed/inline-graphic-75.gif
 [115]: /embed/inline-graphic-76.gif
 [116]: /embed/inline-graphic-77.gif
 [117]: /embed/graphic-50.gif
 [118]: /embed/inline-graphic-78.gif
 [119]: /embed/inline-graphic-79.gif
 [120]: /embed/graphic-51.gif
 [121]: /embed/inline-graphic-80.gif
 [122]: /embed/inline-graphic-81.gif
 [123]: /embed/inline-graphic-82.gif
 [124]: /embed/graphic-52.gif
 [125]: /embed/graphic-53.gif
 [126]: /embed/inline-graphic-83.gif
 [127]: /embed/graphic-54.gif
 [128]: /embed/inline-graphic-84.gif
 [129]: /embed/graphic-55.gif
 [130]: /embed/inline-graphic-85.gif
 [131]: /embed/graphic-56.gif
 [132]: /embed/inline-graphic-86.gif
 [133]: /embed/inline-graphic-87.gif
 [134]: /embed/inline-graphic-88.gif
 [135]: /embed/inline-graphic-89.gif
 [136]: /embed/inline-graphic-90.gif
 [137]: /embed/inline-graphic-91.gif
 [138]: /embed/inline-graphic-92.gif
 [139]: /embed/inline-graphic-93.gif
 [140]: /embed/inline-graphic-94.gif
 [141]: /embed/inline-graphic-95.gif
 [142]: /embed/inline-graphic-96.gif
 [143]: /embed/inline-graphic-97.gif
 [144]: /embed/graphic-62.gif
 [145]: /embed/inline-graphic-98.gif
 [146]: /embed/graphic-63.gif
 [147]: /embed/graphic-64.gif
 [148]: /embed/graphic-65.gif
 [149]: /embed/graphic-66.gif
 [150]: /embed/inline-graphic-99.gif
 [151]: /embed/inline-graphic-100.gif
 [152]: /embed/graphic-67.gif
 [153]: /embed/inline-graphic-101.gif
 [154]: /embed/graphic-68.gif
 [155]: /embed/inline-graphic-102.gif
 [156]: /embed/inline-graphic-103.gif
 [157]: /embed/graphic-69.gif
 [158]: /embed/inline-graphic-104.gif
 [159]: /embed/inline-graphic-105.gif
 [160]: /embed/graphic-70.gif
 [161]: /embed/graphic-71.gif
 [162]: /embed/graphic-72.gif
 [163]: /embed/inline-graphic-106.gif
 [164]: /embed/inline-graphic-107.gif
 [165]: /embed/inline-graphic-108.gif
 [166]: /embed/inline-graphic-109.gif
 [167]: /embed/inline-graphic-110.gif
 [168]: /embed/graphic-73.gif
 [169]: /embed/inline-graphic-111.gif
 [170]: /embed/inline-graphic-112.gif
 [171]: /embed/inline-graphic-113.gif
 [172]: /embed/inline-graphic-114.gif
 [173]: /embed/inline-graphic-115.gif
 [174]: /embed/inline-graphic-116.gif
 [175]: /embed/inline-graphic-117.gif
 [176]: /embed/inline-graphic-118.gif
 [177]: /embed/inline-graphic-119.gif
 [178]: /embed/inline-graphic-120.gif
 [179]: /embed/inline-graphic-121.gif
 [180]: /embed/graphic-74.gif
 [181]: /embed/inline-graphic-122.gif
 [182]: /embed/graphic-75.gif
 [183]: /embed/inline-graphic-123.gif
 [184]: /embed/inline-graphic-124.gif
 [185]: /embed/graphic-76.gif
 [186]: /embed/inline-graphic-125.gif
 [187]: /embed/inline-graphic-126.gif
 [188]: /embed/inline-graphic-127.gif
 [189]: /embed/graphic-77.gif
 [190]: /embed/inline-graphic-128.gif
 [191]: /embed/inline-graphic-129.gif
 [192]: /embed/graphic-78.gif
 [193]: /embed/inline-graphic-130.gif
 [194]: /embed/inline-graphic-131.gif
 [195]: /embed/graphic-79.gif
 [196]: /embed/inline-graphic-132.gif
 [197]: /embed/graphic-80.gif
 [198]: /embed/graphic-81.gif
 [199]: /embed/graphic-82.gif
 [200]: /embed/inline-graphic-133.gif
 [201]: /embed/graphic-83.gif
 [202]: /embed/inline-graphic-134.gif
 [203]: /embed/inline-graphic-135.gif
 [204]: /embed/graphic-84.gif
 [205]: /embed/inline-graphic-136.gif
 [206]: /embed/inline-graphic-137.gif
 [207]: /embed/graphic-85.gif
 [208]: /embed/inline-graphic-138.gif
 [209]: /embed/inline-graphic-139.gif
 [210]: /embed/graphic-86.gif
 [211]: /embed/inline-graphic-140.gif
 [212]: /embed/inline-graphic-141.gif
 [213]: /embed/graphic-87.gif
 [214]: /embed/inline-graphic-142.gif
 [215]: /embed/inline-graphic-143.gif
 [216]: /embed/graphic-88.gif
 [217]: /embed/inline-graphic-144.gif
 [218]: /embed/inline-graphic-145.gif
 [219]: /embed/inline-graphic-146.gif
 [220]: /embed/graphic-89.gif
 [221]: /embed/inline-graphic-147.gif
 [222]: /embed/inline-graphic-148.gif
 [223]: /embed/inline-graphic-149.gif
 [224]: /embed/inline-graphic-150.gif
 [225]: /embed/inline-graphic-151.gif
 [226]: /embed/inline-graphic-152.gif
 [227]: /embed/graphic-90.gif
 [228]: /embed/inline-graphic-153.gif
 [229]: /embed/graphic-91.gif
 [230]: /embed/graphic-92.gif
 [231]: /embed/inline-graphic-154.gif
 [232]: /embed/graphic-93.gif
 [233]: /embed/graphic-94.gif
 [234]: /embed/inline-graphic-155.gif
 [235]: /embed/graphic-95.gif
 [236]: /embed/inline-graphic-156.gif
 [237]: /embed/inline-graphic-157.gif
 [238]: /embed/graphic-96.gif
 [239]: /embed/graphic-97.gif
 [240]: /embed/graphic-98.gif
 [241]: /embed/inline-graphic-158.gif
 [242]: /embed/inline-graphic-159.gif
 [243]: /embed/inline-graphic-160.gif
 [244]: /embed/graphic-99.gif
 [245]: /embed/inline-graphic-161.gif
 [246]: /embed/graphic-100.gif
 [247]: /embed/inline-graphic-162.gif
 [248]: /embed/graphic-101.gif
 [249]: /embed/inline-graphic-163.gif
 [250]: /embed/inline-graphic-164.gif
 [251]: /embed/inline-graphic-165.gif
 [252]: /embed/inline-graphic-166.gif
 [253]: /embed/inline-graphic-167.gif
 [254]: /embed/inline-graphic-168.gif
 [255]: /embed/inline-graphic-169.gif
 [256]: /embed/inline-graphic-170.gif
 [257]: /embed/inline-graphic-171.gif
 [258]: /embed/inline-graphic-172.gif
 [259]: /embed/inline-graphic-173.gif
 [260]: /embed/graphic-102.gif
 [261]: /embed/inline-graphic-174.gif
 [262]: /embed/inline-graphic-175.gif
 [263]: /embed/inline-graphic-176.gif
 [264]: /embed/graphic-103.gif
 [265]: /embed/graphic-104.gif
 [266]: /embed/inline-graphic-177.gif
 [267]: /embed/inline-graphic-178.gif
 [268]: /embed/inline-graphic-179.gif
 [269]: /embed/graphic-105.gif
 [270]: /embed/graphic-106.gif
 [271]: /embed/graphic-107.gif
 [272]: /embed/inline-graphic-180.gif
 [273]: /embed/graphic-108.gif
 [274]: /embed/inline-graphic-181.gif
 [275]: /embed/inline-graphic-182.gif
 [276]: /embed/inline-graphic-183.gif
 [277]: /embed/graphic-109.gif
 [278]: /embed/graphic-110.gif
 [279]: /embed/graphic-111.gif
 [280]: /embed/graphic-112.gif
 [281]: /embed/graphic-113.gif
 [282]: /embed/inline-graphic-184.gif
 [283]: /embed/graphic-114.gif
 [284]: /embed/inline-graphic-185.gif
 [285]: /embed/graphic-115.gif
 [286]: /embed/inline-graphic-186.gif
 [287]: /embed/inline-graphic-187.gif
 [288]: /embed/inline-graphic-188.gif
 [289]: /embed/inline-graphic-189.gif
 [290]: /embed/inline-graphic-190.gif
 [291]: /embed/inline-graphic-191.gif
 [292]: /embed/graphic-116.gif
 [293]: /embed/inline-graphic-192.gif
 [294]: /embed/inline-graphic-193.gif
 [295]: /embed/inline-graphic-194.gif
 [296]: /embed/inline-graphic-195.gif
 [297]: /embed/inline-graphic-196.gif
 [298]: /embed/inline-graphic-197.gif
 [299]: /embed/inline-graphic-198.gif
 [300]: /embed/graphic-117.gif
 [301]: /embed/inline-graphic-199.gif
 [302]: /embed/inline-graphic-200.gif
 [303]: /embed/graphic-118.gif
 [304]: /embed/inline-graphic-201.gif
 [305]: /embed/inline-graphic-202.gif
 [306]: /embed/graphic-119.gif
 [307]: /embed/inline-graphic-203.gif
 [308]: /embed/inline-graphic-204.gif
 [309]: /embed/inline-graphic-205.gif
 [310]: /embed/inline-graphic-206.gif
 [311]: /embed/graphic-120.gif
 [312]: /embed/graphic-121.gif
 [313]: /embed/inline-graphic-207.gif
 [314]: /embed/inline-graphic-208.gif