ABSTRACT
BACKGROUND Introducing data-driven technologies into health systems can enhance population health and streamline care delivery. The use of diverse and geographically varied data is key for tackling health and societal challenges, despite associated technical, ethical, and governance complexities. This study explored the efficacy of federated analytics using general linear models (GLMs) and machine learning (ML) models, comparing outcomes with non-federated data analysis.
METHODS A Conditional Transformation Generative Adversarial Network was used to create two synthetic datasets (training set: N=10,000; test set: N=1,000), using real-world data from 381 asthma patients. To simulate a federated environment, the resulting data were distributed across nodes in a Microsoft Azure Trusted Research Environment (TRE). GLMs (one-way ANOVA) and ML models (gradient boosted decision trees) where then produced, using both federated and non-federated approaches. The consistency of predictions produced by the ML models were then compared between approaches, with predictive accuracy of the models quantified by the area under the receiver operating characteristic curve (AUROC).
FINDINGS GLMs produced from federated data distributed between two TREs were identical to those produced using a non-federated approach. However, ML models produced by federated and non-federated approaches, and using different data distributions between TREs, were non-identical. Despite this, when applied to the test set, the classifications made by the federated models were consistent with the non-federated model in 84.7-90.4% of cases, which was similar to the consistency of repeated non-federated models (90.9-91.5%). Consequently, overall predictive accuracies for federated and non-federated models were similar (AUROC: 0.663-0.669).
INTERPRETATION This study confirmed the robustness of GLMs utilising ANOVA within a federated framework, yielding consistent outcomes. Moreover, federated ML models demonstrated a high degree of classification agreement, with comparable accuracy to traditional non-federated models. These results highlight the viability of federated approaches for reliable and accurate data analysis in sensitive domains.
Competing Interest Statement
S Gallier reports funding support from HDRUK, Innovate UK, MRC and NIHR. E Sapey reports funding support from HDRUK, Innovate UK, MRC, Wellcome Trust, NIHR, EPSRC and British Lung Foundation. S Cox reports funding support from HDRUK
Funding Statement
This study resulted through grant funding from UKRI Innovate UK. A DARE UK Sprint Exemplar Project.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This research was conducted with Health Research Authority and Research Ethical Approvals (East Midlands Derby Research ethics committee, reference 20/EM/0158). The study used synthetic human data available to others through application to PIONEER via the corresponding author. Email: pioneer@uhb.nhs.uk The synthetic data was linked to meteorological and air quality data from The Centre for Environmental Data Analysis (CEDA) available via https://catalogue.ceda.ac.uk/
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
To facilitate knowledge in this area, the synthetic data and a data dictionary defining each field will be available to others through application to PIONEER via the corresponding author. Email: pioneer@uhb.nhs.uk