ABSTRACT
Background In the age of big data, linked social and administrative health data in combination with machine learning (ML) is being increasingly used to improve prediction in cardiovascular diseases (CVD). We aimed to apply ML methods on extensive national-level health and social administrative datasets to predict future diabetes complications by ethnicity.
Methods Five ML models were used to predict CVD events among all people with known diabetes in the population of New Zealand, utilizing national-level administrative data at the individual level.
Results The Xgboost ML model had the best predictive power for predicting CVD events three years into the future among the population with diabetes. The optimization procedure also found limited improvement in AUC by ethnicity. The results indicated no trade-off between model predictive performance and equity gap of prediction by ethnicity. The list of variables of importance was different among different models/ethnic groups, for examples: age, deprivation, having had a hospitalization event, and the number of years living with diabetes.
Discussion and conclusions We provide further evidence that ML with administrative health data can be used for meaningful future prediction of health outcomes. As such it could be utilized to inform health planning and healthcare resource allocation for diabetes management and the prevention of CVD events. Our results may suggest limited scope for developing prediction models by ethnic group and that the major ways to reduce inequitable health outcomes is probably via improved delivery of prevention and management to those groups with diabetes at highest need.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This work was funded by the Royal Society Te Aparangi. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Ethics committee of University of Otago gave ethical approval for this work
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
Access to the anonymised data used in this study was provided by Stats NZ under the security and confidentiality provisions of the Statistics Act 1975. Only people authorised by the Statistics Act 1975 are allowed to see data about a particular person, household, business, or organisation, and the results in this paper have been confidentialised to protect these groups from identification and to keep their data safe.
List of abbreviations
- AUC
- Area under the Receiver Operating Characteristics curve
- CVD
- Cardiovascular diseases
- IDI
- Integrated Data Infrastructure
- ML
- Machine learning
- NZ
- Aotearoa New Zealand
- RF
- Random Forest
- SNZ
- Stats New Zealand