RT Journal Article SR Electronic T1 De Novo Exposomic Geospatial Assembly of Chronic Disease Regions with Machine Learning & Network Analysis JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2024.07.25.24310832 DO 10.1101/2024.07.25.24310832 A1 Deonarine, Andrew A1 Batwara, Ayushi A1 Wada, Roy A1 Nair, Shoba A1 Sharma, Puneet A1 Loscalzo, Joseph A1 Ojikutu, Bisola A1 Hall, Kathryn YR 2024 UL http://medrxiv.org/content/early/2024/07/26/2024.07.25.24310832.abstract AB Background Determining spatial relationships between disease and the exposome is limited by available methodologies. aPEER (algorithm for Projection of Exposome and Epidemiological Relationships) uses machine learning (ML) and network analysis to find spatial relationships between diseases and the exposome in the United States.Methods Using aPEER we examined the relationship between 12 chronic diseases and 186 pollutants. PCA, K-means clustering, and map projection produced clusters of counties derived from pollutants, and the Jaccard correlation of these clusters with counties with high rates of disease was calculated. Pollution correlation matrices were used together with network analysis to identify the strongest disease-pollution relationships. Results were compared to LISA, Moran’s I, univariate, elastic net, and random forest regression.Findings aPEER produced 68,820 maps with human interpretable, distinct pollution-derived regions. Diseases with the strongest pollution associations were hypertension (J=0.5316, p=3.89×10-208), COPD (J=0.4545, p=8.27×10-131), stroke (J=0.4517, p=1.15×10-127), stroke mortality (J=0.4445, p=4.28×10-125), and diabetes mellitus (J=0.4425, p=2.34×10-127). Methanol, acetaldehyde, and formaldehyde were identified as strongly associated with stroke, COPD, stroke mortality, hypertension, and diabetes mellitus in the southeast United States (which correlated with both the Stroke and Diabetes Belts). Pollutants were strongly predictive of chronic disease geography and outperformed conventional prediction models based on preventive services and social determinants of health (using elastic net and random forest regression).Interpretation aPEER used machine learning to identify disease and air pollutant relationships with similar or superior AUCs compared to social determinants of health (SDOH) and healthcare preventive service models. These findings highlight the utility of aPEER in epidemiological and geospatial analysis as well as the emerging role of exposomics in understanding chronic disease pathology.Evidence before this study Many chronic diseases, such as diabetes and stroke mortality, have well defined geographical distributions in the United States. While the reason for these distributions have been actively investigated for decades, limited studies have examined the role of pollution. To assess the current scientific literature available, we completed a structured review in Medline, Google Scholar, and PubMed for any publications in English up to June 24, 2024 using the search terms “stroke”, “cerebral infarction”, “isch(a)emic stroke”, “intracerebral h(a)emorrage”, “h(a)emorrhagic stroke”, or “subarachnoid h(a)emorrage”, “diabetes” AND “Stroke Belt”, “Stroke Region”, “Diabetes Belt”, “Diabetes Region”, or “Disease Belt”. Although there were multiple studies examining the role of genetics and poverty with relation to the geographical distribution of diseases, few examined pollution.Added value of this study In this study a novel machine learning algorithm was developed which modeled geospatial relationships between chronic disease rates for 3141 counties and county-level pollution measures in the United States. aPEER detected significant relationships between pollutants and several cardiometabolic conditions (using Jaccard correlation coefficient, hypertension (J=0.5316, p=3.89×10-208), COPD (J=0.4545, p=8.27×10-131), stroke (J=0.4517, p=1.15×10-127), stroke mortality (J=0.4445, p=4.28×10-125), and diabetes (J=0.4425, p=2.34×10-127)). Using just pollution measures, aPEER consistently identified a region in the southeast United States which correlated closely with both the Stroke and Diabetes Belts, and matched the distribution of multiple cardiometabolic diseases. It was possible to predict the geographical distribution of high chronic disease rates using elastic net and random forest regressions from pollution indicators with similar or superior accuracy (determined by receiver operator curves) compared to preventive healthcare or social determinants of health models.Implications of all the available evidence For the first time, it was possible to predict hypertension, COPD, stroke mortality, diabetes, and stroke rates from pollution indicators with comparable or superior accuracy compared to conventional models, and readily identify a region of increased pollution in the United States that closely matched the Stroke Belt using machine learning methods. These results highlight the utility of machine learning in exploring and analyzing spatial data, and the importance of pollution in predicting the geographical variation of disease, with implications for cardiometabolic disease pathogenesis and management.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work was funded by the Boston Public Health Commission, NHLBI (R03 HL157890) and the CDC. Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The United States Census Website (https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html), the EPA EJSCREEN Database (https://www.epa.gov/ejscreen), the EPA AirToxScreen database (https://www.epa.gov/AirToxScreen) and CDC PLACES Database (https://www.cdc.gov/places/index.html).I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesThe database generated for this study consisted of 226 indicators for 3,141 counties (the complete set of indicators from Center for Disease Control (CDC) PLACES, Environmental Protection Agency's (EPA) EJSCREEN, and EPA AirToxScreen databases) and integrated into a dataframe in Python (version 3.9) using Pandas (version 1.3.4). Chronic Disease data: Health-related indicators for 3,141 US counties including rates of chronic disease, participation in preventive services, and risk factors were extracted from the Behavioral Risk Factor Surveillance System (BRFSS) and available through the 2023 CDC PLACES database22 (Supplementary Table 1). From these datasets we identified 11 disease and health-related measures for analysis (based on the leading contributors to disability-adjusted life years (DALYs) in the United States23), specifically, arthritis, asthma, chronic obstructive pulmonary disease (COPD), cancer, coronary heart disease, depression, diabetes, hypertension, obesity, renal disease, and stroke. Stroke mortality data for ages 35 or older was downloaded from the CDC Stroke Death Rates database (between 2017-2019)24. High disease prevalence or high stroke-mortality counties were defined as having age-adjusted rates > = 70th percentile. Pollution, SDOH, Demographic, and Geographical Data: Pollution data for 9 pollution indicators along with seven social determinants of health (SDOH) / health equity census-tract level measures was extracted from the Environmental Protection Agency (EPA) Environmental Justice (EJSCREEN) 2021 database25, together with 177 chemical ambient air concentrations from the EPA's 2018 AirToxScreen database26 reported at the census block group level (in ug/m3), and calculated at the county level by population-weighting the census block group level exposures and then calculating the sum for each county from the blocks. Together, the EJSCREEN and AirToxScreen measures resulted in 186 pollution measures examined in this study. Geographical boundary information for counties, in the form of GeoJSON, were obtained from the US Census TIGER database27. The 9 EJSCREEN pollution indicators 28 included particulate matter 2.5 (PM2.5; ug/m3), ozone (parts per billion), traffic proximity (vehicles per day / meters), lead paint exposure (%; of housing units built before 1960), superfund proximity (superfund site count / km), RMP facility proximity (facility count / km), hazardous waste proximity (count of hazardous waste facilities within 5 km (or nearest beyond 5 km), each divided by distance in kilometers), underground storage tanks (count of facilities (multiplied by a factor of 7.7) within a 1,500-foot buffered block group), and wastewater discharge (modeled toxic concentrations at stream segments within 500 meters, divided by distance in kilometers (km)) (Supplementary Table 1). The year of pollution exposure was selected to precede the year when chronic disease rates were reported.https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.htmlhttps://www.epa.gov/ejscreenhttps://www.epa.gov/AirToxScreenhttps://www.cdc.gov/places/index.html