PT - JOURNAL ARTICLE AU - Deonarine, Andrew AU - Batwara, Ayushi AU - Wada, Roy AU - Nair, Shoba AU - Sharma, Puneet AU - Loscalzo, Joseph AU - Ojikutu, Bisola AU - Hall, Kathryn TI - De Novo Exposomic Geospatial Assembly of Chronic Disease Regions with Machine Learning & Network Analysis AID - 10.1101/2024.07.25.24310832 DP - 2024 Jan 01 TA - medRxiv PG - 2024.07.25.24310832 4099 - http://medrxiv.org/content/early/2024/07/26/2024.07.25.24310832.short 4100 - http://medrxiv.org/content/early/2024/07/26/2024.07.25.24310832.full AB - Background: Determining spatial relationships between disease and the exposome is limited by available methodologies. aPEER (algorithm for Projection of Exposome and Epidemiological Relationships) uses machine learning (ML) and network analysis to find spatial relationships between diseases and the exposome in the United States. Methods: Using aPEER we examined the relationship between 12 chronic diseases and 186 pollutants. PCA, K-means clustering, and map projection produced clusters of counties derived from pollutants, and the Jaccard correlation of these clusters with counties with high rates of disease was calculated. Pollution correlation matrices were used together with network analysis to identify the strongest disease-pollution relationships. Results were compared to LISA, Moran's I, univariate, elastic net, and random forest regression. Findings: aPEER produced 68,820 maps with human interpretable, distinct pollution-derived regions. Diseases with the strongest pollution associations were hypertension (J=0.5316, p=3.89x10-208), COPD (J=0.4545, p=8.27x10-131), stroke (J=0.4517, p=1.15x10-127), stroke mortality (J=0.4445, p=4.28x10-125), and diabetes mellitus (J=0.4425, p=2.34x10-127). Methanol, acetaldehyde, and formaldehyde were identified as strongly associated with stroke, COPD, stroke mortality, hypertension, and diabetes mellitus in the southeast United States (which correlated with both the Stroke and Diabetes Belt). Pollutants were strongly predictive of chronic disease geography and outperformed conventional prediction models based on preventive services and social determinants of health (using elastic net and random forest regression). Interpretation: aPEER used machine learning to identify disease and air pollutant relationships with similar or superior AUCs compared to social determinants of health (SDOH) and healthcare preventive service models. These findings highlight the utility of aPEER in epidemiological and geospatial analysis as well as the emerging role of exposomics in understanding chronic disease pathology. Funding: Boston Public Health Commission, NHLBI (R03 HL157890) and the CDC.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work was funded by the Boston Public Health Commission, NHLBI (R03 HL157890) and the CDC. Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The United States Census Website (https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html), the EPA EJSCREEN Database (https://www.epa.gov/ejscreen), the EPA AirToxScreen database (https://www.epa.gov/AirToxScreen) and CDC PLACES Database (https://www.cdc.gov/places/index.html).I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).Yes I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesThe database generated for this study consisted of 226 indicators for 3,141 counties (the complete set of indicators from Center for Disease Control (CDC) PLACES, Environmental Protection Agency's (EPA) EJSCREEN, and EPA AirToxScreen databases) and integrated into a dataframe in Python (version 3.9) using Pandas (version 1.3.4). Chronic Disease data: Health-related indicators for 3,141 US counties including rates of chronic disease, participation in preventive services, and risk factors were extracted from the Behavioral Risk Factor Surveillance System (BRFSS) and available through the 2023 CDC PLACES database22 (Supplementary Table 1). From these datasets we identified 11 disease and health-related measures for analysis (based on the leading contributors to disability-adjusted life years (DALYs) in the United States23), specifically, arthritis, asthma, chronic obstructive pulmonary disease (COPD), cancer, coronary heart disease, depression, diabetes, hypertension, obesity, renal disease, and stroke. Stroke mortality data for ages 35 or older was downloaded from the CDC Stroke Death Rates database (between 2017-2019)24. High disease prevalence or high stroke-mortality counties were defined as having age-adjusted rates >= 70th percentile. Pollution, SDOH, Demographic, and Geographical Data: Pollution data for 9 pollution indicators along with seven social determinants of health (SDOH) / health equity census-tract level measures was extracted from the Environmental Protection Agency (EPA) Environmental Justice (EJSCREEN) 2021 database25, together with 177 chemical ambient air concentrations from the EPA's 2018 AirToxScreen database26 reported at the census block group level (in ug/m3), and calculated at the county level by population-weighting the census block group level exposures and then calculating the sum for each county from the blocks. Together, the EJSCREEN and AirToxScreen measures resulted in 186 pollution measures examined in this study. Geographical boundary information for counties, in the form of GeoJSON, were obtained from the US Census TIGER database27. The 9 EJSCREEN pollution indicators 28 included particulate matter 2.5 (PM2.5; ug/m3), ozone (parts per billion), traffic proximity (vehicles per day / meters), lead paint exposure (% of housing units built before 1960), superfund proximity (superfund site count / km), RMP facility proximity (facility count / km), hazardous waste proximity (count of hazardous waste facilities within 5 km (or nearest beyond 5 km), each divided by distance in kilometers), underground storage tanks (count of facilities (multiplied by a factor of 7.7) within a 1,500-foot buffered block group), and wastewater discharge (modeled toxic concentrations at stream segments within 500 meters, divided by distance in kilometers (km)) (Supplementary Table 1). The year of pollution exposure was selected to precede the year when chronic disease rates were reported. https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.htmlhttps://www.epa.gov/ejscreenhttps://www.epa.gov/AirToxScreenhttps://www.cdc.gov/places/index.html