Abstract
Epidemiological studies on health effects of air pollution usually rely on measurements from monitors, which provide limited spatio-temporal coverage. Data from satellites, reanalysis and chemical transport models offer additional information used to reconstruct pollution concentrations at high spatio-temporal resolution. The aim of this study is to develop a multi-stage satellite-based machine learning model to estimate daily fine particulate matter (PM2.5) levels across Great Britain during 2003-2018. This high-resolution model consists of random forest (RF) algorithms applied in four stages. Stage-1 augmented monitor-PM2.5 series using co-located PM10 measures. Stage-2 imputed missing satellite aerosol optical depth observations using atmospheric reanalysis models. Stage-3 integrates the output from previous stages with spatial and spatio-temporal variables to build a prediction model for PM2.5. Stage-4 applied Stage-3 models to estimate daily PM2.5 concentrations over a 1-km grid. The RF architecture performed well in all stages, with results from Stage-3 showing an average cross-validated R2 of 0.788 and minimal bias. Spatial and temporal scale also performed well with R2 of 0.822 and 0.779, respectively. The high spatio-temporal resolution and relatively high precision allows this dataset (1.37 billion points) to be used in epidemiological analyses to assess health risks associated with both short- and long-term exposures to PM2.5.
1. Introduction
The World Health Organization estimates in 7 million the global deaths associated with air pollution (both outdoor and household) every year, emphasising that exposure to particulate matter (PM) is the greatest cause of concern [WHO,2020]. Fine particles with aerodynamic diameter smaller than 2.5 µm (PM2.5) can penetrate into the human circulatory system through the lungs and provoke multiple adverse health outcomes, including mortality [Liu et al., 2019] hospital admissions [Basagaña et al., 2015], lung dysfunction [Raaschou-Nielsen et al., 2016], cardiovascular diseases [Lavigne et al., 2019b], and allergic reactions [Lavigne et al., 2019a]. Usually, epidemiological studies collect air quality (AQ) data from ground monitors to quantify both short-term and long-term PM2.5 exposures associated with acute and chronic health effects, respectively. The limitation in this health assessment approach is the lack of continuous temporal records of PM2.5 and the limited spatial distribution of the monitors. Great Britain is an example of countries with very limited spatio-temporal coverage of PM2.5, whereby the monitoring network is densely located only in major cities and widespread measurements of PM2.5 only started from 2010.
Remote sensing observations of aerosol Optical Depth (AOD) obtained from satellites, which measures how much direct sunlight has been scattered and absorbed by aerosols particles suspended in the atmosphere, has recently been proposed as an alternative to measure PM variability for epidemiological purposes [NASA Earth Observations, 2020]. However, while offering the advantage of global coverage and relatively high spatio-temporal resolution, the use of AOD for PM2.5 exposure assessments present limitations, for instance the fact that it represents the total atmospheric column concentration of aerosol rather than surface values [van Donkelaar, 2010]. Unsurprisingly, early studies based only on satellite-AOD achieved very low performances in predicting PM2.5 [Koelemeijer et al., 2006]. Recent studies have proposed more sophisticated approaches, combining AOD measures with information from other satellite products, reanalysis data, chemical transport models, and geospatial features to improve the prediction of PM2.5. Such studies used various analytical methods, including multiple linear regression [Gupta and Christopher, 2009], land-use regression [Beckerman et al., 2013; Vienneau et al., 2010], and mixed effect models [de Hoogh et al., 2018; Kloog et al., 2011; Kloog et al., 2015; Lee et al., 2011; Staffogia et al., 2016]. The last development in this research area is represented by the application of machine learning (ML) methods, including various architectures such as random forests [Chen et al., 2018; Di et al., 2019; Staffogia et al., 2019; Yazdi et al., 2020], neural network [Di et al., 2019; Yazdi et al., 2020], and gradient boosting [Chen et al., 2019; Di et al., 2019; Just et al., 2018; Zan et al., 2017]. These have demonstrated higher performances, linked with the ability to model any kind of predictor(s)-response association and to deal better with potentially complex relationship between PM2.5, spatial and spatio-temporal predictors [Di et al., 2019; Polley et al., 2010].
The aim of this study is to develop and apply a multi-stage satellite-based ML model to estimate daily concentrations of PM2.5 over a 1 km grid across Great Britain in the period 2003-2018. The analysis is based on a dataset with synchronised information from various data sources, such as several remote sensing satellite products, multiple climate and atmospheric reanalysis databases, chemical transport models, and spatial and spatio-temporal variables. The model is assessed through measures of predictive performance, error, and bias, obtained through cross-validation.
2. Materials and Methods
2.1. Study area and period
Great Britain is an island with 229,462 km2 surrounded by the Atlantic Ocean, Irish Sea, North Sea, and English Channel. It comprises the countries of England, Scotland, and Wales with a total population in 2018 of almost 65 million [ONS, 2020]. According to the Köppen climate classification, the United Kingdom (Great Britain and Northern Ireland) is defined as having a warm temperate climate, fully humid with mostly warm summer (cold summer for some parts of Scotland and England) [Kottel et al., 2006]. The study area included 234.429 1 km grid cells (containing a unique identification code, cell-ID) from the original 1 km Great Britain National Grid Squares [Digimap, 2020] for a period between 1st January 2003 and 31st December 2018.
2.2. PM2.5 and PM10 observed data
Daily PM10 and PM2.5 (µg/m3) measurements in the study period were obtained from five monitoring network sources through the R package openair [Openair, 2020]: Automatic Urban and Rural Network (AURN), Air Quality England (AQE), King College London (KCL), Scotland Air Quality Network (SAQN), and Wales Air Quality Network (WAQN). When monitors from different sources showed exactly the same temporal distributions (i.e., correlation equal to 1) and were located at approximately the same coordinates, only the AURN monitor was maintained. Monitors with less than 18 hours of PM2.5 records per day as well as less than 30 days by year were removed. The final set includes 518 and 223 monitors measuring PM10 and PM2.5 along the study period, respectively. Figure 1 shows the locations of these monitors across Great Britain, illustrating how the network coverage is densely located in major cities, leaving several small cities and rural areas with only a few or none AQ records. Each monitor was indexed using the cell-ID to the 1 km grid cell that it falls inside.
2.3. Spatially-lagged PM2.5 variables
Two additional variables were generated from the monitor series of PM2.5 to represent spatially-lagged annual average concentrations. These were computed using inverse-distance weighted means of annual average values of nearby monitors. Two different weights were applied, namely the inverse distance and the inverse squared distance (in km). The former assigns relatively more weight to distance monitors, and therefore it represents a regional background, while the latter captures local differences. These variables were used as additional predictors to exploit the spatial autocorrelation of PM concentrations and improve the predictive performance [Di et al., 2019].
2.4. AOD data: satellite and atmospheric reanalysis models
Daily satellite-AOD was obtained from the Collection 6 Level-2 gridded product (MCD19A2). These data are generated at a 1 km grid through the Multi-angle Implementation of Atmospheric Correction (MAIAC) algorithm using data from Moderate Resolution Imaging Spectroradiometer (MODIS) sensor on board both Terra and Aqua earth observation satellites [Lyapustin and Wang, 2018]. Four layers from MCD19A2 product were extracted: (i) AOD Blue band (0.47µm), (ii) AOD Green band (0.55µm), (iii) AOD uncertainty (i.e. level of uncertainty based on blue-band surface brightness [reflectance]), (iv) AOD_QA (quality assurance flags to retrieve only the best quality AOD). These layers are generated for each passing time of Terra and Aqua satellites over the area of study and combined by day. The layers AOD-0.47µm and AOD-0.55µm were used as the outcome variables, after their values were filtered using AOD uncertainty and AOD_QA layers to guarantee high product quality.
This calibration process, together with pixels covered by clouds, removed a large sample of AOD grid cells over Great Britain. To fill the satellite-AOD gaps, modelled-AOD total column was used from Copernicus Atmosphere Monitoring Service (CAMS) reanalysis provided by the European Centre for Medium-Range Weather Forecasts (ECMWF) [Bozzo et al., 2017]. CAMS reanalysis provides every 3-hourly modelled-AOD at five different wavelengths (0.47µm, 0.55µm, 0.67µm, 0.87µm, and 1.24µm) with a spatial resolution of approximately 80 km but the data was downloaded at 10 km, based on an interpolation performed through the ECMWF’ s API request. The satellite-AOD and CAMS modelled-AOD were indexed to the closest 1 km grid cell from their pixel centroid.
2.5. Other spatio-temporal predictors
2.3.1. Modelled PM2.5 from chemical transport models
Atmospheric chemistry transport models (ACTMs) incorporate anthropogenic and natural sources of emission, land use and meteorological conditions to simulate the atmospheric compositions and deposition of various air pollutants (trace gases and particles). Based on the European Modelling and Evaluation Programme (EMEP) ACTM, EMEP4UK,) has been developed to represent the UK hourly atmospheric composition at a spatial resolution of 5 km [EMEP4UK, 2020]. The description of EMEP4UK model framework and setup can be found elsewhere [Vieno et al., 2010]. Daily EMEP4UK simulations of PM2.5 (µg/m3) concentrations at surface-level were included to represent ground-level contributions, in contrast to AOD products that refer to the total column of aerosols concentrations. Each EMEP4UK 5 km pixel was linked to the closest 1 km grid cell centroid.
2.3.2. Meteorological variables from climate reanalysis models
Meteorological variables were retrieved from the ECMWF’ s climate reanalysis models with the highest spatial resolution available during 2003-2018 and at two sub-day times (0:00 and 12:00). Sea pressure and the boundary layer height (BLH) were downloaded from the ERA 5 global reanalysis with a spatial resolution of approximately 30 km [ERA5, 2020]. Air temperature at 2m height and total precipitation were obtained from the ERA 5 Land global reanalysis with a spatial resolution of approximately 9 km [ERA5 Land, 2020]. Relative humidity, wind direction, and wind speed were downloaded from UERRA regional reanalysis at 5.5 km for the MESCAN-SURFEX system [UERRA, 2020]. All meteorological variables were indexed to the closest 1 km grid cell to their centroid.
2.3.3. Normalized difference vegetation index
Monthly Normalized difference vegetation index (NDVI) layer is used to quantify vegetation presence and it ranges from -1 to 1. NDVI 1 km grid was obtained from MOD13A3 Version 6 Level 3, Terra-MODIS product [Didan et al., 2015]. Each NDVI pixel was indexed to the closest 1 km grid cell to its pixel centroid and the NDVI values repeated for the days inside the corresponded month.
2.6. Spatial predictors
2.4.1. Land variables and night-time light data from earth observation satellites
Three land predictors were collected from the Copernicus Land Monitoring Service (CLMS) [CLMS, 2020] database: elevation, land cover, and impervious surfaces. Elevation data was obtained from the 2011 European Digital Elevation Model (EU-DEM) version 1.1 with a spatial resolution of 25 m. The elevation values were obtained from the mean of all 25 m-pixels values located inside each 1 km grid cells. Land cover data was obtained from 2012 CORINE Land Cover (CLC) inventory. It was derived from high-resolution ortho-rectified satellites images that mapped all land elements at a spatial resolution ranging from 5 m to 60 m and aggregated into 100 m. Nine predictors were defined by grouping the original 44 CLC classes and each predictor represents its group proportion inside each 1 km Great Britain grid. Imperviousness degree is a binary raster product at a spatial resolution of 100 m, where the value 0 represents natural land cover or water surface and value 1 represents entirely artificial surfaces (i.e. built-up areas). The amount of impervious surfaces was estimated by the proportion of artificial surfaces inside each 1 km Great Britain grid.
Night-time lights data was provided by the visible Infrared Imaging Radiometer Suite (VIIRS) satellite sensor aboard the Suomi-National Polar-orbiting Partnership (Suomi-NPP). The VIIRS Day/Night band collects cloud-free average radiance values at annual and monthly composites in spatial resolution of 750m [EOG, 2020]. The 2015 night-time lights annual mean was computed based on the weighted average of VIRRS pixels inside each 1 km Great Britain grid.
2.4.2. Population density
The resident population counts data from 2011 were collected from the Office for National Statistics (England and Wales) and the National Records of Scotland. The smallest geographic unit of the UK census is output areas (OA), with a total of 227.769 OAs polygons for Great Britain. The proportion of each OA’ s area inside each 1 km Great Britain grid was extracted to estimate the weighted average population by 1 km grid cell.
2.4.3. Road density and distance
Road density and length predictors were derived from the Ordnance Survey (OS) [2020] Open Roads product, which offers a geospatial representation of Great Britain’ s Road network. Three density predictors were defined by grouping the original eight OS road types, where each predictor was computed as the sum of all roads length inside each 1 km Great Britain grid by group (highway, secondary, and local). Three distance predictors were defined by computing the inverse distance of each 1 km grid centroid from the closest road group (highway, secondary, and local).
2.4.4. Distance from airports and seashore
Information on location and size of airports was derived from the Civil Aviation Authority (CAA) [2020], which collects monthly statistics about air traffic movements for more than 60 UK Airports, including aviation activities for terminal passengers, commercial flights, and cargo tonnage. A total of 19 airports across Great Britain were selected based on a minimum of 1% for the annual percentage of passengers at the airport during 2015-2018. For each cell of the 1 km Great Britain grid the inverse distance from the closest airport was calculated.
Distance from seashore was computed for each 1 km grid cell using the geographical information about the boundaries of England, Wales, and Scotland provided by the UK Data Service [2020].
2.7. Statistical methods
A four-stage model was developed to obtain daily PM2.5 concentrations for all 234.429 grid cells covering Great Britain. Each stage is described in detail below. Briefly, Stage-1 applies a random forest (RF) algorithm to predict PM2.5 concentrations in monitors with only PM10 records. Stage-2 uses RF models to impute missing satellite–AOD from Terra and Aqua satellites, using modelled-AOD from CAMS. Stage-3 combines the output from Stage-1 and Stage-2 with a list of spatial and spatio-temporal synchronised predictors to estimate PM2.5 concentrations at the locations of the monitors. Stage-4 uses the Stage-3 model to predict daily PM2.5 across the whole Great Britain.
2.5.1. Random forest algorithm
RF is a supervised tree-based design ML algorithm which trains an ensemble of independent decision trees (or forest) in parallel. The final model accuracy is estimated by the performance average of all decision trees. There are two main advantages of the RF architecture (known as bagging ensemble method): first, it controls the bias-variance trade-off by feeding each tree model with two-thirds of the training set while one-third is left out for validation (i.e., out-of-bag (OOB)) [Schneider dos Santos, 2020]. When all decision trees receive the same amount and list of predictors, they become highly correlated, not solving the variance problem. Therefore, the second positive aspect from RF, the number of predictors on each tree is less than the full list available and they are randomly selected. This approach changes the predictor placed on the top of the tree, generating different splits and internal nodes. The algorithm is then able to estimate an importance ranking by quantifying the amount of error decreased due to a split of a specific predictor [James et al., 2013].
The performance of the models in each stage was assessed using statistics based on OOB samples and then from a 10-fold cross-validation (CV) procedure based on monitors. In the latter, ten random groups of monitors were defined, and the complete outcome series in each group were predicted using a model fitted in the other nine. This offer a measure of true predictive ability of the RF model in locations where no monitor is available. Measures of performance were generated by regressing the OOB or cross-validated predicted values on the observed series, and computing the R2, the root mean square error (RMSE), and intercept and slope of the prediction. These statistics were computed overall using the whole series, and then separated in spatial and temporal contributions. The former was computed using the averages of predicted and observed values across the series, and it offers a measure of performance in capturing long-term average PM2.5 concentrations. The latter was computed as daily deviations from the averages, and it quantifies the temporal variability explained by the model.
2.5.2. Stage-1: increasing PM2.5, measurements using co-located PM10 monitors
The number of PM2.5 monitors across Great Britain at the beginning of the study period was relatively low, and even if the quantity has increased substantially after 2010, most of the monitors were mostly installed in major cities. Stage-1 aims to increase the number of spatio-temporal ground-level PM2.5 references using observations from co-located monitors measuring PM10, which are available in higher number or locations and better distributed across Great Britain. Specifically, in this stage we fitted a RF model in locations with both PM2.5 and PM10 measurements, using data from all years (2003-2018). The RF model is defined as: where: PM2.5i,t and PM10i,t are the target variable and main predictor measured at monitor i on day t; mon.typei is a categorical variable classifying monitor i (traffic, industrial, urban, suburban, rural, and airport); montht and dowt are categorical variables representing the months and day of the week of day t; and lati and loni define the coordinates of monitor i. The RF model was selected following an optimisation process, and the final parameters were 500 decision trees as the RF ensemble (Ntree=500) and four variables randomly selected to be used on each tree (mtry=4). This model was eventually used to predict PM2.5 measurements in locations/days in which only PM2.5 was measured.
2.5.3. Stage-2: imputing missing satellite-AOD from CAMS modelled-AOD
The percentage of missing satellite-AOD measurements in Great Britain ranged between 87% and 94% during 2003-2018 with the greatest portion during autumn and winter, near to the coast, and at the North of Scotland. Stage-2 imputes satellite-AOD missing for every day and 1 km grid based on an optimised RF model (Ntree=50 and mtry=20) and satellite-AOD wavelength (0.47µm and 0.55µm), separately in each year within the study period. These RF models were built as follows: where: Satellite- is the target variable representing satellite-AOD (wavelength z, year y) estimates at grid cell i on day t; CAMS.AODi,t(w,h) is the main predictor representing CAMS modelled-AOD estimates at grid cell i, on day t, at wavelength w (0.47µm, 0.55µm, 0.67µm, 0.865µm, and 1.24µm), and at seven sub-day times (3h, 6h, 9h, 12h, 15h, 18h, and 21h); ydayt defines the sequence of days in a year from 1 to 365 (366, for lap years); lati and loni represent the coordinates of grid cell centroid i. This model was eventually used to predict missing satellite-AOD measurements.
2.5.4. Stage-3: estimating PM2.5 concentrations using spatial and spatio-temporal variables
Stage-3 aims to build a predictive model for daily PM2.5 concentrations, using as target variable the combined set of PM2.5 directly measured from monitors or predicted from Stage-1, AOD measured from satellite instruments or predicted from Stage-2, together with all the spatially and spatio-temporally synchronized predictors described in the previous section. The RF models were fit separately by year. The optimization process for this stage selected Ntree=500 and mtry=20 as the best set of parameter values. In this stage, PM2.5 concentrations were log-transformed to ensure the prediction of non-negative values. The Stage-3 model is defined as: where: is the target variable representing the log-PM2.5 concentrations in year y at the monitor located in grid cell i on day t, while SPTi,t and SPi represent the spatio-temporal and spatial predictors, respectively.
Specifically, SPTi,t included: surface-level log-PM2.5 values from the EMEP4UK model; Stage-2 AOD (at 0.47µm, 0.55µm); meteorological variables (air temperature, sea pressure, relative humidity, total precipitations, wind speed and direction); BLH (at 12:00 and 24:00 hours); monthly-averaged NDVI; and day of the year, month, and week. SPi included: spatially-lagged regional and local log-PM2.5 concentrations (computed as inverse distance and square distance averages); land variables (elevation, land cover in 9 groups, and % of impervious surface); road density and inverse distance (each for highway, secondary, and local); population density; and inverse distance from closest airport and seashore.
2.5.1. Stage-4: reconstructing PM2.5 time-series at 1 km grid
Using the RF models developed by year in Stage-3, daily PM2.5 concentrations for each 1 km grid cell were reconstructed across Great Britain for the whole study period (2003-2018).
3. Results
3.1.1. Stage-1 results
Not all monitors in each network provided the full series of daily PM concentrations during 2003-2018. However, PM10 was more often available, especially in the years 2003-2009, when there were less than 100 PM2.5 monitors across Great Britain. The imputation process in Stage-1 enabled the expansion to 162 (2003), 172 (2004), 181 (2005), 195 (2006), 216 (2007), 232 (2008), and 238 (2009). Table 1 shows the Stage-1 model performance (i.e., a single RF model for all years) reported by two CV methods: (i) OOB - the original RF CV algorithm, which presented better accuracy (R2 of 0.923), and (ii) 10-Fold CV (overall R2 of 0.865) - a targeted sampling, which leaves-out monitors with the full set of observations. As expected, the 10-Fold CV approach resulted in a slightly lower predictive accuracy but still showing a good performance. There is some indication of bias in the spatial component, with a slope lower than the expected value of 1.
3.1.2. Stage-2 results
The Stage-2 procedure is illustrated in Figure 2, showing missing satellite-AOD, imputed through a combination of multiple modelled-AOD wavelengths and sub-day times. Table 2 shows the performance of Stage-2 models validated using the OOB CV method. The results presented consistently high R2 (ranging from 0.963 to 0.989), low RMSE (varying between 0.006 to 0.011, and almost no bias (intercept zero and slope close to 1 in almost all years and wavelengths).
3.1.3. Stage-3 results
Stage-3, the main step from four stage satellite-based machine learning framework, combines the output of Stage-1 and Stage-2 with a list of spatial and spatio-temporal predictors to estimate PM2.5 at the locations of the monitors. The relative importance of the predictors for the Stage-3 RF models are ranked in Table 3. The list with the top-15 predictors demonstrated of the larger contribution of the more informative spatio-temporal variables (EMEP4UK PM2.5, meteorological variables, BLH, Stage-2 AOD, and NDVI). Among the spatial-only predictors, the spatially–lagged PM2.5 variables ranked as the second and third-strongest variables, with the local version showing a very high importance score. This suggests the presence of spatial correlations in PM2.5 values that are not entirely captured by the other variables.
Table 4 shows the results of the 10-Fold CV Stage-3 RF models by year. The results indicate a good predictive performance of the model throughout the study period. Overall cross-validated R2 ranged from 0.729 (2005) to 0.846 (2017), with an average of 0.788. The average prediction error is 4.046 µg/m3, with negligible bias in intercept and slope. The inspection of the spatial and temporal contributions shows that the model performs well generally across the two components. The cross-validated spatial R2 ranges from 0.710 (2005) to 0.926 (2017), while the cross-validated temporal R2 from (0.697) (2010) to 0.835 (2011). The two components have an average of 0.822 and 0.779, respectively. The high spatial R2 performance across the years demonstrates that the Stage-3 RF models were able to predict the spatial variation of long-term PM2.5 across Great Britain with good accuracy. In Appendix B, Table B1 shows the Stage-3 results by season, demonstrating that the seasonal patterns were well described by RF models, although with lower accuracy for the temporal domain in summer, characterized by a lower 10-Fold R2. This drop in performance during summer was also seen by other studies [Stafoggia et al., 2019; Stafoggia et al., 2016].
3.1.4. Stage-4 results
Stage-4 provides the prediction of daily PM2.5 concentrations for each of the 234.429 1 km grid covering Great Britain. Results indicate an annual PM2.5 average of 15.369 µg/m3 for 2003, 12.881 µg/m3 for 2010, 9.021 µg/m3 for 2018, and 11.844 µg/m3 for 2003-2018 but with a strong spatial and temporal variation. The spatial distribution of annual average PM2.5 concentrations for 2003, 2010, and 2018 are shown in Figure 3, revealing a strong decrease of pollution levels in time across the whole territory. Table C1 in Appendix C provides the same figures for all the years, confirming the decreasing trend. The spatial comparison suggests that PM2.5 concentrations are lower in Scotland and Wales compared to the more populated southern regions of England, with hotspots located in urban areas such as Manchester, Birmingham, and Greater London. At the bottom of Figure 3, the maps display the corresponding annual average of PM2.5 levels in London, demonstrating the precision of the multi-stage ML model in reconstructing PM2.5 concentrations in 1 km grid cells within urban areas. The maps show local hotspots of high pollution, with a spatial distribution that however changes along the study period.
The greatest contribution of this study is the ability of the satellite-based ML models to reconstruct daily levels of PM2.5 at 1 km grid across a wide geographical domain. Figure 4 displays the spatial distribution of PM2.5 estimations across Great Britain (Top) and London (Bottom) for specific days within the study period. It is interesting to note the wide variation in PM2.5 concentrations between days in the same area and between areas in different days. For instance, the maps on the left panels represent a day with almost no variation and generally low concentrations, the maps on the mid panels display a strong north/south split likely linked to weather conditions. The right panels show a more complex pattern with wider range of PM2.5 values, a large area with very high pollution concentrations located at Glasgow (Scotland), and hotspots in highly-populated urban areas (such as Manchester and London (England)).
4. Discussion
This study presents the first application of satellite-based spatio-temporal ML methods to reconstruct levels of pollution across Great Britain, providing estimates of daily PM2.5 concentrations over a 1 km grid during 2003-2018. The multi-stage ML framework provided significant advantages. The method allowed combining information from multiple data sources, such as air quality monitoring networks, remote sensing satellite products, chemical dispersion models, reanalysis databases, administrative census data, among others.
The beginning of the study period (2003-2009) had lower quantity of monitors measuring PM2.5 across Great Britain but this number increased considerably from 2010 after a new measuring network has been established in 2009 [DEFRA, 2012]. Therefore, Stage-1 was extremely important step to extend the number of PM2.5 concentrations. The implementation of Stage-2 was also relevant to fill the gaps in MAIAC AOD retrievals from satellites, thus maximizing the available information. Stage-3 produced a ML prediction model based on a long list of informative predictors, accounting for potentially complex inter-relationships and functional forms. The Stage-3 ML algorithms offered an excellent performance, showing an average cross-validated R2 of 0.788 across the period, with increased predictive ability in the last years. Stage-4 provided a single PM2.5 estimation for each of the 234.429 1 km grid cells in each of 5,843 days, totaling around 1.37 billion data points. The complete spatial coverage, high resolution, and relative small error of the predictions, coupled with and the ability of the model to capture variations in PM2.5 concentrations across both spatial and temporal domains, making the output suitable for application in epidemiological studies on short and long-term health effects of air pollution.
The output from the empirical ML model developed in this study complements existing databases of modelled PM2.5 in the United Kingdom, with some advantages for applications in epidemiological studies. Country-wide maps generated by emission-dispersion models are available at lower spatial or temporal resolution [DEFRA, 2020; EMEP4UK, 2020; Savage et al., 2013, and generally they show lower small-scale accuracy when tested against observed monitoring data [Hood et al., 2018; Lin et al., 2017]. The spatio-temporal ML models presented here demonstrated comparable predictive performance to similar methods applied in other countries, based either on single-learner ML models [de Hoogh et al., 2018; Stafoggia et al., 2019], ensemble ML models [Chen et al., 2018; Di et al., 2019; Yazdi et al., 2020], or generalised addictive models (GAM) [Kloog et al., 2015]. Yazdi et al. (2020) developed an ensemble ML model (composed by RF, Deep Neural Network, GAM, Gradient Boosting, K-nearest Neighbour) to estimate PM2.5 for Greater London, reaching mean 2005-2013 CV spatial-R2 of 0.396. Using the same period, this study reached CV spatial-R2 of 0.788 for Great Britain. Modelling a larger area might have provided more spatial variability, improving substantially the CV spatial-R2. The application of a single learner to model air pollution in both spatial and temporal domains for Great Britain achieved a satisfactory performance. Ensemble model formats can be used in future applications using the same area of study and variables to test how much performance gain is reached compared to RF-learner.
In 2008, a new Air Quality Directive (AQD) came into force bringing considerable changes to the following UK annual air quality assessment in 2010, setting an annual mean target of 25 µg/m3 (the upper limit used in the annual plots) [Brookes et al., 2011]. Several policy controls and emission reductions had been put in place aiming to reduce from 2010 the road traffic PM emission by 83%, off-road mobile machinery by 54%, and energy production by 32% until 2020 [AQEG, 2013]. Independently of the temporal aggregation (daily or annual), all maps shown in this study detected this considerable drop in the PM2.5 concentrations from 2010 across Great Britain.
Some limitations must be acknowledged. First, the multi-stage model relies on the extension of the observed series of PM2.5 by predicting values based on co-located PM10 in Stage-1. This step was necessary for the application of the method in early years characterized by sparse PM2.5 monitoring, which is likely to have contributed to the lower predictive performance in this period. In addition, the cross-validation procedures revealed the presence of small bias in the Stage-1 predictions, particularly in the spatial domain, probably linked to limitations in modelling the relative distribution of the two PM components. Second, while the model displayed a good performance throughout the period, the accuracy is worse in the temporal domain in summer (Table B1), suggesting limitations in capturing the higher temporal variation in this season. Third, the generalization of the prediction model was dependent on the selected locations of the monitors, which may be not representative of the study domain. This can result in an underestimation of the error and potentially biases in the predictions in more remote and less represented areas, if structural spatial differences are not entirely captured by the model covariates.
Future research directions for improving the current definition of the spatio-temporal ML model and its output can be discussed as well. First, availability of predictors that capture spatio-temporal variations is critical for ensuring high predictive ability. Improvements in emission-dispersion models such as EMEP4UK, whose output was consistently selected as the most important predictor, or in remote sensing satellite products, which in contrast show limited present contributions, can be a key factor for increasing the model performance. For instance, new instruments available in more recent earth observation satellites are explicitly developed for applications in pollution measurements and environmental health, and can be integrated in future versions of the model (i.e. Copernicus Sentinel 5 Precursor satellite products) [ESA, 2020]. Second, spatially-lagged PM2.5 variables based on inverse-distance averages of nearby monitors were selected as second-best predictors, suggesting that the performance can be extended further if the underlying spatial correlation structure is appropriately modelled. This can be achieved by extending the current spatial definition to fully spatio-temporal variables, and by applying more sophisticated spatial methods to incorporate such correlation. Third, published articles have presented alternative ML algorithms in single-learner [Just et al., 2018; Zan et al., 2017] or ensemble models [Di et al., 2019], which can be tested and applied in alternative to the RF method proposed here. Fourth, finally, previous studies have developed methods to increase the resolution of the predictions beyond the 1 km grid, for instance using additional stages involving small-area variables [de Hoogh et al., 2018; Di et al., 2019; Stafoggia et al., 2019]. Similar approached can be applied here.
5. Conclusions
This study developed and applied a multi-stage ML model, combining data for multiple sources, including remote sensing satellite products, climate and atmospheric reanalysis models, chemical transport models, and geospatial features, to generate a complete map of daily PM2.5 concentrations in 1 km grid across Great Britain in the period 2003-2020. The model showed good performance overall and in both spatial and temporal domains, with an accuracy that is compatible with the use of such reconstructed values as proxy for PM2.5 exposures in epidemiological studies. In particular, the availability of high-resolution measures that can be linked as such or aggregated at different spatial and/or temporal scales makes the output suitable for investigations on both transient and chronic health risks associated to short and long-term exposures to PM2.5, respectively.
Data Availability
All data used to perform the analysis are in the public domain. References and sources are provided in the text.
Author Contributions
The authors’ contribution are as follows: Conceptualization, R.S.S. and A.G.; methodology, R.S.S., A.G., I.K., M.S. and, K.H.; validation, R.S.S. and A.G.; formal analysis, R.S.S., A.G. and F.S.; data curation, R.S.S.; writing—original draft preparation, R.S.S.; writing—review and editing, all authors; visualization, R.S.S.; project administration, A.G.; funding acquisition, A.G. All authors have read and agreed to the published version of the manuscript.
Funding
This study was supported by the Medical Research Council-UK (Grant ID: MR/M022625/1), the Natural Environment Research Council UK (Grant ID: NE/R009384/1), and the European Union’ s Horizon 2020 Project Exhaustion (Grant ID: 820655). EMEP4UK Model results and contributions by S.R. and M.V. were supported by award number NE/R016429/1 as part of the UK-SCAPE programme delivering National Capability.
Conflicts of Interest
All the authors declare no competing interests.
Acknowledgments
The authors are grateful for the technical support received from the European Centre for Medium-Range Weather Forecasts (ECMWF). The authors also would like to thank the European Space Research Institute (ESRIN) from the European Space Agency (ESA) for their feedback on this study. This research had free access to all data sources: (i) NASA EOSDIS Land Processes Distributed Active Archive Center (LP DAAC), (ii) NOAA National Centers for Environmental Information (NCEI), (iii) Copernicus Land Monitoring Service (CLMS), (iv) Copernicus Atmosphere Data Store, (v) Copernicus Climate Data Store, (vi) EMEP4UK, (vii) UK Department for Environment, Food & Rural Affairs.