Abstract
We present human mobility data for the United Kingdom collected from the “BBC Pandemic”, a public science project linked to the BBC Four television documentary of the same name. Mobile phone GPS trajectories submitted by users and collected over a 24 hour period were aggregated to construct anonymised origin-destination flux matrices at the local administrative district (LAD). We use these data to explore how mobility patterns change with age and employment status - unique stratifications that are not available from other publicly and privately held mobility data sets. We validate the consistency of the aggregated BBC mobility data set against census workflow data and illustrate how the systematic differences in mobility rates with age affect the risk and pattern of transmission between regions with 18-30 year old’s contributing the greatest risk of transmission to adjacent regions, but older 60-100 years playing the most important role in more remote low-density locations.
1 Introduction
The spatial dynamics of human mobility are a key mechanism driving the local persistence of endemic infections [9, 10, 24, 18] and limiting the rate of invasion of novel pathogens [31]. Over the past twenty years there has been an rapid expansion in the type of data and methods available to directly track or infer patterns of human mobility [8] from traditional sources such as census data, tracking bank notes [14], the frequency of posting on social media [36], mobile phone records [23] through to direct tracking using the global positioning system (GPS) [48]. Mobile phone GPS data are perhaps the most accurate and powerful tool to measure human mobility - as demonstrated most recently by the real time analysis of the effectiveness of lockdown procedures during the Covid-19 pandemic [32, 46, 49]. However, mobile phone data sets are privately held by mobile network operators and – with a few notable exceptions [48, 37, 26] – rarely shared publicly which is a barrier to reproducibility and open science [35]. Hence, spatial epidemic models still often rely on traditional sources of mobility data such as census workflows for commuters [7, 28, 15, 5]. Given the importance of protecting individuals identities, mobile phone data shared privately to researchers by network operators also lacks key meta-data on users such as gender, age and employment status which are likely to affect patterns of mobility [13].
To address these limitations, the BBC Pandemic project recruited over 86,000 participants in the United Kingdom between September 2017 and December 2018 as part of a public science project linked to a BBC Four documentary [31]. Here we present and analyse mobility data from 43,291 participants aggregated to the level of the observed flux between local administrative districts and stratified by age and employment status. The only other open source mobility data currently available for the United Kingdom is the 2011 census workflow data [1]. Published by the Office of National Statistics under version 3.0 of the open government licence this data can be freely shared, adapted and republished for both commercial and non-commercial purposes [4]. We compare the BBC mobility data set to this census workflow data which only captures commuting patterns of adults to and from their usual place of work. We estimate - and systematically compare - human mobility models for the United Kingdom and impute national level origin-destination matrices for the total BBC data set and four coarse grained stratifications by age. We use these imputed commuting matrices to demonstrate how the risk of transmission between regions changes for each age-group compared to using aggregated data.
1.1 Human Mobility Models
So-called gravity models, where the rate of migration between spatially segregated populations is modelled as a function of the distance between locations and their relative population size, have proven to be exceptionally popular for epidemiological modelling [24, 45, 19, 42, 22, 12, 17, 25, 30, 33, 6]. Taking its name and form from the Newtonian law of gravitation, the theory actually has its origins in the social sciences and in particular transportation theory [20]. However, beyond the dependence on population size and distance there has been very little consistency in the definition of gravity models by different authors.
The most basic formulation of the gravity model has the convenient - but unrealistic - characteristic that the population flux between two locations only depends on the local characteristics of the two interacting populations. These foundational models take no account of the distribution of population - or connectivity - of the population between two points. The Radiation model [40] was developed to account for these effects explicitly by construction and in some human mobility data sets achieves similar – or better – predictive performance than gravity models despite having no free parameters to estimate. This parsimony comes at the expense of a lack of flexibility with the Radiation model being outperformed by classical gravity formulations when factors other than population density – such as the types and opportunities of travel – are more important. The extended radiation model [47] addressed this limitation by introducing a single scaling parameter α which can be interpreted as defining a characteristic length scale (l) for trips with a region.
There have been several attempts within the transport theory field to address these same issues within the gravity modelling framework - in particular the intervening opportunities model [41] and the competing destinations model [21]. We consider two variants of the intervening opportunities model – the Schneider formulation examined by [34] and Stoufer’s Rank Model as recently revisited in the context of modelling historical measles epidemics in England and Wales [2]. Finally, we consider the Impedance model [39]. Taking inspiration from the Ohmic law, the Impedance model was proposed as parameter free mobility model and compared to gravity and radiation laws in a model of the 2010 Haiti cholera epidemic [39].
Human mobility data sets typically capture a snapshot of individuals movement for which the total flux of individuals moving (or commuting) from a region is a fixed margin. For such data it is convenient to model the probability of moving given a particular home location separately from the choice of destination [40]. We therefore first compare the mobility rates (proportion of users that have different origin-destination locations) for the BBC data to census workflow data from the United Kingdom, then estimate constrained (singly or work constrained) variants of each mobility model [34].
All of these models have previously only been estimated from and tested against either aggregate mobility data or indirectly with respect to infectious disease case reports. The BBC mobility data set offers the first opportunity to compare and assess the predictive ability of these models for different groups of society.
2 Methods and Results
2.1 Data
There were two components to the BBC Pandemic study, one focused on the town of Haslemere [29], and another focused on the wider UK population [31]. Here we present mobility data from the UK national study. Upon starting the BBC Pandemic app users first entered their basic demographic information, including age, household size, gender and occupation. The app then recorded their location at a 1km grid scale across the UK at hourly intervals for a 24 hour period. At the end of this period, users provided key meta data including their gender, age, occupation, health today (self-assessed), number of people in household, maximum distance travelled (self-estimated) and mode of transport. Users were additionally asked to complete a detailed survey on the social contacts they made over the study period which are reported elsewhere [3].
Over 86,000 participants started the survey and filled out their profile. Participants with no encounter or location data were excluded, as were users whose location recordings were all outside the UK leaving 47,741 user with GPS trajectories. A small subset of users repeated the survey and provided multiple observation periods. To protect the anonymity of these users we only consider the location data collected during the first 24 hour period of observation for this study. As a further step to ensure anonymity of users we aggregated individual user trajectories to origin and destination locations at the mid-layer super output area (MSOA). As the mid-layer super output areas nest within the local area districts (LAD), we map the MSOA code to the corresponding LAD to define origin and destination locations at this higher spatial scale and calculate the final origin-destination flux matrices Ωij [8] that record the number of users with origin i and destination j.
We use the modal location to define users origin (home location) and considered two alternative definitions of destination based on the furthest extent and second (next) most frequent location from home. For the purposes of this study we focus on analysis on the ‘next’ (most frequent) origin-destination matrices as these are the most consistent conceptually with the census work flow data. (Full details of these definitions are given in supplemental information).
To provide some context on changing patterns of mobility over the working week we tabulate the start time of each individuals observation period aggregated by day of week (Table 3). The majority of users started the app on a weekday (37,322). The highest single day of sign-up was on the Thursday that the associated BBC Pandemic documentary was originally broadcast (22nd March 2018). 2,981 users signed up during the hour of broadcast alone, with a total of 7,796 users starting the app on that day. The next highest day of sign-up (with 4,967 new users) was also a Thursday (28th September 2017) corresponding to the second day of a social media campaign run by the production company to promote the app.
Given the relative sparseness of the data set at the MSOA level, we focus on analysing the patterns of mobility at the LAD level. Raw data, estimated models and imputed flux matrices for both the next and furthest extent definitions are available in a public repository along with code developed for this manuscript (https://github.com/BBCPandemic/BBCMobility).
Of the 43,291 users in the BBC mobility data set (henceforth the Total BBC data), 25,114 had inferred home and next most frequent locations within the same LAD. We define the complementary set of 18,177 users with origin and destination locations in different LADs as ‘movers’ (Tables 1,2, Figure 1) rather than commuters to acknowledge that the displacements in the BBC mobility set capture a wider range of human mobility than the strictly commuting flows measured by census work flow data.
2.1.1 Modeling the origin-destination flux matrix
Flux patterns can be decoupled into two model components: the probability that someone moves between locations, and the location that someone goes to given that they do move. The origin-destination flux matrix Ωij can be correspondingly decomposed. The diagonal elements represent users who stayed in the same location and the off-diagonal elements correspond to the ‘movers’. Thus, the flux matrix of movers alone is given by the origin-destination flux matrix with the diagonal elements set to zero:
The total eflux from location i is given by the sum over all potential destinations while the total number of users with origin i (i.e. including the non-movers) is Δi = j Ωij. The proportion of movers for location i is given by . We define the conditional movement matrix: as the proportion of movers from i who choose destination j. It can then be seen that σ satisfies the normalisation Σj σij = 1 and by definition σii = 0. The origin-destination flux matrix can thus be decomposed in terms of the conditional movement matrix σij and probability of movement pi: where δij is the alternating Kronecker delta symbol.
2.1.2 Probability of moving
The data are now summarised in two separate components: the proportion of the population that move (pi) and where the movers go (σij). We first we consider the first part of this: the probability of movement, comparing the BBC and census datasets, and also explore how probability of movement varies by age and employment status.
Census outputs are prepared and published separately for Scotland, Northern Ireland and England & Wales at different levels of spatial aggregation. The Northern Ireland Statistics and Research Agency excludes all responses with work locations outside of Northern Ireland. The Office of National Statistics and Scotland’s Census publish aggregate numbers for the total work flows to each member nation, but not the sub-national location. For consistency, and to make comparisons between the member nations of the United Kingdom, we separate the combined England and Wales data set and treat the four resulting public census outputs as separate data sets for the purposes of model estimation and comparison (Table 1). Sub-national movement rates are comparable between the census workflow and BBC mobility data sets (to 1 significant figure).
The meta-data collected by the BBC pandemic app allows us to further stratify the Total BBC data set by age and employment categories. For age we consider four coarse grained categories – under the age of 18 (BBC Under 18), over 18 and under 30 (BBC 18-30), over 30 and under 60 (BBC 30-60) and over the age of 60 (BBC 60+). With respect to employment status we reuse the under 18 category (BBC Under 18), and define three alternative subsets for analysis corresponding to the group of users over the age of 18 and in education (BBC Education), over the age of 18 and in employment (BBC Employment) and over the age of 18 and not in employment, education or training (BBC NEET). The per-capita mobility rate for these strata of the BBC mobility data set vary from a minimum of 0.245 for BBC Under 18 to a maximum of 0.460 for BBC 30-60 (Table 2).
2.1.3 Geographic variation
The proportion of movement pi also varies considerably between regions (Figure 2). There is no clear relationship with the size of the resident population and only a weak association between the area of LADS and pi. However, there is a clear linear relationship between pi from the BBC data and the corresponding probability of movement from the census data, with an R2 = 0.71, suggesting that both instruments are measuring a common source of variability in mobility between regions.
With a median of 81 users per LAD (range 2-948) the raw BBC data is too sparse to estimate the geographic variation in population flux from each LAD for each stratification of interest (i.e. age or employment status). To address this limitation and be able to impute movement rates for each age and employment group we model the per location probability of moving using a generalised linear model (with logit link) and random intercept term for each LAD (i): where group is a categorical variable that adjusts the background rate of movement for each level of the age or employment strata. We therefore assume that the difference in mobility between groups is constant across the UK, proportionally adjusting the regional mobility captured by the random intercept term. Random effects models were estimated using the lme4 package [11] in the R statistical language [38].
Model fit was assessed graphically by plotting the predicted probability of movement against the observed data (Supplemental Figures 9,10). This basic check illustrates the model is performing as intended, fitting closely to the 18-30 and employed categories which have the largest sample size and pulling the imputed movement rate for under-sampled LADS and categories (Under 18, NEET) towards the group and local (LAD) averages.
2.2 Mobility Models
Here we now focus on the model component of where people go, conditional that they do move, in essence modelling the conditional movement matrix σij as defined above.
2.2.1 Gravity Model
The classical gravity formulation assumes that the flux between two locations depends on the product of the size of the donor (Ni) and destination (Nj) populations (with scaling parameters τ2, τ1) divided by a function of the relative distance (rij) between them:
However, this general form is arguably too flexible, and in particular is unbounded with no upper limit on the predicted number of commuters from the donor patch. We can normalise the gravity model by defining a vector of normalisation constants for each donor patch i by summing over the set of recipient patches j (for all (j ≠ i): forming a so called singly constrained gravity model. Recasting the model in terms of the conditional movement matrix σij, the donor term cancels out and is redundant in this formulation. We therefore define our basic gravity law model for i ≠ j as: with σii = 0 by definition and:
We consider gravity-type models with three different distance scaling functions, power-law , exponential f (r) = erij/ρ and offset .
2.2.2 Competing Destinations Model
Following the same reasoning, the competing destinations model [21] can be defined for our purposes for i ≠j as: where:
The parameter δ adjusts for the effect that other locations – the competing destinations – have on the flux. For a negative value of δ the effect of competing destinations is to reduce the flux to a particular location, whereas for a positive δ the flux between the two locations is enhanced by the presence of alternative destinations. When δ = 1 the competing destinations term vanishes and we recover the classical gravity model. As the gravity model is nested within competing destinations we do not fit these simpler gravity model formulations separately. We therefore compare three gravity type models: competing destinations with power law distance function (CDP), competing destinations with exponential distance function (CDE) and competing destinations with offset distance function (CDO).
2.2.3 Extended Radiation Model
The radiation model has no free parameters to estimate and is normalised by construction: where sij is the total population in a circle of radius rij centred at i and excluding Ni and Nj themselves [40]. The extended radiation model (ERad) introduces a single scaling parameter α [47]. where
Yang et al. [47] proposed that α should vary between between regions with different characteristic length scales (l defined as mean length between regions being modelled) such that:
For the UK local authorities, McNeill et al. [36] calculated l = 19km and thus we would expect α = 0.43. Here we estimate α as a free parameter to compare with this predicted scaling law.
2.2.4 Intervening Opportunities Model
Schneider’s intervening opportunities (IO) model [34] depends on the same matrix sij as the radiation model and is defined for i ≠ j as: where
2.2.5 Stoufer’s Rank Model
Stoufer’s (Sto) Rank model [2] also depends on the same matrix sij as the radiation model and can be defined as: where
2.2.6 Impedance Model
The impedance (Imp) model [39] is defined as: where
In common with the radiation model this mobility model has no free parameters to estimate.
2.3 Inferential framework and model comparison
Each row of the mover flux matrix could be considered as a multinomial sample [42] with trials and probability vector σi equal to the corresponding row of the mobility matrix σij:
However, for a multinomial likelihood the variance of observations scales linearly with the number of trials. As the efflux scales with local population size, a multinomial likelihood for these data will be dominated by contributions from large urban centres potentially introducing systematic biases and not allowing for the possibility of over-dispersion in the sampled flux. The expected rate of flux for a given mobility matrix σi is:
From which we can construct a negative binomial likelihood and explicitly allow the variance to scale with (origin) population size:
Note that we use the ecological parameterisation of the negative binomial specified by the mean and shape parameter ϕ) and the variance of the flux leaving patch i is thus ..
We estimate posterior distributions for each model using Hamiltonian MCMC (as implemented by Stan [16] http://mc-stan.org/). To assess model fit and provide a basis for model selection we use approximate leave-one-out cross-validation [43, 44]. For numerical stability we restrict the range of parameters such that 0 < τ, ϕ < 5, 0 < α < 1, 5 < δ < 5 and 0 < γ < 10−4. We restrict ρ > 0 for the offset and exponential competing destinations models (CDO, CDE), with the further restriction that 0 < ρ < 5 for the power law scaling (CDP). We choose Cauchy (0,5) prior distributions for all parameters. All further analyses were carried out in R [38].
2.4 Model checking and cross-validation
For each combination of model and data set, 4 Hamiltonian MCMC chains were run for the default 2,000 iterations, unless a greater number were required to pass diagnostic checks. Chains were well mixed for all combinations of models and data sets and passed standard convergence diagnostics. As a further predictive check of the model – and to provide a basis for model comparison and selection – we carried out approximate leave-one-out cross validation (LOO) [43].
This method uses Pareto smoothed importance sampling (PSIS-LOO) [44] to estimate the expected log pointwise predictive density: which measures the predictive accuracy of the model when a single observation is dropped out. The difference between for alternative models fitted to the same data provides a measure of their relative predictive accuracy. The standard error on the difference gives a measure of uncertainty with standard errors comparable to the magnitude of the difference suggesting the relative predictive accuracy of the two models is indistinguishable.
The estimated (Pareto) shape parameters for the predicted distribution of can be used to judge the reliability of the estimate of for each data point. The estimate of is considered reliable for , performance may still be reliable for values of up to 0.7. Values of suggest that the data points are highly influential to the estimated posterior and potentially introducing bias.
Models estimated from the full BBC mobility and Census data sets have five locations with and two with suggesting these locations are highly influential to the estimated posterior distribution and potentially introducing bias. Most of these issues can be traced to the uniquely large number of commuters to two specific districts within London: namely the City of London and Westminster. The City of London and Westminster (and to a lesser extent other boroughs of London) attract anomalously large numbers of commuters for their size. This can be visualised by plotting the numbers of commuters in (influx) against leaving (efflux), (Supplementary figure 11). For the majority of LADs this relationship is symmetric with both the influx and efflux of commuters approximately proportional to the resident population size. However the City of London and Westminster have orders of magnitude higher commuters in than residents who commute out. This extreme lack of fit of gravity type models results in a systematic bias to parameter estimates as illustrated by comparing the posterior estimates between the full set of 326 English LADs and a data set dropping the City of London and Westminster (324 English Lads, Supplementary figure 12).
The Highland LAD in Scotland has a similar impact on parameter estimates from the BBC mobility data set (due in this case to the small flux of users to this location). Removing these three locations, then all of the models estimated from the BBC mobility data have . However, even after these steps between 1 to 9 LADS have . for the gravity type CDO, CDE and CDP models estimated from census data. Given the greater overall predictive performance of these models on the BBC mobility data set we do not wish to simply set aside these models.
Although the specific problematic LADS vary between models, the value of increases directly with the size of the resident population. Gravity models are notoriously sensitive to contributions from high population density locations. This, and the disparity in sample sizes between the BBC mobility and census data sets, was our motivation for using a negative binomial likelihood where the shape parameter ϕ controls the variance to mean scaling relationship. For a mean flux y, the variance of the negative binomial likelihood is inversely related to the shape parameter . Reducing the value of ϕ reduces the influence of larger populations to the likelihood, hence we can carry out a sensitivity analysis to the influence of the outlier values by fixing the value of ϕ and reducing it until the pareto shape parameter for all of the observations < 0.7. This was achieved for a fixed shape parameter of ϕ = 0.1. The models estimated from census data with a fixed value of ϕ reduced the absolute values of the likelihood but did not change the rank ordering of models (described below). Although the point estimates of the gravity parameters do change systematically with ϕ, they still have overlapping credible intervals in the range between the estimated value and fixed value (0.1) where the census models pass all diagnostic checks (13). For the purposes of comparison to the BBC mobility data set we present the models with estimated value of ϕ, which have a greater overall predictive performance (as measured independently via the common part of commuters index described in the next section).
2.5 Model Selection and Predictive Performance
The rank order of mobility models based on is identical for the UK level stratifications of the BBC mobility data set (Tables 5, 6, 7) – with the competing destinations (CDO) model favoured above the other models for all data sets. The ranking of the second and third ranked models varies between member countries of the UK, but the differences are small compared to the standard deviation implying there is little to choose between the predictive ability of the three gravity formulations for these data. The English census data is the only data set where an alternative model – the Extended Radiation model is preferred.
As a final posterior predictive check of the estimated models we consider the Common Part of Commuters (CPC) index introduced by [47] which can be defined as: where is a posterior predictive flux matrix from a mobility model fitted to the empirical flux matrix and the sum is over all donor (j) and recipient (i) patches within the meta-population. A value of 1 is calculated when there is perfect agreement between the two matrices, with 0 when there is no agreement. Figure 3 presents the posterior predictive distributions for the CPC for each combination of mobility model and data set. This measure is in broad agreement with the ranking achieved through LOO cross-validation - with the competing destinations model favoured in the majority of cases, but once again with relatively little difference in predictive performance between the top three models.
2.6 Impact of age and employment status on mobility patterns
We first compare the gravity parameters for the BBC Total data set estimated from each of the member nations of the UK and compare with estimates from the census workflow data (Figure 4). The pooled UK estimates are most consistent with Census estimates from England (which has the largest population and number of LADs) but there are systematic differences in estimates from England and the smaller member nations. Containing only 10 LADs the Northern Ireland data set is clearly to small to support inference of this model, with huge uncertainty in the results should not be interpreted and are presented only for completeness. The Scottish and Welsh data sets demonstrate an increased importance of population size (larger τ value) and a longer spatial scale (ρ, α parameters) compared to estimates from England and the pooled UK data.
Given this sub-national variation it would be ideal to compare the mobility patterns of by age and employment categories of the BBC mobility data set at this level. However, given the sample size we must restrict comparison to models estimated at the national (UK) level. There is close correspondence between the age and employment groups suggesting that the 18-30, 30-60 and 60+ age groups behave largely like the education, employed and NEET categories.
Once again given that the vast majority of users were are in the 30-60 or employed categories, the parameters estimated from the BBC Total data set are statistically indistinguishable from those from the subset that were employed with overlapping posterior distributions (Figure 5). There are systematic differences in the estimated gravity scaling parameters for Under 18s, those in Education and NEETS. Under 18s have a faster decay in mobility with distance than other groups (larger α) and experience the strongest impact of competing destinations. Population size is a more important predictor for those over the age of 18 and in full time education (higher estimated value of τ) - which may reflect the location of universities, colleges and other higher education institutions within urban centres. By comparison NEET’s range further (smaller α, but population density is less important in modulating their decisions (smaller τ).
To facilitate the use of the BBC mobility data in simulation studies, we use our estimated models for the probability of movement and choice of destination to impute UK population flux matrices for each level of the BBC mobility data set (Figure 6): where is the total population in group g resident in patch i. These imputed matrices are calculated using the point (median) posterior estimates and provided as supplementary information.
2.7 Force of infection for a multi-group commuter model
To explore the extent to which variation in commuting rates and patterns within a population translates to epidemiological risk we derive a commuter approximation for the force of infection for a generic SIR (susceptible S, infected I, recovered R) model within a meta-population with multiple movement groups. We assume that the force of infection is well mixed within each patch. The effective local force of infection within a patch can be conceptually thought of as being constituted by two parts: the local force of infection due to resident infectives and the extrinsic force of infection generated by infected movers resident within the local population and susceptible movers to other spatial locations.
Elaborating an earlier result from [27], [7] demonstrated that the magnitude of these two components can be explicitly derived from the mechanistic movement rates. The per capita movement rate from patch i to patch j for members of group g can be calculated in terms of the inferred probability of movement and conditional movement matrix : where D is the average trip duration.
The total movement rate from patch i to patch j will then be:
As long as the return rate is small with respect to the generation time of the pathogen then the force of infection acting on susceptibles (Sig) in patch i and movement group g can be approximated as: where .
We further assume that individuals differ only with respect to their mobility and have equal susceptibility and infectiousness and mix homogeneously with other individuals within the same patch. Under this assumption the force of infection only depends on the total number of infected individuals in each patch: Ii = Σg Iig. The first term corresponds to the contribution to the force of infection acting on susceptibles within group g resident in patch i:
This has also two terms, the first corresponds to the contribution from local infectives (who are not commuting) from all four mobility groups. The second term corresponds to the contribution of susceptible individuals from patch j meeting infectious individuals (from all mobility groups g) while commuting. Similarly, the contribution of local susceptibles encountering infectious individuals (of all groups) while commuting to another patch is:
The effective population size within each patch is: where the population size Ni = Σg Sig + Iig + Rig and we neglect higher order terms of the movement matrix O(σ2) and higher.
2.8 Geographic risk of transmission
We use our multi-group commuter model to explore how the predicted geographic risk of transmission differs between an aggregate and age-stratified commuter model. Previous theoretical work has suggested that difference in mobility rates (or equivalently average trip duration D) are most important in the early stages of invasion when incidence is low [7]. As an illustration we consider an invasion seeded, as was the BBC pandemic, in Haslemere situated within the Waverley district in southern England. We consider the scenario where a novel pathogen has been introduced into a single location and identify the age-group that contributes the largest value to the net force of infection assuming a single infectious individual within each age group (Figure 7 A). We note first that although modulated by population density (7 C), distance from the seed location is the primary determinant of the net contribution to the force of infection in distant LADs ((7 B). While young people – in this case the 18-30 group – provide the largest net risk of transmission to another LAD – the relative contribution of older age-groups becomes more important for distant, low density areas (7 B) which also tend to have older populations (7 D). In supplemental information we explore a range of different seeder locations in comparable commuter districts across the UK (Falkirk, Belfast, Newport) and see the same basic pattern with one small variation. For low density districts proximal to the seed location the youngest Under 18 year old group, whose movements are closely constrained to their home location, are occasionally the dominant group - perhaps reflecting the size of school catchment areas.
3 Discussion
In this paper we have introduced mobility data from the BBC Pandemic project to perform an analysis of how commuting patterns differ between groups of individuals with respect to employment status. We estimate and compare seven alternative human mobility models and find that the Competing Destinations model (with offset distance kernel, CDO) provides the best fit and predictive ability as assessed by leave-one-out (LOO) cross validation and posterior predictive checks of the common part of commuters (CPC) index. The competing destinations model extends the classical gravity formulation by adding a term that adjusts the flux between two locations according to the network of alternative locations within the meta-population. Although it lacks the elegant theoretical origins and mechanistic interpretation of the Radiation model – the additional flexibility makes it more appropriate for exploring the differences in mobility patterns between different groups.
However we note that, despite requiring two less parameters than the favoured CDO model, the Extended radiation model has similar absolute predictive performance in terms of the common part of commuters (CPC) index. The estimated value of α = 0.49 (0.48 − 0.5) for the full UK BBC Total data set is in line, but slightly higher than, the expected value of α = 0.43 independently estimated by McNeill et al. [36] based on a length scale for average trips of l = 19km.
Users in the BBC mobility data set demonstrate a higher rate of mobility across employment categories. There are important differences in the rates of mobility between urban and rural areas for users in different employment and age groups which could potentially be important for modelling the invasion of novel pandemic pathogens as the increased mobility and range of older individuals and those outside employment could enhance the rate of spread in the early stages of an outbreak.
For this paper we constructed origin-destination flows from users GPS trajectories taking the most frequent location as the inferred home (origin) and considered two alternative measures for the destination. The results we present in this paper use the second most frequent location (“next”) as the users destination. Repeating the analysis using the furthest extent and all recorded user locations (other than the origin) give the same qualitative results in terms of the relative performance of mobility models and differences between employment categories. We chose to focus on the “next” definition given it’s logical consistency with the question asked in the UK census. To validate this assumption by comparing the common part of commuters (CPC) between the predicted flux from the England & Wales census data to the flux predicted by the best fit models to the BBC Total data (Figure 8). Both definitions lead to estimated models with predicted flux that has a high (and indistinguishable) degree of similarity to the flux predicted from models estimated from Census data. For our purposes in this paper, to explore how human mobility between different employment groups in society varies from patterns inferred from census data, it is clearly the most appropriate choice. However, it is not necessarily the most appropriate choice for predicting disease transmission, hence we include both alternative definitions and estimated models within the the on-line data respository.
An unavoidable consequence of the crowd-sourced nature of our data is that users knew when their movements were being recorded and chose when to start tracking on the app which could potentially have changed their behaviour. We expect that this effect will be mitigated to an extent by the vast majority of users that signed up on (or immediately after) the day of broadcast. Although recruitment through the app was open over the course of a year, the largest group of users signed up following the broadcast of the documentary. There was also a smaller wave of recruitment following a social media campaign before filming was carried out in the town of Haslemere. This temporal pattern highlights both the effectiveness of the documentary and public engagement activities on recruitment, but also the potential for bias to be introduced into our sample of user trajectories. The close correspondence between the predicted flux and gravity parameters from the Total BBC data and subnational census workflows provides another layer of reassurance about the representativeness of the BBC data.
Census workflow data (and privately held mobile data sets) have the advantage of being dense enough that the raw data can be used directly in commuter models [5] without the need to estimate the human mobility models necessary to interpret the BBC mobility data. The consistency of the total BBC data at the aggregate level to census data, belies the differences in both the probability of movement (from home) and the choice of destination for movers not in full time employment. The unique meta-data collected along with GPS traces from the BBC Pandemic app has allowed us to quantify these differences for the first time. Under 18s and the over 60s are both less likely to move and have destinations closer to home than those in employment. We used a multi-group commuter model to illustrate how the predicted spatial risk of transmission varies according to seeder cases in different age groups. These differences may be small in aggregate but could be critically important in assessing the risk of spillover between regions in the early stages of a pandemic or in the period immediately following the easing of lockdown restrictions.
Data Availability
Anonymised data and all code required to replicate analyses within the paper are available from a online repository linked below.
A Definition of Origin-Destination Matrices
User locations were first snapped to the nearest MSOA based on the generalised (20m resolution) shape file provided by the Open Geography portal from the Office of National Statistics (ONS). Locations outside of the boundary of any MSOA were either excluded or snapped to the MSOA with the greatest area of overlap within a 1km buffer centered around the user location. A home (origin) location was defined for each user as their modal MSOA. In defining the duration of time spent in a location we needed to account for missing location logs for some users who moved into an area with poor service, or switched off their phone, during the observation period. For such gaps we make the assumption the user remained in the last seen location until a new location was logged and use the duration of time within each location to calculate the modal location. In the event of a tie we chose the location with the least amount of time spent in the 12 hours between 7am and 6pm (inclusive). Users to which we could not assign a home location were removed from the data set for analysis.
For destination locations, two alternative definitions were considered - the furthest extent and the second most frequent recorded location after home (which we will refer to as the ‘next’ location for convenience). For the ‘next’ location we again ranked locations based on the total duration of time spent including the inferred location between gaps as described above. In the event of ties the location with the greater proportion of time between 7am and 6pm was chosen. Once again, users to whom we could not assign a unique destination location were removed from the data set. In total there were 4,450 users we could not assign a unique origin and destination location according these definitions leaving a total of 43,291 users within the final BBC mobility data set.
For users with all location records in the same MSOA, their home and destination locations are both set to this unique value. As the LAD origin and destinations are mapped from these MSOA locations, at the LAD level, users can therefore have the same inferred origin and destination locations (even though they have moved between different MSOAs over the course of the reporting period).
B Supplemental Figures
C Figure supplements to Figure 7
D Posterior estimates
Footnotes
NOTE: This article is a preprint and has not yet been certified by peer review.