Abstract
Violence as a phenomena has been analysed in silo due to difficulties in accessing data and concerns for the safety of those exposed. While there is some literature on violence and its associations using individual datasets, analyses using combined sources of data are very limited. Ideally data from the same individuals would enable linkage and a longitudinal understanding of experiences of violence and their (health) impacts and consequences. However, in the absence of directly linked data, look-alike modelling may provide an innovative and cost-effective approach to exploring patterns and associations in violence-related research in a multi-sectorial setting.
We approached the problem of data integration as a missing data problem to create a synthetic combined dataset. We combined data from the Crime Survey of England and Wales with administrative data from Rape Crisis, focussing on victim-survivors of sexual violence in adulthood. Multiple imputation with chained equations were employed to collate/impute data from different sources. To test whether this procedure was effective, we compared regressions analyses for the individual and combined synthetic datasets on a binary, continuous and categorical variables. Our results show that the effect sizes for the combined dataset reflect those from the dataset used for imputation. The variance is higher, resulting in fewer statistically significant estimates. We extended our testing to an outcome measures and finally applied the technique to a variable fully missing in one data source. Our approach reinforces the possibility to combine administrative with survey datasets using look-alike methods to overcome existing barriers to data linkage.
Introduction
It has been established for over 20 years that violence is a complex social problem and a public health issue [1–3], with implications for the health and social care systems, police and justice systems [4], as well as significant productivity losses for those who experience it [5, 6]. Analysing data collected by these systems can aid understanding of the problem of violence and how to respond to it. In social research, analysing administrative records together with survey data has already enabled better measurements of violence experiences, capturing experiences of both victim-survivors and perpetrators across multiple points in time and social and economic domains [7].
Although some violence-related research has been carried out using matched or combined emergency departments and police data [8–11], most studies in violence-related research analyse data in silo due to difficulties in accessing data and concerns for the safety of those exposed [12, 13]. Particularly, data from third sector voluntary specialist support services for victims or perpetrators of violence has, to our knowledge, not been linked or combined with other datasets, as these services are keen to provide person-centred trauma-informed care and fear that information on their service users may be used against them in courts or by immigration authorities [14, 15].
From an analytical viewpoint, ideally, data from the same individuals would enable linkage and a longitudinal understanding of experiences of violence and their (health and inequalities) impacts. However, given safety concerns, data on people who have experienced violence is often pseudonymised before being made available for researchers, meaning records across sectors pertaining to the same individuals cannot be linked. Look-alike profiling may provide an innovative and cost-effective approach to exploring patterns and associations in violence-related research in a multi-sectorial setting.
Look-alike modelling has been extensively used to identify similar and new costumer and consumer target groups in marketing, e-commerce and advertising [16–18]. We apply costumer look-alike principles to violence-related research. Our goal is to propose an innovative method for data integration in this particularly sensitive research area, to move beyond silo analyses, which could also be used in other research areas with similar issues. Effectively, this method allows for integrating additional information into one dataset based on its distribution and associations in another dataset, creating a new (synthetic) dataset. This methodology could also be used in other fields of social and economic research, where issues regarding pseudonymisation and missing information are also present.
In this paper, we approached the problem of data integration and look-alike profiling as a missing data problem, although we acknowledge that several other approaches are possible. We combined data from the Crime Survey for England and Wales (CSEW) with administrative data from three Rape Crisis Centres (RCC) in England, which are part of a Rape Crisis England and Wales (RCEW), focussing on victim-survivors of sexual violence in adulthood, in line with the understanding that a benefit of linking administrative and survey data is the improvement in imputation methods to fill in missing values in surveys [19]. Multiple imputation with chained equations were employed to collate and integrate data from these two different sources producing a synthetic dataset.
Theoretical framework
In theory, look-alike modelling is based on the principle that similar individuals have similar behaviours. While in economics this normally refers to consumption behaviour, for people experiencing violence it refers to their trajectories and help-seeking behaviours. Therefore, to explore similarities between individuals, one needs to look at socio-economic and demographic variables, as well as violence experience. Mathematically, in two different datasets A and B, there are aij and bij individual records. These records can be compared in multiple variables k to ascertain how similar their look-alike profiles are. Each component-wise or variable-wise comparison relies on a vector that effectively produces a comparison function looking at the values of the record component or variable k in the two records aij and bij. In order to approach this data integration problem as a missing data problem, one relies on a sequence of univariate imputation models, with fully conditional specifications of prediction equations. Formally, for imputation variables X1, X2, . . . , Xp and complete independent predictors C, so that:
Where t are iterations that converge at t = T and φj are the corresponding model parameters prior [20]. In our study, we created the vector Ci,j based on the following variables: type of sexual violence experienced (type of SV), relationship to the perpetrator, health impact, employment status, housing tenure, number of dependants, relationship status (usually referred to as marital status in social research), ethnicity, age and gender. These variables were selected as they are considered to influence the journey of victim-survivors of sexual violence and their help-seeking behaviour.
Traditionally, multiple imputation (MI) is used to address missingness of data by generating plausible values derived from distributions and relationships among observed variables [21]. While MI has been widely used in statistical and economic analysis of clinical trials [22] and more recently social research [23], to our knowledge, it has not been used to produce a synthetic dataset. Our multiple imputation approach to data integration recognises that the reason for missing data may be different for each dataset A and B. This is particularly true in our empirical application, since we are using a population-level survey (CSEW) and administrative records from a victim support service (RCEW). Furthermore, while datasets A and B are completely independent in our case, the reasons for missingness may be correlated, as disclosing sexual abuse is still stigmatised in society [24–26]. Finally, our approach recognises that the variables or covariates used for imputation may have non-normal distributions [27, 28].
Procedurally, multiple imputation replaces each missing value with a set of plausible values. Following Bayesian rules, the imputed values are drawn based on the conditional distribution of the missing observations given the observed data, reflecting the uncertainty associated with the missing data itself and parameters estimated in the imputation model [29]. Mathematically, let fij represent the variable you are interested in imputing for the ith individual within the jth cluster. In this case, , the comparison function and D, the cluster-level vector of covariates, are the predictors of missingness in variable f at individual and cluster-levels respectively. Then, a MI model can be specified as:
Where β and γ are the vectors of the regression coefficients corresponding to individual and cluster-level covariates. The model assumes that the error term (ε) is normally distributed with variance σ2. The imputation procedure generates multiple values for each missing observation based on the distributions for β, γ and σ conditioned on observed data. By combining two datasets A and B, based on the vector C and using multiple imputation, we are applying a look-alike modelling approach that may enable imputation of partially and completely missing data into a complete combined synthetic dataset.
Methods
We aimed to test our proposed approach to data integration by combining survey data from the Crime Survey for England and Wales (CSEW) with administrative data from Rape Crisis England & Wales (RCEW), focussing on victim-survivors of sexual violence in adulthood. This research was reviewed and approved by the IMJEE (International Politics, Music, Journalism, Economics, and English) research ethics committee from City, University of London (ETH2122-2023 and ETH2122-0299). Informed verbal consent regarding future use of their data for research was obtained by case workers from Rape Crisis centres while working with service users and recorded in their case management system, in line with their a non-intrusive approach to data collection whereby only what is appropriate is asked and/or what survivors choose to disclose is recorded [30].
Datasets
The CSEW, previously known as the British Crime Survey, is a nationally representative face-to-face victimisation survey of about 35 thousand to 46 thousand respondents per survey wave, which started biannually from 1982 before becoming an annual survey from 2001[31]. The CSEW asks people aged 16 and over about their experience with household and personal crimes in the twelve months prior to the interview. Considering our focus on sexual violence, we only included individual level data from respondents who had reported being a victim-survivor of rape, attempted rape, wounding with sexual motive, and indecent assault. In order to include a sufficient number of incidents of sexual violence to do the data integration, we used CSEW data from 2001 to 2020.
The RCEW data comes from three RCCs in a region in eastern England and is based on routinely collected administrative data recorded in a centralised case management data system between April 2016 and March 2020. Information is self-reported by victim-survivors upon initial contact with RCEW, most commonly over the phone but sometimes online or face-to-face, and data are inputted to the RCEW database by frontline support workers. Rape Crisis centres collect individual level data for their service users in pre-determined coding categories based on a person-centred non-intrusive principle, which means frontline workers only ask questions that are appropriate, or rely on information victim-survivors choose to disclose [30]. Information collected typically includes socio-demographic and protected characteristics (gender, age, disability, ethnicity, nationality, sexuality, religion, marital status, accommodation, employment, language, immigration status, socioeconomic status), experiences of sexual violence and abuse (SVA), victim-perpetrator relationship, impacts from experience of SVA, risk level, referral routes, engagement with different (statutory and non-statutory) services and contact with the criminal justice system. Data on experiences of SVA are collected in two main ways; information is gathered on the ‘presenting incident’ (the main experience of violence the victim-survivor is seeking support for at the time of initial contact with RCEW), and elsewhere in the database further details can be entered under ‘incident summary’ about separate ‘incidents’ or experiences of violence, if disclosed [30]. Most information is inputted into their case management system at the point of intake based on the victim-survivor’s report and, where necessary, the assessment of the support worker. However, further information on the abuse can be collected and recorded at any point during the support journey, as appropriate. Case management and criminal justice data are collected in separate parts of the system, however, data are recorded under a client identification number, making it possible to merge case management and criminal justice data.
Considering our focus on sexual violence, we selected respondents (CSEW) or service users (RCEW) who have reported being a victim-survivor of rape (including attempted) or another form sexual violence and abuse (which included indecent assault and wounding with sexual motive). We selected respondents/service users with no missing values on vector C variables, which led to a sample of 1,232 incidents from 1,111 individuals in the CSEW, and 6,102 referral cases from 5,333 individuals in RCEW. In RCEW data, it included data for individuals who accessed the service more than once.
The comparison vector (C) variables
As previously mentioned, we created the vector Ci,j based on the variables that were considered to influence victim-survivors’ journeys and help-seeking behaviour the most. Thus, we needed to harmonise the following variables across CSEW and RCEW data: type of sexual violence experienced, relationship to the perpetrator, health impact, employment status, housing tenure, number of dependants, relationship status, ethnicity, age and gender.
The type of sexual violence experienced in the CSEW was categorised into crime codes by professional coders based on respondents’ responses to survey questions and narrative description of the incident. The categories are aimed to align with Home Office categorisation. We selected the following reported offences: rape, serious wounding with sexual motive, other wounding with sexual motive, attempted rape, and indecent assault. We categorised these into rape (including attempted) or some other form of sexual violence. In the RCEW data, sexual violence was categorised based on the information recorded at intake under ‘presenting incident’ and ‘incident summary’. Once again, we categorised these into broader categories: rape (as an adult, including attempted rape); and some other form of sexual violence (including sexual assault, assault by penetration, voyeurism, sexual bullying, penetration by object, gang related sexual violence, forced sexual activity in public, exposed to sexual images, sexual harassment and sexual exploitation). Victim-survivors accessing RCEW services for other types of violence or abuse were excluded, including rape or sexual abuse during childhood.
For the variable victim-perpetrator relationship, respondents to CSEW were first asked whether they knew the perpetrator, and if so, what their relationship was at the time of the incident. The RCEW data recorded who the primary perpetrator was. This was categorised into domestic (such as [former] intimate partner or family member), acquaintances (including friends, colleagues), and strangers. If multiple perpetrators were mentioned, it was coded as the closer relationship (e.g. prioritising domestic over acquaintances).
The health impact of the incident was assessed in the CSEW by whether they were bruised, scratched, cut or injured in any way as a result of the incident. The health impact was measured in the RCEW data using information recorded under ‘incident impact’ and ‘impact summary’, for which we included physical health impacts of memory loss, physical injuries and body problems, gynaecological disorder, and sexually transmitted infection. While these do not match directly between the two datasets, we only included a binary in our empirical application for whether there was (yes/no) a health impact on the victim-survivor.
Relationship status was categorised into whether respondents were in a co-residential relationship (either married or cohabiting), single/non-resident partner/widowed, or separated or divorced in the CSEW and RCEW. Ethnicity was coded as White and non-White, as further differentiation led to too small numbers in some categories. However, we acknowledge that the ethnicity categorisation of White/non-White may be problematic and any conclusion in this respect, limited [32]. Employment status in both datasets was assessed by whether people were employed, unemployed, students, or outside the labour force (e.g. a homemaker, retired, or unable to work due to illness). Gender was asked as whether the respondent was male or female1 in the CSEW. In RCEW data, more detailed responses are given, including transgender female and transgender male, which were recoded into men and women. Finally, age was measured numerically in both datasets and we included in our analyses people over the age of 16. Table S1 in the supporting information summarises how variables were harmonised.
Analytical Strategy
To test whether approaching look-alike modelling as a missing data problem was effective, we compared regression analyses for the two datasets (CSEW and RCEW) and the combined synthetic dataset, which imputed data based on the comparison vector. As a proof of concept, we tested the approach using variables of different types (binary and continuous) that are observed in both datasets. Formally, our approach had three steps. First, we specified the same linear (OLS) or logistic regression (as appropriate) for dataset A (RCEW) and dataset B (CSEW). We then assumed one variable was missing from the combined integrated synthetic dataset by generating a completely missing variable for dataset A, which we imputed, using multiple imputation with chained equations, based on the observed values for the combination vector in both datasets. This effectively imputed the (assumed) missing variable in dataset A based on the distribution and associations with other variables of the combination vector in dataset B.
We carried out this exercise for four variables, two that are very similarly measured – age (continuous) and gender (binary), one that is differently measured across datasets – health impact (binary), and lastly, we illustrated the potential of this method of combining data in a real-life application to a variable that only appears in one dataset (CSEW) – frequency of abuse (count). We acknowledge that the first two tests, using age and gender, are not particularly interesting from an analytical standpoint. Nonetheless, we wanted to start off with variables that were objectively measured as much as possible.
Results
Profiles comparison of sexual violence victim-survivors in CSEW and RC
Before conducting our look-alike exercises, we compared the profiles of sexual violence victim-survivors in CSEW and RCEW datasets (Table 1). The table shows some meaningful differences between the individuals pertaining to each dataset. Particularly, only 32% of sexual violence victim-survivors in the CSEW had been victims of rape, compared to 71% in the RCEW data. Relationship to the perpetrator was more likely to be domestic in the RCEW data compared to the CSEW (48% vs 25%, respectively) and perpetrators were far more likely to be strangers or to be unknown in the CSEW (42%), compared to only 12% of records in the RCEW dataset.
Furthermore, the CSEW recorded a physical injury in 39% of incidents, while this appeared in only 6% of cases in the RCEW dataset, which might reflect the different measurements of physical health impact between these two data sources. Finally, there are some differences in relationship status and employment status, with more single or widowed people in RCEW data and more separated or divorced people in CSEW, and more unemployed people and students in RCEW when compared to CSEW.
Look-alike empirical application
Our first empirical application exercise pretended the variable age was missing from the combined dataset. Thus, we stipulated our comparison vector (C) as:
Table 2 presents the results of a linear regression (OLS), looking at the associations between age as the dependent variable, and the independent variables for dataset A (RCEW), dataset B (CSEW) and the complete combined dataset inputting age based on our proposed approach. When comparing the associations with age between the original datasets, and the imputed synthetic dataset based on the variation observed in B, the results show that the effect sizes and direction for the imputed data reflects the results from the dataset used as the basis for imputation. For example, the type of SV was not associated with age in the original RCEW, but was in the CSEW. The imputed synthetic dataset reflects the CSEW dataset in that those who were victim-survivors of rape were younger on average. Reversely, while the perpetrator being an acquaintance compared to domestic was associated with younger people in RCEW, this was not the case for the CSEW, where no significant association was found, which was also the case in the imputed synthetic dataset. One coefficient was significantly related to age in both datasets, but not in the imputed version (stranger/unknown perpetrator). For all independent variables / controls, the standard errors were similar between the CSEW and the imputed synthetic dataset, which additional testing indicates is due to two opposing mechanisms which (partially) cancel each other out. That is, on the one hand, imputation may result in larger standard errors due to the uncertainty around the imputation; on the other hand, the bigger sample size of the imputation sample leads to smaller standard errors.
We then tested the approach on a binary variable, gender. For this, we stipulated that the comparison vector was specified as:
Table 3 shows the results of logistic regressions looking at the associations between gender as a dependent variable and the independent variables for dataset A (RCEW), dataset B (CSEW) and the complete combined synthetic dataset. Similarly to what we saw in our analyses of age, the imputed dataset mimics the associations from the CSEW dataset. For example, men were less likely to experience rape than women, while stranger perpetrators where more strongly associated with male than female victim-survivors. Two important things stand out: Acquaintance (compared to domestic relationship) perpetrator was not associated with gender, nor was ‘other’ housing tenure (compared to homeowners) in the CSEW, but these do become significant in the imputed dataset. The latter is most likely due to the far higher prevalence of ‘other’ housing tenure in the RCEW dataset, making it more likely to reach statistical significance, while the former is likely due to the larger sample size of the imputed dataset compared to the original CSEW dataset.
We considered the consistencies across tables 2 and 3 as an indication that our proposed approach works for variables that are recorded similarly in the two datasets. We then extended our testing to an outcome measure which is not similarly recorded in CSEW and RCEW; health impact. For this, we specified the comparison vector as:
Table 4 shows the results of logistic regressions looking at the associations between health impact for dataset A (RCEW), dataset B (CSEW) and the complete combined synthetic dataset. We acknowledge that this is a more meaningful regression specification than the previous two specified in the paper. However, since our approach is novel, we wanted to ensure that the approach worked for similarly recorded variables before testing for differently recorded ones. The results show the same as for the previous models, namely that the imputed dataset reflects the associations from the CSEW dataset in both magnitude, direction, and significance. This includes an association that was positive in the original RCEW dataset (dataset A), but negative in CSEW (dataset B), and thus are also negative in the imputed dataset, namely, whether the perpetrator was a stranger or unknown. In these models, the coefficients for single/widowed stand out, given the synthetic dataset presents a very similar association to that found in the RCEW dataset. This is again likely a result of a much higher prevalence of this group in the RCEW (and therefore in the synthetic dataset).
Finally, in order to achieve our goal of combining data in a real-life application and producing a complete integrated dataset, we inputted a variable that only appears in CSEW and is, therefore, completely missing in RCEW; frequency of abuse. In this case, the comparison vector is:
The analyses estimating the number of sexual violence incidents or repetitions based on CSEW data reveals that rape (compared to other sexual violence) and incidents by acquaintances or strangers (compared to domestic perpetrators) are less likely to be repeated, and if repeated they are repeated fewer times. The imputed synthetic dataset reflects these associations. On the other hand, whilst in the CSEW, significant negative associations between sexual violence incidents and singles/widowed (versus married or cohabitors), non-White (compared to White) victim-survivors exist, these associations did not reach statistical significance in the imputed dataset. Lastly, while students did not have a higher number of sexual violence incidents compared to employed people in the CSEW, in the imputed synthetic dataset this was the case. This change in significance is likely due to the larger proportion of students in the RCEW (and therefore in the synthetic dataset).
Discussion
There are several implications from our proposed approach to combining data, based on look-alike principles, using multiple imputation methods. First, the initial distribution may be different between datasets, as was the case in the RCEW and CSEW, and while this does not appear to prevent meaningful analyses in the synthetic dataset, the sample sizes are important both in defining what dataset ultimately provides the basis for the synthetic dataset and also in interpreting some of the meaningful associations found. In general, in our proposed exercise, the associations mimic those of the CSEW (smaller sample size), which was used as the basis for imputation. However, where the prevalence of a certain group was much larger in the RCEW (larger sample size), this group was also larger in the synthetic version, meaning that there was an increased chance of significance. In order to test the robustness of our approach, we swapped datasets A and B, that is, we tested imputing data from the RCEW into the CSEW. This led to a synthetic dataset that was the size of the CSEW (1,232). While we found the same general findings, i.e. that magnitude and direction of effect sizes in the synthetic dataset mimicked those of the RCEW (used for imputation instead), standard errors were in general larger, meaning results were less likely to reach significance. This reinforces the importance of sample sizes (both in the imputing and in the imputed datasets).
A strength of the proposed method is that it enables the combining of data on different individuals based on similar characteristics, meaning that working with pseudonymised data is possible. This is relevant to any area of research where there are concerns around data-sharing, not only violence. Furthermore, our analyses have shown that results are fairly consistent regardless of the type of modelling used (OLS, logistic or negative binomials). Integrated survey and administrative data can strengthen study designs by providing more complete information on similar profiles, lessening response burden on participants, or by serving as a source of triangulated data [33].
The approach outlined involved a trade-off between the standardisation of variables required for imputation and the detail about individuals and experiences that is valued in research on violence. The need to standardise variables used for imputation meant that more nuanced understanding of experiences was lost. In our analyses, this was particularly relevant in terms of health impact. While our final coding only allowed for the inclusion of a binary, there is a wide literature on the impacts of sexual violence on physical health[34–36], and some of the final categories in the variables we used were much more aggregate than we would have liked. This was also the case for ethnicity, precluding analyses using an intersectional approach. Furthermore, we did not consider time (i.e. time of experience of sexual violence) as a variable in the comparison vector due to limited sample sizes, but we acknowledge that the understanding of experiences of violence varies over time, so ideally time should be a comparison-vector variable.
Our proposed data integration approach should be particularly useful for costing or burden of disease type of analyses, including calculating the societal burden of violence, given it enables taking a micro-costing approach, which produces more precise estimates [37]. Nonetheless, further applications, in particular to evaluate interventions, need further testing. Analyses using a longitudinal design are certainly not feasible if time is not used as a comparison-vector variable.
Similarly to all applications of multiple imputation, there are assumptions around the patterns of data missingness. While MI assumes data missing at random (MAR) or missing completely at random (MCAR), when using our approach to impute a variable that only appears in one dataset, there is a normative assumption that the synthetic dataset follows the same distribution (and the same pattern of missingness) as the dataset used for imputation.
Finally, there are numerous practice and policy implications for researchers, voluntary sector partner organisations, and the general population. Compared to traditional research, our proposed approach to data integration offers a cost-effective solution to breaking (data-related) silos in research. Further research should not only test different approaches to data integration, but also applications to evaluations by mutually engaging practitioners, policymakers, and researchers to foster a culture of research [33, 38] facilitating the refinement of techniques as well as producing real-world evidence based on integrated synthetic data.
Conclusion
This study has demonstrated that data integration between a survey (CSEW) and administrative records (RCEW) is possible using look-alike modelling principles and using multiple imputation by chained equations. Our results serve as a proof of concept, and the associations in the resulting synthetic dataset tend to mimic the dataset used for imputation in magnitude and direction. The regression results in the synthetic dataset also tend to yield larger standard errors, resulting in larger confidence intervals. This approach should be applicable for costing exercises as it permits micro-costing. Further applications of the approach should be the focus of future research.
Footnotes
↵1 We acknowledge that female / male are correct categories for sex not necessarily gender. But we used the categories as asked by the CSEW as proxies for women / men.