Establishing and characterising large COVID-19 cohorts after mapping the Information System for Research in Primary Care in Catalonia to the OMOP Common Data Model =================================================================================================================================================================== * Edward Burn * Sergio Fernández-Bertolín * Erica A Voss * Clair Blacketer * Maria Aragón * Martina Recalde * Elena Roel * Andrea Pistillo * Berta Raventós * Carlen Reyes * Sebastiaan van Sandijk * Lars Halvorsen * Peter R Rijnbeek * Talita Duarte-Salles ## Abstract **Background** Few datasets have been established that capture the full breadth of COVID-19 patient interactions with a health system. Our first objective was to create a COVID-19 dataset that linked primary care data to COVID-19 testing, hospitalisation, and mortality data at a patient level. Our second objective was to provide a descriptive analysis of COVID-19 outcomes among the general population and describe the characteristics of the affected individuals. **Methods** We mapped patient-level data from Catalonia, Spain, to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). More than 3,000 data quality checks were performed to assess the readiness of the database for research. Subsequently, to summarise the COVID-19 population captured, we established a general population cohort as of the 1st March 2020 and identified outpatient COVID-19 diagnoses or positive test results for SARS-CoV-2, hospitalisations with COVID-19, and COVID-19 deaths during follow-up, which went up until 30th June 2021. **Findings** Mapping data to the OMOP CDM was performed and high data quality was observed. The mapped database was used to identify a total of 5,870,274 individuals, who were included in the general population cohort as of 1st March 2020. Over follow up, 604,472 had either an outpatient COVID-19 diagnosis or positive test result, 58,991 had a hospitalisation with COVID-19, 5,642 had an ICU admission with COVID-19, and 11,233 had a COVID-19 death. People who were hospitalised or died were more commonly older, male, and with more comorbidities. Those admitted to ICU with COVID-19 were generally younger and more often male than those hospitalised in general and those who died. **Interpretation** We have established a comprehensive dataset that captures COVID-19 diagnoses, test results, hospitalisations, and deaths in Catalonia, Spain. Extensive data checks have shown the data to be fit for use. From this dataset, a general population cohort of 5.9 million individuals was identified and their COVID-19 outcomes over time were described. **Funding** Generalitat de Catalunya and European Health Data and Evidence Network (EHDEN). Key Words * COVID-19 * common data model * OMOP * electronic health records * primary care ## Introduction Spain has been one of the European countries hit hardest by the ongoing Coronavirus disease 2019 (COVID-19) pandemic. The first COVID-19 cases in Spain were identified in late February 2020, and by the 1st May of that year there had been more than 32,000 COVID-19 deaths in the country.1 Further waves of infections have since followed and although the advent of effective vaccines has dramatically improved the outlook, COVID-19 cases continue to accrue and the effects of the disease for many of the people previously infected are likely to be long-lasting. Similar to many European countries, Spain has a universal coverage healthcare system with general practitioners (GPs) acting as the gatekeepers to care.2 This role has largely been maintained during the COVID-19 pandemic. In particular, clinical diagnoses by primary care professionals played an important role during the first wave of COVID-19 in Spain, with the use of SARS-CoV-2 testing initially restricted to the most severe cases, such as those hospitalised and groups considered to be at particularly high-risk, such as care home residents.3 In such a context, primary care records can provide an important foundation for COVID-19 research, particularly when linkage to testing, hospitalisation, and mortality data is available. The Information System for Research in Primary Care (SIDIAP; [www.sidiap.org](http://www.sidiap.org)) is a primary care records database covering approximately 80% of the population of Catalonia. The provenance of the data has been well documented, and the population captured have been found to be representative in terms of geography, age, and sex.4 Data from SIDIAP has previously been used for in a wide range of epidemiological research studies, including COVID-19 related research.5–8 Individual-level linkage of hospital and regional mortality data has previously been established for SIDIAP, and now further linkage to SARS-CoV-2 test results is also possible. Our first aim was to establish a SIDIAP COVID-19 dataset that could be used to generate reliable evidence relating to the pandemic. To facilitate international collaboration, we set out to conform to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM).9 Our second aim was to summarise the occurrence of COVID-19 outcomes observed and describe the characteristics of those affected. ## Methods ### Overview Primary care data collected in SIDIAP between 1st January 2006 (when data became available) and 30th June 2021 (the last available date of data collection) was linked, at a patient-level, to COVID-19 testing, hospitalisation, and mortality data. This data was mapped to the OMOP CDM following an extract, transform, and load process which we first describe below. Using this mapped data, a cohort of the general population was followed up from 1st March 2020, with COVID-19 outcomes (outpatient COVID-19 diagnoses and positive tests, hospitalisations with COVID-19, and COVID-19 deaths) observed over follow-up until the 30th June 2021. ### Mapping to the OMOP CDM: Extract, Transform, and Load The extract, transform, and load (ETL) process was based on the approach put forward by the Observational Health Data Sciences and Informatics (OHDSI) community which involves four distinct steps: 1) designing the ETL, 2) creating the code mappings, 3) implementing the ETL, and 4) quality control to assess whether the database was fit for use. Any issues identified during quality control are addressed by updating the ETL where possible.10 #### Designing the ETL The OHDSI White Rabbit tool was used to scan and characterise the source data.11 Based on this, a design was created using the Rabbit-in-a-Hat tool in which source data tables were mapped to the OMOP CDM person, observation period, visit occurrence, condition occurrence, procedure occurrence, drug exposure, measurement, observation, and death tables (see Supplementary Figure 1). The derived drug and condition era tables were also created. #### Creating the code mappings Mapping to the OMOP CDM requires mapping terminology to standard vocabularies in the OMOP Vocabularies.11 Example of code mappings are given in Table 1. SNOMED, for example, is a standard vocabulary for conditions, while RxNorm codes are a standard vocabulary for drug exposures. View this table: [Table 1.](http://medrxiv.org/content/early/2021/11/24/2021.11.23.21266734/T1) Table 1. Example of mappings implemented in SIDIAP to the OMOP CDM COVID-19 diagnoses and patient comorbidities could first be identified from the source table *diagnostics* (diagnoses) table, containing ICD-10 codes recorded during primary care interactions. An example of a COVID-19 diagnosis code mapping for this table was from the ICD-10 code B34.2 (“ Coronavirus infection, unspecified site”) to the OMOP concept id 439676 (“ Coronavirus infection”), while an example of a COVID-19 symptom code is the mapping of the ICD-10 code R06.02 (“ Shortness of breath”) to the concept id 312437 (“ Dyspnea”). Additional data specifically collected for people with COVID-19 was captured in the *covid formulari seguiment* (COVID monitoring) table. These data were not codified in the source table, but instead contained columns representing a concept specific to COVID-19 (with a value of 1 entered if it was present). As an example, a value of 1 in the column with the name *interpretacio COVID-19 confirmada greu* (COVID-19 – confirmed, serious) was mapped to the concept id 37311061 (“ Disease caused by severe acute respiratory syndrome coronavirus 2”). Likewise, a value of 1 the column “ anosmia” was mapped to the concept id 4185711 (“ Loss of sense of smell”). Prescriptions of medications were identified from the source table *prescripcio* (prescription), with this information mapped to the drug exposure table of the OMOP CDM. To map each drug national code from AEMPS (*Agencia Española de Medicamentos y Productos Sanitarios*) in the source table to the best corresponding standard concept id in the OMOP CDM drug exposure table an intermediate table was used SARS-CoV-2 test results were identified from a new source table, linked to SIDIAP patient data at the individual-level. These results were mapped to the measurement table in the OMOP CDM. Each polymerase chain reaction (PCR) test record in this source table were mapped to a measurement concept id of 586310 (“ Measurement of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Genetic material using Molecular method”), while antigen tests were mapped to 37310257 (“ Measurement of Severe acute respiratory syndrome coronavirus 2 antigen”) and antibody tests to 37310258 (“ Measurement of Severe acute respiratory syndrome coronavirus 2 antibody”). If the result was coded as *Positiu* (positive) in the source table, the value as concept id was then set as the concept id of 45884084 (“ Positive”). Lastly, hospitalisation data, from the conjunt mínim bàsic de dades de l’alta hospitalària (minimum basic set of hospital discharge data) collated by the Data Analysis Program for Health Research and Innovation (PADRIS) in Catalonia, was also linked at the individual-level. This dataset included both diagnosis and procedures registered during hospital admissions for all public and private hospitals in Catalonia. ICD-10 codes used to register diagnoses at hospitals were mapped to the OMOP CDM, as was done with the *diagnostics* table, with the procedure occurrence table in the CDM also populated. #### Implementing the ETL The local SIDIAP tables are in a MariaDB database, with the OMOP CDM v5.3.1 tables stored in a PostgreSQL database. Docker containers were used to host the full environment. SQL code was used for implementing the mapping in the MariaDB environment, with GitHub used for version control. After completing the mapping, CDM tables were migrated to their final PostgreSQL environment. A schematic of the extract, transform, and load (ETL) framework used is provided in Supplementary Figure 2. #### Quality control A range of database constraints defined by OHDSI were created in the Postgres database to prevent errors such as duplicate rows or unmatched id’s across interrelated tables. Data quality was also considered systematically using the Data Quality Dashboard (DQD) tool.11,12 This tool was run on the data after conversion to the OMOP CDM to test how well the resulting CDM instance complied with OHDSI standards. DQD runs a series of data quality checks that measure a database’s conformance to the model specifications, completeness of mapping to Standard Concepts, and plausibility of a select set of values. At a high-level, the tool works by applying data quality check types to applicable tables and fields in the CDM. The results of DQD were used to evaluate whether the database was fit for use. In total, 3,486 data quality checks were performed, any failed checks were reviewed, and the ETL was updated to address them where necessary. ### Summarising the occurrence of COVID-19 outcomes and describing the characteristics of those affected #### Study population and follow-up Individuals present in SIDIAP as of 1st March 2020 were identified as the study population. Any individuals who had a clinical diagnosis or positive test result for SARS-CoV-2 between the 1st January and 29th February 2020 were excluded, as were any individuals in hospital on 1st March 2020. These two exclusions were to ensure that the cohort identified from SIDIAP was representative of the general population at risk of subsequent incident COVID-19. Follow-up for this cohort began on 1st March 2020 (the index date for all individuals) and ended on 30th June 2021 (the last available date of data collection). #### COVID-19 outcome cohorts Four COVID-19 related outcomes were considered: outpatient COVID-19, hospitalised with COVID-19, ICU admission with COVID-19, and died with COVID-19. These outcome cohorts were not mutually exclusive. An outpatient diagnosis of COVID-19 was identified on the basis of a compatible clinical code (with a broad definition used) or positive SARS-CoV-2 test (antigen or PCR), with no hospital admission with COVID-19 observed prior to or on the same day as this diagnosis. Hospitalisation with COVID-19 was identified as a hospital admission where the individual had a compatible COVID-19 clinical code or positive SARS-CoV-2 test over the 21 days prior to their admission up to three days after admission. ICU admission during hospitalisation was identified in a similar manner. A COVID-19 death was defined as a death where an individual had a compatible COVID-19 clinical code or positive SARS-CoV-2 test recorded in the 28 days preceding their death, with deaths identified from regional registers linked at the person-level. To assess the impact of using alternative definitions for outpatient diagnosis of COVID-19, we explored four further definitions: PCR positive test, PCR or antigen positive test, COVID-19 diagnosis (narrow definition), and COVID-19 diagnosis (broad definition). While the broad definition for diagnoses allowed for included codes such as “ Coronavirus infection” and “ Suspected COVID-19”, the narrow definition only included codes specific to COVID-19, such as “ Disease caused by 2019 novel coronavirus”. #### Variables The age, as of 1st March 2020 (the index date for all individuals), and sex of study participants were extracted. Using their most recent observation, individuals’ body mass index (BMI) and smoking status (classified as never smoker, former smoker, or current smoker) were also obtained. Individuals’ comorbidities and medication use was also summarised relative to the 1st March 2020. The comorbidities included were autoimmune diseases, asthma, malignant neoplastic disease, diabetes mellitus, heart disease, hypertensive disorder, renal impairment, chronic obstructive lung disease [COPD], and dementia. These health conditions were identified based on an individual’s entire observed history prior to the index date. In addition, for those individuals in the outpatient diagnosis of COVID-19 cohort, symptoms recorded within two days prior to two days after index date were also identified. The following symptoms were considered: cough, dyspnea, diarrhea, headache, fever, one of anosmia, hyposmia, or dysgeusia, either malaise or fatigue, and pain. #### Descriptive analysis The characteristics of the study population as a whole and each of the COVID-19 outcome cohorts were summarised, with counts and percentages for categorical variables and median and interquartile ranges (IQR) for continuous variables. Cohort entry over time is plotted for the study cohorts. The proportion of persons in the outpatient COVID-19 cohort with a symptom of interest is summarised, stratified by calendar month. All analysis code (including phenotype definitions) used in the study has been made publicly available at: [https://github.com/SIDIAP/CovidCdmSummary](https://github.com/SIDIAP/CovidCdmSummary) ## Results ### Mapping to the OMOP CDM The SIDIAP CDM contained information on 8,022,374 unique individuals. These people had 228,003,476 records in the condition occurrence table, 542,711,011 records in the drug exposure table, and 1,317,159,817 records in the measurement table. Of the 3,486 data quality checks run against the database, 3,456 passed (99%). The remaining 30 checks that failed were considered not to be pertinent, and each of these is summarised in Supplementary Table 1. ### COVID-19 outcomes A total of 5,870,274 individuals were included in the general population cohort of people alive in the database as of 1st March 2020. Over observed follow-up, 604,472 had either an outpatient COVID-19 diagnosis or positive test result, 58,991 were hospitalised with COVID-19, 5,642 had an ICU admission with COVID-19, and 11,233 had a COVID-19 death. The distribution of age for each study cohort, stratified by sex, is shown in Figure 1. The average age of the general population study cohort was 43 years (IQR: 25 to 59), with 50.7% female. Median age was higher among outcome cohorts, most notably among those with a COVID-19 death who had an average age of 85 (78 to 90). Patients admitted to ICU, however, were younger than those admitted to hospital in general (63 [53 to 71] compared to 65 [51 to 78]). While the outpatient COVID-19 cohort was majority female (54%), those hospitalised were more typically male (55%), and those with a COVID-19 death were close to equally distributed by sex (49.1% were female). Patients admitted to ICU were though majority male (67%). Comorbidities were generally more common among those with a COVID-19 outcome compared to the general population, see Table 2. For example, prevalence of diabetes and hypertension were 23% and 45% among those hospitalised with COVID-19 which compared to 7% and 17% in the general population. View this table: [Table 2.](http://medrxiv.org/content/early/2021/11/24/2021.11.23.21266734/T2) Table 2. Characteristics of the SIDIAP COVID-19 study cohorts ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/11/24/2021.11.23.21266734/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2021/11/24/2021.11.23.21266734/F1) Figure 1. Histogram of age, stratified by sex, for the general population and each COVID-19 outcome cohort Cohort entry over calendar time, stratified by age, is shown in Figure 2. The various waves of COVID-19 can be seen, along with the much greater number of cases of COVID-19 hospitalisations, ICU admission, and deaths among the older age groups. The highest number of outpatient COVID-19 cases did though occur among the younger age group. ![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/11/24/2021.11.23.21266734/F2.medium.gif) [Figure 2.](http://medrxiv.org/content/early/2021/11/24/2021.11.23.21266734/F2) Figure 2. COVID-19 outcome cohort entry over calendar time, stratified by age group Capture of COVID-19 symptoms over calendar time is shown in Figure 3 and stratified by age group in Supplementary Figure 3 and stratified by sex in Supplementary Figure 4. Cough and fever were the most common symptoms identified, but all symptoms had a prevalence of less than 15% and with substantial changes over time. ![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/11/24/2021.11.23.21266734/F3.medium.gif) [Figure 3.](http://medrxiv.org/content/early/2021/11/24/2021.11.23.21266734/F3) Figure 3. Symptoms recorded at time of outpatient COVID-19 diagnosis or positive test The impact of different definitions for outpatient COVID-19 is shown in Figure 4, where cohort entry over calendar time is depicted. While from September 2020 definitions were generally in accordance, the first wave of COVID-19 in Catalonia was only identified when including the broad COVID-19 diagnosis definition. ![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/11/24/2021.11.23.21266734/F4.medium.gif) [Figure 4.](http://medrxiv.org/content/early/2021/11/24/2021.11.23.21266734/F4) Figure 4. Outpatient COVID-19 cohort entry over calendar time ## Discussion In this study we have described the development of a wide-ranging dataset for COVID-19. With more than 3,000 data quality checks performed to assess data quality, the resulting database can be considered fit for use to inform appropriate research questions. Moreover, the use of the OMOP CDM will facilitate its use for both single database studies and distributed network research. Demonstrating the breadth of data captured, a descriptive analysis of various COVID-19 outcomes among the general population has been performed, providing a broad overview of COVID-19 in Catalonia and the characteristics of the individuals affected. Numerous health care databases, including both claims and electronic health care data, have previously been mapped to the OMOP CDM.13,14 The European Health Data and Evidence Network (EHDEN, [www.ehden.eu](http://www.ehden.eu)) is building a large, federated network of data sources mapped to the OMOP CDM and SIDIAP is an EHDEN data partner. The mapping of SIDIAP data to the OMOP CDM has already facilitated its use in several international network studies on COVID-19. Examples include a characterisation study of patients hospitalised with COVID-19,15 a study to develop a prediction model for hospitalisation and mortality among patients diagnosed with COVID-19,16 and another which assessed the impact of angiotensin converting enzyme inhibitors and angiotensin receptor blockers on the risk of COVID-19.17 In other studies based solely on SIDIAP data mapped to the OMOP CDM, multi-state modelling was used to describe COVID-19 patient trajectories in Catalonia and the impact of cancer and obesity on these trajectories.18–20 The impact of the pandemic on trends in diagnoses of anxiety and depression has also been studied.21 In our descriptive analysis we found that individuals with a COVID-19 outcome were typically older and had more comorbidities than the general population. This was particularly pronounced for the most severe outcomes studied. This is in concordance with research to date, with numerous studies finding older age to be associated with worse outcomes in COVID-19.22–25 While those with an outpatient COVID-19 diagnosis or positive test were more often female in our data, those hospitalised were majority male, as were 67% of those admitted to ICU. People who died with COVID-19 were almost equally distributed by sex. Previous research studies have reported mixed results for diagnoses and positive tests, for example two studies from the UK which reported a higher risk of testing positive for SARS-CoV-2 among men,26,27 while a study from China found there to have been a higher attack rate among women.28 A range of studies have though previously found males to be at an increased risk of severest outcomes.23,24,29 The importance of appropriate phenotyping when using routinely collected data is also demonstrated when comparing alternative definitions of an outpatient COVID-19 case in our data. A definition that relied solely on testing for SARS-CoV-2 or using a narrow set of diagnosis codes would have missed many of the COVID-19 cases from the first wave, a time when testing was not widely available and medical vocabularies had not yet introduced COVID-19 specific codes. ### Strengths and limitations Much of the COVID-19 literature is based on studies where study populations have been drawn from people hospitalised with COVID-19, tested for infection, or who volunteered to participate in a study. Such studies can be subject to a number of biases, in particular collider bias which can lead to the reporting associations that do not exist for the general population or by attenuating, inflating or reversing the sign of true associations.30 This underscores the importance of developing comprehensive datasets to generate the reliable evidence required to inform decision-making related to the pandemic. With more than half a million outpatient cases of COVID-19 captured and a breadth of data capture that allows for comparisons with the general population and subsequent hospital care to be described, the mapped SIDIAP database described here is one such resource. While electronic health record data brings numerous opportunities, with the data collected for non-research purposes careful curation is required. Using a well-established common data model, meant that existing open-source tools could be used to evaluate data quality and that research studies can be run in a distributed manner. This has allowed the database to already have been used in a number of international network research studies, with standardised analytic packages and only aggregated results sets shared. One limitation of the dataset has been seen with the likely underreporting of COVID-19 symptoms. The estimates drawn from this database are much lower than reported in studies informed by self-reported patient data.31,32 Capture of symptoms would likely be improved if free text data recorded during primary care visits was also mapped to the OMOP CDM. However, underreporting of symptoms also likely reflects the nature of electronic health care records not designed for specific research questions. Other limitations include lack of hospital prescribing of medicines and lab data, while SARS-CoV-2 variants and contact tracing are also not captured. Vaccination records are available, and linkage to these ongoing. ## Conclusion We have established a wide-ranging COVID-19 dataset that captures COVID-19 diagnoses, SARS-CoV-2 test results, hospitalisations, and deaths in Catalonia, Spain. In this study we have summarised the creation of this dataset and described observed COVID-19 outcomes and summarised the characteristics of those individuals affected. ## Data Availability In accordance with current European and national law, the data used in this study is only available for the researchers participating in this study. Thus, we are not allowed to distribute or make publicly available the data to other parties. However, researchers from public institutions can request data from SIDIAP if they comply with certain requirements. Further information is available online ([https://www.sidiap.org/index.php/menu-solicitudesen/application-proccedure](https://www.sidiap.org/index.php/menu-solicitudesen/application-proccedure)) or by contacting SIDIAP ([sidiap@idiapjgol.org](https://sidiap@idiapjgol.org)). ## Ethics approval This study was approved by the Clinical Research Ethics Committee of the IDIAPJGol (project code: 20/070-PCV). ## Data sharing statement In accordance with current European and national law, the data used in this study is only available for the researchers participating in this study. Thus, we are not allowed to distribute or make publicly available the data to other parties. However, researchers from public institutions can request data from SIDIAP if they comply with certain requirements. Further information is available online ([https://www.sidiap.org/index.php/menu-solicitudesen/application-proccedure](https://www.sidiap.org/index.php/menu-solicitudesen/application-proccedure)) or by contacting SIDIAP (sidiap{at}idiapjgol.org). ## Competing interests All authors have completed the ICMJE uniform disclosure form at [www.icmje.org/coi\_disclosure.pdf](http://www.icmje.org/coi_disclosure.pdf). EAV and CB are employees of Janssen Research and Development LLC and shareholders of Johnson & Johnson (J&J) stock. ## Funding and role of the funding source This project is funded by the Health Department from the Generalitat de Catalunya with a grant for research projects on SARS-CoV-2 and COVID-19 disease organized by the Direcció General de Recerca i Innovació en Salut. This project has received support from the European Health Data and Evidence Network (EHDEN) project. EHDEN received funding from the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No 806968. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA. The funders had no role in study design, data collection, and analysis, decision to publish, or preparation of the manuscript. ## Acknowledgements We would like to acknowledge the patients who suffered or died from this devastating disease, and these patient’s families and carers. We would also like to thank the healthcare professionals in the Catalan healthcare system involved in the management of COVID-19 during these challenging times, from primary care to intensive care units. ## Appendix View this table: [Supplementary Table 1.](http://medrxiv.org/content/early/2021/11/24/2021.11.23.21266734/T3) Supplementary Table 1. Failed data quality checks ![Supplementary Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/11/24/2021.11.23.21266734/F5.medium.gif) [Supplementary Figure 1.](http://medrxiv.org/content/early/2021/11/24/2021.11.23.21266734/F5) Supplementary Figure 1. Source tables to OMOP-CDM tables conversion diagram ![Supplementary Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/11/24/2021.11.23.21266734/F6.medium.gif) [Supplementary Figure 2.](http://medrxiv.org/content/early/2021/11/24/2021.11.23.21266734/F6) Supplementary Figure 2. Extract, transform, and load (ETL) framework ![Supplementary Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/11/24/2021.11.23.21266734/F7.medium.gif) [Supplementary Figure 3.](http://medrxiv.org/content/early/2021/11/24/2021.11.23.21266734/F7) Supplementary Figure 3. Symptoms recorded at time of outpatient COVID-19 diagnosis or positive test, stratified by age group ![Supplementary Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/11/24/2021.11.23.21266734/F8.medium.gif) [Supplementary Figure 4.](http://medrxiv.org/content/early/2021/11/24/2021.11.23.21266734/F8) Supplementary Figure 4. Symptoms recorded at time of outpatient COVID-19 diagnosis or positive test, stratified by sex ## Footnotes * * joint first-authors * Received November 23, 2021. * Revision received November 23, 2021. * Accepted November 24, 2021. * © 2021, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), CC BY-NC 4.0, as described at [http://creativecommons.org/licenses/by-nc/4.0/](http://creativecommons.org/licenses/by-nc/4.0/) ## References 1. 1.Roser M, Ortiz-Ospina E. “ Spain: What is the daily number of confirmed deaths?” Published online at OurWorldInData.org. 2. 2.Gérvas J, Ferna MP, Starfield BH. Primary Care, Financing and Gatekeeping in Western Europe. Family Practice. 1994;11(3):307–317. doi:10.1093/fampra/11.3.307 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/fampra/11.3.307&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=7843523&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F11%2F24%2F2021.11.23.21266734.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1994PP88200013&link_type=ISI) 3. 3.Borras-Bermejo B, Martínez-Gómez X, Gutierrez-San Miguel M, Esperalba J, Antón A, Martin E et al. Asymptomatic SARS-CoV-2 infection in nursing homes, Barcelona, Spain, April 2020. Emerg Infect Dis. Published online 2020. doi:[https://doi.org/10.3201/eid2609.202603](https://doi.org/10.3201/eid2609.202603) 4. 4.García-Gil MDM, Hermosilla E, Prieto-Alhambra D, et al. Construction and validation of a scoring system for the selection of high-quality data in a Spanish population primary care database (SIDIAP). Informatics in primary care. 2011;19(3):135–145. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22688222&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F11%2F24%2F2021.11.23.21266734.atom) 5. 5.Ramos R, Comas-Cufí M, Martí-Lluch R, et al. Statins for primary prevention of cardiovascular events and mortality in old and very old adults with and without type 2 diabetes: retrospective cohort study. BMJ. 2018;362:k3359. doi:10.1136/bmj.k3359 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE3OiIzNjIvc2VwMDVfNC9rMzM1OSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIxLzExLzI0LzIwMjEuMTEuMjMuMjEyNjY3MzQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 6. 6.Alexander M, Loomis AK, van der Lei J, et al. Non-alcoholic fatty liver disease and risk of incident acute myocardial infarction and stroke: findings from matched cohort study of 18 million European adults. BMJ. 2019;367. doi:10.1136/bmj.l5367 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE3OiIzNjcvb2N0MDhfMS9sNTM2NyI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIxLzExLzI0LzIwMjEuMTEuMjMuMjEyNjY3MzQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 7. 7.Burn E, Murray DW, Hawker GA, Pinedo-Villanueva R, Prieto-Alhambra D. Lifetime risk of knee and hip replacement following a GP diagnosis of osteoarthritis: a real-world cohort study. Osteoarthritis and Cartilage. Published online June 2019. doi:10.1016/j.joca.2019.06.004 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.joca.2019.06.004&link_type=DOI) 8. 8.Prieto-Alhambra D, Balló E, Coma E, et al. Filling the gaps in the characterization of the clinical management of COVID-19: 30-day hospital admission and fatality rates in a cohort of 118 150 cases diagnosed in outpatient settings in Spain. International Journal of Epidemiology. Published online October 2020. doi:10.1093/ije/dyaa190 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyaa190&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33118037&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F11%2F24%2F2021.11.23.21266734.atom) 9. 9.Hripcsak G, Duke JD, Shah NH, et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Studies in health technology and informatics. 2015;216:574–578. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26262116&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F11%2F24%2F2021.11.23.21266734.atom) 10. 10.Observational Health Data Sciences and Informatics. The Book of OHDSI. Independently published; 2019. 11. 11.Observational Health Data Sciences and Informatics. The Book of OHDSI. Independently published; 2019. 12. 12.Blacketer C, Defalco FJ, Ryan PB, Rijnbeek PR. Increasing Trust in Real-World Evidence Through Evaluation of Observational Data Quality. medRxiv. Published online January 1, 2021:2021.03.25.21254341. doi:10.1101/2021.03.25.21254341 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMS4wMy4yNS4yMTI1NDM0MXYxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMTEvMjQvMjAyMS4xMS4yMy4yMTI2NjczNC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 13. 13.Voss EA, Makadia R, Matcho A, et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J Am Med Inform Assoc. 2015;22:553–564. doi:10.1093/jamia/ocu023 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/jamia/ocu023&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25670757&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F11%2F24%2F2021.11.23.21266734.atom) 14. 14.Datta S, Posada J, Olson G, et al. A new paradigm for accelerating clinical data science at Stanford Medicine. arXiv. Published online 2020. 15. 15.Burn E, You SC, Sena AG, et al. Deep phenotyping of 34,128 adult patients hospitalised with COVID-19 in an international network study. Nature Communications. 2020;11(1):5009. doi:10.1038/s41467-020-18849-z [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41467-020-18849-z&link_type=DOI) 16. 16.Williams RD, Markus AF, Yang C, et al. Seek COVER: Development and validation of a personalized risk calculator for COVID-19 outcomes in an international network. medRxiv. Published online 2020. doi:10.1101/2020.05.26.20112649 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMC4wNS4yNi4yMDExMjY0OXY0IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMTEvMjQvMjAyMS4xMS4yMy4yMTI2NjczNC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 17. 17.Morales DR, Conover MM, You SC, et al. Renin-angiotensin system blockers and susceptibility to COVID-19: a multinational open science cohort study. medRxiv. Published online January 2020:2020.06.11.20125849. doi:10.1101/2020.06.11.20125849 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMC4wNi4xMS4yMDEyNTg0OXYxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMTEvMjQvMjAyMS4xMS4yMy4yMTI2NjczNC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 18. 18.Burn E, Tebe C, Fernandez-Bertolin S, et al. The natural history of symptomatic COVID-19 in Catalonia, Spain: a multi-state model including 109,367 outpatient diagnoses, 18,019 hospitalisations, and 5,585 COVID-19 deaths among 5,627,520 people. medRxiv. Published online January 1, 2020:2020.07.13.20152454. doi:10.1101/2020.07.13.20152454 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMC4wNy4xMy4yMDE1MjQ1NHYxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMTEvMjQvMjAyMS4xMS4yMy4yMTI2NjczNC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 19. 19.Recalde M, Pistillo A, Fernandez-Bertolin S, et al. Body Mass Index and Risk of COVID-19 Diagnosis, Hospitalization, and Death: A Cohort Study of 2 524 926 Catalans. The Journal of Clinical Endocrinology & Metabolism. Published online July 23, 2021. doi:10.1210/clinem/dgab546 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1210/clinem/dgab546&link_type=DOI) 20. 20.Roel E, Pistillo A, Recalde M, et al. Cancer and the risk of COVID-19 diagnosis, hospitalisation, and death: a population-based multi-state cohort study including 4,618,377 adults in Catalonia, Spain. medRxiv. Published online January 1, 2021:2021.05.18.21257371. doi:10.1101/2021.05.18.21257371 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMS4wNS4xOC4yMTI1NzM3MXYxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMTEvMjQvMjAyMS4xMS4yMy4yMTI2NjczNC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 21. 21.Raventós B, Pistillo A, Reyes C, et al. The impact of the COVID-19 pandemic on diagnoses of common mental health disorders in adults in Catalonia, Spain. medRxiv. Published online January 1, 2021:2021.08.06.21261709. doi:10.1101/2021.08.06.21261709 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMS4wOC4wNi4yMTI2MTcwOXYxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMTEvMjQvMjAyMS4xMS4yMy4yMTI2NjczNC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 22. 22.Zhou F, Yu T, Du R, et al. Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study. The Lancet. 2020;395(10229):1054–1062. doi:10.1016/S0140-6736(20)30566-3 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0140-6736(20)30566-3&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32171076&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F11%2F24%2F2021.11.23.21266734.atom) 23. 23.Docherty AB, Harrison EM, Green CA, et al. Features of 20 133 UK patients in hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: prospective observational cohort study. BMJ. 2020;369. doi:10.1136/bmj.m1985 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE3OiIzNjkvbWF5MjJfMS9tMTk4NSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIxLzExLzI0LzIwMjEuMTEuMjMuMjEyNjY3MzQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 24. 24.Gupta S, Hayek SS, Wang W, et al. Factors Associated With Death in Critically Ill Patients With Coronavirus Disease 2019 in the US. JAMA Internal Medicine. Published online July 15, 2020. doi:10.1001/jamainternmed.2020.3596 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jamainternmed.2020.3596&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32667668&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F11%2F24%2F2021.11.23.21266734.atom) 25. 25.Williamson EJ, Walker AJ, Bhaskaran K, et al. OpenSAFELY: factors associated with COVID-19 death in 17 million patients. Nature. Published online 2020. doi:10.1038/s41586-020-2521-4 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-020-2521-4&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32640463&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F11%2F24%2F2021.11.23.21266734.atom) 26. 26.de Lusignan S, Dorward J, Correa A, et al. Risk factors for SARS-CoV-2 among patients in the Oxford Royal College of General Practitioners Research and Surveillance Centre primary care network: a cross-sectional study. The Lancet Infectious Diseases. Published online July 8, 2020. doi:10.1016/S1473-3099(20)30371-6 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S1473-3099(20)30371-6&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F11%2F24%2F2021.11.23.21266734.atom) 27. 27.Ho FK, Celis-Morales CA, Gray SR, et al. Modifiable and non-modifiable risk factors for COVID-19: results from UK Biobank. medRxiv. Published online 2020. doi:10.1101/2020.04.28.20083295 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMC4wNC4yOC4yMDA4MzI5NXYxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMTEvMjQvMjAyMS4xMS4yMy4yMTI2NjczNC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 28. 28.Qian J, Zhao L, Ye R-Z, Li X-J, Liu Y-L. Age-dependent gender differences of COVID-19 in mainland China: comparative study. Clinical Infectious Diseases. Published online May 30, 2020. doi:10.1093/cid/ciaa683 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/cid/ciaa683&link_type=DOI) 29. 29.Petrilli CM, Jones SA, Yang J, et al. Factors associated with hospital admission and critical illness among 5279 people with coronavirus disease 2019 in New York City: prospective cohort study. BMJ. 2020;369. doi:10.1136/bmj.m1966 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE4OiIzNjkvbWF5MjJfMTUvbTE5NjYiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8xMS8yNC8yMDIxLjExLjIzLjIxMjY2NzM0LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 30. 30.Griffith GJ, Morris TT, Tudball MJ, et al. Collider bias undermines our understanding of COVID-19 disease risk and severity. Nature Communications. 2020;11(1):5749. doi:10.1038/s41467-020-19478-2 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41467-020-19478-2&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F11%2F24%2F2021.11.23.21266734.atom) 31. 31.Grant MC, Geoghegan L, Arbyn M, et al. The prevalence of symptoms in 24,410 adults infected by the novel coronavirus (SARS-CoV-2; COVID-19): A systematic review and meta-analysis of 148 studies from 9 countries. PLOS ONE. 2020;15(6):e0234765. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0234765&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F11%2F24%2F2021.11.23.21266734.atom) 32. 32.Menni C, Valdes AM, Freidin MB, et al. Real-time tracking of self-reported symptoms to predict potential COVID-19. Nature Medicine. 2020;26(7):1037–1040. doi:10.1038/s41591-020-0916-2 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41591-020-0916-2&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32393804&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F11%2F24%2F2021.11.23.21266734.atom)