Abstract
There is justified optimism regarding the use of miRNAs as early detection biomarkers of cancer. They are well characterized and are involved in all the hallmarks of cancer. Less is known about the role of most other non-coding RNA (ncRNAs) classes in normal physiology and tumorigenesis. The JanusRNA dataset consist of circulating RNA profiles of pre-clinical samples from 1631 cancer patients and 673 cancer-free controls. We studied eight cancer types including cancer of the: lung, colon, rectum, prostate, breast, testis, ovaries and gallbladder. JanusRNA has its origin from the large population-based Janus Serum Bank Cohort which consists of 318 628 Norwegians. The dataset combines information from the complete nationwide cancer registry, RNA sequencing profiles from 1631 cancer patients and 673 cancer-free controls, as well as data on lifestyle, anthropometry and biochemical measurements from national health surveys. The Janus Serum Bank is specifically suited for studies of early detection and risk biomarkers of cancer, since samples are collected nationwide over a large time span, pre-clinically and cancer occurs at different points in time after blood draw. We used a nested case-control design, selecting both cases and controls among the Janus cohort members. We restricted our selection to cases with at least one sample collected within 10 years prior to cancer diagnosis. We selected 673 cancer-free Janus participants for comparison of RNA levels with the cancer cases. The controls were frequency matched to the case group on sex, age at blood donation and date of blood donation. The JanusRNA dataset has been used to investigate the natural variation of circulating RNAs in cancer-free individuals. This data resource was also used in a study of variation in RNA expression associated with common traits like age, sex, smoking, BMI and physical activity in cancer-free individuals. RNA dynamics in lung and testicular carcinogenesis throughout a 10-year follow-up has also been studied.
Data resource basics
There is justified optimism regarding the use of miRNAs as early detection biomarkers of cancer. They are well characterized and are involved in all the hallmarks of cancer (1). Less is known about the role of most other non-coding RNA (ncRNAs) classes in normal physiology and tumorigenesis. Several classes of ncRNA act as regulators of key cellular processes, many of which are associated with cancer (2). Circulating ncRNAs may improve cancer management in the future as minimally invasive biomarkers. Large prospective cohorts with harmonized genetic and phenotypic data are required to realize the potential of ncRNAs as early cancer biomarkers, but such datasets are rare (3).
In this resource profile, we present the circulating RNA dataset available within the large prospective Janus Serum Bank Cohort in Norway (Janus RNA). The resource was established to facilitate research on RNA dynamics in samples collected up to 10 years prior to cancer diagnosis, compared to cancer-free controls, and with the long-term objective of identifying early detection biomarkers of cancer. The data resource will be instrumental for a wide range of national and international cancer research projects in the future.
We studied eight cancer types. Five of these cancer types are of major public health concern in many countries: lung cancer (LC), colon cancer (CC), rectum cancer (REC), prostate (PC) and breast cancer (BC). In all of them early detection screening biomarkers could have big public health implications. We included testicular germ cell tumors (TGCT), the most common malignancy in young males in most Western countries. Further, we expanded the dataset with cancer of the ovaries (OC) and gallbladder (GBC) which are diseases with poor prognosis since they are often diagnosed at advanced stages.
Data collected
JanusRNA has its origin in the large population-based Janus Serum Bank Cohort which consists of 318 628 Norwegians. The serum bank is administered by the Cancer Registry of Norway (CRN) (4). The dataset combines information from the complete nationwide cancer registry (5), RNA sequencing profiles from 1631 cancer patients and 673 cancer-free controls, as well as data on lifestyle, anthropometry and biochemical measurements from national health surveys (6) (Figure 1). The Janus Serum Bank is specifically suited for studies of early detection and risk biomarkers, since samples were collected nationwide over a large time span, pre-clinically. Cancer has occurred at different points in time after the blood draw.
Study design and Sampling strategy
We used a nested case-control design, selecting both cases and controls among the Janus cohort members. This design offers logistic efficiency, is typically used for molecular epidemiological studies within prospective cohorts, and is suited for studying biomarkers that can be influenced by analytical batch, long-term storage and freeze-thaw cycles (7).
Case selection
The cases were identified by linking the Janus Cohort to the CRN using the individual’s Norwegian national identity number. We included cohort participants with the selected cancer sites as their primary cancer diagnosis. No prior cancer diagnosis (except non-melanoma skin cancer) were allowed, and at least 500 µl of serum had to be available. We restricted our selection to cases with at least one sample collected within 10 years prior to cancer diagnosis. The dynamics in RNA levels prior to a cancer diagnosis is sparsely described, however studies in Janus showed that most changes in RNA expression occurred close to a lung cancer diagnosis (8), and depend on staging (9). Based on our knowledge from using pre-diagnostic samples we considered 10 years prior to diagnosis as a reasonable timeframe for the study objectives. The average age of recruitment to the Janus cohort is 41 years. Because most cancers occur at older ages, most of the eligible cancer cases have a long lag-time between sampling and diagnosis. Therefore, we selected all available cases in the time window up to five years prior to diagnosis and a selection of eligible cases in the 5-10 years prior to diagnosis. Distribution of case samples in the different pre-diagnostic time-windows are shown in Figure 2.
Control selection
Controls were selected according to a modified version of the incidence density sampling method for nested case-control studies (10). The cases were set up in strata based on age at blood sampling, gender, time period and county of residence at sampling. The date of diagnosis for the latest case in the case strata, was set as limit for all controls in the strata. The controls had to be alive and free from cancer at this date to be included in the control-pool. In addition, they had to be free from cancer up to ten year after blood collection. Due to the many unknown factors that affect the RNA levels prior to a cancer diagnosis, we considered it appropriate to set these additional criteria for the controls to give the best basis of comparisons between cases and controls. We then randomly selected frequency matched controls such that the case/control ratio was the same within each stratum (11).
We profiled circulating RNA by sequencing 2 997 serum samples from 1 631 cancer cases. This includes cancer of the lung (n=404), colon (n=308), rectum (n=182), breast (n=206), prostate (n=332), testis (n=84), gallbladder (n=27) and ovaries (n=88). Gender, age at blood donation and age at diagnosis by cancer type is shown in Table 1. There are 199 cases with a second cancer diagnosis and 22 cases with a third cancer diagnosis. Furthermore, 107 of the controls developed a cancer diagnosis 10 years or more after the blood donation (not shown in table). Multiple samples (up to 7 time points) are available for a subset of the cancer patients, allowing investigation of temporal variations in samples from the same individual (Table 1). RNA profiles were also produced from 673 cancer-free control samples collected at a single time point. A pooled positive control sample, consisting of serum from several individuals, was profiled 28 times for quality and reproducibility assessment. In addition, for each batch of 96 samples, we included two negative control samples, one negative extraction control (NEC) and one water control from the library preparations.
Data set production
JanusRNA contains harmonized data from three sources: clinical cancer records from CRN, lifestyle information from nationwide health surveys, and RNA profiles and sampling information from Janus (Figure 1).
Cancer Registry data
Since1952, CRN has systematically collected mandatory notifications on cancer occurrence for the Norwegian population. The registration is considered to be close to complete from 1953, with 98.8% completeness for the registration period 2001– 2005 (5). Information from clinical notifications, pathology reports and death certificates are the main sources that enables the CRN to code and store data on cancer patients in Norway. Information from the Norwegian Patient Registry (NPR) is an important additional source for identifying cancer cases. Clinical registries, also administered by CRN provide detailed information about diagnostic procedures, pathology-examinations, treatment and follow-up from cases that were diagnosed after 2004 (12).
Health survey data on lifestyle, anthropometry and biochemical measurements
Survey data were collected at the time of blood donation from most study participants (Table 2). The quality, completeness and standardization of these data has been described elsewhere (6). In brief, participants in the health surveys completed a baseline questionnaire about smoking habits and physical activity, and anthropometry and blood pressure were measured. Biochemical parameters such as cholesterol, triglycerides and glucose were also measured. As shown in (Table 2), the proportion of current smokers ranges from 23% in rectal cancer to 70% in lung cancer in males, and from 27% in breast cancer to 70% in lung cancer in females. The percentage of obese individuals (BMI ≥ 30kg/m2) ranges from 2% in testicular cancer cases to 14% in male colon cancer cases, in females BMI ranges from 8% in breast cancer cases to 20% in gallbladder cancer cases. Sedentary physical activity levels (inactive + low) varied from 43% in male gall bladder cases to 69% in male lung cancer cases. In women, 60% of the breast cancer cases and 87% of the ovarian cancer cases reported sedentary physical activity habits.
Sequencing data production
We have developed a platform for small RNA sequencing, tailored to low RNA yield samples such as archived serum samples. This platform enables expression profiling of RNAs from 17-47 nucleotides, annotation of 9 RNA classes and sequence isoform identification. The quality and quantity of RNA was sufficient for analyses, independent of storage time and sample pre-processing (13). Sequencing data production entails three steps:
Step 1 – RNA isolation
RNA was extracted from 2 × 200 µl serum using the miRNeasy Serum/Plasma kit (Cat. no 1071073, Qiagen) on a QIAcube (Qiagen). Internal spike-in control C. elegans miR-39 was added to each sample. Glycogen (Cat. no AM9510, Invitrogen) was used as carrier during the RNA extraction step. The eluate was concentrated using Ampure beads XP (Agencourt). Cases and controls were blinded.
Step 2 - Library preparation
Small RNAseq libraries were created using NEBNext® Small RNA Library Prep Set for Illumina (Cat. No E7300, New England Biolabs Inc.) with a cut size on the Pippin Prep (Cat. No CSD3010, Sage Science) of 17-47 nucleotides.
Step 3 - RNA sequencing
12 samples were sequenced per lane on an Illumina HiSeq 2500 platform to an average depth of 18 million reads per sample. A detailed description of the RNA profiles is available (14).
Bioinformatics pipeline
A custom RNA analysis pipeline for processing raw RNA sequences to count data and expression profiles has been established (Figure 3). It includes adaptor and low-quality data filtering, mapping, and counting of RNA annotations. The RNAseq reads were initially trimmed for adapters using AdapterRemoval (v2.1.7) (15). We then mapped the collapsed reads (generated by FASTX v0.14) to the human genome (hg38) using Bowtie2 (10 alignments per read were allowed). All multi-mapped reads with equivalent mapping score were counted. We compiled a comprehensive annotation set from miRBase (v21) (16) for miRNAs, pirBAse (v1.0) for piRNAs63, GENCODE (v26) (17) for other RNAs and tRNAs. We used SeqBuster (v3.1) (18) to profile isomiR and miRNA profiles. To count the mapped reads, HTSeq (v0.7.2) (19) was used. Candidate tRNA fragments (tRFs) were selected from reads mapped to tRNA annotations. We later updated our workflow and tRFs were profiled using MINTmap tool (https://www.nature.com/articles/srep41184). When reporting the identified RNAs, we excluded RNAs with fewer than 10 reads in more than 20% of the samples (20).
The average sequencing depth and average number of RNAs by cancer type is given in Figure 4A and Figure 4B. The most abundant RNA type across all cancer types and controls is mRNA followed by isomiRs. The heat-map in figure 4 B shows the average number of identified RNAs for the different cancer types. The RNAseq read counts in millions for all RNA classes combined, ranged from 13.5 in ovarian cancer patients to 20.7 in female gallbladder patients, and standard deviations ranged between 2.9 and 6.2 (Supplementary Table 1).
Ethical approval
The JanusRNA study was approved by the regional committee for medical and health research ethics, Oslo, Norway (2016/1290) and (2013/1821), and is based on a broad consent from participants in the Janus cohort. The JanusRNA database contains pseudonymized data.
Funding
The project was funded by the Research Council of Norway under the Program: ‘Human Biobanks and Health data’ project numbers: 229621/H10 and 24879/H10 (production of sequencing data from cases with cancer of the lung, colon, rectum, breast, prostate, ovaries and the main part of controls). Sequencing costs for the testicular cancer cases and a small number extra controls were covered by the Norwegian Cancer Society (grant number 190157-2017) and sequencing costs for the gallbladder cases were supported by the European Union within the initiative “Biobanking and Biomolecular Research Infrastructure—Large Prospective Cohorts” (Collaborative study “Identification of biomarkers for gallbladder cancer risk prediction—Towards personalized prevention of an orphan disease”) under grant agreement no. 313010 (BBMRI-LPC) and the German Federal Ministry of Education and Research (BMBF, grant 01DN15021).
Data resource use
The JanusRNA dataset has been used to investigate the natural variation of circulating RNAs in healthy controls (14). The core serum RNA repertoire in the cancer-free control group includes 258 micro RNAs (miRNA), 441 piwi-interacting RNAs (piRNA), 411 transfer RNAs (tRNA), 24 small nucleolar RNAs (snoRNA), 125 small nuclear RNAs (snRNA) and 123 miscellaneous RNAs (misc-RNA). We investigated biological and technical variation in expression, and the results suggest that many RNA molecules identified in serum shows signs of inter-individual variation (14). The same dataset was used to investigate the association between circulating RNAs and common traits such as age, sex, smoking, BMI and physical activity. The study showed that common traits influence circulating RNA expression, in particular age and sex, and concluded that these traits should be treated as potential confounders for RNA analyses (20). RNA dynamics in lung and testicular carcinogenesis throughout a 10-year follow-up has been studied (9, 21). Ongoing studies include pre-diagnostic RNA dynamics in cancer of the lung, colorectal, prostate, breast, gallbladder, testis, ovary and a pan-cancer profile (https://www.kreftregisteret.no/en/Research/Projects/Small-non-coding-RNA-as-early-detection-cancer-biomarkers/).
Strengths and weaknesses
A strength of JanusRNA is the large number of population-based samples sequenced from eight different cancer types combined with high-quality cancer registry data and access to information on life-style factors at baseline such as BMI, smoking and physical activity, that may influence circulating RNA expression levels significantly. The selection of cases and controls is based on established procedures for linking biobank and registry data (22). We have created a robust and unique dataset to investigate RNA dynamics and biomarker potential across cancer types. It is also a strength that we have sequenced samples from a large cancer-free control group enabling us to study the variation in a healthy population. The control group gives the opportunity to explore disease versus trait-specific patterns, which is important for early detection biomarkers discovery. The control group profiles can be re-used to contrast compatible RNA profiles from a range of clinical studies. Another strength of the JanusRNA dataset is the pre-diagnostic sample collection, which provides the possibility of studying RNA levels prior to diagnosis and compare this with the healthy control group. We also use well-established sequencing platforms, biocomputational capacity and the RNA yield of our biobanked material as documented (13, 14, 20, 23). The sequencing read-depth is high (on average 18 million reads per sample), and targeting RNAs of 17-47 nucleotides enables comprehensive assessment of the major RNA classes.
There are some limitations to JanusRNA. First, we only have serial samples from approximately 30% of the participants, so the statistical power for investigating temporal changes in cancer specific analyses is low. Also, for the control subjects there is only one sample time point available with no opportunity to measure the changes in natural variation over time. Another limitation of the dataset is that detailed clinical information from clinical registries is available only for a subset of the cases with cancer of the breast (hormone receptor status), prostate (PSA and Gleason score), lung and rectum. Pathology reports can be reviewed to complete this information (24), however that is a time-consuming process. Further, we are also missing survey data from approximately 13% of the study participants. Another weakness is the technical noise arising from analysing archived serum samples with low amounts of RNA. However, we have characterised the technical and biological variability and since technical variability is random, it has little impact on association studies.
Reason to be cautious (limitations including generalisability)
All RNA studies, including this one, may have problems with annotation and the lack of functional information that makes the interpretation of findings challenging. Trusted annotations are essential to correctly identify transcripts, yet well-known annotation databases are not optimal (14, 25). For example, piRNA annotations contain fragments corresponding to other RNAs (26) something that might reduce comparability between trait associations. For reuse of our dataset one has to be aware of some observed batch effects correlating with changes in kit lot numbers.
Data resource access and collaborations
The JanusRNA datasets generated for this article are not readily available because of the principles and conditions set out in articles 6 (1) (e) and 9 (2) (j) of the General Data Protection Regulation (GDPR). National legal basis as per the Regulations on population-based health surveys and ethical approval from the Norwegian Regional Committee for Medical and Health Research Ethics (REC) is also required.
There is work in progress to develop data sharing mechanisms in compliance with GDPR (27-29). In the meantime, we welcome ideas and proposals for potential collaborations for using the dataset. To facilitate this process, interested researchers can contact miRJanus{at}kreftregisteret.no
Data Availability
The JanusRNA datasets generated for this article are not readily available because of the principles and conditions set out in articles 6 (1) (e) and 9 (2) (j) of the General Data Protection Regulation (GDPR). National legal basis as per the Regulations on population-based health surveys and ethical approval from the Norwegian Regional Committee for Medical and Health Research Ethics (REC) is also required. Requests to access the datasets should be directed to the corresponding authors.
Acknowledgements
The sequencing service was provided by the Norwegian Sequencing Centre www.sequencing.uio.no, a national technology platform hosted by Oslo University Hospital and the University of Oslo supported by the Research Council of Norway and the South-eastern Regional Health Authority. We acknowledge Dr Tom Grotmol at the CRN Department of Research and Prof Justo Lorenzo Bermejo Statistical Genetics Group, Institute of Medical Biometry & Informatics, University of Heidelberg in Germany for close collaboration and production of the testicular and gallbladder datasets, respectively. We also thank senior lawyer Hilde Olav at the CRN for help with all legal issues related to data sharing. We are thankful to the Norwegian Cancer Society for providing funding for the Janus Serum Bank management through the years of cohort recruitment. The study used data from the Cancer Registry of Norway and the Norwegian Institute of Public Health. The interpretation and reporting of these data are the sole responsibility of the authors, and no endorsement by the Cancer Registry of Norway or the Norwegian Institute of Public Health is intended nor should be inferred.