Abstract
The ability to predict the health trajectories of individuals based on their personalized risk scores can help formulate a preventive roadmap - of a disease or its complications. Currently, most of these risk prediction algorithms are based on epidemiological data from the Caucasian population and there is liberal evidence that they fail to work well for the Indian population due to ethnic diversity, varied dietary and lifestyle patterns, and altered risk profiles. In this multi-centric pan-India study, we aim to address these challenges and develop clinically relevant personalized risk prediction scores of cardio-metabolic diseases for the Indian population. This multi-centric program will involve the longitudinal collection and bio-banking of samples from ∼10,000 CSIR employees, pensioners, and their spouses of which of the baseline sample collection is now completed. Multi-parametric data collected during baseline sampling includes a clinical questionnaire, lifestyle and dietary habits, anthropometric parameters, assessment for lung function, liver elastography, ECG, biochemical data, followed by molecular assays, including genomics, plasma proteomics, metabolomics, and fecal microbiome. In addition to mining the data for associations between the different parameters and their cardio-metabolic outcomes, we intend to develop models using artificial intelligence algorithms (AI) to predict phenotypic conditions. The study may be a step towards precision medicine for the Indian population, especially middle-income group strata, and aid in refining the normative values of healthy/disease parameters in the Indian population.
Introduction
Non Communicable Diseases (NCDs) contribute significantly to human morbidity and mortality globally. Amongst NCDs, cardiovascular diseases (CVD) result in the highest number of deaths - about 17.9 million people each year, followed by cancer (9.3 million), respiratory disorders (4.1 million), and diabetes (2 million). NCDs, in total, account for 71% of the total 55 million deaths globally (1). Studies have shown that about 17 million NCD deaths occur in individuals younger than 70 years, of which 86% of early deaths are reported in low and middle-income countries (LMICs) (1). India accounts for 5.8 million of these, which is nearly 10 percent of the global mortality due to NCDs. Most importantly, the onset of CVD in this population is nearly a decade earlier (2, 3). In India, for all ages, the disability-adjusted life-years DALY rate per 100,000 population was 36,300 of which NCDs contributed 46·6% overall. This number increased to 55·0% in urban areas (4). Nonalcoholic fatty liver disease (NAFLD) and nonalcoholic steato-hepatitis (NASH) add to increasing the risk of cardiovascular events and hence become important to be studied in the context of this framework. They have a global prevalence of 24% to 40% and are expected to rise to greater than 55% by 2040 (5–7). Liver and alcohol-related conditions now contribute to one of the top 15 NCDs for DALYs derived majorly from Years of Life Lost (YLL) (4). While NCDs result from genetic, physiological, environmental, and behavioral factors interplay, the risk factors contributing to NCDs could be an amalgam of all these. To reduce the exposure to risk factors, it is critical to prioritize specific preventive measures promoting overall health (8). There is a need for more extensive population-based studies of NCDs in the Indian population to understand the risk factors specific to the population in higher granularity than the guidelines from the Caucasian population that form the current basis for evidence-based medicine. India’s wide variation of ethnicities, geographical heterogeneity, and age distribution contributes to nearly one-sixth of the world’s population as it is the genetic melting pot for the world population (9). Moreover, Indian population forms a large part of the diaspora in other countries; hence, it is furthermore important to understand risk factors concerning the Indian population. Knowing risk factors specific to India for cardio-metabolic diseases may help formulate relevant preventive, predictive, personalized, and participatory policies and guidelines for precision medicine.
Precision medicine allows the prediction of the health trajectories of individuals for early mediation and prevents the onset or complications of the disease. The development of risk scores has traditionally accomplished such prediction of health outcomes based on data from prospective cohort studies. The advent of multi-omics data and artificial intelligence-based big-data analytical tools has opened an unprecedented opportunity for developing novel personalized risk metrics to predict health outcomes (10, 11). A longitudinal population study, consisting of phenotypic measurements at baseline and long-term follow-up, can provide opportunities to understand the complex interrelationships between environmental, genetic, and other molecular factors with the risk of subsequent disease. It has the advantage of measuring parameters well before the disease onset, thereby eliminating selection bias and issues of reverse causality. Such epidemiological data is crucial for developing risk prediction tools and is vital in establishing precision health and medicine. Currently, most of these risk prediction algorithms are based on epidemiological data from the Caucasian population, and there is ample evidence that they are inaccurate for the Indian population due to ethnic diversity, varied dietary and lifestyle patterns, and altered risk profiles (12–14). More importantly, several studies currently available for the Indian populations are limited in the phenotypic information or diversity.
The Indian population is known to have a predisposition to a centrally obese phenotype that increases their risk of cardio-metabolic diseases (15, 16). Hence, a detailed phenotyping and understanding of these risks’ genetic and metabolic underpinnings is essential. A longitudinal approach involving gathering phenotypic data from individuals across multiple time points ranging from physiological function, multi-omics data, assessment of cellular function, and microbiome diversity may aid in understanding the risks better. Integrating different OMIC data, including phenome, metabolome, proteome, genome, etc., is crucial to developing preventive protocols for various NCD risks. Phenomics can provide a strategy for discovering newer risk factors and diagnostic biomarkers and unveiling novel therapeutic targets and precision therapies for realizing precision medicine (17).
With this objective of understanding risk factors of cardio-metabolic diseases, especially diabetes and NAFLD, a multi-centric pan-India study has been undertaken The current study, “Phenome India-CSIR Health Cohort Knowledgebase” (PI-CheCK) is initiated by the Council of Scientific and Industrial Research (CSIR) and spans its 37 constituent laboratories and associated centers across India, spanning 17 states and 2 union territories (UT) (18) (Figure 1).
On one side, CSIR encompasses various ethnicities, geosocial, habitats and occupational exposures with majorly urban and semi-urban distribution, thus bringing in variables that could play a role in the objectives being studied; on the other hand, CSIR is fairly homogenous from a perspective of socio-economic status, with most of its staff belonging to urban and semi-urban middle- and higher-middle-income groups. This planned cohort (∼ N=10,000) comprises a captive population of well-literate working and retired staff and their spouses. India has some significant cohort studies, albeit localized or limited by geographical entities, population heterogeneity, or the monitored parameters (18–22). Thus, to our knowledge, such a pan-country longitudinal cohort has not been previously reported. The planned study is a prospective collection (initailly over five years at three time points) of extensive health information, including but not limited to the physiological assessment of lung function through spirometry and oscillometry, cardiac function and heart rate variability (HRV) analyses by ECG, body composition analysis, liver function tests by transient elastography, dermatological assessment of skin moisture, sebum, hydration, anthropometry and molecular and biochemical assays of blood samples for analysis in addition to the creation of a sample and data repository for multiple omics and microbiome of an active Indian population.
The goal is to generate a first-of-its-kind Indian dataset on different health-related and biomedical parameters from a country-wide employee cohort, followed up for years together, which will be of significant public health importance with the possibility of developing diagnostic and prognostic biomarkers. An employee cohort favors a lower dropout rate, being educated and literate in realizing the study’s potential and discussing clinically significant results with the study participants, and encouraging seeking appropriate medical attention also results in a healthy and productive scientific workforce for CSIR.
Study Objectives
The primary objectives of the PI-CHeCK cohort study are as follows:
a) To develop a prospective longitudinal cohort of consenting CSIR employees and their spouses, including pensioners/retired employees.
b) To develop clinically useful personalized risk prediction scores for cardio-metabolic disorders for pre-diabetes, diabetes, and fatty liver disease based on multi-omics data integration.
The long-term objectives for the listed disorders of the study are:
Develop risk prediction models for diabetes and liver fibrosis:
(i) Glycaemic control: Develop a binary prediction model for the normo-glycaemic population to develop to:
(a) Diabetes (prior HbA1c <5.7 and FBS < 100 to HbA1c >= 6.5 and Fasting blood sugar (FBS) >126 mg/dL on American Diabetes Association (ADA) criteria)
(b) Pre-Diabetes (with prior HbA1c <5.7 and FBS =<100 to HbA1c >=5.7 and <6.5 and FBS =>100 but less than <126 mg/dL on ADA criteria)
(ii) NAFLD/Liver Fibrosis: Develop a binary prediction model for F0/F1 (no fibrosis) to F2/F3/F4 (moderate to severe fibrosis/cirrhosis) on liver Fibroscan (Transient Elastography)
Create and maintain a database and linked bio-repository.
Methods
Design of the study
The study design is primarily longitudinal cohort-based with additional two-time points for the follow-up other than the present first phase of baseline collection which is now completed.
Sampling sites and sample size determination
All 37 CSIR labs and associated centers are categorized into four zones primarily to assist in better logistics management. Zones were defined based on the geographical locations of the labs; the current number of regular employees was taken as a baseline count from each CSIR lab. The baseline count was multiplied by four to accommodate spouses of current employees, an equal number of pensioners, and pensioners’ spouses. Here, the assumptions made were a) the pensioners are of at least the same number as the employees of each lab, and b) employees and pensioners have spouses. Thus, an overall count for each lab was obtained and further weighted proportionately so that each zone had at least 2,500 samples. This was done to keep the follow-up studies’ attrition rate below 20-30%. The information on the four zones and respective labs is given in Supplementary Table S1. Four zonal nodal Institutes (Indian Institute of Chemical Biology (CSIR-IICB), East zone; National Chemical Laboratory (CSIR-NCL), West zone; Institute of Genomics and Integrative Biology (CSIR-IGIB), North zone; Centre for Cellular and Molecular Biology (CSIR-CCMB), South zone) were designated to coordinate this activity. (Figure 1).
Inclusion/Exclusion criteria
All CSIR permanent staff, i.e., employees, pensioners, and their spouses, were eligible to be enrolled in this study. The exclusion criteria for this study were pregnancy and under 18. Further, the participants were excluded from a specific scanning modality if he/she gave a history of medical surgery or complication defined for that particular modality, as mentioned in Supplementary Table S2. The participants would fast for a minimum of 10-12 hours before coming to the sampling site and phlebotomy.
Tests and scans planned
Three main categories of assays were planned: blood-based, scanning-based, and stool sample-based. A brief description is given in Table 1. A few modalities such as Spirometry, Oscillometry, TE-for Liver, and BCA may be done twice or more for a selective population at the same or different time points to understand multiple physiological variable states or outcomes based on molecular assays. Consent was obtained initially before their enrollment.
Contraindications to the imaging/scanning/other tests being done
Specific contraindications for the tests are listed in Supplementary Table 2, along with specific query-based enrollment if the volunteering participants are doubtful of their medical history.
Cohort enrolment and sample collection
The first sample collection phase was planned for six months, from December 2023 to June 2024. In each lab, a week was spent sampling nearly 50 participants per day. The second and third sampling phases are planned to be carried out in 2025-26 and 2026-27
To ensure maximum participation, awareness about this drive of the program’s importance and participatory nature was conducted at each of the constituent laboratories of CSIR through brochures, web-based and personal interactions. A signed physical consent form was obtained from the participants during the sample collection.
Registration & Survey Questionnaire, Slot Booking
Once a lab is made aware of the project, the participants were registered through a secure online portal-the CSIR Cohort portal (available at https://www.csircohort.org/) inspired by HIPAA guidelines complying with considerations such as two-factor authentication, backend data encryption, user timeouts, role-based access (administrator types), data integrity controls and logging privileges. The portal is equipped with various supporting documents, making the participants aware of the objectives, the scientific rationale of the protocols, and the interpretation of results. Using two-factor authentication, a participant may access his/her recorded health parameters and reports anytime from the portal. Various administrators were assigned to the defined roles based on the data modalities.
Participants must fill out a baseline questionnaire, including questions about their family and residence. This information was recorded for authentication purposes only. The backend data structure of the baseline questionnaire would be placed in a different table, which is not connected to any other data pipelines. The entire table will be encrypted and not accessible via the back end. The entries will be indexed with Registration IDs, which become the only identifying factor for a participant.
Once the respective coordinator approves, the participant can complete the questionnaire online. The participant’s personal information was converted to the computer-generated registration ID once the coordinator approved the participant. Henceforth, the participant’s identity is represented by this registration ID, which only the participant knows. The questionnaire is embedded in REDCap, which enables the participant to provide self-reported information in different fields. Details of the questionnaire are provided in Supplementary Table 3 and Table 4. The participant will only be eligible for slot booking when the questionnaire is submitted successfully. The slot booking procedure includes precautionary questions for the testing day. Participants can then book the slot date and time with eligible tests. The reports section is accessible via two-factor authentication, which fetches the reports with date and time.
Once the participant has filled out the questionnaire, the participant can book the slot for the sample collection. During this process, the participant must consent to blood collection, which is mandatory for further consideration when a participant agrees to participate. Once selected, the date and time of sample collection can be finalized by the participant. Registered participants from each laboratory will be informed of the dates for sample collection at their chosen lab, and they can then schedule the date and time of their visit through the online portal at their convenience. If a participant chooses to provide a stool sample, the kit can be obtained from the collection center 48 hours before the sampling date, and the participant can bring a stool sample on the sampling day. The reminders through email and mobile will be conveyed to the participants enabled through the portal in which the participant has registered.
Data Acquisition (Materials and Methods)
All the stations at all labs and centers had the same equipment except Oscillometry. Stations for different scanning modalities are given in Supplementary Table 5 and briefly described as follows:
Anthropometry
Anthropometry will be carried out to measure height, weight, chest (CC), waist (WC), Abdominal (AC), and hip circumferences (HC).
Height (in cm) - As recommended by ICDS guidelines, a standard stadiometer was utilized at all centers.
Weight (in kg) - Weight was measured using Accuniq Body Composition Analyzer (Manufacturer-SELVAS Healthcare, Korea Model BC 380)
Body Circumferences (in cm) - CESCORF measuring tape (Manufacturer – Cescorf, Brazil) was utilized to measure CC, WC, AC, and HC.
The following definitions were adopted to measure circumferences thrice at each site (23, 24).
CC-“Measure the chest circumference at the most significant part of the chest, which is usually across the level of the nipple line in males and just above the breast tissue in females. Measurement should be taken at the end of a normal expiration”.
WC-“A horizontal measure is taken at the midpoint between the lower margin of the last palpable rib and the top of the iliac crest”.
AC-“The tape should be held behind the participant with one edge at the horizontal plane through the center of the umbilicus”.
HC-“The participant is standing erect, and the feet close together. A horizontal measure is taken around the widest portion of the hips and buttocks”.
Body Composition Analysis
Body Composition Analysis (BCA) was done on the Accuniq Body Composition Analyzer (manufacturer-SELVAS Healthcare, Korea Model BC 380) as per the standard protocol specified by the manufacturer.
Transient Elastography of Liver
The FS Mini 430 Plus model (Manufacture-Echosens, France) with M or XL probes was used to acquire the scan and data. Fibroscan is a simple and safe technique to assess liver stiffness and fat content and takes 5-10 minutes on average per examination. Preliminary preparation required the subject to remain fasting for 6 hours before the procedure, which was met through the fasting condition specified. The subject is then asked to lie supine with the right arm in full abduction. The examiner begins by placing the probe along the 5th to 7th intercostal space in the mid or anterior axillary line to obtain a view of the right lobe of the liver. Once the area is identified, measurements were obtained using the fibro scan probe.
Liver Stiffness Measurement (LSM) and Continuous Attenuation Parameter (CAP) measurements will be captured by trained personnel blinded to patients’ clinical data. Although the literature points out the use of an XL probe for BMI greater than 30, here we utilized and worked with the recommender system of the device for probe selection. Failure of transient elastography is defined as the inability to obtain valid LSM or CAP value (25–27). Manufacturer guidelines were followed to acquire a minimum number of valid measurements, usually 10.
SPIROMETRY
EasyOne Air hand-held spirometer (Manufacturer-NDD, Switzerland), a widely utilized handheld spirometer based on ultrasound sensing, was used in the study to capture spirometry data. ATS/ERS guidelines and interpretation were followed to capture spirometry data (28–31). FEV1 and FVC are generally used to diagnose lung function abnormalities, with an FEV1/FVC ratio of <0.7 indicative of obstructive airway disease (OAD). If the ratio is normal, i.e.≥0.7, FEV1 and/or FVC of <80% predicted indicate low lung function with restricted lung function measured through FVC and preserved ratio through FEV1 (32).
Oscillometry
Oscillometry was performed before spirometry using the Tremoflo C-100 Airwave Oscillometry System (Thorasys, Canada) or Resmon Pro V3 System. Oscillometry is used to measure perturbations to sound signals of different frequencies superimposed onto normal tidal breathing, which helps characterize the resistance of small airways and airway distensibility. The signal was recorded for at least three artifact-free breaths by Tremoflo or 45 seconds by Resmon Pro (automated in device itself), which sets the QC before capturing acceptable maneuvers. Resistance and Reactance at 5 Hz (R5 and X5), Resistance at 20 Hz or 19 Hz (R20/R19), Resistance of small airways, 5 to 20 Hz (R5-20 or R5-R19) and frequency response (Fres) would be monitored primarily
ECG
The study included ECG of around 10,000 samples at a 500 Hz sampling rate using Pisces 1012 model 12 lead ECG, (Manufacturer Allengers, India). The ECG uses a 12-lead setup with a high-pass frequency filter at 0.05 Hz, a low-pass filter at 35 Hz, and a 50 Hz notch filter. The sweep speed is 25 mm/s, and the amplitude is 1.00 cm/mV. This process ensures the accurate and efficient collection of ECG data. The ECG machine is connected to a laptop, and participant details are entered into the software. A two-minute recording was taken while the participant remained motionless. The data gets saved automatically, generating a PDF uploaded to the participant’s dashboard. A hard copy of the ECG was provided to the participant. The station administrator uploads all data in PDF and Excel formats to the portal by the end of the day.
Skin analysis and phenotyping
To develop standards for objective assessment of skin barrier in Indian skin types, we measured skin barrier function in around 10,000 participants utilizing Courage Khazaka (Manufacturer Courage Khazaka Electronic GmBH, Germany) in terms of four biophysical parameters, employing distinct probes for measurement: Skin sebum content (Sebumeter): The measurement process involves using grease spot photometry with the Sebumeter® SM 815. The matte tape contacts the skin or hair, becoming transparent based on surface sebum. A photocell measures transparency when the tape is inserted into the device aperture, indicating sebum content by light transmission. The Sebumeter measures sebum levels on the forehead and left cheek, with readings taken over 30 seconds at each site. Transepidermal Water Loss (Tewameter): The Tewameter® probe measures the density gradient of the water evaporation from the skin (Δc) proportional to the TEWL. The probe takes three 10-second measurements on the forehead, left cheek, and left volar forearm to assess TEWL. Stratum Corneum Hydration (Corneometer®): The measurement is based on capacitance measurement of a dielectric medium, here the stratum corneum, the uppermost layer of the skin. With increasing hydration, its dielectric properties change. The measurement is based on the fact that water has a higher dielectric constant than most other substances (mainly < 7) and is done on the forehead, left cheek, and left volar forearm. Skin Elasticity (Cutometer): The measuring principle of the Cutometer® is based on the suction method, where negative pressure deforms the skin mechanically. The device creates pressure, drawing the skin into the probe’s aperture. After a defined time, it is released again.
Blood collection
Since several parameters were being investigated, we expected to collect about 40-50 samples/ day. The samples collected were transported to CSIR-IGIB in a controlled temperature environment. CSIR-IGIB will provide a sample code for each sample, and aliquots will be transported to the CSIR labs, which will be designated to perform the molecular analysis for that lab. The remaining aliquots will be stored (Bio-banking). A few of the routine biochemical analyses were outsourced to a private diagnostic laboratory that will use the same equipment for all the samples collected and meet the predefined quality control requirements of the consortium in all phases of sample collection. Other biochemical analyses will be performed at IGIB (Details in Supplement). The clinically relevant results will be uploaded to the central database by CSIR-IGIB, and the participants will be provided passcode-enabled access to their results. The data will be stored in a data repository at IGIB with a backup at 4PI. The sample flow and processing work plan is shown in Figure 2.
Bio-Banking
Long-term Bio-banking of samples plays a crucial role in developing and maintaining longitudinal prospective cohorts like the CSIR-cohort, which involves the collection of more than 30,000 samples over five years, including baseline assessments and two follow-up evaluations. This will allow us to study common diseases over time, focusing on identifying risk factors and prevention strategies from the data acquired in large, diverse populations, understanding disease mechanisms, and developing targeted treatments. Not only does bio-banking hold immense potential to foster proactive public health strategies, but it is also crucial to address ethical, legal, and social implications to maximize the benefits of bio-banking while safeguarding participant rights and data integrity. Anonymization and de-identification are the key features that are employed in the maintenance of the biorepository to protect participant privacy, which is done through the assignment of a LAB-ID for each participant at the time of sample collection and is later converted to a TEST-ID for various assays based on the study and its objectives. This process ensures that the data remains linked to the correct samples while maintaining confidentiality. The unused samples will be stored in 0.5 ml 2D tubes, each marked with unique QR codes and tube numbers for precise traceability. This meticulous system supports robust and reliable research by accurately correlating samples with participants and health outcomes. Maintaining a standard bio-bank can expand global biobank networks and comprehensive data-sharing frameworks to enhance the scope and impact of research further, allowing for more representative studies across diverse populations.
Assays to be performed and planned
Biochemical and Molecular Analysis
The molecular studies, including Genomics, transcriptomics, proteomics, metabolomics, metallomics, immunological and microbiome studies, will be performed by different CSIR Labs. It is proposed that a Global Screening array (GSA 3.0, Illumina, USA) based SNP array will be used for genomic studies. Mass spectrometry-based MRM assays will be performed for proteomic and metabolomics studies, and inductively coupled optical electron microscopy studies (ICP-OES) will be used for metallomics studies. This comprehensive profiling will be done in a subset of 50% of the participants who will be chosen based on randomization controlled for gender and geographic distribution (zonal). The list of biochemical assays that would be done in the samples is provided in Supplementary Table 6.
Targeted Quantitative Metabolomics and Proteomics
It will be done using MxP Quant 500 Kit (Biocrates), and analysis will be done using Web IDQ Software to calculate the concentration of plasma metabolites. MRM (Multiple Reaction Monitoring) based absolute quantification of a panel of important proteins associated with Cardio-metabolic diseases from human plasma (∼10 ul) in a single LC-MS/MS run will be performed through Scheduled-MRM method by using heavy isotope-labeled proteotypic peptides. In brief, total proteins from plasma samples will be extracted, estimated, and digested per standard protocols (33–35). The MRM of the targeted proteotypic peptides will be run in a triple quadrupole mass spectrometer using standard and validated methods. The peak area of endogenous and heavy isotope-labeled peptides will be utilized to calculate the concentration of endogenous peptides. The concentration of targeted plasma proteins will then be calculated from the concentration of endogenous proteotypic peptides by the ‘Protein-peptide equivalent molarity concept’.
Metallomics in Phenome-India Population
Metals are crucial for the development and indispensable for the growth and development of organisms. The metal profile of the Indian population is required to establish the role of metal ions in disease biology in general and cardiovascular and diabetes in particular. Though small pockets of cohort studies in India have been found related to a particular disease or particular metal ions in a defined region, large-scale studies across India taking multiple elements have not yet been reported (36). Few other countries have reported the metal profile in extensive cohort studies using multi-element standards or multiple calibrations to measure the ppb to ppm level of metal ions (37). Nevertheless, metals play a very vital role in physiology and biochemistry. Thus, their prime role in biochemical pathways and the proper functioning of cellular systems cannot be ruled out. Keeping this in view, the Phenome India project aims to analyze more than 30 essential, trace, and toxic serum metals of participants and correlate them to the objectives under study. Moreover, the metal profile of the Indian population will be established for the first time according to the geographical location and food habits from the filled questionnaire.
Phenome India will use the matrix-matched serum calibrators (Recipe) to measure more than 30 elements (Al, As, Au, Ag, Ba, Be, Bi, Cs, Cd, Cr, Co, Cu, Fe, Hg, I, Li, Mg, Mn, Mo, Ni, Pd, Pt, Pb, Sr, Se, Sb, Sn, Ti, Ti, V and Zn) through Inductively Coupled Plasma Mass Spectrometry (ICP-MS).
RNA-seq Assay protocol
We will be using Illumina TruSeq Stranded Total RNA Library Prep Gold (cat. no 20020599) to prepare the sequencing libraries from 250 ng of total blood RNA extracted of participants, as per the manufacturer’s reference guide (1000000040499 v00). In summary, cytoplasmic and mitochondrial rRNA will be depleted using target-specific biotinylated oligos and Ribo-Zero rRNA removal beads, and pure RNA will be fragmented using a divalent cation at a high temperature. AMPure XP bead with a sample-to-bead ratio of 1:1 (A63881; Beckman Coulter) will be used for cleaning the final library. The library quality will be checked using an Agilent 2100 bioanalyzer. The libraries will be sequenced on the NextSeq 2000 platform P4 sequencing reagent kit, with 2 x 151 cycles of sequence by synthesis (SBS) and at a final loading concentration of 650 pM. The initial quality assessment of raw sequencing reads will be conducted using FastQC, and subsequent removal of adapter sequences will be performed using Trimmomatic. Before identifying editing sites, we will perform read filtering to ensure stringent read quality, removing reads with a phred score < 30 and poor quality bases, which will be trimmed. Therefore, all the reads selected for the analysis have a minimum error rate of 0.001 per base. Trimmed sequencing reads were taken for differential gene expression analysis. The reads will be aligned to the human genome using the Salmon quasi-mapping tool to quantify read abundance or transcript expression levels. DESeq2 will be employed for the differential gene expression analysis, utilizing Wald’s statistics test. Genes with a p-adjusted value of ≤0.05 and a Log2 fold change of ≥ ± 1.5 will be considered differentially expressed.
Genomic studies (GSA)
To discern genetic markers associated with cardio-metabolic diseases, we intend to genotype about 5000 participants in the study using Global Screening Array (GSA) 3.0 (Illumina, USA) on iScan (Illumina, USA). The GSA 3.0 includes a broad set of single nucleotide polymorphisms (SNPs) that are informative across global populations, covering common rare and low-frequency variants. In addition, several disease-related and pharmacogenetically significant SNPs in the array enable us to analyze their prevalence in the context of Indian populations. We expect the analysis will enable us to generate polygenic risk scores with improved accuracy for cardio-metabolic diseases in Indians. The data shall also be used to explore previously unexplored associations of these SNPs with fecal microbiome, serum metal content, skin hydration status, etc.
Serum miRNA profiling
The human serum contains an array of miRNAs, which may be promising biomarkers for various conditions, especially cardiovascular disease, cancer, and neurodegenerative conditions, where the lack of objective diagnostic criteria is a long-standing problem (38). However, no baseline data is available for serum miRNA profiles from apparently normal Indians. Here total RNA will be isolated from 100 microL of serum collected from subjects, using the RNA Isoblood reagent as per manufacturer’s instruction. A polyA tail will be added using E. coli Poly(A) polymerase; NEB #M0276). Further, using the polyA tail to anchor a universal adaptor (5’ CTCAATCGTACATAGAAACAGGGATCTTTTTTTTTTTTTTTTTTVN 3’), the cDNA corresponding to the RNA will be prepared as per manufacturer’s protocol (M-MuLV RT; NEB #M0253L). A DNA oligonucleotide corresponding to the miRNA sequence will be used as the forward primer, along with the universal reverse primer ( CTCAATCGTACATAGAAACAGGGATC) and SyBR PCR mix (KAPA SYBR FAST qPCR Master Mix (2X) Universal; Roche #KK4618) to amplify selected miRNAs in the RT-PCR (Applied Biosystems QuantStudioTM 6-Flex). Lastly, the threshold cycle (CT) value of each miRNA will be compared to U6 RNA, to correct for differences in starting amount of total RNA. The expression level of miRNAs will be estimated using ΔCT where ΔCT = CT (U6)-CT (miRNA) (39).
Telomere
Telomeres present at the chromosome ends offer genomic stability. With each round of cell division, telomeres shorten, and in different physiological and pathological conditions, accelerated shortening of telomeres has been reported (40, 41). The length of telomeres is broadly understood as a molecular marker that discerns healthy and diseased states. Evaluating telomere length over a longitudinal cohort of healthy volunteers might provide interesting insights into whether and how telomeres relate to healthy people over the years. Moreover, if short or long telomeres in volunteers correlate to other biomarkers and/or physiological conditions. We use a reported quantitative PCR-based method to analyze telomeres length from about 0.5-1 microgram of isolated DNA (42). Optimized by us, the method uses amplification of telomeric repeats by PCR, which is then normalized to a single copy reference gene.
Gut Microbiome
Stool sample were collected by the participant at home or the sample collection center using the feces catcher (Zymo Research) and sterile hand gloves (Kimtech) provided in the stool sample collection kit in Sterile Multi-Purpose Clinical Sample Collector with spoon (Himedia). Samples were collected and stored at 0-4℃ immediately until DNA extraction. Participants who reported having taken antibiotics in the last two weeks before sample collection were excluded from the study. Metagenomic DNA extraction were done on the same day with 300 mg of stool sample using QIAamp Powerfecal Pro DNA kit (Qiagen) according to the manufacturer’s instructions. DNA was eluted in Tris-EDTA buffer instead of the C6 solution provided in the kit. The V3-V4 region of the 16S rRNA gene were sequenced using the ‘16S Metagenomic Sequencing Library Preparation’ protocol by Illumina. The 341F and 785R 16S rRNA gene primers with Illumina indexing overhangs were used. The libraries were sequenced on the MiSeq™ platform using the MiSeq™ Reagent Kit v3 (2 × 300 bp; Illumina) with a depth of 100,000 reads per sample. The taxonomic composition of metagenomes were profiled and analyzed by the QIIME2 pipeline.
Cytokine Analysis
In our study, we utilized multiplexing to analyze cytokine concentrations in human plasma using a bio-plex pro-human cytokine assay containing 48 cytokines (cat no – 12007283), which enables simultaneous detection and quantification of 48 different cytokines, facilitating a more efficient and comprehensive analysis of the cytokine profile. In this method, 40 µL of plasma isolated from blood samples is diluted with the sample diluent provided in the kit in a 1:2 dilution. Then the sample preparation for analysis is done by following the instructions provided by the kit manufacturer. After the preparation of the samples they were analyzed on the Bioplex 200 platform (Manufacturer-Biorad, USA)
Data acquisition and storage
Data Management
Technical Specifications
The backend API and frontend application were integrated and deployed primarily on a virtual machine provided by IGIB. Data was stored in a relational database management system-based architecture on the back end. The front end was developed using PHP-based architecture and framework with modular supports from JavaScript, CSS, and web server-related programming widgets. Robust data security and access management protocols were deployed, and development using non-standard protocols will be avoided to prevent potential ‘SQL Injection attacks. Data backup, recovery, and mirrored deployment (disaster management protocols) were implemented (Figure 3).
Data acquisition
The data acquisition process for the PI-CHeCK study, which focuses on examining the long-term health outcomes related to cardio-metabolic risks, involves collecting data from 10,000 participants of diverse backgrounds over a five-year period. The study will utilize various tools and methods, including surveys, laboratory tests, clinical tests, and imaging techniques. Medical instruments like ECGs, spirometers, oscillometers, Fibroscan, blood analyzers, dynamometers, body composition analyzers, stadiometers, and various skin probes. These instruments undergo regular calibration and maintenance to ensure data accuracy. Data is uploaded via secure methods, including direct uploads, manual entry, and data importation, with stringent quality assurance measures, such as automated validation checks, manual reviews, and proper personnel training. To protect data security and privacy, participant information is de-identified and Base64 encoded, with restricted access granted only to authorized personnel. The data is stored in the in-house server, established at CSIR-IGIB, where the PI-CHeCK portal is hosted. The data backup is managed by a cron job to store it on the same server and at a remote FTP (File Transfer Protocol) server at regular intervals. Researchers need access to participant’s data stored in a centralized database for their studies. However, access to this data will be controlled to ensure patient privacy and compliance with regulations like HIPAA (Health Insurance Portability and Accountability Act).
Data security and privacy
Utmost care has been taken to maintain data privacy, i.e., data is secured at rest, during transit, and editing using HIPAA guidelines. Collected multi-omics/clinical/Lifestyle data from different CSIR institutes is stored in a Cohort database. The database can be queried using a secured Graphical User Interface (GUI) application. Individuals (who provided samples) were treated as ‘normal users’ of the application and allowed only to view the submitted data. Power users with ‘Lab’ privileges can generate sample IDs (multiple samples for each ‘normal user’), add/link protocols, and upload data files (raw and processed) against anonymized sample IDs only.
A centralized data platform of about 500 terabytes (TB) and appropriate data security, including data protection and database development, were set up at CSIR-IGIB. CSIR-4PI has been recognized as a data recovery site. To enable collaboration amongst the participating CSIR labs, curated de-identified data will be made available for in-depth analysis.
Each sample was provided with a unique identifier during the sample registration. The unique ID is then printed as a small sticker and pasted on the collection tube. Users with ‘Administrator’ privileges can Upload ‘Announcements’, ‘News’ etc. Through a content management system (CMS) included in the application. The GUI includes static pages for a broad project description with a gallery view (image grid with auto-slider) (https://csircohort.org/).
Return of the results
Results of scanning and phlebotomy were provided to the participants within 72 hours. Participants were able to access the portal only through the OTP-enabled login.
Data Sharing
Data Accessibility
Data accessibility refers to the ability of authorized users or systems to access specific data. It involves establishing protocols and permissions to ensure that only authorized users can access the necessary data.
Researchers need access to participants’ data stored in a centralized location for the purpose of the research. However, access to this data must be controlled to ensure patient privacy and compliance with regulations. Only authorized researchers with approved projects and predefined objectives should be able to access the data. Technical controls, such as role-based access control (RBAC) and encryption, will be implemented to prevent unauthorized access, and audit logs should be maintained to track data access and modifications. Requesters must fill out the data access request form, which the data access committee will review. The following will assess the objective/purpose of the data access does not overlap with the existing research objectives. Once approved, the requester will be granted data access for a defined period. The requester will not share data with anyone else who is not mentioned in the request form.
Secure data sharing using SSH File Transfer Protocol (SFTP) is essential for maintaining sensitive research data’s confidentiality, integrity, and availability, ensuring compliance with ethical and regulatory standards. SFTP provides a secure alternative to traditional methods such as email and FTP by encrypting data during transmission and implementing robust authentication mechanisms. Our protocol includes configuring the SFTP server, establishing user groups with appropriate permissions, and setting up secure directories, with additional measures like VPN configuration and tools such as FileZilla or command-line interfaces to enhance security and efficiency. Data requests must be submitted via email, detailing the intended use, and upon committee approval, requestors receive login credentials and a user guide. An MD5 checksum file is provided for each data directory or file, requiring users to perform an MD5 checksum comparison to verify the completeness and integrity of the downloaded data, adding a crucial layer of verification to the process.
Data backup and mirroring
Data backup will be done regularly on different physical machines. Data mirroring efforts will be made at different locations, preferably other CSIR labs located at a location different from the primary data storage site.
Development of the CMD risk scores
The Framingham Risk Score (FRS) is a widely used tool for estimating the 10-year cardiovascular disease (CVD) risk based on traditional risk factors like treated or non-treated hypertension, smoking, high-density lipoprotein, total cholesterol, and fasting blood sugar. However, its application across diverse populations often necessitates recalibration for demographic, environmental, and genetic variations that may influence CVD risk in specific cohorts. For PI-CHeCK, recalibration of the FRS is essential to enhance the accuracy of risk prediction models and ensure they reflect the unique characteristics of the population under study. This involves a systematic strategy that begins with validating the FRS on the in-house dataset, followed by adjusting the survival coefficients as per WHO’s reported mortality rate for CVD in both genders and calculating the new reference score that captures the specific risk profile of the population using the mean of log values. Such recalibration improves the predictive power of the FRS and supports more personalized and effective preventive strategies tailored to the population’s needs.
In addition to data visualization, we plan to analyze inter-parameter linear/nonlinear correlation keeping given age/gender covariates, use open source tools to identify patterns within the data, develop predictive models using ML/AI algorithms to predict certain phenotypic conditions, create an integrative network of various genomics, proteomics, and metabolomics parameters to understand the systemic regulation (Figure 4).
Sample and indexed patient similarity can be achieved through clustering and Euclidean/ Mahalanobis distance-based methods, which can be used to quantify these similarities. These similarity scores can then be fed and used to train personalized models for risk score assessment.
Using collaborative filtering techniques, individuals having similar phenotypes can be clubbed together. Employing cohort participants’ similarities will help identify a subset proximal to an index patient, which will then be used to train a personalized model in a disease-specific manner. The approach for the personalized model will depend on the type of disease and ground truth label criteria. Recent studies have used various algorithms ranging from recurrent neural networks and ensemble techniques as personalized models for disease state prediction (10, 11). Multi-view datasets will allow us to explore the quantitative association of collected risk factors with the comorbidity- and disease-specific outcomes. The developed framework will be based on explainable AI, rendering the weight of each contributing factor to the predicted outcome.
Discussion
The incidences of metabolic and other non-communicable complex diseases are increasing at an alarming rate in our country. India is now considered to be the capital of cardiovascular diseases and diabetes in the world. Thus, from the public health perspective, it is of prime importance to understand the mechanisms underlying the increased risk of these diseases in the Indian population and develop new strategies for risk stratification, prevention, and management of these major diseases.
Large-scale longitudinal cohort studies are critical for investigating the cause of disease and establishing links between risk factors and health outcomes. For instance, the Framingham heart study, an ongoing study that began in 1948, has provided the key information on most of the currently well-accepted risk factors for heart disease, including the role of lipids, diet, exercise, hypertension, cigarette smoking, etc. It has led to the development of a 10-year cardiovascular risk prediction score popularly called the Framingham risk score (FRS) (43–47). However, the FRS has low value for Indian patients due to varied lifestyles and risk factors, necessitating the development of our own risk matrix and prediction scores based on long-term prospective data. Therefore, an indigenized longitudinal population study, consisting of phenotypic measurements at baseline and long-term follow-up, is essential to understand the complex interrelationships between environmental, genetic, and other molecular factors with disease risk in our country.
Over the last few decades, several studies were initiated to identify diagnostic and prognostic markers in India and based on localized case-control studies where a small number of biochemical markers were considered. Such studies were powered only to look at the association of these markers with the disease. Some of these large-scale studies were cross-sectional, where critical time trajectories of diseases could not be analyzed (48–52). With these limitations, concluding these studies from the pan-Indian perspective is difficult. In India, CSIR, with its more than forty constituent laboratories and centers spread all over the country, represents a wide range of ethnicity subclasses, geo-social habitats, and occupational exposures and is ideally placed to overcome such shortcomings. This has been demonstrated during the recent pandemic of COVID-19 where a longitudinal cohort helped determine infection dynamics, antibody response dynamics, progression toward herd immunity, and the likelihood of significant outbreaks in a population. The study helped obtain sentinel data about the spread pattern and different characteristics of SARS-CoV-2 infection across India. The multi-centric Indian cohort showed that a large number of Indians, probably exceeding a hundred million, had contracted asymptomatic SARS-CoV-2 infection by September 2020 (53).
Further, advances in molecular techniques in genomics, transcriptomics, proteomics, metabolomics, and metagenomics in the last decade have allowed us to comprehensively understand several facets of human biology. CSIR, through the Phenome India project, intends to leverage its strength in these advanced techniques and combination with the prospective collection of biological samples from a diverse population amenable to long-term follow-up, fill in the current gaps in efforts for the establishment of causation and development of diagnostic and prognostic biomarkers for chronic non-communicable diseases in the country. Establishing such a cohort with longitudinal biological and physiological parameter sampling will align with the framework of UN-SDG and the National Health Mission. It will facilitate the development of national reference standards to aid clinical decision-making and national healthcare policy decisions.
As described above, the extensive phenotyping of individuals makes this cohort extremely valuable and unique. To mention a few, tests like fibroscan, oscillometry, gut microbiome, and skin barrier assessments are probably being performed in the country’s diverse and large population of this extent for the first time. Prospective collection and analysis of fecal microbiomes adds another unexplored dimension to cardio-metabolic disease risk. Evaluation of metal content in serum and plasma samples of the participants and comparison of these with the information available regarding these metals in various geographies and environments is expected to illuminate their complex genetic, dietary, and environmental regulation in addition to their impact on metabolic diseases. This deep phenotyping is also expected to unravel several novel genotypic and phenotypic associations in the Indian population.
To conclude, this multicenter multi-omic prospective cohort study is an essential step towards precision medicine, especially in the context of the Indian population. This study is expected to generate normative data for several parameters that will impact health care in India shortly. Such comprehensive phenotyping is being attempted for the first time in the country. Coupled with data analytics, AI/ML approaches will hopefully lead to the development of reliable risk prediction tools for cardio-metabolic disorders.
Ethics
The study was approved by IHEC at IGIB (reference no.: CSIR-IGIB/IHEC/2023-24/16). CCMB (IEC95-R1/2023). The study was registered vide CTRI number-CTRI/2024/01/061807.
Funding
The study is funded by CSIR through grant HCP47.
Author Contributions
Conflicts of Interest
All the authors declare no conflict of interest.
Data Availability
Anonymized Data for public use may be made available after 3 yrs from completion of baseline phase of study or as per advisory from Monitoring Committee of the project if any revisions thereof.
Supplementary Information
Acknowledgments
We acknowledge the support from CSIR, prospective participants, and volunteers to get this project started.
List of and DeidentifiedAbbreviations
- AC
- Abdominal Circumference
- ADA
- American Diabetes Association
- AEC
- Absolute Eosinophil Count
- Ag
- Silver
- AI
- Artificial Intelligence
- Al
- Aluminium
- ALC
- Absolute Lymphocyte Count
- ALP
- Alkaline Phosphatase
- ALT/SGPT
- Alanine Aminotransferase
- ANC
- Absolute Neutrophil Count
- API
- Application Programming Interface
- As
- Arsenic
- AST/SGOT
- Aspartate Aminotransferase
- ATS
- American Thoracic Society
- Au
- Gold
- Ba
- Barium
- BCA
- Body Composition Analysis
- Be
- Beryllium
- Bi
- Bismuth
- BMI
- Body Mass Index
- BNP
- brain natriuretic peptide
- BUN
- Blood Urea Nitrogen
- CAP
- Continuous Attenuation Parameter
- CC
- Chest Circumference
- CCMB
- Centre for Cellular and Molecular Biology
- Cd
- Cadmium
- cDNA
- Complementary deoxyribonucleic Acid
- CMS
- Content Management System
- Co
- Cobalt
- Cr
- Chromium
- Cs
- Caesium
- CSIR
- Council of Scientific and Industrial Research
- CSS
- Cascading style sheet
- Cu
- Copper
- CVD
- Cardio Vascular Disease
- DNA
- Deoxyribonucleic Acid
- ECG
- Electro Cardio Gram
- ERS
- European Respiratory Society
- ESR
- Erythrocyte Sedimentation Rate
- FBS
- Fasting Blood Sugar
- Fe
- Iron
- FRS
- Framingham Risk Score
- FTP
- File Transfer Protocol
- GFR
- Glomerular Filtration Rate
- GGT
- Gamma Glutamyl Transferase
- GSA
- Global Screening Array,
- GUI
- Graphical User Interface
- HALE
- Healthy Life Expectancy
- Hba1c
- Glycosylated Haemoglobin
- HBC
- High Blood Cholesterol
- HBP
- High Blood Pressure
- HDL
- High Density Lipoprotein
- Hg
- Mercury
- HIPAA
- Health Insurance Portability and Accountability Act
- HP
- Hip Circumference
- HRV
- Heart Rate Variability
- I
- Iodine
- ICDS
- Integrated Child Development Scheme*
- ICP-MS
- Inductively Coupled Plasma Mass Spectrometry
- ICP-OES
- Inductively Coupled Plasma Optical Emission Spectroscopy
- IGIB
- Institute of Genomics and Integrative Biology
- IICB
- Indian Institute of Chemical Biology
- LC/MS
- Liquid chromatography-mass spectrometry
- LDL
- Low Density Lipoprotein
- Li
- Lithium
- LMIC
- Low- and Middle-Income Countries
- LSM
- Liver Stiffness Measurement
- MCH
- Mean Corpuscular Haemoglobin
- MCHC
- Mean Corpuscular Haemoglobin Concentration
- MCV
- Mean Corpuscular Volume
- MD5
- Message-Digest Algorithm 5
- Mg
- Magnesium
- miRNA
- micro ribonucleic acid
- ML
- Machine Learning
- Mn
- Manganese
- Mo
- Molybdenum
- MPV
- Mean Platelet Volume
- MRM
- Multiple Reaction Monitoring
- NAFLD
- Non-Alcoholic Fatty Liver Disease
- NASH
- Non-Alcoholic SteatoHepatitis
- NCDs
- Non-Communicable Diseases
- NCL
- National Chemical Laboratory
- NDD
- new diagnostic design
- Ni
- Nickel
- OTP
- One Time Password
- Pb
- Lead
- PCR
- Polymerase Chain Reaction
- PCV
- packed cell volume
- Pd
- Palladium
- PHP
- Hypertext Preprocessor
- PI-CHeCK
- Phenome India-CSIR Health Cohort Knowledgebase
- PLT
- Platelet
- PSA
- Prostate-Specific Antigen
- Pt
- Platinum
- RBAC
- Role-Based Access Control
- RBC
- Red Blood Cell
- RDW - CV
- Red Cell Distribution Width - Coefficient of Variation
- RDW - SD
- Red Cell Distribution Width - Standard Deviation
- RDWI
- Red Cell Distribution Width Index
- RNA
- Ribonucleic Acid
- rRNA
- Ribosomal ribonucleic acid
- SARS-COV
- Severe Acute Respiratory Syndrome Coronavirus 2
- Sb
- Antimony
- SBS
- Sequence By Synthesis
- Se
- Selenium
- SFTP
- Secure Shell File Transfer Protocol
- Sn
- Tin
- SNP
- Single Nucleotide Polymorphisms
- SQL
- Structured Query Language
- Sr
- Strontium
- SSH
- Secure Shell
- TB
- Terabyte
- TE
- Transient Elastography
- Ti
- Tennessine
- TIBC
- Total Iron Binding Capacity
- TLC
- Total Leucocyte Count
- TSH
- Thyroid Stimulating Hormone
- UIBC
- Unsaturated Iron-Binding Capacity
- UN-SDG
- United Nations Sustainable Development Goals
- UT
- Union Territories
- V
- Vanadium
- VLDL
- Very Low-Density lipoprotein
- VPN
- Virtual Private Network
- VRS
- Voluntary Retirement Scheme
- WC
- Waist circumference
- WHO
- World Health Organization
- Zn
- Zinc