ABSTRACT
Each piece of cell-free DNA (cfDNA) has a length determined by the exact metabolic conditions in the cell it belonged to at the time of cell death. The changes in cellular regulation leading to a variety of patterns, which are based on the different number of fragments with lengths up to several hundred base pairs at each of the almost three billion genomic positions, allow for the detection of disease and also the precise identification of the tissue of their origin.
A Kullback-Leibler (KL) divergence computation identifies different fragment lengths and areas of the human genome, depending on the stage, for which disease samples, starting from pre-clinical disease stages, diverge from healthy donor samples. We provide examples of genes related to colorectal cancer (CRC), which our algorithm detected to belong to divergent genomic bins. The staging of CRC can be viewed as a Markov Chain and that provides a framework for studying disease progression and the types of epigenetic changes occurring longitudinally at each stage, which might aid the correct classification of a new hospital sample.
In a new look to treat such data as grayscale value images, pattern recognition using artificial intelligence (AI) could be one approach to classification. In CRC, Stage I disease does not, for the most part, shed any tumor circulation, making detection difficult for established machine learning (ML) methods. This leads to the deduction that early detection, where we can only rely on changes in the metabolic patterns, can be accomplished when the information is considered in its entirety, for example by applying computer vision methods.
INTRODUCTION
CRC is the second most common cancer-related cause of death worldwide (1). Approximately two-thirds of newly-diagnosed patients present with a curable disease (2), but despite curatively intended treatment, up to 40% of them experience relapsed disease after an initial treatment (3). In addition, about 86% of Stage I disease do not shed tumor in circulation, which leads to a decreased ability for early detection (4). A minimally-invasive analysis based on circulating cfDNA has emerged as a promising nucleic acids biomarker (5), but this method assumes very homogenous characteristics of the healthy population, which makes it hard to gain regulatory approval. Here, we attempt to consider the processes underlying the evolution of CRC as an electronics communication system and propose to visualize the cfDNA samples as images.
COLORECTAL CANCER AS AN IMAGE
In CRC, patients go through a complex diagnostic paradigm. While healthy, the donors could be characterized as having different stages of co-morbidities. Next, pre-cancerous polyps might lead to adenomas of the colon or the rectum, which also have different stages. About 45% of the patients with advanced adenoma, and oftentimes people with low-grade adenoma or no adenoma at all, develop colorectal neoplasm, which has four stages. The extensive length of the colon makes tissue biopsies and surgical interventions uncertain procedures because of the sheer length of the organ. The physiology of the colon is particular and it is conceivable that changes, reported by the changes in the fragmentation patterns, in the local cellular regulation could offer information on the exact location of early disease. In disease, the variety of cell death fragmentation patterns reflects differential nucleosome packaging, chromatin remodeling and accessibility (6). DNA hypomethylation or the loss of a histone in a nucleosome, for instance, lead to a skewed fragment length distribution (Fig.1A and Fig. 1B). In contrast, in healthy donor samples (Fig. 1C) the distribution is symmetric, centered around 168 base pairs (bp). One can observe the consistent number of fragments across the genome for the healthy donor sample (Fig. 1C), which is in stark contrast to the copy-number variation seen as bright horizontal streaks in the two Stage IV samples (Fig. 1A and Fig. 1B). The difference image between a healthy donor sample and a Stage I sample highlights the differences (Fig. 1D) present in cellular regulation and cell death in Stage I CRC (Fig. 1E). This holds true for low- and high-grade adenoma too.
DNA FRAGMENTATION
A way to classify samples is to compute the relative entropy for each fragmentation length by building a probability density function based on the fragments from all genomic positions. Such distributions consist of about 40 million fragments for each whole genome sequencing sample, which can be binned in genomic regions of, for instance, five mega bases (MB). In this scenario, for each fragment length, we could compare histograms of about 45,000 values from each region of the 23 chromosomes (Fig. 2). A KL divergence computation based on an adaptive minimax rate-optimal estimator (7) of the changes in disease from healthy state(s) to precancerous lesion(s) to malignant tumor(s) can be presented as a heterogeneous directed Markov Chain with absorbing states (8). Considering this Markov Chain as a degradeddegraded broadcast channel, considerations regarding the capacity region (9) and its estimation for a cohort of CRC samples (10) could aid the classification of a new patient sample in the clinic. To build the boundary of the capacity region for each stage, we assumed that the two peaks in the DNA fragmentation length KL divergence histogram (similar to Fig. 2, but for all cohorts, including the adenoma cohorts) provide the (X, Y) coordinates for which the boundary intersects with the X-axis (divergence value for the first peak, 0) and the Y-axis (0, divergence value for second peak) (boundaries of the capacity regions not shown). As different boundaries of the capacity regions (11) are defined by the probability density function of the cohort of each disease stage, a new sample will be classified according to the boundary it falls within. A new sample for an already existing patient which has become an outlier for the disease stage it has been assigned, and falls outside the boundary (12), will be classified in the next (or more advanced) disease stage. Oppositely, after a surgical resection or drug treatment, when tumors are removed or recede, and a new sample is collected and classified after an intervention, if it falls within an earlier disease stage boundary of the capacity region, it will be reclassified as a lower burden disease or an earlier disease stage, according to the boundary it falls within.
The KL divergences from different cancer cohorts of healthy donor samples exhibit distributions, which are multi-modal, with different peaks being present for different cancers, defining potentially unique signatures. The two highest peaks on Fig. 3, for 364 base bp and 198 bp, are the result of significant differences in the median number of fragments in the genomic bins for these fragment lengths. While for healthy samples we measured that most genomic bins consist of about 20 fragments with length 364 bp (see the peak in the green histogram on Fig. 3), the CRC samples exhibit a very different distribution in which most genomic bins consist of less than 10 fragments with length 364 bp (see the peak in the pink histogram on Fig. 3).
The distinct per-disease peaks in the divergence from healthy samples (Tab. 1, col. 2) and the divergence of between-cancers signatures (Tab. 1, col. 3) are the result of differential epigenetic regulation of the different cancer types and can be used as diagnostic biomarkers for the detection of disease and identification of the tissue of origin. We measured the divergence in fragment lengths also between the different stages of CRC (Stage I – IV), including between cohorts of samples of pre-cancerous polyps (not shown). Our approach to the identification of discriminative biomarkers in disease does not require any user-defined input metrics and it is data-driven. Further divergence analysis of the genomic bins only (for the most divergent fragment lengths -the peaks in the histogram on Fig. 2) demonstrated the ability of the method to pinpoint the areas of the human genome involved in pathogenesis and drug resistance/susceptibility.
HIERARCHICAL CLUSTERING
Hierarchical clustering of patient and healthy samples based on fragments with length 364 bp resulted in a few false positive (79.3% specificity), but, importantly for the detection of sub-clinical disease, no false negative (Fig. 4) (100% sensitivity). It suggested that the depletion of fragments from di-nucleosome-protected DNA in genomic regions associated with disabled antioxidant program (13) in samples from healthy donors might be indicating pathogenesis and early CRC (Fig. 4). This poses the question whether the three healthy donors from the DELFI (5) study associated in our analysis with CRC have since the time of blood draw been diagnosed with CRC.
One of the CRC-related genes, AXIN2, mutations in which have been associated with mismatch repair errors (14), falls within one of the most divergent genomic bins (Fig. 4). It has also been shown to interact with Glycogen synthase kinase-3β (15), which regulates microtubules in migrating cells (16). Two other genomic bins, which are the most divergent (Fig. 4) from the healthy cohort in CRC are those containing RAP2B (17) and GPX3 (18). These two genes are involved in the detoxification (reduction by GPX3, produced mainly in the kidneys (19)) of soluble reactive oxygen species (20). Interestingly, this new result indicates the presence of a divergence in the metabolic/redox patterns in CRC.
The clustering algorithm separates the samples into healthy donors (upper half of Fig. 4) and patient samples (lower half of Fig. 4). Remarkably, there are no patient samples grouped with the healthy donors, which indicates no false positive selections. There are, however, three false negative selections (labeled with FN in red color on Fig. 4) within the patient cluster (PGDX labels on Fig. 4). This result is promising in the context of the detection of sub-clinical and early disease; it could be further verified whether these three donor have developed CRC since the time of the blood draw.
It is conceivable that predictions regarding disease progression can be achieved using generative methods. Generative methods that produce novel samples from high-dimensional data distributions are finding widespread use, for example in speech synthesis. Generative adversarial network (GAN) models (21) consist of two CNNs, style-based generator and discriminator, which converge upon reaching Nash equilibrium and can be used to generate synthetic samples. Such sampler posterior distributions can be then used to reduce the complexity and improve the numerical convergence of predicting disease progression for any new sample using the Bayes formula, given there is available longitudinal data and transitional states for several patients and healthy donors. This approach may allow attempting to induce changes in reversal (22) to the physiological deterioration occurring in disease, if detected early in a pre-clinical stage.
FERROPTOSIS AND ANGIOGENESIS
A possible mechanism leading to the differential fragmentation is the increase of ferroptotic cell death in CRC. The decrease of DNA fragments with lengths 198 bp, 364 bp, and to a lesser extend 521 bp (resulting in the three divergence peaks on Fig. 2) could be due to a partial switch from apoptosis in normal physiology to ferroptosis in CRC – thus suppressing the three apoptotic peaks of fragments from mono-(147+2×26, histones plus linkers), di-(2×147+3×23, histones plus linkers), and tri-(3×147+3×27, histones plus linkers) nucleosome-protected DNA (Fig. 2). The reason behind our new results is that during apoptosis the cleavage of nuclear DNA results into fragments with length proportional to nucleosome size and resulting in patterned fragmentation. In somatic tissues, the apoptotic cleavage of DNA results in fragments of about 195 bp in lengths and multiples thereof (23). Ferroptosis, however, is characterized by non-patterned DNA fragmentation and not characterized by caspase-dependent cleavage and, thus, our analysis results demonstrate that there is a generation of less fragments with lengths 198 bp, 364 bp, and 521 bp.
Ferroptosis has a dual role in cancer. It plays a role in tumor initiation, tumorigenesis, which depends on inflammation-associated immunosuppression triggered by ferroptotic damage (24) and later, during treatment, in tumor suppression (25). Erastin, for instance, was discovered during a synthetic lethality screen with oncogenic RAS cells (26). It lowers cysteine and, thus, the cells stop the synthesis of antioxidants/glutathione, and this activates voltage-dependent anion channels (VDAC) by reversing tubulin’s inhibition on VDAC2/3 (27). VDAC is a mitochondrial protein, which is a novel target for anti-cancer drugs. Our analysis shows that for the genomic bin where VDAC2 falls has an increase in the number of fragments with length 198 bp in CRC (not shown), which offers a quantitative strategy for drug selection (28).
Within the fragments with length 198 bp, we also measured a divergence in CRC in the genomic bin in which falls TCF7L2 (not shown), which has been implicated in promoting migration and invasive behavior of human CRC cells (29). Interestingly, when we analyze the data for CRC Stage IV only (without Stages I-III), an additional peak appears on the divergence histogram from healthy donor samples at 129 bp (not shown). Within the fragments with length 129 bp in Stage IV, for a few of the patients, there are about three-fold more fragments than the average in healthy samples in the genomic bin which covers the area of the human genome containing VEGFC. This gene has been associated with disseminated epithelial tumor cells to regional lymph nodes (30). Thus, it can serve as an endothelial marker likely correlating with angiogenesis due to metastasis or minimal residue disease after a curative-intent surgery and can also offer a strategy for the selection of combination therapy using zaltrap or eylea (31).
DISCRETE BROADCAST CHANNEL
Visible in the difference image on Fig. 1C is the appearance of periodic vertical streaks around 90 to 150 bp. Visible are also horizontal streaks, likely resulting from errors in gene copy-number amplification in CRC. These examples demonstrate the patterns appearing in disease in comparison to normal physiology. Such grayscale images, in the thousands, can be utilized to train a generative transformer (32), or another large language model. This classification approach will ensure all epigenetic changes occurring in disease have been taken into account.
One possible avenue for classification is to derive the maximum likelihood estimate of the parameters of Markov Chain Monte Carlo sampling for time series prediction and supervised Bayesian learning (33). Further, let the appearance of co-morbidities in healthy donor samples be a stationary Markov Chain and CRC Stages I-IV denote its noisy version as a Hidden Markov Process (HMP), when corrupted by a discrete memoryless channel (DMC) (34), with channel capacity equal to the maximum of the KL divergence. The DMC (35) is completely characterized by the channel transition matrix, also known as the confusion matrix (36). Consider the HMP given by a binary symmetric channel with corrupted symmetric binary Markov source. One can approximate the entropy rate of a HMP via approximations of the stationary distribution of a related Markov process with high precision-complexity trade-off (37). Therefore, KL divergence can serve as a prognostic biomarker (38) based on longitudinal data prior to diagnosis, i.e., samples collected periodically from the same healthy donor earlier in life.
If we consider the transformations occurring in the DNA fragmentation patterns within a cohort of healthy samples so that all become alike to disease samples, we can view pathogenesis and disease progression of a population as a broadcast channel with memory and present it in terms of Gaussian multi-user parallel broadcast channels with identical code words (39). The achievable rate for the capacity of a degraded broadcast channel (in bits) is a function of the logarithm of the signal-to-noise ratio of the transmission signal and depends on the quality of the transmission medium. Next, the achievable rates for the capacity region of a family of parallel broadcast channels is given by the union of the overall capacity in each channel (40). Hence, any divergence in the fragmentation patterning within the healthy cohort (baseline dataset or parallel broadcast channels) would create an elevated noise floor and, thus, an overall unsolvable stochastic heterogeneity.
Each time a new patient sample is presented for classification, it would not be compared to samples collected from the same person longitudinally, in which most of the fragmentation pattern would be very similar to the previously collected healthy samples. Instead, it would be compared to a variety of fragmentation patterns in a whole cohort of (different people) healthy donors. Such comparisons will always generate noisy baseline datasets, even in the case of a very large healthy donor population, because a very few outliers will affect the overall quality of the healthy baseline dataset. For this reason, such population approach, because of the intrinsic variability in human, inherently impedes the ability of all currently utilized methods to detect disease in its early stage.
CONCLUSIONS
In the specific case of CRC, this conclusion means that any non-neoplastic gastrointestinal (GI) metabolic divergence in samples from the healthy donor cohort considered as baseline during classification would impair the ability of the traditional ML classifiers to reliably detect neoplastic transformation (41). Oppositely, an image-based classification, as we propose here, could be in a position and better equipped to successfully detect and delineate both the fragmentation patterns resulting from tumors and those resulting from non-lethal, transient conditions, such as inflammation (42). If this holds true, it would impact, besides CRC diagnosis, our ability to correctly diagnose other cancers of the GI tract as well.
In the long run, the most reliable way to perform early detection will be to personalize the process by aggregating longitudinal baseline datasets for each individual. We have attempted to do that in an effort of detecting lung cancer in urine samples (not shown). Analysis of microRNA profiles extracted from the urine of healthy donors longitudinally indicates a relative consistency in their levels. This suggests that tracking the levels of nucleic acids in body fluids longitudinally may allow for the delineation of organ- and tissue-specific patterns of changes in dedicated panels of disease-associated biomarkers and, thus, anticipate early disease and inform therapy.
MATERIALS AND METHODS
Data Processing
Whole genome sequencing DELFI raw data from (5) was processed to extract and bin all available DNA fragments using: https://github.com/Hogfeldt/ctDNAtool.
Data Analysis
All analysis programs for fragmentomics analysis and graphical/image representation of the results were developed in R and Python. The KL divergence method used is described and validated in (7). The computer code is available for download at: https://github.com/amatov/FragmentomicsSubclinicalDisease.
ACKNOWEDGEMENTS
I thank Claus Andersen for genome data and his feedback on the shortcomings of the current whole genome sequencing analysis methods.
Footnotes
Corrected a minor definition in the manuscript.