Abstract
Single-cell RNA-seq enables the quantitative characterization of cell types based on global transcriptome profiles. We present single-cell consensus clustering (SC3), a user-friendly tool for unsupervised clustering, which achieves high accuracy and robustness by combining multiple clustering solutions through a consensus approach (http://bioconductor.org/packages/SC3). We demonstrate that SC3 is capable of identifying subclones from the transcriptomes of neoplastic cells collected from patients.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout



Similar content being viewed by others
Accession codes
References
Grün, D. et al. Nature 525, 251–255 (2015).
Jaitin, D.A. et al. Science 343, 776–779 (2014).
Mahata, B. et al. Cell Rep. 7, 1130–1142 (2014).
Gentleman, R.C. et al. Genome Biol. 5, R80 (2004).
McCarthy, D.J., Campbell, K.R., Lun, A.T.L. & Wills, Q.F. Bioinformatics https://doi.org/10.1093/bioinformatics/btw777 (2017).
Biase, F.H., Cao, X. & Zhong, S. Genome Res. 24, 1787–1796 (2014).
Yan, L. et al. Nat. Struct. Mol. Biol. 20, 1131–1139 (2013).
Goolam, M. et al. Cell 165, 61–74 (2016).
Deng, Q., Ramsköld, D., Reinius, B. & Sandberg, R. Science 343, 193–196 (2014).
Pollen, A.A. et al. Nat. Biotechnol. 32, 1053–1058 (2014).
Kolodziejczyk, A.A. et al. Cell Stem Cell 17, 471–485 (2015).
Treutlein, B. et al. Nature 509, 371–375 (2014).
Ting, D.T. et al. Cell Rep. 8, 1905–1918 (2014).
Patel, A.P. et al. Science 344, 1396–1401 (2014).
Usoskin, D. et al. Nat. Neurosci. 18, 145–153 (2015).
Klein, A.M. et al. Cell 161, 1187–1201 (2015).
Zeisel, A. et al. Science 347, 1138–1142 (2015).
van der Maaten, L. & Hinton, G. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Zurauskiene, J. & Yau, C. BMC Bioinformatics http://doi.org/10.1186/s12859-016-0984-y (2016).
Xu, C. & Su, Z. Bioinformatics https://doi.org/10.1093/bioinformatics/btv088 (2015).
Guo, M., Wang, H., Potter, S.S., Whitsett, J.A. & Xu, Y. PLoS Comput. Biol. 11, e1004575 (2015).
Macosko, E.Z. et al. Cell 161, 1202–1214 (2015).
Jiang, L., Chen, H., Pinello, L. & Yuan, G.-C. Genome Biol. 17, 144 (2016).
Patterson, N., Price, A.L. & Reich, D. PLoS Genet. 2, e190 (2006).
Tracy, C.A. & Widom, H. Commun. Math. Phys. 159, 151–174 (1994).
Rousseeuw, P.J. J. Comput. Appl. Math. 20, 53–65 (1987).
Guo, G. et al. Dev. Cell 18, 675–685 (2010).
Boroviak, T. et al. Dev. Cell 35, 366–382 (2015).
Chen, E., Staudt, L.M. & Green, A.R. Immunity 36, 529–541 (2012).
Ortmann, C.A. et al. N. Engl. J. Med. 372, 601–612 (2015).
Nangalia, J. et al. N. Engl. J. Med. 369, 2391–2405 (2013).
Hartigan, J.A. & Wong, M.A. J. R. Stat. Soc. Ser. C Appl. Stat. 28, 100–108 (1979).
Strehl, A. & Ghosh, J. J. Mach. Learn. Res. 3, 583–617 (2003).
Hubert, L. & Arabie, P. J. Classif. 2, 193–218 (1985).
Ben-Hur, A., Horn, D., Siegelmann, H.T. & Vapnik, V. J. Mach. Learn. Res. 2, 125–137 (2001).
Hubert, M. & Debruyne, M. WIREs Comp Stat 2, 36–43 (2010).
Hubert, M., Rousseeuw, P.J. & Branden, K.V. Technometrics 47, 64–79 (2005).
Reimand, J. et al. Nucleic Acids Res. 44, W83–W89 (2016).
Goder, A. & Filkov, V. Consensus clustering algorithms: comparison and refinement. in Proceedings of the Meeting on Algorithm Engineering & Experiments 109–117 (Society for Industrial and Applied Mathematics, 2008).
Petzer, A.L., Zandstra, P.W., Piret, J.M. & Eaves, C.J. J. Exp. Med. 183, 2551–2558 (1996).
Picelli, S. et al. Nat. Protoc. 9, 171–181 (2014).
Andrews, S. FastQC: A quality control tool for high throughput sequence data. Reference Source (2010).
Bolger, A.M., Lohse, M. & Usadel, B. Bioinformatics 30, 2114–2120 (2014).
Trapnell, C., Pachter, L. & Salzberg, S.L. Bioinformatics 25, 1105–1111 (2009).
Love, M.I., Huber, W. & Anders, S. Genome Biol. 15, 550 (2014).
Risso, D., Ngai, J., Speed, T.P. & Dudoit, S. Nat. Biotechnol. 32, 896–902 (2014).
Ritchie, M.E. et al. Nucleic Acids Res. 43, e47 (2015).
Acknowledgements
We thank B. Vangelov, J.-C. Delvenne and R. Lambiotte for fruitful discussions and for their help with computational methods. We also thank D. Flores Santa Cruz, D. Dimitropolou and J. Grinfeld for technical assistance with experiments. We thank I. Vasquez-Garcia, D. Harmin, M. Kosicki, D. Ramsköld and M. Huch for comments on the manuscript. V.Y.K., T.A., A.Y. and M.H. are supported by Wellcome Trust Grants. K.N.N. is supported by the Wellcome Trust Strategic Award 'Single cell genomics of mouse gastrulation'. M.T.S. acknowledges support from FRS-FNRS; the Belgian Network DYSCO (Dynamical Systems, Control and Optimisation), funded by the Interuniversity Attraction Poles Programme initiated by the Belgian State Science Policy Office; and the ARC (Action de Recherche Concerte) on Mining and Optimization of Big Data Models, funded by the Wallonia-Brussels Federation. M.B. acknowledges support from EPSRC (grant EP/N014529/1). T.C. was funded through a core funded fellowship by the Sanger Institute and a Chancellor′s fellowship from the University of Edinburgh. K.K. and A.R.G. are supported by Bloodwise (grant ref. 13003), the Wellcome Trust (grant ref. 104710/Z/14/Z), the Medical Research Council, the Kay Kendall Leukaemia Fund, the Cambridge NIHR Biomedical Research Center, the Cambridge Experimental Cancer Medicine Centre, the Leukemia and Lymphoma Society of America (grant ref. 07037) and a core support grant from the Wellcome Trust and MRC to the Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute. W.R. was supported by BBSRC (grant ref. BB/K010867/1), the Wellcome Trust (grant ref. 095645/Z/11/Z), EU BLUEPRINT and EpiGeneSys.
Author information
Authors and Affiliations
Contributions
M.H. conceived the study; V.Y.K., M.H., M.T.S., M.B., T.A. and A.Y. contributed to the computational framework; K.K. and T.C. performed the experiments for the patient data; K.N.N. helped with the analysis of embryonic mouse data; M.B., W.R., A.R.G. and M.H. supervised the research; and V.Y.K. and M.H. led the writing of the manuscript with input from the other authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Boxplots of 100 realizations of the SC3 clustering on the Biase, Yan and Goolam datasets.
For clarity, lines are drawn through the medians of the boxplots. The x-axis shows the number of eigenvectors d of the transformed distance matrix as a percentage of the total number of cells N in each dataset. The black vertical lines correspond to d = 4% of N and d = 7% of N. Dots represent outliers that are higher than the highest value (or lower than the lowest value) within 1.5 * IQR, where IQR is the inter-quartile range, or distance between the first and third quartiles.
Supplementary Figure 2 Boxplots of 100 realizations of the SC3 clustering on the Deng, Pollen and Kolodziejczyk datasets.
For clarity, lines are drawn through the medians of the boxplots. The x-axis shows the number of eigenvectors d of the transformed distance matrix as a percentage of the total number of cells N in each dataset. The black vertical lines correspond to d = 4% of N and d = 7% of N. Dots represent outliers that are higher than the highest value (or lower than the lowest value) within 1.5 * IQR, where IQR is the inter-quartile range, or distance between the first and third quartiles.
Supplementary Figure 3 Exploration of SC3 pipeline parameters
(a) Histogram of the d values where ARI>.95 is achieved for the downsampled (by a factor of 10, Methods) gold standard datasets from Fig. 1b. The black vertical lines indicate the interval d = 4-7% of the total number of cells N, showing high accuracy in the classification; (b) Histogram of the d values where ARI>.95 is achieved for the silver standard datasets from Fig. 1b. The black vertical lines indicate the interval d = 4-7% of the total number of cells N, showing high accuracy in the classification; (c) Exploration of the gene filter parameters (see Methods for more details). Dots represent individual clustering runs. Bars correspond to the median of the dots; (d) The effect of dropouts in the distance calculations step on the accuracy of SC3 clustering (Methods for more details). Dots represent individual clustering runs. Bars correspond to the median of the dots. Red and grey colours correspond to clustering with and without dropouts. The black line corresponds to ARI=0.8.
Supplementary Figure 4 Scalability, accuracy and rare cell-type detection rate of SC3 and benchmarking of the hybrid SC3
(a) Run times for different clustering methods as a function of the number of cells (N). All methods were run on a MacBook Pro (Mid 2014), OS X Yosemite 10.10.5 with 2.8 GHz Intel Core i7 processor, 16 GB 1600 MHz DDR3 of RAM. Two results shown for SC3 correspond to nstart=1000 and nstart=50, where nstart is the number of starting points for k-means clustering; (b) Reducing the number of k-means runs (nstart) from 1,000 to 50 results only in a slightly worse performance for SC3, yet with significant computational savings, as shown in (a). The black line indicates ARI = 0.8; (c) Using the hybrid SC3 based on reference labels provided by the authors. Same as Fig. 2c in the main text, but using the reference labels provided by the authors as inputs to the SVM. Dots represent outliers higher (lower) than the highest (lowest) value within 1.5 x IQR, where IQR is the interquartile range. The black line indicates ARI = 0.8; (d) Robustness of SC3 for the detection of rare cell-types. For two of the datasets, we remove different percentages of the cells in the rare cell-types. The figure shows the mean fraction of SC3 runs in which all the rare cells were clustered together as a function of the total number of cells in the rare cell-type; (e) Sensitivity of SC3 for identifying rare cell-types when the hybrid SC3 approach is used with 30% of cells to train the SVM. This figure was derived from (d) by correcting the mean fraction of times that the rare cells were located in the same cluster using the probability of drawing rare cells within the 30% of all cells (Methods).
Supplementary Figure 5 Analysis of SC3 clustering of the Macosko dataset
(a) The cells from the Macosko dataset were clustered 100 times using SC3. “Pairwise” indicates the ARIs between the different solutions (a sample of 100 ARIs was taken) obtained and “Reference” indicates the ARI as compared to the labels obtained by Macosko et al.; (b) Sankey diagram comparing the 39 clusters reported by Macosko et al (left) and the 39 clusters obtained with SC3 (right). The widths of the lines linking both sets of clusters correspond to the number of cells they have in common. Colors and cell types as in Macosko et al.
Supplementary Figure 6 Explanation of biological insights provided by SC3
(a) Illustration of the difference between marker genes and differentially expressed genes. In this small example, 20 cells containing 14 genes with binary expression values (blue for ‘off’, red for ‘on’) are clustered. Only genes 1-4 can be considered as marker genes, whereas all 14 genes are differentially expressed; (b) Density of distributions of AUROC (sample of 1000 values for each dataset) obtained from merging of 100 calculations of marker genes using randomly shuffled assignments of reference labels (provided by the authors, see Methods); (c) Outlier scores for all N= 268 cells of the Deng dataset as generated by SC3 (colors correspond to the 10 reference clusters provided by the authors – same as Stage in Fig. 2d). The nine cells with high outlier score in the red cluster (black arrow) were prepared using a different protocol (see text for details), and are thus assigned to a technical artifact.
Supplementary Figure 7 Cell sorting and genotyping procedures for patients
(a) Contour plots describing the sorting strategy for isolating HSCs in patient 2 (the same was done for patient 1). CD34, CD38, CD90 and CD45RA expression is displayed using a log scale; (b) Lineage negative, CD34+/CD38-/CD90+/CD45RA- single cells were sorted into individual wells for scRNA-Seq or colony growth in cytokine cocktail allowing progenitor cell expansion. For genotyping the JAK2V617F and the TET2 loci were characterised using Sanger sequencing. (c) Clonal composition of patients 1, 2 obtained by Sanger sequencing experiments as described in (b) of the JAK2V617F and the TET2 loci (Methods). Colors are the same as Cluster colors in Fig. 3.
Supplementary Figure 8 Quality control of cells in the patient data
(a) Number of cells with a given number of expressed genes in each patient. Cells on the left side of the red line were removed from further analysis as lowly expressed; (b) Number of cells with a given (# of ERCC reads)/(# endogenous reads) ratio in each patient. Cells on the right side of the red line were removed from further analysis as outliers.
Supplementary Figure 9 Clustering of scRNA-seq data from patient 1
Consensus matrices corresponding to different values of k. For average silhouette width and stability see Methods.
Supplementary Figure 10 Clustering of scRNA-seq data from patient 2
Consensus matrices corresponding to different values of k. For average silhouette width and stability see Methods.
Supplementary Figure 11 Clustering of scRNA-seq data using combined patient 1 and patient 2 datasets
Consensus matrices corresponding to different values of k. For average silhouette width and stability see Methods.
Supplementary Figure 12 Additional lines of evidence that SC3 can help to define subclonal composition
(a) Comparison of the coefficient of variation of gene expression in Tet2 and WT subclones of patient 1; (b) Sorting of haematopoietic stem and progenitor cells from patient 1 and 2 using antibodies that target surface markers identified using SC3. Our analysis suggests that CD83 should be specific for WT clones, CD127 and CD244 for the Tet2 only mutant clones, while CD82 is specific to double mutant clones. Percentages account for CD38+CD34+ cells positive for the indicated surface marker.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–12, Supplementary Tables 2, 4 and Supplementary Results 1–5 (PDF 4591 kb)
Supplementary Table 1
SC3 analysis of Macosko dataset (XLSX 1211 kb)
Supplementary Table 3
Deng marker genes obtained by SC3 (XLSX 168 kb)
Supplementary Table 5
Marker genes analysis of patients 1 & 2 (XLSX 81 kb)
Supplementary Software 1
SC3 v.1.1.2 source files used to generate the analyses in this paper. (ZIP 1620 kb)
Supplementary Software 2
Source Rmd, python and text files used to generate Supplementary Results 1-4 (ZIP 285 kb)
Source data
Rights and permissions
About this article
Cite this article
Kiselev, V., Kirschner, K., Schaub, M. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14, 483–486 (2017). https://doi.org/10.1038/nmeth.4236
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.4236
This article is cited by
-
Single cell RNA sequencing improves the next generation of approaches to AML treatment: challenges and perspectives
Molecular Medicine (2025)
-
Topological identification and interpretation for single-cell epigenetic regulation elucidation in multi-tasks using scAGDE
Nature Communications (2025)
-
Transcriptomic neuron types vary topographically in function and morphology
Nature (2025)
-
Analysis of a Single Cell RNA-seq Workflow by Random Matrix Theory Methods
Bulletin of Mathematical Biology (2025)
-
p-clustval: a novel \(p\)-adic approach for enhanced clustering of high-dimensional single-cell RNASeq data
International Journal of Data Science and Analytics (2025)