Abstract
Background The routine diagnostic process increasingly entails the processing of high-volume and high-dimensional data. This processing may provide scaling issues that limit the implementation of these types of data into research as well as integrated diagnostics in routine care. Here, we investigate whether we can use existing dimension reduction techniques to provide visualisations and analyses for a complete bloodcount (CBC) while maintaining representativeness of the original data. We considered over 3 million CBC measurements encompassing over 70 parameters of cell frequency, size and complexity from the UMC Utrecht UPOD database. We evaluated PCA as an example of a linear dimension reduction techniques and UMAP, TriMap and PaCMAP as non-linear dimension reduction techniques. We assessed their technical performance using quality metrics for dimension reduction as well as biological representation by evaluating preservation of diurnal, age and sex patterns, cluster preservation and the identification of leukemia patients.
Results We found that PCA performs systematically better than the UMAP, TriMap and PaCMAP in representing the underlying data. Biological relevance was retained for periodicity in the data. However, we also observed a decrease in predictive performance of the reduced data for both age and sex, as well as an overestimation of clusters within the reduced data. Finally, we were able to identify the diverging patterns for leukemia patients after use of dimensionality reduction methods.
Conclusions We conclude that for hematology data, the use of unsupervised dimension reduction techniques should be limited to data visualization applications, as implementing them in diagnostic pipelines may lead to decreased quality of integrated diagnostics in routine care.
Competing Interest Statement
Institutional grants by Abbott Hematology, Abbott Global, Siemens Healthineers and Beckman Coulter were received by the authors' department. None of these organizations had a role in conceptualization, design, data collection, analysis, decision to publish, or preparation of the manuscript. The authors declare that they have no further competing interest.
Funding Statement
No funding was received for this research.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Medical Research Ethics Comittee NedMec waived the need for informed consent, as only pseudonymized data were used for a large patient sample. The study was in concordance with the declaration of Helsinki. This study was not subject to the Human Subjects Act (in Dutch: Wet Medisch-Wetenschappelijk onderzoek met mensen, WMO) and we therefore obtained a waiver for study approval from the institutional review board (Medical Research Ethics Comittee NedMec).
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
Slight change of the abstract, removed some LaTeX for the Medrxiv UI, and added "unsupervised" in the final statement of the abstract to delineate the paper a bit clearer.
Data Availability
The datasets generated and/or analysed during the current study are not publicly available due to privacy regulations but are available from the corresponding author on reasonable request.