Abstract
Adopting a computational approach for the assessment of urine cytology specimens has the potential to improve the efficiency, accuracy and reliability of bladder cancer screening, which has heretofore relied on semi-subjective manual assessment methods. As rigorous, quantitative criteria and guidelines have been introduced for improving screening practices, e.g., The Paris System for Reporting Urinary Cytology (TPS), algorithms to emulate semi-autonomous diagnostic decision-making have lagged behind, in part due to the complex and nuanced nature of urine cytology reporting. In this study, we report on a deep learning tool, AutoParis-X, which can facilitate rapid semi-autonomous examination of urine cytology specimens. Through a large-scale retrospective validation study, results indicate that AutoParis-X can accurately determine urothelial cell atypia and aggregate a wide-variety of cell and cluster-related information across a slide to yield an Atypia Burden Score (ABS) that correlates closely with overall specimen atypia, predictive of TPS diagnostic categories. Importantly, this approach accounts for challenges associated with assessment of overlapping cell cluster borders, which improved the ability to predict specimen atypia and accurately estimate the nuclear-to-cytoplasm (NC) ratio for cells in these clusters. We developed an interactive web application that is publicly available and open-source, which features a simple, easy-to-use display for examining urine cytology whole-slide images (WSI) and determining the atypia level of specific cells, flagging the most abnormal cells for pathologist review. The accuracy of AutoParis-X (and other semi-automated digital pathology systems) indicates that these technologies are approaching clinical readiness and necessitates full evaluation of these algorithms via head-to-head clinical trials.
Introduction
Urothelial carcinoma is highly prevalent (9th most common worldwide) and has the highest recurrence rate among all forms of cancer (74%) 1, 2. The treatment and management of urothelial carcinoma requires follow-up urine cytology (UC), expensive, painful chemotherapy, and/or invasive cystoscopy procedures for long periods of time (typically the remainder of the patient’s life), necessitating the development and implementation of less invasive screening and follow up measures 3.
The detection and screening for bladder cancer has greatly improved since the earliest recorded evaluation of hematuria was recorded in the papyrus of Kahun, circa 1900 B.C.. In 1550 B.C., it was suggested that hematuria originated from “worms in the belly” 4. A causative agent, S. haematobium, was identified in 1854 by Theodor Bilharz 5, 6. In 1947, Dr. George Papanicolaou, widely considered the father of modern cytopathology, proposed a formal system for evaluation of malignant cells exfoliated from the bladder’s epithelium, which has largely remained intact 7, 8. Over the past half-century, efforts to rigorously define quantitative assessment criteria (e.g., nuclear-to-cytoplasm (NC) ratio, chromatin structure, etc.) and improve specimen preparation methods have sought to resolve remaining ambiguity. Yet, traditional cytological approaches are still hampered by inter-rater variability, specimen quality issues, and the tendency towards ‘hedging’ to the atypical category 9–12.
In recent years, The Paris System for Reporting Urinary Cytology (TPS), formulated in 2013, published in 2016, and updated in 2022, has emerged as a more quantitative and reproducible reporting system bladder cancer 13–17. TPS criteria are applied to assign one of four main ordered categories (negative, atypical urothelial cells, suspicious for high-grade urothelial carcinoma, positive for high-grade urothelial carcinoma) based on the following criteria for a positive diagnosis: (1) at least five malignant urothelial cells (updated to ten in 2022), (2) an NC ratio at or above 0.7, (3) nuclear hyperchromasia, (4) markedly irregular nuclear membrane, and (5) coarse/clumped chromatin 2. It is often easier to evaluate specimens that have clear-cut diagnoses, either negative or positive, than those that are atypical or suspicious. Atypical specimens are those that are hedged against a negative diagnosis, while suspicious specimens are those that are hedged against a positive diagnosis, but allow fewer than five malignant cells to be detected. Unsurprisingly, the two indeterminate designations suffer from poor inter-rater variability 12, 17.
There are a number of drawbacks to cytological assessments, despite improvements in screening criteria: cytology slides are far less structured than traditional histological specimens (as they are a random dispersion of cells); there is high inter-rater variability; and the workload involved often leads to cytologist exhaustion– all of these factors increase the likelihood of misclassification. Furthermore, TPS does not introduce rigorous screening criteria for urothelial cell clusters, instead mainly relying on aggregates of individual cellular estimates. Systems to automate the assessment of cytology specimens can provide more quantitative assessments of atypia, while improving reliability and reproducibility.
Advances in cytopathology vis-à-vis increased automation can bring several benefits to all stakeholders in the healthcare space 18–22. The adoption of computer assisted Papanicolaou (‘Pap’) test screening helped laboratories address overwhelming numbers of tests that formerly required manual screening, leading to inevitable workflow backlogs and diagnostic errors resulting from overwork. The end result of this practice was the drafting of the CLIA-88 regulations concerning cytotechnologist workload limits and the development of semi-automated Pap screening devices such as the FocalPoint™ GS and the ThinPrep ® Imaging System (TIS) 23, 24. The commercial success of these automated systems in the gynecologic cytology market provides a window into the possibilities of future computational applications in urine cytology 25–33. The factors which drove the creation of automated gynecologic cytology systems are similarly present in urine cytology: to improve clinical outcomes and integrate smoothly within the daily workflows of cytopathology laboratories. Outside of gynecologic cytology, several computational methods have been developed for cytological applications in screening cancers of varying types of specimens 18, 34–36. For instance, efforts have been made to screen potential malignancies in thyroid fine-needle aspirations (FNA), liquid-based lung cancer specimens, pancreaticobiliary FNA, breast lesions, and urine specimens 37–42.
Systems to automate cytology screening can provide more quantitative assessments of atypia while improving reliability, precision and reproducibility of findings. State-of-the-art approaches leverage deep learning, which relies on the use of artificial neural networks (ANN– inspired by the central nervous system), to construct indicators of atypia that can be formulated into diagnostic tests. For instance, Sanghvi et al. developed a semi-autonomous diagnostic decision aid for bladder cancer using a deep learning algorithm to quantify abnormal cytomorphological features 43. The algorithm detected urothelial cells using QuPath, urothelial clusters using density-based clustering and used convolutional neural networks for scoring cells for atypia (e.g., NC ratio, hyperchromasia, etc.). Although the effectiveness of QuPath, the scoring algorithms, and density-based clustering was not fully discussed, the study showed promising results in estimating overall atypia and could potentially improve bladder cancer screening. However, it should be noted that other studies have highlighted the limitations of QuPath in disaggregation of cells within clusters in favor of detection-based approaches, indicating a need for further refinement of the algorithm 44–47.
We previously developed the AutoParis system to automatically report the presence of malignant cells across cytology specimens through cross-tabulation of the degree of atypia and NC ratio for all urothelial cells in the preparation 48. Cross-tabulation is used to generate an Atypia Burden Score (ABS) to directly classify the specimen. The current AutoParis system operates by: 1) using connected component analysis (morphometry) and watershedding to separate individual cells from cell clusters within the specimen; 2) estimating the NC ratio of the cell using a segmentation neural network to separate the nucleus and cytoplasmic components on a pixel-by-pixel basis; 3) simultaneously assigning the cell as urothelial and recording whether the cell is atypical (atypia score) from a classifier which separates negative urothelial cells, positive urothelial cells, leukocytes, red blood cells (RBCs), debris, squamous, and crystals; and 4) generating digital images in which the cells are arranged in order of atypia, which could be helpful to pathologists. Limitations in current classification systems for urine cytology include 20: 1) confounding by the presence of blood, high cellularity, neobladders (abundant degenerated enterocytes) and scanning artifacts. Other previously unaccounted for cell types may also confound classifiers (e.g., polyomavirus encrusted cells conflated with positive urothelial cells, leukocytes vs. clusters of leukocytes, urothelial cells with no nucleus present, renal tubule cells) 49; 2) morphometry algorithms may not scale to hundreds of thousands of cells at maximal resolution; 3) density-based clustering / watershedding is likely insufficient to separate overlapping cells; 4) using a single classifier does not adequately separate the tasks of determining whether a cell is both urothelial and atypical; 6) orientation and size of cell could confound the classifier; and 7) existing graphical displays for communicating the burden of atypia are static rather than dynamic.
We set out to improve on the AutoParis classification tool by addressing the above limitations and additionally trained the models using a more expansive dataset– we dub the new tool AutoParis-X (AP-X). In AutoParis-X, we addressed challenges associated with cell cluster assessment by developing an artificial intelligence tool that uses detection models to localize urothelial cells, overlapping cell boundaries, dense regions of significant overlap, and identify visual markers of urothelial atypia. By breaking clusters into their constituent architectural components, this preprocessing tool facilitates downstream association studies and predictive algorithms that incorporate quantitative cluster-level features. The cell border identification tool helped develop a more comprehensive understanding of urothelial cell cluster atypia as it pertains to bladder cancer screening. In comparison to the previous AutoParis study, which was validated on a small well-curated test set, we performed a large-scale retrospective validation of AutoParis-X on nearly 1,300 real-world specimens from internal cohorts. In this manuscript, we discuss improvements to the previous approach and its potential for real-time assessment as a mature diagnostic decision aid.
Methods
Specimen Collection and Slide Processing
A total of 1,303 urine specimens were collected across 140 bladder cancer patients (median of 8 specimens per patient; IQR: [8-13]) from 2008 to 2019 at Dartmouth-Hitchcock Medical Center. Forty-seven of these specimens were used to curate data for training the cell and cluster-level machine learning models (cell and cluster-level training and validation cohort). Four specimens were removed due to equivocal findings and/or excessive confluent cellularity. AutoParis-X was further trained and validated on 1,252 specimens after curating slide-level cell/cluster predictors (slide-level training and validation cohorts; see Calculation of Cell and Cluster Slide-Level Scores). The specimens were prepared using ThinPrep® and Papanicolaou staining before being examined microscopically 24. The urine slides were scanned using a Leica Aperio-AT2 scanner at 40× resolution and were stored as 70% quality SVS files representing whole slide images. The slides were manually focused (by a trained technician) on a single plane during scanning, and z-stacking was not used. Patient and slide-level characteristics from the slide-level training and validation cohorts can be found in Table 1. All slides were assessed by a group of five cytopathologists using TPS criteria (negative for high grade urothelial carcinoma, atypical urothelial cells, suspicious for high grade urothelial carcinoma, positive for high grade urothelial carcinoma) 12.
Methods Overview
In this section, we summarize improvements introduced in AutoParis-X, which will be elaborated on in following sections. AutoParis-X was written using the Python programming language and neural networks were implemented using the PyTorch and Detectron2 frameworks 50, 51. Statistical and machine learning models were implemented in Python and R 52–54. A graphical overview is provided in Figure 1:
Slide processing– Connected components analysis to isolate individual cells and cell clusters, sped up through parallel processing 55.
Cell border detection (BorderDet)– Isolates cells within urothelial clusters with overlapping cytoplasmic borders through neural network detection model 44.
Cell-Level Measures:
Morphometric measures– Additional morphological features to improve cell-type classification and atypia estimation (e.g., size / area).
Urothelial Classifier (UroNet)– Used to filter urothelial cells from potentially conflated cell types through a convolutional neural network, which operates on images of cells and their morphometric measures 56– trained on an expanded dataset with more cell classes.
NC ratio estimation (UroSeg)– Estimates the NC ratio by neural network pixel-wise segmentation of background, nucleus and cytoplasm. Used as objective marker of atypia.
Atypia score (AtyNet)– For predicted urothelial cells at a particular cutoff threshold, a subjective score which incorporates multiple screening criteria (e.g., hyperchromasia, etc.) is determined using another convolutional neural network which operates on images of cells and their morphometric measures and outputs an atypia score 48.
Cell- and Cluster-Slide-level scores– Established through a combination of the above scoring methods, counting the number of cells/clusters in the slide with atypical morphology / cluster architecture as defined by previous works 43, 48. Optimal decision cutoffs for determining cellular/cluster atypia were decided using Bayesian Optimization techniques 57.
Classifier development– Machine learning classifier which integrates cell and cluster level scores and other demographic/specimen characteristics into an Atypia Burden Score (ABS), accounting for repeat measures by patient 58–64.
Model interpretation– A hierarchical logistic regression model was constructed from the machine learning model to identify important indicators of atypia, in addition to analogous univariable models. Helpful graphical displays were generated through an interactive web application 65.
Demo– A demo was deployed to an Amazon Web Services (AWS) server and software released through GitHub and PyPI.
Slide Preprocessing
As detailed in a previous work, individual objects in the image were identified through a connected component analysis 48. In brief, WSI were converted into grey scale images using opencv2 in Python (version 3.8) 66. The background of WSIs were converted to white through intensity thresholding of the grey scale image to form an object mask. Small objects, defined as a pixelwise area of 50 or below, were filtered using the remove_small_objects (scipy, Python v3.8) morphological operation 67. Large objects (e.g., ink markings) were similarly filtered as defined by a minimal area of 500,000 pixels. After small and large object removal, holes within the object mask were filled through the fill_voids function (which is faster than offerings from the scipy package) 68. We leveraged the cupy package (Python v3.8) to reduce compute time through usage of Graphics Processing Units (GPU) where appropriate after extensive timing tests 69. Subimages of slide objects (e.g., candidate urothelial cells and clusters) were returned using the scipy regionprops function, which also returned various other morphometric measures and bounding boxes. Inference time and memory usage for the connected component analysis for object identification was reduced through distributed computing procedures (e.g., Dask), which use optimized parallelization to operate on larger-than-memory arrays. Using multiprocessing through dask, operations were also parallelized across subregions within the slide 55.
Cell Border Identification for Cell Cluster Analysis
To improve detection of individual cells within clusters, we previously developed a cell detection neural network, BorderDet, (using the state-of-the-art Detectron2 framework) to identify: 1) location of cells through estimation of bounding boxes (one box per cell) and 2) identify cell boundaries by separating overlapping cytoplasm from adjacent cells. BorderDet was developed using cell clusters identified from the cluster-level training cohort. In brief, two cytopathologists (LJV and XL) annotated 800 cell cytoplasmic boundaries for squamous cells, inflammatory cells, negative/atypical urothelial cells, and dense regions of overlapping/indistinguishable cell borders (dense region). BorderDet is an object detection neural network that can detect multiple objects/instances (i.e., cells) in a cell cluster image 44. It looks for areas in the image that may contain an object and then assigns a score that indicates how likely it is that the region contains an object. The program labels identified objects with the appropriate label (e.g., squamous cell, dense region) and draws a line around the edges of the object (i.e., segmentation mask) to portray the exact boundary, which can overlap with adjacent cells. This allows the program to accurately identify and locate multiple objects in a single cluster. Objects were then filtered using non-max suppression, a technique which ranks overlapping objects, as defined through their intersection over union (IoU), based on their “objectness” score and removes objects with a lower score 70.
To reduce the number of objects assessed using BorderDet, a size filter was enforced, assessing candidate cell clusters with a pixelwise area of at least 1800 pixels, determined through a sensitivity analysis and visual inspection. Parallel processing through multithreading and multiprocessing was integrated using dask for rapid evaluation 55. Individual cells extracted through the connected component analysis (area between 256 and 1800 pixels) and objects extracted from clusters using their instance segmentation masks were further assessed using single-cell algorithms which report quantitative metrics of atypia (cell-level measures).
In comparison to the density-based clustering approach that validated urothelial clusters using a CNN (Sanghvi et al.), which could lead to many false negative findings (i.e., approach only “screens out” candidate cell clusters), urothelial cell clusters were identified by BorderDet if they contained urothelial cells 43. This approach improves on watershedding (AutoParis v1) and density-clustering (Sanghvi et al.) techniques as these two methods do not precisely identify cells within larger candidate clusters 20, 43, 44, 48. BorderDet also improves upon previous methods by locating dense urothelial cell architectures with overlapping indistinguishable cytoplasmic borders which are challenging to assess for individual cells. Furthermore, while presence of a dense architectural region in a cluster as defined by an area cutoff was used as an atypia predictor, dense architectures themselves were further subclassified as atypical if surrounding urothelial cells were labeled as atypical (as defined by morphology).
Cellular Morphometric Measures
Various morphometric features were estimated from individual candidate cells, including: 1) area; 2) convex area; 3) eccentricity; 4) equivalent diameter; 5) extent; 6) Feret’s diameter; 7) maximum diameter; 8) filled area; 9) major axis length; 10) minor axis length; 11) perimeter; and 12) solidity, extracted using scikit-image (Python v3.8) 56, 71. These morphometric features were primarily used to help demarcate urothelial cells. As an example, urothelial cells are significantly larger than leukocytes, so cell area is an important criterion for separating the two cell types. Morphometric features were standardized using quantile transformation (implemented in scikit-learn, Python v3.8) within the training set to reduce the influence of any given cell on specifically which morphometric features were important for the assessment 72. This places greater emphasis on the imaging findings as means to delineate between different cell types.
Urothelial Cell Classification
Urothelial cell classification was accomplished using UroNet, which was modified significantly from its original incarnation. While AutoParis estimated both the presence and atypia of the urothelial cell simultaneously 48, as differentiated from several other specimen constituents, AutoParis-X is chiefly focused on delineating urothelial cells from potentially conflated cell types and slide objects prior to estimating atypia. When aiming to validate the AutoParis algorithm, we noticed that a nontrivial number of urothelial cells lacked a nucleus, potentially related to being out of focus (no Z-stacking) 73, but were not included in our original training set and thus were often confused with other cell types with a smaller nuclear area (e.g., squamous cells). We also identified rare urothelial cells with changes consistent with a Polyomavirus cytopathic effect 49, 74. These cells are benign but assessment can often mimic HGUC and would certainly mislead any attempt to accurately predict the NC ratio and are thus removed by UroNet.
A total of 108,388 and 27,097 cells were manually labeled by two cytopathologists (LJV and XL) and used to train and validate the cell level model respectively from the cell-level training and validation cohort. A breakdown of cell types present in this training and validation cohort is listed in Table 2. These cell images were combined into the following classes: 1) urothelial cells (benign/atypical), 2) urothelial cells with polyomavirus cytopathic effect, 3) debris, crystals and red blood cells (RBC), 4) leukocytes, 5) clusters of leukocytes, and 6) squamous cells. UroNet was developed using a residual neural network (ResNet18), augmented with an auxiliary layer which combines the morphometric information (e.g., area/size, eccentricity, etc.) with features extracted from ResNet18 by fusing the penultimate layer of the network with this information. The auxiliary neural network first maps the number of morphometric features, , to the number of ResNet18 features using a multi-layer perceptron, fϕ. Then the morphometric information (same dimensionality as the ResNet features) is fused with the deep learning features using a gated attention operation, which decides dynamically on a cell-by-cell basis which set of features (deep learning, , vs morphometric, to weight more. The weight is dynamically determined using the gating neural network, fθ75.
This operation permits UroNet to filter out cells with significant size differences (e.g., leukocytes are much smaller than urothelial cells). After model training using the PathflowAI package 76, the performance of UroNet was assessed using the cell-level validation set through the area under the receiver operating characteristic curve (AUC), reported for each class. To assess how much weight was placed on the morphometric features for prediction, we investigated the attention weights, α, across the validation set. We used Integrated Gradients 77, 78, a deep learning interpretation method, to assess which specific image/deep learning and morphometric features were important for each cell type.
NC Ratio Estimation
For cells classified as urothelial, the NC ratio was calculated for both isolated and cluster cells using a segmentation neural network, UroSeg, which employed a U-Net architecture to assign on a pixelwise basis the presence of nucleus, cytoplasm, or background 48, 79, 80. These areas were annotated/outlined by cytopathologists and UroSeg was trained and validated on 3,690 and 1,231 urothelial cells respectively. Performance was reported using the area under the receiver operating characteristic curve (AUC), reported on a pixelwise basis. For select cell clusters, we compared the impact of running BorderDet, followed by UroNet and UroSeg to calculate the NC ratio as compared to running UroSeg then watershedding, as was originally done by the previous AutoParis algorithm.
Atypia Score
Several cytopathologists determined whether every urothelial cell extracted from the cell-level training and validation cohort (Table 2) was benign or atypical, based on existing markers of atypia (e.g., presence of nuclear membrane irregularity, abnormal chromatin, hyperchromasia, etc.). From this information, AtyNet, a CNN based on ResNet18 with a similar architecture as UroNet, was trained to recapitulate these subjective findings 81. For every urothelial cell, AtyNet calculates a subjective marker of atypia– the atypia score– which is a value from 0-1 that reflects the probability that a cell is atypical. We used IntegratedGradients, a deep learning interpretation method, to assess which specific image/deep learning and morphometric features were important for atypia assignment.
Calculation of Cell and Cluster Slide-Level Scores
All extracted individual cell and cluster level statistics are placed into Rich Information Frames (RIF), which are data frame/tabular data structures 48. For any given WSI, there are three RIFs (see Table 3 for description of features):
Isolated-Cell-RIF: Stores morphometric measures; bounding box locations within specimens, cell type assignment probabilities; NC ratios; and atypia scores for each cell not associated with clusters (isolated urothelial cells).
Cluster-Cell-RIF: Stores morphometric measures; bounding box locations within specimens; cell type assignment probabilities; NC ratios; and atypia scores for each cell associated with clusters, in addition to their cluster assignment label (cluster urothelial cells).
Cluster-RIF: Stores bounding box locations within WSI; cluster size; cytoplasmic borders; area of dense regions in cluster; and associated cluster label/identifier. Information on cellular atypia (e.g., number of atypical cells), number of urothelial cells, amongst other cluster-level measures, were added to this RIF from the Cluster-Cell-RIFs.
All RIFs are cross-tabulated to form a Slide Inference Frame (SIF), which represents slide-level statistics, aggregated across all urothelial cells and urothelial cell clusters. This is accomplished by thresholding the cutoff probabilities for the cell and cluster-level scores and counting the number of cells and clusters which meet these criteria. For instance, given an atypia score cutoff of 0.7 (i.e., cell is atypical if AtyNet assigns a 70% probability), a cluster is deemed to exhibit cellular atypia if, for instance, more than 20% of the cells within the cluster are atypical under this definition. Based on the definition of a urothelial cluster (e.g., number of urothelial cells), the number of atypical clusters within the WSI can be estimated. All urothelial cells with an NC ratio of 0 were removed prior to calculating these scores. SIF contains the following statistics:
Isolated cell subscores: Derived from Isolated-Cell-RIF, for cells which were not associated with clusters, including the following statistics: 1) number of urothelial cells;
number of atypical urothelial cells as determined using the atypia score; 3) number of atypical urothelial cells as determined using the NC ratio; 4) number of urothelial cells; and 5) center and spread of various morphometric measures.
Cluster cell subscores: Derived from Cluster-Cell-RIF. Similar to isolated cell subscores, only considering cells which were associated with / identified within clusters.
All cell subscores: Combines isolated and cluster cell subscores, considering all cells, irrespective of whether there was a cluster assignment.
Cluster subscores, representing aggregate Cluster-RIF statistics, including: 1) number of urothelial clusters (defined by a minimum threshold of urothelial cells); 2) number of atypical urothelial clusters (defined by either NC ratio or atypia score); 3) number of dense clusters; and 4) number urothelial clusters that are both atypical and dense. Unlike the previous three scores which focus on individual urothelial cells, identified urothelial cell clusters represent the principal unit of analysis.
Using AutoParis-X, RIF-SIF scores were calculated across the slide-level training and validation cohorts. We added the following patient-level characteristics to the RIF-SIF scores: 1) age; 2) sex; 3) history of hematuria; and 4) specimen source 82, 83. We also noted where slides contained significant blood, high cellularity, acellularity, neobladders (abundant degenerated enterocytes) and scanning artifacts.
Estimating Specimen Atypia with Machine Learning
Specimen atypia was reported through dichotomization of TPS categories into the following classes: 1) negative, atypical and 2) suspicious, positive. The Atypia Burden Score (ABS) reflects the predicted probability of a specimen being atypical as assessed by AutoParis-X. We implemented several machine learning and statistical modeling approaches to predict specimen atypia, including: 1) generalized linear mixed effects modeling (hierarchical logistic regression; GLMM; brms package, R v4.1), accounting for patient- and pathologist-level random intercepts, 2) Random Forest, which does not account for clustering by patient, 3) Gaussian Process Tree Boosting (GPBoost), and 4) Bayesian Additive Regression Trees (BART) 58–61, 64. GPBoost and BART account for clustering by patient by fitting patient- and pathologist-level random intercepts while capturing interactions and nonlinear associations between SIF predictors using ensemble tree models, fθ(x̄):
Overall model performance was communicated using fivefold cross-validation, which randomly partitions the data into a training and validation set and reports the overall performance (using the AUC) over the validation folds. Specimens belonging to the same patient were partitioned into the same training/validation fold for each cross-validation split to avoid potential inflation of test statistics. Confidence intervals (CI) were reported using 1000-sample nonparametric bootstrapping of each fold to yield 1000 samples of cross-validation statistics. Cell and cluster-level thresholds (e.g., atypical cell if NC>0.7; atypical cluster if at least 3 urothelial cells are atypical), which are used to generate RIF-SIF scores, were optimally aligned with specimen atypia through a Bayesian Optimization routine 57.
Interpretation
We identified significant ABS predictors by extracting salient interactions from the tree ensemble models and reporting odds ratios (OR) from univariable and multivariable Bayesian GLMM models: . As many of the ABS predictors were highly multicollinear, variance inflation factors and horseshoe lasso priors were used to select predictors 84, 85. Univariable associations adjusting for age, sex and hematuria were reported to give credence to omitted collinear predictors in the multivariable statistical modeling.
Hierarchical Bayesian cumulative link models (i.e., ordinal regression) in a similar specification were also used to report associations between the predictors and specimen atypia, treating the urine cytology assignment as an ordinal variable 86, 87. Statistical significance was reported using the p-value, as derived from the probability of direction (pd): p ≈ 2 ∗ (1 − pd). A p-value less than 0.05 indicates a significant atypia predictor. Credible intervals, similar to confidence intervals, communicated uncertainty in the effect estimates.
Web Application and Software Availability
We also developed an interactive web application which allows for rapid assessment of cytology slides. In brief, users first select a slide to examine. An ABS score is returned for the specimen as assessed using AutoParis-X. The Cell-RIF is converted into a 2D scatter plot of the NC ratio and atypia score– each point represents a cell. Using a “lasso tool”, users select cells within this scatterplot. The urothelial cells are highlighted on a zoomable WSI viewer (openseadragon) and additionally made available through an image gallery for additional examination (Figure 2) 88. The WSI viewer will highlight cells based on their relative degree of atypia as assessed algorithmically, focusing the end-user on a small subset of potentially malignant cells. A demo of this interactive web application can be found at the following URL: http://edit.autoparis.demo.levylab.host.dartmouth.edu/ (user: edit_user, password: qdp_2022; full-screen display is encouraged for optimal viewing experience). The web application also features a tutorial video for operating the application. The AutoParis-X software is also open-source, available to download on GitHub (https://github.com/jlevy44/AutoParisX) and installable using the following PyPI package: autoparis. Users aiming to run AutoParis-X will need to train compatible neural networks as neural networks were only trained on data from a single institution and would need additional finetuning to generalize.
Results
Performance of UroNet
UroNet demonstrated remarkable performance in the task of delineating among 6 different classes of cell types / objects to determine which cells are urothelial (Figure 2; Table 4). Figure 3A demonstrates a nearly perfect ROC curve (AUC=0.997 macro-averaged) for all 6 cell types across the validation set, indicating high classification accuracy. In addition, raw imaging features interpreted using IntegratedGradients corroborated with known histomorphology for specific cell types (e.g., highlighting dense chromatin to depict urothelial cells, surrounding membrane for squamous cells, etc.; Figure 3B). Many morphometric features were found to be important– for instance: 1) eccentricity as a defining feature of urothelial cells versus other cell types, 2) solidity for RBCs, 3) convex area as an important predictor for leukocyte clusters which have highly irregular formations, and 4) both convex area and solidity for squamous cells, which are larger than the other cell types and typically solid shapes without any notable deformations (Supplementary Figure 1). These findings suggest that UroNet can accurately identifying urothelial cells, important for establishing assessment of urothelial cells as the basis for AutoParis-X’s automated assessment.
Performance of UroSeg
UroSeg, a neural network segmentation tool, demonstrated excellent performance on our internal validation set in predicting the pixelwise presence of the nucleus and cytoplasm (AUC=0.971 macro-averaged) in order to calculate nuclear to cytoplasm (NC) ratio (Figures 2-3; Table 4). Figure 3F also shows nearly perfect receiver operating characteristic curves for both the nucleus and cytoplasm, indicating the high accuracy of UroSeg in predicting these structures. Additionally, we found that the NC ratios calculated from the segmentation masks produced by UroSeg correlated nearly perfectly with the ground truth NC ratios (r=0.965; MAE=0.015) annotated by the cytopathologists (Figure 3G). Figure 3E demonstrates the alignment of the true and predicted nuclear and cytoplasmic segmentation masks, further highlighting the accuracy of UroSeg. UroSeg was similarly effective when used in conjunction with BorderDet, our previously established urothelial cluster border separation tool. Cells extracted from urothelial clusters using BorderDet and confirmed to be urothelial via UroNet were assessed using UroSeg. We compared the NC ratios, averaged across each urothelial cluster, in our internal validation set with what was accomplished using watershedding techniques (which divided the clusters after seeding the watershed based on the location of the nuclei). Watershedding was not sensitive to the cell type as it did not leverage BorderDet and UroNet. In addition, for clusters containing urothelial cells and background debris or other confounding cell types, watershed heavily underestimated the NC ratio (Figure 4). This was universal across all of the urothelial clusters in the internal validation set. Through visual examination, it is clear that by precisely demarcating cytoplasmic borders between immediately adjacent and overlapping cells, BorderDet and UroNet allow for precise estimation of the NC ratio. Opting for alternative assessment approaches (e.g., watershedding) could reduce the predictive capacity of slides containing abundance of urothelial cell clusters by removing or unnecessarily skewing the reported statistics for these cells as compared to isolated cells.
Performance of AtyNet
Performance for AtyNet, the neural network which provides an atypia score estimate for each urothelial cell, was equally promising (Figure 2; Table 4). The algorithm achieved an area under the receiver operating characteristic curve of 0.917 on the internal validation set, indicating a strong ability to distinguish between atypical and normal cells. Model interpretation using integrated gradients revealed that the algorithm placed a high emphasis on irregularities in the nuclear membrane as a key feature in determining cytological atypia (Figure 2B) 56.
ABS Classifier Performance
Individual cell and cluster level features were cross tabulated across the slide and assessed using multiple statistical and machine learning algorithms. Many cellular and cluster level features correlated closely with specimen atypia (Supplementary Figures 2-4). Atypical urothelial cells as defined by both the NC ratio and atypia score, which were contained within clusters were, in some cases, more predictive of specimen atypia than assessment of isolated cells alone (e.g., cells with high NC ratio in clusters were more predictive than isolated cells with high NC ratio), further suggesting the importance of employing BorderDet for separating cells. The number of urothelial cells and cell clusters correlated directly with potential for malignancy. Urothelial cell clusters which were both atypical and contained dense regions were the third most predictive variable when assessed using univariable regression.
As part of the AutoParis-X framework, each machine learning model outputs the Atypia Burden Score (ABS)– the probability of assigning suspicious or positive UC exam as judged using AutoParis-X. Across all algorithms, ABS correlated closely with specimen atypia. The machine learning models which accounted for patient and pathologist-level variation, GPBoost and BART, outperformed all other approaches with AUCs of 0.89 and 0.88 respectively (Figure 5A; Table 5). The generalized linear mixed effects models also performed well. Across all models, ABS scores preserved the ordering of the UC categories (Negative<Atypical<Suspicious<Positive; Figure 5B). We fit an ordinal regression model to this data, which demonstrated a strong positive association with atypia (UC categories; β = 3.61; 95%CI: [3.12 − 4.11]; p < 0.0001). This information is corroborated by density heatmaps depicting the NC Ratio and Atypia score for individual urothelial cells across the entire cohort, after being filtered using UroNet. This yielded more than 6 million cells, which were separated based on their UC class. Figure 5D demonstrates the progression in cellular atypia across the categories– negative cases typically do not contain cells that have both high NC ratio and atypia, while these cells can be increasingly found at higher UC categories. Positive cases contain many cells that are both highly atypical with high NC ratio.
Univariable and Multivariable associations with Specimen Atypia
Table 6 demonstrates the importance of the individual slide level predictors through both univariable and multivariable regression modeling. A few predictors remained in the unpenalized statistical model after applying the horseshoe lasso (Figure 5C). This included positive associations with number of clusters, number of both atypical and dense clusters, number of isolated atypical cells and an interaction between age and atypia. The interaction demonstrates that overall specimen atypia younger individuals more greatly impacted by number of atypical urothelial cells as compared to older individuals.
Web Application Example
As a demonstration of Autoparis-X’s ability to facilitate rapid examination of UC specimens, we examined four specimens with the web application (see Supplementary Figures 5-7 for screenshots). Among thousands of specimens examined using this web tool, select cases (negative, atypical, suspicious, positive) can be further inspected using the demo application (see Web Application and Software Availability). The first case (Supplementary Figure 5) yielded an Atypia Burden Score of 0.14. Urothelial cells were selected with high atypia and were plotted on the WSI, revealing their locations. Zooming in on the WSI confirmed the reported cell-level statistics. We also used the table as means to rapidly examine all atypical cells in order of decreasing atypia as a faster method to examine cells versus zooming in using the web application. These examinations confirmed that this was in fact an atypical specimen. The second case produced an atypia burden score of 0.6– a similar examination revealed specimen atypia on par with that of a suspicious assignment. The final case was a positive patient with an atypia burden score of 0.76. We focused on only a few cells which demonstrated the highest potential for malignancy in order to focus our examination given the high cellularity of the specimen. Many of these cells were nested in urothelial cell clusters. This search identified cells which were indeed highly malignant morphologically, allowing for rapid assignment of a positive finding. In Supplementary Figure 8, we used the WSI viewer to zoom in on a few malignant cells identified using the AutoParis-X web application.
Discussion
Advances in urine examination from ancient times to the information age have been accompanied by improvements in both specimen preparation and rigorous quantitative bladder cancer screening criteria 4. Urine cytology (UC) examination for specimen atypia has emerged as the staple of modern-day bladder cancer screening and is often accompanied by more invasive methods for cases demonstrating suspicious or positive classifications. For example, TPS is a widely used grading system in urine cytology screening for bladder cancer, which assigns four main categories based on the presence of high-grade urothelial carcinoma cells and specific cellular features. Yet, despite advances in manual examination methods, there is often poor inter-rater variability in the interpretation of atypical or suspicious specimens, and TPS does not include rigorous criteria for evaluating urothelial cell clusters 11, 17, 89–94. Automation in cytopathology can improve the reliability of cytological assessments and help clinicians address growing numbers of tests and avoid diagnostic errors, as has been demonstrated in the gynecologic cytology market with the adoption of systems such as ThinPrep® Imaging System and FocalPoint™ GS Imaging system 24. Existing systems for semi-autonomous UC examination have addressed many existing challenges, though have yet to adequately account for many additional complexities which can confound assessment (e.g., clusters, polyomavirus, etc.) 20, 21. In this study, we detailed the development of an artificial intelligence tool, AutoParis-X, which improves upon its previous incarnation, to allow for the rapid and nuanced examination of UC specimens; validation on a large-scale retrospective cohort illustrated the maturity and technical sophistication of this tool. For instance, challenges associated with calculation of NC ratios and overall cellular atypia within dense, overlapping urothelial cell clusters were addressed with remarkably good performance 44. The importance of many previously understudied predictors were evaluated (e.g., number of atypical and dense urothelial clusters). Finally, the featured interactive web application was designed for ease-of-use for semi-autonomous diagnostic decision making.
All of these innovations suggest AutoParis-X’s potential to greatly facilitate the process of bladder cancer screening, potentially resulting in a significant increase in diagnostic accuracy and a subsequent decrease in potential avenues for error (similar to what occurred with wide adoption of FocalPoint for Pap tests) 31, 95. For instance, results suggest that UroSeg can be used to accurately calculate NC ratios in a high-throughput manner. AutoParis-X can be used to examine hundreds to thousands of cytology specimens overnight, permitting semi-autonomous evaluation from the cytopathologist via the web application the following day (or in real time as results are generated). This is expected to increase the number and throughput of cytology exams that can be performed by any given institution while accounting for the necessary safeguards (i.e., secondary manual review of random cohort of cases as is now done with Pap tests). Cases unable to be assessed using this web-based platform could be shunted to the classical manual interpretation pathway. With any newly introduced technology, rigorous real-world clinical trials will be required to evaluate the potential impact of adopting this system. As there are only limited applications of AI technologies in digital pathology that have been approved by the FDA for clinical usage, several existing practicalities are worth addressing before AutoParis-X can be safely employed in the clinic. Social barriers for adoption can be identified through surveys on attitudes and beliefs about the tool, which will allow for iterative refinement of the output display and additional algorithmic finetuning. AutoParis-X will also need to demonstrate non-inferiority in a clinical trial (i.e., random assignment of individuals to assessment via manual and semi-autonomous examination). As non-inferiority is evaluated with respect to a ground-truth measurement, it will be difficult to prove the utility of AutoParis-X to assign specimen atypia based on alignment to cytopathologist ratings alone given the high inter-observer variation (e.g. there is no universal, quantitative ground truth in urine cytology) 12, 17, 93. Additional validation will likely require assessment of its capacity to predict more objective outcomes, such as disease recurrence or death 96–99. Additionally, its cost-effectiveness over traditional methods will also need to be proven (e.g., CPT codes, RVUs, number of specimens per day, technologist and pathologist time spent), which will communicate revenue to be expected / workforce needed when operating the device 100–103. A clearer understanding of how these tools can impact clinical decision making is needed before implementation (e.g., what conditions/thresholds are necessary to flag the case for manual review under a microscope) 104.
There are several limitations worth noting that will require future improvements and developments. We observed potential scanning artifacts (e.g., pixelation of cells), deficiencies in specimen preparation, high cellular density, and blood in the samples, which complicate the assessment. However, we have not yet developed methods to address these challenges. In addition to surveying attitudes, beliefs and adoption barriers, cytopathologists unfamiliar with digital technologies may favor assessment through analog means (e.g., microscope)– this will either require additional training and education on how to operate these nascent technologies or may require further subspecialization / training of cytopathologists to perform a digital assessment 105–109. AutoParis-X does not account for Z-stacking of cytology slides which can be accounted for in future iterations to model cells in 3D 73, 110. Annotation of individual cells and clusters were performed by a small group of cytopathologists. Some of these annotations (e.g., nucleus, delineation of cytoplasmic borders in clusters, cell type) may differ between cytopathologists. In addition, data was only collected and validated at a single institution which may limit generalization of these approaches as other institutions may have heterogenous patient characteristics/demographics and different specimen preparation methods 111. Additional data collection from multiple institutions can ameliorate these potential challenges by improving the diversity of the dataset, allowing additional flexibility. There is also room for improvement for deriving slide level features. While we utilized Bayesian Optimization to decide which cells/clusters were atypical, dense, clusters, etc., consideration of additional thresholds or forms to summarize this information could improve the model accuracy. There exists a plethora of modeling approaches which can be utilized to predict specimen atypia. For instance, attention and graph-based neural network architectures can take as input the entire WSI broken into constituent cells, each of which has stored attribute/morphological information. and perform what amounts to a weighted average across the cells to derive a final summary statistic 112, 113. The ordinal nature of UC class assignment was not explicitly taken into account for most of the results in this study and can be incorporated into these machine learning models using the appropriate model likelihoods 114. Institutions aiming to adopt these digital technologies will also require significant computing infrastructure. This requires the purchase and utilization of GPU enabled compute nodes (cloud computing services such as AWS and Google Cloud present viable alternatives to in-house purchases), adoption of containerized workflows, which standardize and scale analyses, and hosting of front-facing applications with appropriate databasing, security and credentialling.
Conclusion
Bladder cancer screening through urine cytology exams is a tedious and fatigable process as cytopathologists assess tens to hundreds of thousands of cells per specimen. Algorithmic techniques to emulate these assessments are beginning to address the incredibly nuanced nature of these assessments. This study featured the design and large-scale validation of a digital diagnostic decision aid, AutoParis-X, which iterates on previous incarnations of urine cytology assessment algorithms to address many remaining complexities associated with challenging examination; further, it features a web application that allows for accurate and rapid examination of specimens. We encourage interested parties to utilize the AutoParis-X workflow and consider validating and finetuning the algorithm for other practice settings to enhance its wider generalizability. The current study demonstrated that quantitative digital urine cytology assessment methods have come of age and are prepared for further rigorous prospective evaluation to investigate its future role in augmenting clinical diagnostic decision making.
Data Availability
Access to manuscript data is limited due to patient privacy concerns. Data produced in the present study are available upon reasonable request to the authors and four specimens are made available for demonstrated assessment at the following URL: http://edit.autoparis.demo.levylab.host.dartmouth.edu/
Appendix
Footnotes
Conflict of Interest None to disclose.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.
- 11.↵
- 12.↵
- 13.↵
- 14.
- 15.
- 16.
- 17.↵
- 18.↵
- 19.
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.↵
- 32.
- 33.↵
- 34.↵
- 35.
- 36.↵
- 37.↵
- 38.
- 39.
- 40.
- 41.
- 42.↵
- 43.↵
- 44.↵
- 45.
- 46.
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.
- 60.
- 61.↵
- 62.
- 63.
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.
- 90.
- 91.
- 92.
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.
- 98.
- 99.↵
- 100.↵
- 101.
- 102.
- 103.↵
- 104.↵
- 105.↵
- 106.
- 107.
- 108.
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵