Abstract
Histopathological evaluation of prostate biopsies using the Gleason scoring system is critical for prostate cancer diagnosis and treatment selection. However, grading variability among pathologists can lead to inconsistent assessments, risking inappropriate treatment. Similar challenges complicate the assessment of other prognostic features like cribriform cancer morphology and perineural invasion. Many pathology departments are also facing an increasingly unsustainable workload due to rising prostate cancer incidence and a decreasing pathologist workforce coinciding with increasing requirements for more complex assessments and reporting.
Digital pathology and artificial intelligence (AI) algorithms for analysing whole slide images (WSI) show promise in improving the accuracy and efficiency of histopathological assessments. Studies have demonstrated AI’s capability to diagnose and grade prostate cancer comparably to expert pathologists. However, external validations on diverse data sets have been limited and often show reduced performance. Historically, there have been no well-established guidelines for AI study designs and validation methods. Diagnostic assessments of AI systems often lack pre-registered protocols and rigorous external cohort sampling, essential for reliable evidence of their safety and accuracy.
This study protocol covers the retrospective validation of an AI system for prostate biopsy assessment. The primary objective of the study is to develop a high-performing and robust AI model for diagnosis and Gleason scoring of prostate cancer in core needle biopsies, and at scale evaluate whether it can generalise to fully external data from independent patients, pathology laboratories, and digitalisation platforms. The secondary objectives cover AI performance in estimating cancer extent and in detecting cribriform prostate cancer and perineural invasion. This protocol outlines the steps for data collection, predefined partitioning of data cohorts for AI model training and validation, model development, and predetermined statistical analyses, ensuring systematic development and comprehensive validation of the system. The protocol adheres to TRIPOD+AI, PIECES, CLAIM, and other relevant best practices.
1. Introduction
Histopathological evaluation of prostate core needle biopsies is an important factor for prostate cancer diagnosis and treatment. Pathologists examine biopsies using the Gleason scoring system (Gleason, 1992) assigning primary and secondary grades based on the relative quantities of tissue representing different Gleason patterns (e.g. a Gleason score of 3 + 4 = 7 indicating primary Gleason pattern 3 and secondary Gleason pattern 4) (Epstein et al., 2005). Grading is however inherently subjective and associated with high intra- and inter-pathologist variability placing patients at risk of inappropriate treatment selection (Melia et al., 2006; Egevad et al., 2013; Ozkan et al., 2016). With the aim of standardisation, the International Society of Urological Pathology (ISUP) updated grading guidelines such that Gleason scores (GS) are pooled into five ordinal categories (i.e. 1 to 5) referred to as the ISUP grades (also called grade groups or WHO grade) (Ji, 2005; Epstein et al., 2016; WHO Classification of Tumours Editorial Board and International Agency for Research on Cancer, 2022). Besides Gleason scoring, similar issues also hamper the reliable and repeatable assessment of other histopathological entities relevant to the clinical management of prostate cancer, such as cribriform cancer morphology (Egevad et al., 2023) or perineural invasion (PNI) (Egevad et al., 2021), both of which are associated with a poor prognosis.
Digital pathology (Pantanowitz et al., 2018) and the application of artificial intelligence (AI) algorithms to analyse whole slide images (WSIs) hold promise for reducing variability and improving the accuracy of histopathological assessments. Many previous studies have demonstrated that AI can diagnose and grade prostate cancer on par with expert pathologists (Campanella et al., 2019; Bulten et al., 2020, 2022; Ström et al., 2020). However, external validations demonstrating the generalisation capacity of these models on data spanning across scanning devices, laboratories, and patient populations not involved in the model development have been limited. Moreover, results from the validation studies have often shown deteriorated performance on the external data (Campanella et al., 2019; Swiderska-Chadaj et al., 2020; Ji et al., 2023). These complications are not specific to prostate pathology, as there are several examples of scanner-induced variability and bias posing challenges for AI models across different tasks and tissue types (Howard et al., 2021; Schmitt et al., 2021; Duenweg et al., 2023).
The unresolved issues with generalisation limit the widespread application of AI in clinical practice, including histopathology. The field has historically lacked well-established guidelines on AI study designs and standardised methods for the proper evaluation and reporting of AI validation studies. Generally, diagnostic assessments of AI systems lack pre-registered study protocols with predefined analysis plans and rigorous sampling of external cohorts, which are key factors for generating reliable evidence of the safety and diagnostic accuracy of these systems in view of further prospective evaluations in clinical trials (Nagendran et al., 2020; McGenity, Bossuyt and Treanor, 2022). Here, we present a comprehensive study protocol for retrospective validation of an AI system for diagnostic assessment of prostate biopsies. This protocol outlines study objectives, analysis and experimental pipelines, as well as data cohorts for evaluating the generalisability and robustness of the AI system. The AI system is ultimately intended to be used as part of computer-aided diagnosis (CAD) software to provide decision-making support for pathologists, but the focus of the current study is on the standalone diagnostic performance of the system. Aspects relating to the clinical implementation of the system, user interaction, and analysis of the diagnostic performance of the system in combination with the supervision of a human pathologist are outside of the scope of this protocol.
Several guidelines have recently been proposed or are under development for reporting clinical validation studies of AI-based methods e.g. SPIRIT-AI (Standard Protocol Items: Recommendations for Interventional Trials) and its companion statement CONSORT-AI (Consolidated Standards of Reporting Trials), which are intended for protocols and reporting of randomised clinical trials involving an AI intervention component (Cruz Rivera et al., 2020; Liu et al., 2020), or the DECIDE-AI (Developmental and Exploratory Clinical Investigations of DEcision support systems driven by Artificial Intelligence) guideline which applies specifically to early, small-scale evaluation of AI interventions, with a focus on clinical utility, safety and human factors (Vasey et al., 2022).
In terms of guidelines applicable to pre-clinical and offline evaluation of AI prediction models, the TRIPOD+AI (Transparent Reporting of a multivariable prediction model of Individual Prognosis Or Diagnosis) (Collins et al., 2021) guideline on developing or reporting performance of AI prediction models has recently been released (Collins et al., 2024), while the STARD-AI (Standards for Reporting of Diagnostic Accuracy Study) (Sounderajah et al., 2021) guideline is still under development. This protocol incorporates guidelines by the TRIPOD+AI (Collins et al., 2024), applicable parts of the best practice checklists proposed in PIECES (Protocol Items for External Cohort Evaluation of a Deep Learning System in Cancer Diagnostics) (Kleppe et al., 2021), CLAIM (Checklist for AI in Medical Imaging) (Mongan, Moy and Kahn, 2020; Tejani et al., 2023) and methodological checklists with a focus on radiology due to absence of such guidelines in the field of pathology (Park and Han, 2018). This AI study protocol covers the steps of data collection, prespecified partitioning of data cohorts, model development, and prespecified statistical analyses, ensuring systematic development and thorough validation of the system.
2. Study objectives
The objective of the study is to develop a high-performing and robust AI model for diagnosis and Gleason scoring of prostate cancer in core needle biopsies, and at scale demonstrate that it can generalise to fully external data from independent patients, pathology laboratories, and digitalisation platforms.
2.1. Primary objective
The primary objective is to assess the concordance between the AI model and pathologists in diagnosing and Gleason scoring prostate cancer in core needle biopsies.
2.2. Secondary objectives
There are three secondary objectives which this study accommodates:
- Assess the concordance between the AI model and pathologists in measuring cancer extent (in millimetres) in prostate core needle biopsies.
- Assess the concordance between the AI model and pathologists in detecting perineural invasion in prostate core needle biopsies.
- Assess the concordance between the AI model and pathologists in detecting cribriform cancer in prostate core needle biopsies.
3. Artificial intelligence system
The AI system to be developed and evaluated in this study is intended for the histopathological assessment of digitised prostate core needle biopsies. The system will be based on deep neural networks and its specific design (e.g. image preprocessing steps, model architecture and training approach) will be optimised during the study (see Section 4 for further description of the design choices and hyperparameters that will be evaluated). This study comprises multiple AI models, each tailored for the specific objectives i.e. grading, perineural invasion, cribriform cancer and cancer length and together these models integrate to form an AI system.
System input: A WSI stored in a supported vendor-specific format, depicting a formalin-fixed, paraffin-embedded (FFPE) haematoxylin & eosin (HE) stained prostate core needle biopsy specimen with one or several tissue cuts of one or several biopsy cores.
System output:
- Gleason score: the system will output GS, such as 4 + 3 = 7, indicating the primary and secondary patterns observed within the input WSI. The GS ranges from 3 + 3 = 6 to 5 + 5 = 10, with lower scores representing less aggressive cancer and higher scores indicating more aggressive cancer. Benign samples are encoded as 0 + 0.
- ISUP grade: the system will output an ISUP grade which groups GS into ordinal categories, ranging from 1 to 5. The GS are expressed as ISUP grades as follows: ISUP 1 (GS 6), ISUP 2 (GS 3 + 4 = 7), ISUP 3 (GS 4 + 3 = 7), ISUP 4 (GS 8), ISUP 5 (GS 9 - 10). Benign samples are encoded as 0.
- Cancer extent: the system will quantify the extent of cancer within the provided WSI in millimetres. This measurement indicates the size of the cancerous area within the tissue specimen.
- Cribriform cancer: the system will output the predicted probability of cribriform prostate cancer morphology being present within the input WSI.
- Perineural invasion: the system will output the predicted probability of perineural invasion being present within the input WSI.
- Visualisation: the system will provide a visualisation of its predictions including areas of different Gleason patterns, PNI and cribriform cancer, which can be examined in a WSI viewer software overlaid on the digital slide. The exact format of the visualisation will vary depending on the viewer software.
4. Study design
In this study, the aim is to develop the AI system described above and validate its diagnostic performance on retrospectively collected cohorts. To carry out the study, historical data, including medical records, pathology reports, and digitised images have been collected for cases where both the AI system and human pathologists make diagnostic assessments. The study design involves two independent phases: AI system development and AI system validation as shown in Fig. 1. The development phase involves an iterative cycle of refining the model design and hyperparameters using predefined development and tuning cohorts for model training and estimation of the effects of design choices on diagnostic performance. Once the overall performance on the development and tuning sets is deemed to have reached a plateau and further changes to the model design no longer yield meaningful improvements, a design freeze will take place and the final AI model will be graduated to the validation phase. This design achieves complete isolation between the model development and the retrospective validation to avoid any information leakage, which could lead to overly optimistic validation results. All model parameters and hyperparameters, including selection of any classifier thresholds, will be set based on the development and tuning cohorts and no adjustments or tweaking will be conducted on the validation cohorts, which will remain entirely untouched during the development phase.
The development cohorts provide a wide representation of tissue morphologies, scanning devices, laboratories, and clinical characteristics of patients, allowing for the training of a robust model. The tuning cohorts enable assessing model generalisation (i.e. performance on data from different laboratories and scanners than the development cohorts) on each development iteration, and direct performance comparison with state-of-the-art models evaluated on these same datasets in earlier studies (Ström et al., 2020; Bulten et al., 2022). Sequential experiments will be conducted one modification at a time to evaluate e.g. different preprocessing approaches for extracting image data from the WSI, deep neural network architectures, optimiser hyperparameters etc. (see Supplementary Appendix Section 1). Model performance at each step will be measured using cross-validation on the development cohorts and independent evaluation on the tuning cohorts. To accelerate the development process by reducing runtime for early model designs and to simplify troubleshooting, we will initially only use one of the development cohorts for model training and gradually introduce the other development cohorts one by one. This approach to model development allows:
- Effective troubleshooting: systematic experiments facilitate easier debugging and identification of error root causes.
- Traceability and accountability: transparency and traceability of how the model evolved during development, and accountability in cases of improvements or issues.
- Isolation of changes: the impact of each modification is assessed independently without the confounding effects of simultaneous changes (e.g. changing multiple hyperparameters at once).
- Optimal model tuning: controlled and sequential modifications allow for optimal tuning of the model and achieving the best possible model performance.
The validation phase will employ a blinded approach, wherein neither the pathologists nor the AI model have access to each other’s assessments. The validation cohorts consist of samples representing a range of heterogeneous clinical settings and were collected from patients not included in the development or tuning cohorts. They are categorised as internal (scanner and laboratory included in the development set), partly external (scanner included in the development set) or fully external (neither scanner nor laboratory included in the development set) depending on the slide scanners and clinical laboratories involved. Internal validation can be expected to provide an optimistic estimate of the diagnostic performance of the AI model in the absence of laboratory or scanner variation. The generalisation performance of the model will ultimately be evaluated on the external validation cohorts, which avoids any optimistic bias. The design also allows for additional validation cohorts to be added at any point after the development phase.
Due to inter-observer variability among pathologists, reference standards established by pathologists vary across different validation cohorts. This complicates the assessment of the AI model for generalisation across cohorts, as any differences in observed performance can be partly attributed to differences in reference standards and partly attributed to imperfect AI generalisation to data originating from different clinical sites. In the case of the primary study objective of Gleason scoring, we have addressed this issue by having a representative subset of slides from each cohort be re-assessed by the lead pathologist (L.E.). The lead pathologist is highly experienced in urological pathology and has shown high concordance relative to other experienced uropathologists in several studies (Kweldam et al., 2016; Egevad et al., 2017; Bulten et al., 2022). For the secondary study objectives of cribriform cancer and perineural invasion detection, the assessments were conducted either by the lead pathologist or by other experienced (>25 years of clinical experience after residency) uropathologists (B.D., H.S.) whose concordance with the lead pathologist has been quantified in earlier studies (Egevad et al., 2021, 2023). This provides a consistent reference standard which will allow us to assess the technical generalisation performance of the model (without complete confounding between laboratory, scanner, and pathologist reference standards), in addition to large-scale evaluation relying on the varying reference standards provided by the local pathologists for each cohort.
Clinical and pathological characteristics of the included patients are summarised in Table 1 and detailed information on the slide scanning is provided in Table 2. Details on reference standards for each cohort with respect to grading are given in Table 3, and with respect to cribriform cancer and PNI are given in Table 4. Information on slides representing morphological subtypes is given in Table 5, and the number of slides for which immunohistochemistry (IHC) staining was performed in order to confirm the diagnosis is tabulated in Table 6. See Supplementary Appendix Section 3 for CONSORT diagrams summarising the data cohorts.
5. Inclusion and exclusion criteria
Provided below are the detailed criteria used to assess the eligibility of patients, individual biopsy slides, or WSIs for inclusion in this study.
5.1. Inclusion criterion
- Patients who underwent a prostate core needle biopsy were eligible.
5.2. Exclusion criteria
- Clinical information:
a) Patients with either slides or associated pathology information unavailable.
b) Slides lacking identifiers (IDs) preventing linkage to the pathology data.
c) Slides with identical IDs preventing unambiguous linkage to the pathology data.
d) Slides with mismatching GS and ISUP grade information.
e) Slides with mismatching information concerning malignancy and GS or ISUP grade (e.g. indicated benign but a GS is provided).
f) Slides with partial or erroneous GS reporting (e.g. <6, 4 + 0 or 1 + 1 etc.).
- Staining and slide preparation:
a) Samples not containing prostate tissue e.g. bladder biopsies, testicular biopsies.
b) Samples not stained with HE (e.g. IHC stains).
c) Initial cuts of tissue blocks deemed unsuitable by the pathologist for providing a diagnosis and requiring a recut.
d) Empty biopsy slides with no tissue on the glass.
- Slide integrity and annotation:
a) Slides with pen mark annotations that cover a vast amount of the tissue, obscuring the underlying morphology.
b) Slides with pen mark annotations conflicting with the pathology diagnosis (e.g. there exists a pen mark annotation on the slide, but the slide is diagnosed as benign or vice versa). This only applies to the STHLM3 cohort (see Section 7.1), where the pen mark annotation process is known to be consistent for all samples.
c) Slides with pen mark annotations that result in the majority of the tissue being out of focus during scanning.
- Slide digitisation:
a) Earlier scans of the same slide on the same scanner instrument, assuming the latest WSI represents a successful rescanning due to e.g. earlier focus issues.
b) Corrupt WSI files which cannot be accessed with Openslide (Goode et al., 2013) or OpenPhi (Mulliqi et al., 2021).
6. Data partitions
6.1. Requirements for data partition
We established a number of requirements to guide the inclusion, exclusion and partitioning of data into development, tuning and validation sets to account for several sources of potential bias in the training and validation of the model. We followed available guidelines and criteria for balanced and representative data partitions (Park and Han, 2018; Mongan, Moy and Kahn, 2020; Willemink et al., 2020; Varoquaux and Cheplygina, 2022) and arrived at the following set of requirements:
Representative sample selection: Ensure the data are representative of the diversity encountered in clinical practice by including multi-site cohorts with variations in scanning equipment (e.g. vendors, models, image formats), biopsy preparation (e.g. staining, tissue cutting), morphological heterogeneity (e.g. different Gleason scores and rare cancer subtypes) and patient demographics.
Representative sample size: Include a sufficiently large sample for development and validation to increase the probability of generalisability in the larger population.
Mitigate overfitting due to observer bias: Alleviate the possibility of overfitting or “over tweaking” of the model, which may be caused by excessive refinement of the model design aimed at maximising cross-validation performance in development data, since that can jeopardise generalisation outside the development cohorts. The issue can be mitigated by additional (external) tuning data cohorts serving as a less biased performance indicator during model development. It should be further ensured that the tuning cohorts are independent of model training (for example, criteria for early stopping of model training should be assessed only on the development data).
Ensure independence of specimens between data partitions: Each data partition (development, tuning, internal or external validation sets) should be independent of the others with no overlap of biopsies or patients.
Ensure independence of sample preparation process between data partitions: Sample external cohorts such that there is no overlap with respect to the clinical laboratories that prepared these cohorts and the development cohorts.
Ensure independence of the digitisation process between data partitions: Sample external cohorts such that there is no overlap with respect to the scanning device used for these cohorts and the development cohorts.
6.2. Predefined data partitions
The process of splitting the data cohorts into development, tuning, and internal and external validation sets was conducted adhering to the requirements for data partitions and is described below (see Fig. 1 for an overview). The characteristics of the data cohorts included in this study are summarised in Table 1 and described in detail in Section 7.
The development set was sampled from the following cohorts: Capio S:t Göran Hospital (STG), Radboud University Medical Center (RUMC), Stavanger University Hospital (SUH) and Stockholm3 (STHLM3). From the RUMC, STHLM3 and SUH cohorts, the patients who were not allocated to tuning or validation sets (see below) were assigned to the development set (approximately 80% of patients). Given the limited size and skewed grade distribution of the STG cohort, it was fully allocated into the development set. The development set covers several clinical laboratories and scanner devices as well as a large degree of variation in tissue morphology and the clinical characteristics of patients, in part due to the largest cohort, STHLM3, originating from a population-based screening trial (Requirements 1-2). Each of the development cohorts was further split into 10 cross-validation folds by randomly allocating patients to folds, stratified by the maximum slide level ISUP grade of each patient.
The tuning set was sampled from the following cohorts: Karolinska University Hospital (KUH-1), RUMC and STHLM3. The entire KUH-1 cohort was assigned to tuning and represents a fully external cohort relative to the development set (i.e. different patients, laboratory and scanner). This set also corresponds to the European external validation cohort of the PANDA challenge (Bulten et al., 2022). The subsets of the RUMC and STHLM3 cohorts assigned to the tuning set represent internal data relative to the development set (i.e. different patients but the same laboratories and scanners) and correspond to the PANDA public test sets in Kaggle (i.e. the PANDA tuning sets). The tuning sets allow for evaluating the effects of model design changes on data that is independent of the development set, direct comparison with state-of-the-art models from PANDA, and in the case of KUH-1, assessing the generalisation performance of the model prior to design freeze (i.e. performance on data coming from different patients, laboratories, and scanners compared to the development data) (Requirement 3). A subset of slides belonging to the PANDA Swedish tuning set was allocated to the internal validation set for reasons related to patient stratification and the inclusion of specific subsets of interest in the internal validation (see below).
The internal validation set was sampled from the following cohorts: RUMC, STHLM3 and SUH, consisting of patients who were not part of the development or tuning sets but whose biopsies were obtained from the same clinical laboratories and scanned with the same scanners as the development and tuning set samples. The STHLM3 internal validation set includes the following subsets, supplemented with randomly sampled patients to achieve a total 20% fraction of patients assigned to tuning and validation: ImageBase (Egevad et al., 2017), Swedish private test set in Kaggle (i.e. PANDA Swedish internal validation set) (Bulten et al., 2022), perineural invasion multi-observer validation set (Kartasalo et al., 2022), and rare morphological subtypes set (Olsson et al., 2022). Including these samples as subsets of the internal validation set will facilitate (internal) comparisons with results obtained in the papers referenced in the preceding sentence. The SUH internal validation set includes the following subsets, supplemented with randomly sampled patients to achieve a 20% fraction of patients assigned to validation: all patients (n=25) with multiple recuts of their biopsy tissue blocks, and patients (n=81) corresponding to a random selection of 119 slides stratified on ISUP grade (to be rescanned repeatedly over time for an AI temporal stability study). The STHLM3 subsets allocated into the internal validation set were selected based on being particularly valuable for the evaluation phase of the study, while the SUH subsets will be used as validation sets in upcoming follow-up studies involving the AI model developed here, hence cannot be assigned to the development set. The RUMC internal validation set includes the RUMC private test set in Kaggle (i.e. PANDA RUMC internal validation set) (Bulten et al., 2022), supplemented with randomly sampled patients to achieve a total 20% fraction of patients assigned to tuning and validation.
External validation cohorts are fully external relative to the development set (no overlap with respect to patients, laboratory, or scanner) or partly external (no overlap with respect to patients or laboratory, but digitisation performed using a scanner that is present also in the development set). Fully external validation set cohorts include Aarhus University Hospital (AUH), Karolinska University Hospital morphological subtypes (KUH-2), Mehiläinen Länsi-Pohja (MLP), Medical University of Lodz (MUL), Synlab Switzerland (SCH), Synlab Finland (SFI), Synlab France (SFR), Spear Prostate Biopsy 2020 (SPROB20), University Hospital Cologne (UKK), Hospital Wiener Neustadt (WNS). Partly external validation set cohorts include: Aquesta Uropathology morphological subtypes (AQ), partially scanned on a scanner present in the development set and partially scanned on an external scanner. The external nature of the validation set cohorts fulfils Requirements 4-6.
All data splits were performed on patient level, that is, all slides and resulting WSIs from a given patient were allocated to the same data partition in order to avoid information leakage between development and validation sets. If a patient was biopsied on several occasions, all biopsies were included and allocated together. Any samples lacking patient identifiers were assigned to development data to avoid the risk of slides from any patients ending up in both development and evaluation cohorts.
Subsets of the slides included in this study have been scanned multiple times. If the same slide had been rescanned multiple times on the same individual scanner (i.e. the same physical device), we only kept the WSI with the latest scanning date, assuming the rescanning was due to e.g. initially poor focus or other scanning issues. Subsets of the STG, STHLM3 and MUL cohorts were rescanned with multiple different scanners (see Table 2). To avoid biasing the evaluation towards these slides that appear in the dataset multiple times, we will only include one WSI per slide in the validation sets. For STHLM3, we will randomly select one WSI for each slide to be evaluated, and for MUL, we will utilise WSIs from the Grundium Ocus40 scanner, excluding those on the Philips UFS scanner. This ensures that the MUL cohort remains entirely external relative to the development data, considering that the STHLM3 cohort was partly digitised on the same Philips UFS instrument. The repeated scans will, however, be used during AI model development as an augmentation technique (except for the Grundium Ocus40 which is kept as an external scanner for validation), and for a separate cross-scanner reproducibility analysis (see Section 8).
7. Data cohorts
7.1. Development, tuning and internal validation data cohorts
7.1.1. Karolinska University Hospital (KUH-1)
The KUH-1 samples were collected at the Department of Pathology, Karolinska University Hospital in Solna, Sweden in 2018. Among the cases assessed by L.E. during 2018, we included all positive slides of all patients diagnosed with ISUP grade 2-5 cancer, all positive slides from a random selection of patients diagnosed with ISUP grade 1 cancer, and all slides from a random selection of patients with a negative diagnosis. Patients underwent systematic transrectal biopsies in approximately 1/3 of the cases, and magnetic resonance imaging (MRI) targeted or combined biopsies in approximately 2/3 of cases. Slides typically contain one core, sectioned at two levels. This cohort has been used as an external validation set in previous studies (Ström et al., 2020; Bulten et al., 2022).
7.1.1.1. Reference standard protocol
All cases were assessed by the lead pathologist (L.E.) using a microscope to determine the GS and cancer extent per slide, as well as the ISUP grade per slide and per patient. The linear cancer extent was generally measured from end to end in cases with discontinuous cancer and it was reported on a per-cut level.
7.1.2. Radboud University Medical Center (RUMC)
The RUMC samples were collected at the Radboud University Medical Center in Nijmegen, the Netherlands from January 2012 to December 2017 (Bulten et al., 2020). Patients were sampled randomly, stratified by the highest reported GS in the pathology reports, and the slide with the most aggressive part of the tumour was included for each patient. Additionally, a group of patients with only benign biopsies were randomly sampled. Patients generally underwent MRI-targeted transrectal biopsy. The data underwent additional refinement in preparation for the PANDA Kaggle challenge (Bulten et al., 2022): only one core, sectioned at one level was retained per WSI, the background was masked to hide most of the markings made on the glass, and the images were converted into .tiff format (JPEG compression, quality 70). For the purposes of PANDA, the cohort was partitioned into three sets—development, tuning, and internal validation, stratified by patient and the highest Gleason pattern in the biopsy.
7.1.2.1. Reference standard protocol
The reference standard for all cases on the RUMC development set was determined based on the original pathology reports. Due to each slide containing multiple biopsy cores, trained non-experts digitally outlined the individual cores, allowing them to be partitioned into separate WSIs, and assigned core-level GS based on the pathology reports. Inconclusive pathology reports were assigned for a second review, and if no match could be made these cases were discarded (Bulten et al., 2020).
Subsets of the cohort underwent additional re-assessments as follows:
The PANDA RUMC tuning set (n=195, corresponds to our RUMC tuning set) and the PANDA RUMC internal validation set (n=333, part of our RUMC internal validation set) were assessed in three rounds. In the first round, three uropathologists individually graded the cases digitally, providing a GS per slide. A majority vote was taken for cases where an agreement was reached on the ISUP grade but there was a discrepancy in the Gleason patterns, and cases where two uropathologists agreed and the third one had a maximum deviation of one ISUP grade. In the second round, all the cases that did not achieve consensus were re-graded by the uropathologist whose grade differed from the others, followed by pooling of all the assessments and discussion in a consensus meeting in the third round. The GS was reported per slide.
A subset of slides (n=66) from the RUMC internal validation cohort was randomly selected, stratified by the ISUP grade, for re-assessment by the lead pathologist (L.E.). This re-assessment was conducted digitally on Cytomine (Marée et al., 2016) using 3DHISTECH WSIs (.mrxs converted to .tiff) to report the GS per slide.
7.1.3. Capio S:t Göran Hospital (STG)
The STG samples were collected at Capio S:t Göran Hospital in Stockholm, Sweden from 2016 to 2017. We included a random selection of slides with an enrichment for high-grade cancer. Patients underwent transrectal biopsy, and slides typically contain one core, sectioned at two levels. This cohort was also part of the development set in a previous study (Ström et al., 2020).
7.1.3.1. Reference standard protocol
All cases were assessed by the lead pathologist (L.E.) using a microscope to provide GS, ISUP grade, and cancer extent on a per-slide level. The linear cancer extent was generally measured from end to end in cases with discontinuous cancer and it was reported on a per-cut level.
7.1.4. Stockholm3 (STHLM3)
The STHLM3 samples were collected in a population-based clinical trial (ISRCTN84445406) (Grönberg et al., 2015) from 2012 to 2015 in Stockholm, Sweden. Histological sample preparation was performed at Histocenter, Gothenburg, Sweden, and the samples were assessed at the Department of Pathology, Karolinska University Hospital in Stockholm. Patients underwent 10-12 core systematic transrectal biopsies and slides usually contain one core, sectioned at two levels. Subsets of the digitised samples have been used as development and internal validation sets in previous studies (Ström et al., 2020; Bulten et al., 2022; Ji et al., 2022; Kartasalo et al., 2022; Olsson et al., 2022). Patient and slide selection, retrieval and digitisation took place on five occasions between 2014 and 2023 (see Table 2), as below:
2014: All cores from the first 500 patients diagnosed with prostate cancer in the STHLM3 trial were scanned on a Hamamatsu NanoZoomer 2.0-HT.
2017-2019: All patients with at least one core graded as GS 4 + 4 or 5 + 5 and 497 randomly selected patients with at least one core graded as 3 + 3 were considered. From each of these patients, we included all positive cores and a randomly selected negative core. Finally, we randomly selected 139 cancer-free patients from whom we included one randomly selected core. Additionally, we added all cores which were indicated to have PNI and had not been scanned earlier. The cores were scanned on an Aperio AT2.
2018-2019: The cores of a random selection of patients were scanned on a Hamamatsu NanoZoomer XR.
2019-2020: The cores of a random selection of patients were scanned on the Philips IntelliSite Ultra Fast Scanner (UFS).
2023: Patients belonging to the PANDA challenge Swedish public and private validation sets were scanned on the Grundium Ocus40.
2023: Initially, cores with < 4 millimetres of cancer were excluded to have sufficient cancer tissue for future molecular profiling of the samples. Among the remaining patients, 50% of those with ISUP 1 or ISUP 2 (patient level ISUP) were randomly selected for inclusion, while all patients with ISUP 3-5 were included for scanning on the Grundium Ocus40.
7.1.4.1. Reference standard protocol
All cases were assessed by the lead pathologist (L.E.) using a microscope to obtain the GS, the ISUP grade, cancer extent and PNI on a per-slide level. The linear cancer extent was generally measured from end to end in cases with discontinuous cancer and reported on a per-cut level. However, in cases with 1 or 2 cores infiltrated by low-grade discontinuous cancer with a benign gap exceeding 3 millimetres, the benign tissue was subtracted in the reporting of total cancer extent.
Subsets of the cohort underwent additional re-assessments as follows:
A subset of slides (n=212) from the STHLM3 internal validation cohort underwent a second review to construct a reference standard for the PANDA Swedish internal validation set. Slides initially indicated as benign according to the original reference standard were not re-reviewed, while cases indicated as malignant were divided between two uropathologists (B.D. and H.S.), each reviewing 100 slides blinded to the original review. In the case of agreement between the initial and the second review, the consensus ISUP grade was assigned to the case. In case of disagreement, a third uropathologist (T.T.) reviewed the case. For cases that were indicated as malignant by all pathologists, the final ISUP grade was assigned according to 2/3 consensus. If all three reviews were in disagreement, the case was excluded from the internal validation set. Any cases indicated as benign in the second or third review were excluded from the PANDA Swedish internal validation set. The re-assessment was conducted digitally on Cytomine using Hamamatsu and Aperio WSIs (.ndpi and .svs converted to .tiff) as described in (Bulten et al., 2022).
A subset of slides (n=24) from the STHLM3 internal validation cohort was additionally assessed by the lead pathologist (L.E.) for specific rare morphologies (see Table 5) using a microscope. This set has been used as validation data in a previous study (Olsson et al., 2022).
A subset of slides (n=87) from the STHLM3 internal validation cohort, representing the ImageBase set (Egevad et al., 2017) was additionally assessed by an expert panel of uropathologists (n=23). The assessment was conducted using digital micrographs. This set has been previously used in the study (Ström et al., 2020) as an internal validation set.
A subset of slides (n=702) from the STHLM3 development and internal validation cohorts was digitally assessed for cribriform cancer as described in (Egevad et al., 2023). To arrive at this selection, we first enriched Gleason pattern 4 tissue by randomly selecting one core per combination of patient and ISUP grade among all cores with ISUP grades 3-5. To maintain some representation of GS 3+4 biopsies, we randomly selected 86 additional cores with one core per patient from the set of all cores with ISUP grade 2. The slides were assessed by the lead pathologist (L.E.) on Cytomine using Hamamatsu (.ndpi) and Aperio (.svs) WSIs to create pixel-wise annotations of areas with cribriform cancer. The pathologist could also indicate uncertain cases with a borderline category.
A subset of slides positive for cribriform cancer (n=152) and a random selection of slides negative for cribriform cancer (n=152) according to the assessment by L.E. were additionally assessed by an expert panel of uropathologists (n=9) as described in (Egevad et al., 2023). The pathologists assessed the presence of cribriform cancer on slide level on Cytomine using Hamamatsu (.ndpi) and Aperio (.svs) WSIs. The pathologists were blinded to the distribution of positive or negative slides and to each other’s assessments.
All slides positive for PNI (n=485) in the STHLM3 development and internal validation cohorts were digitally re-assessed as described in (Kartasalo et al., 2022). The slides were assessed by the lead pathologist (L.E.) in QuPath (Bankhead et al., 2017) using Hamamatsu (.ndpi) and Aperio (.svs) WSIs to create pixel-wise annotations of areas of PNI.
A subset of slides positive for PNI (n=106) and a random selection of slides negative for PNI (n=106) according to the assessment by L.E. was additionally assessed by an expert panel of uropathologists (n=4) as described in (Egevad et al., 2021). The pathologists assessed the presence of PNI on slide level on Cytomine using Hamamatsu (.ndpi) and Aperio (.svs) WSIs. The pathologists were blinded to the distribution of positive or negative slides and to each other’s assessments. The pathologists could also indicate uncertain cases with borderline categories.
7.1.5. Stavanger University Hospital (SUH)
The SUH samples represent consecutive cases collected from routine diagnostics at the Department of Pathology, Stavanger University Hospital in Stavanger, Norway from December 2016 to March 2018. Biopsies were taken at the Department of Urology in Stavanger University Hospital and other private urological clinics at the Stavanger Urological Center. Patients primarily underwent systematic transrectal biopsies, although some received MRI-targeted biopsies, either alone or combined with systematic biopsy. Slides typically contain two cores from the same anatomical location, sectioned at two levels. A subset of the SUH cohort has been used as an external validation set in previous studies (Ji et al., 2022; Olsson et al., 2022).
7.1.5.1. Reference standard protocol
The reference standard was obtained from the original pathology reports from the clinical routine. Seven uropathologists and seven general pathologists assessed the slides microscopically reporting the GS, ISUP grade, Gleason pattern 4 percentage, cancer extent, biopsy length, PNI, fatty tissue infiltration (FTI), and additional stainings (e.g. IHC) on the slide level. The linear cancer extent was generally measured from end to end in cases with discontinuous cancer and it was reported on a per-cut level.
Subsets of the SUH cohort underwent additional re-assessments as follows:
A subset of slides (n=66) from the SUH internal validation cohort was randomly selected and stratified by ISUP grade for re-assessment by the lead pathologist (L.E.). This re-assessment was conducted digitally on Cytomine using Hamamatsu WSIs (.ndpi) to report the GS per slide.
A subset of slides (n=332) with Gleason pattern 4 tissue from the SUH development and internal validation cohorts was initially assessed by a uropathologist (A.B.) for potential cribriform cancer using QuPath. We then randomly selected at most 90 positive, 30 borderline and 30 negative slides from the development cohort and at most 30 positive, 10 borderline and 10 negative slides from the internal validation cohort to be re-assessed by the lead pathologist (L.E.), resulting in 200 slides. This re-assessment was conducted digitally on Cytomine using Hamamatsu (.ndpi) WSIs to report cribriform cancer per slide. The pathologist could also indicate uncertain cases with a borderline category.
All slides from cases reported as positive for PNI in the SUH development and internal validation cohorts were initially assessed by a uropathologist (A.B.) for potential PNI using a microscope. We then randomly selected at most 25 positive and 5 negative slides per ISUP grade from the development cohort, and at most 8 positive and 2 negative slides per ISUP grade from the internal validation cohort to be re-assessed by the lead pathologist (L.E.), resulting in 185 slides. This re-assessment was conducted digitally on Cytomine using Hamamatsu (.ndpi) WSIs to report PNI per slide. The pathologist could also indicate uncertain cases with a borderline category.
7.2. External validation cohorts
7.2.1. Aichi Medical University (AMU)
The AMU samples were collected at the Aichi Medical University in Nagakute, Japan from 2020 to 2023. Samples were selected to include cribriform prostate cancer cases and non-cribriform cases. Cribriform cases were chosen sequentially, while non-cribriform cases were selected among cases containing Gleason pattern 4 and age-adjusted to match the cribriform cases. Patients generally underwent systematic transrectal biopsy, with only a few undergoing MRI-targeted biopsy. Slides typically contain several cores, sectioned at several levels.
7.2.1.1. Reference standard protocol
All cases were assessed by a uropathologist (T.T.) initially using a microscope and then confirmed digitally with the NDP.View software using Hamamatsu WSIs (.ndpi). The presence or absence of cribriform prostate cancer was reported on slide level and GS was reported on patient level.
7.2.2. Aquesta Uropathology morphological subtypes (AQ)
The AQ cases were collected at the Aquesta Specialised Uropathology laboratory in Toowong, Australia from 2009 to 2023. The biopsies were performed in private hospitals and urology clinics in Queensland state, Australia. Slides were specifically selected to represent rare morphologies such as benign mimickers of prostate cancer which are typically hard to diagnose in routine pathology. Patients generally underwent MRI-targeted transrectal biopsies, and each slide has two cores, sectioned at two levels.
7.2.2.1. Reference standard protocol
A uropathologist (H.S.) assessed the slides microscopically and reported the GS, ISUP grade, additional stainings (e.g. IHC), and the presence or absence of specific morphological subtype categories on slide level (see Table 5). Slides representing benign mimickers were microscopically re-assessed by the lead pathologist (L.E.).
7.2.3. Aarhus University Hospital (AUH)
The AUH samples were part of the PRIMA clinical trial conducted at the Aarhus University Hospital in Aarhus, Denmark from January 2018 to December 2021 (Fredsøe et al., 2023). Histopathology assessment was conducted at the Department of Pathology, Aarhus University Hospital, Aarhus, Denmark. In this trial, men aged 50-59 years with elevated prostate-specific antigen (PSA) (3-10 ng/ml) and/or positive STHLM3 test (defined as STHLM3 score equal to or above 11%) and MRI of PIRADS 3-5 were referred to MRI-targeted transrectal biopsy. Out of 117 patients who underwent the biopsy procedure, the pathologist selected slides based on histopathological features with the aim of a uniform distribution of ISUP grades. Slides typically contain two cores, sectioned at three levels. This cohort was used as an external validation set in a previous study (Ji et al., 2022).
7.2.3.1. Reference standard protocol
All cases were assessed by a uropathologist (B.P.U.) microscopically and the GS, the ISUP grade, cancer extent and biopsy length were reported on the slide level.
Subsets of the AUH cohort underwent additional re-assessments as follows:
A subset of slides (n=41) was randomly selected, stratified by the ISUP grade, for re-assessment by the lead pathologist (L.E.). This re-assessment was conducted digitally on Cytomine using Hamamatsu WSIs (.ndpi) to report the GS per slide.
7.2.4. Karolinska University Hospital morphological subtypes (KUH-2)
The KUH-2 samples were collected at the Department of Pathology, Karolinska University Hospital in Solna, Sweden in 2022. The biopsy procedure and number of tissue sections per slide adhere to the KUH-1 cohort. Similarly to the AQ cohort, these samples were specifically selected to represent cases that are typically challenging to diagnose in clinical practice, such as rare disease morphologies and benign mimickers. This cohort was used as an external validation set in a previous study (Olsson et al., 2022).
7.2.4.1. Reference standard protocol
The reference standard protocol for the KUH-2 cohort adheres to KUH-1, except for additional reporting of the presence or absence of specific morphological subtype categories, assessed by the lead pathologist (L.E.) on slide level (see Table 5).
7.2.5. Mehiläinen Länsi-Pohja (MLP)
The MLP samples represent consecutive cases from routine pathology at the Mehiläinen Länsi-Pohja Hospital in Kemi, Finland from 2016 to 2019. Patients underwent systematic transrectal biopsies, and biopsies were sampled based on anatomical location: left and right typically consisting of six cores per location. Slides typically contain one core, sectioned at two to three levels.
7.2.5.1. Reference standard protocol
The reference standard was obtained from routine assessments done by several pathologists using a microscope to determine the GS, the ISUP grade, cancer extent and biopsy length per patient or per anatomical location (i.e. a set of biopsy cores assessed together).
Subsets of the MLP cohort underwent additional re-assessments as follows:
A subset of slides (n=66) was randomly selected, stratified by the ISUP grade, for re-assessment by the lead pathologist (L.E.). The patient level ISUP grade was used for stratification, due to missing slide level grading. This re-assessment was conducted digitally on Cytomine using 3DHISTECH WSIs (.mrxs) to report the GS per slide.
7.2.6. Medical University of Lodz (MUL)
The MUL samples represent consecutive cases from routine pathology at the 1st Department of Urology, University Clinical Hospital of the Military Academy of Medicine - Central Veterans Hospital, Medical University of Lodz, Lodz, Poland from January 2018 to March 2019. Histopathological assessment was conducted at the Department of Pathology, Department of Oncology, Medical University of Lodz, Lodz, Poland. Patients underwent systematic transrectal biopsy and slides typically contain one core, sectioned at four to seven levels.
7.2.6.1. Reference standard protocol
The reference standard was determined based on an initial assessment by a single pathologist (M.B.) and a second review by a more experienced pathologist (R.K.). Both pathologists have a specialisation in surgical pathology and are currently specialising in uropathology. The pathologists assessed the cases using a microscope and reported the GS, the ISUP grade, total cancer percentage and Gleason pattern 4 and 5 percentages on the slide level.
Subsets of the MUL cohort underwent additional re-assessments as follows:
A subset of slides (n=66) was randomly selected, stratified by ISUP grade, for re-assessment by the lead pathologist (L.E.). This re-assessment was conducted digitally on Cytomine using Grundium WSIs (.svs) to report the GS per slide.
All slides containing Gleason pattern 4 (n=276) were initially assessed for potential cribriform cancer by a uropathologist (A.B.). The assessment was conducted digitally on Cytomine using Grundium WSIs (.svs) to report cribriform cancer per slide and mark the positive and borderline foci. All foci were then re-assessed on Cytomine by the lead pathologist (L.E.).
The slides (n=276) assessed for cribriform cancer were also initially assessed for potential PNI by a uropathologist (A.B.). The assessment was conducted on Cytomine using Grundium WSIs (.svs) to report PNI per slide and mark the positive and borderline foci. All foci were then re-assessed on Cytomine by the lead pathologist (L.E.).
7.2.7. Synlab Switzerland (SCH)
The SCH samples represent consecutive cases from routine diagnoses at the Argot Laboratory in Lausanne, Switzerland from January 2020 to December 2020. Patients underwent systematic, MRI-targeted or combined transrectal biopsies. Slides typically contain one core, sectioned at two levels. A varying number of cores were typically obtained from a varying number of anatomical locations.
7.2.7.1. Reference standard protocol
The reference standard was determined based on the pathology reports from routine diagnostics. Using the microscope the pathologists reported the GS, the ISUP grade, cancer extent, biopsy length, Gleason pattern 4 percentage, cribriform cancer, PNI, high-grade prostatic intraepithelial neoplasia (HGPIN) and possible IHC staining per anatomical location (i.e. a set of biopsy cores assessed together) and per patient.
Subsets of the SCH cohort underwent additional re-assessments as follows:
A subset of slides (n=72) was randomly selected, stratified by the ISUP grade and anatomical location for re-assessment by the lead pathologist (L.E.). This re-assessment was conducted digitally on Cytomine using Philips WSIs (.isyntax converted to .tiff) to report the GS per slide.
A subset of slides (n=56) were digitally re-assessed for cribriform cancer by a uropathologist (H.S.). We selected all positive anatomical locations and a random selection of 6 negative anatomical locations with Gleason pattern 4 tissue and included all slides from these locations. This re-assessment was conducted digitally on Cytomine using Philips WSIs (.isyntax converted to .tiff) to report cribriform cancer per slide. The pathologist could also indicate uncertain cases with a borderline category.
A subset of slides (n=94) were digitally re-assessed for PNI by a uropathologist (B.D.). We randomly selected 12 positive and 5 negative anatomical locations per ISUP grade and included all slides from these locations. This re-assessment was conducted digitally on Cytomine using Philips WSIs (.isyntax converted to .tiff) to report PNI per slide. The pathologist could also indicate uncertain cases with a borderline category.
7.2.8. Synlab Finland (SFI)
The SFI samples represent consecutive cases from routine diagnostics at the Synlab Laboratory in Helsinki, Finland from January 2020 to February 2021. Patients underwent systematic, MRI-targeted or combined transrectal biopsies. Slides typically contain two cores, sectioned at five to six levels. A varying number of cores were typically obtained from a varying number of anatomical locations.
7.2.8.1. Reference standard protocol
The reference standard was determined based on the pathology reports from routine diagnostics. Using the microscope the pathologists reported the GS, the ISUP grade, cancer extent, biopsy length, Gleason pattern 4 percentage, cribriform cancer, PNI, HGPIN and possible IHC staining per anatomical location (i.e. a set of biopsy cores assessed together) and in some cases per patient.
Subsets of the SFI cohort underwent additional re-assessments as follows:
A subset of slides (n=67) was randomly selected, stratified by the ISUP grade and anatomical location for re-assessment by the lead pathologist (L.E.). This re-assessment was conducted digitally on Cytomine using Philips WSIs (.isyntax converted to .tiff) to report the GS per slide.
7.2.9. Synlab France (SFR)
The SFR samples represent consecutive cases from routine diagnostics at the Technipath-Synlab Medical Laboratory in Dommartin, Rhône, France from September 2020 to December 2020. Patients underwent systematic, MRI-targeted or combined transrectal biopsies. Slides usually contain two to three cores from the same anatomical location, sectioned at two levels.
7.2.9.1. Reference standard protocol
The reference standard was determined based on the pathology reports from routine diagnostics. Pathologists using a microscope reported the GS, the ISUP grade, cancer extent, biopsy length, Gleason pattern 4 percentage, cribriform cancer, PNI, HGPIN and possible IHC staining per anatomical location (i.e. slide) and in some cases per patient.
Subsets of the SFR cohort underwent additional re-assessments as follows:
A subset of slides (n=49) was randomly selected, stratified by the ISUP grade and anatomical location for re-assessment by the lead pathologist (L.E.). This re-assessment was conducted digitally on Cytomine using Philips WSIs (.isyntax converted to .tiff) to report the GS per slide.
7.2.10. Spear Prostate Biopsy 2020 (SPROB20)
The SPROB20 samples were collected at Uppsala University Hospital, Uppsala, Sweden from 2015 to 2018. Patients underwent targeted transrectal biopsies. Slides typically contain one core, sectioned at one level. This cohort is publicly available at the AIDA Data Hub (Walhagen et al., 2020).
7.2.10.1. Reference standard protocol
The reference standard was obtained from the clinical routine. The pathologists assessed the slides microscopically and reported the ISUP grade at the patient level in two ways: as the maximum and as the average of the slide level ISUP grades. The underlying slide-level ISUP grades were not provided on the AIDA Data Hub.
Subsets of the SPROB20 cohort underwent additional re-assessments as follows:
A subset of slides (n=50) was randomly selected, stratified by ISUP grade and patient, for re-assessment by the lead pathologist (L.E.). This re-assessment was conducted digitally on Cytomine using Hamamatsu WSIs (.ndpi converted to .tiff) to report the GS per slide.
7.2.11. University Hospital Cologne (UKK)
The UKK samples represent consecutive cases from the Institute of Pathology at the University Hospital Cologne in Cologne, Germany. Patients underwent combined systematic and MRI-targeted transrectal biopsies. Slides typically contain one core, sectioned at three levels. The publicly available subset of samples was randomly selected and stratified by the ISUP grade, including ten samples per ISUP grade. This cohort was obtained from a publicly available dataset which was part of the development and validation sets in an earlier study (Tolkach et al., 2023). The WSIs were converted from JPEG2000 compressed OME-TIFF format via an intermediate raw Zarr format to JPEG compressed (quality 80) generic pyramidal TIFF format for OpenSlide compatibility using the bioformats2raw (v. 0.9.3), raw2ometiff (v. 0.7.1) and libvips (v. 8.9.1) converters.
7.2.11.1. Reference standard protocol
The reference standard was determined digitally by a panel of 10 different pathologists from Austria, Germany, Israel, Japan, the Netherlands, Russia and the United States. All pathologists reported the ISUP grade per slide and the final grade was obtained as the majority vote. A consensus was considered reached in cases where the majority ISUP grade had at least six votes.
7.2.12. Hospital Wiener Neustadt (WNS)
The WNS samples represent consecutive cases from the Hospital Wiener Neustadt in Wiener Neustadt, Austria. Patients underwent combined systematic and MRI-targeted transrectal biopsies. Slides typically contain one core, sectioned at one level. The publicly available subset of samples was randomly selected and stratified by the ISUP grade, including ten samples per ISUP grade. This cohort was obtained from a publicly available dataset which was part of the development and validation sets in an earlier study (Tolkach et al., 2023). The WSIs were converted from JPEG2000 compressed OME-TIFF format via an intermediate raw Zarr format to JPEG compressed (quality 80) generic pyramidal TIFF format for OpenSlide compatibility using the bioformats2raw (v. 0.9.3), raw2ometiff (v. 0.7.1) and libvips (v. 8.9.1) converters.
7.2.12.1. Reference standard protocol
The reference standard was determined digitally by a panel of 11 different pathologists from Austria, Germany, Israel, Japan, the Netherlands, Russia and the United States. All pathologists reported the ISUP grade per slide and the final grade was obtained as the majority vote. A consensus was considered reached in cases where the majority ISUP grade had at least six votes.
8. Statistical analyses
8.1. Overview of statistical analyses
8.1.1. Primary analysis: Diagnosis and Gleason scoring
Internal and external validation against the original cohort-specific reference standard
Subgroup analyses
Evaluate performance across different age groups
Evaluate performance on systematic vs. targeted biopsies
Evaluate performance on non-treated patients vs. patients treated for benign prostatic hyperplasia prior to biopsy
Evaluate performance on morphological subtypes
Evaluate performance on cases requiring vs. not requiring IHC staining
Evaluate performance compared to the current state-of-the-art AI systems
Sensitivity analyses
Cross-scanner consistency analyses
Compare the AI system vs. individual pathologist panel members
Internal and external validation against uniform reference standard by the lead pathologist
Blinded re-assessment of slides with marked errors
8.1.2. Secondary analysis: Cancer extent prediction
Internal and external validation against the original cohort-specific reference standards
Subgroup analyses
Evaluate performance across different age groups
Evaluate performance on systematic vs. targeted biopsies
Evaluate performance on non-treated patients vs. patients treated for benign prostatic hyperplasia prior to biopsy
Sensitivity analyses
Cross-scanner consistency analyses
8.1.3. Secondary analysis: Cribriform cancer detection
Internal and external validation against the original cohort-specific reference standards
Subgroup analyses
Evaluate performance across different age groups
Evaluate performance on systematic vs. targeted biopsies
Evaluate performance on non-treated patients vs. patients treated for benign prostatic hyperplasia prior to biopsy
Sensitivity analyses
Cross-scanner consistency analyses
Compare the AI system vs. individual pathologist panel members
Re-assessment excluding borderline slides
8.1.4. Secondary analysis: Perineural invasion detection
Internal and external validation against the original cohort-specific reference standards
Subgroup analyses
Evaluate performance across different age groups
Evaluate performance on systematic vs. targeted biopsies
Evaluate performance on non-treated patients vs. patients treated for benign prostatic hyperplasia prior to biopsy
Sensitivity analyses
Cross-scanner consistency analyses
Compare the AI system vs. individual pathologist panel members
Re-assessment excluding borderline slides
8.1.5. Exploratory analyses
Evaluate visualisations of the AI output
Evaluate the impact of tissue segmentation algorithms
Evaluate end-to-end vs. transfer-learning-based models
Evaluate the impact of physical colour calibration
8.2. Details of statistical analyses
Primary analysis: Diagnosis and Gleason scoring
We will quantify the concordance of the AI system’s cancer diagnosis (positive/negative), Gleason score and ISUP grade with the reference standards in the tuning, internal validation and external validation cohorts using the metrics described below. The analysis will be conducted on slide level (AQ, AUH, KUH-1, KUH-2, MUL, RUMC, SFR, STHLM3, SUH, UKK, WNS), anatomical location level (MLP, SFI, SCH) and/or patient level (KUH-1, SCH, SFI, SFR, SPROB20) depending on the granularity of the available reference standards.
Cancer diagnosis
Sensitivity (true positive rate) and specificity (true negative rate) will be used to quantify the agreement of negative/positive diagnosis for prostate cancer with the reference standard. Confidence intervals for sensitivity and specificity will be computed using the non-parametric bootstrap over cases. We will additionally report the Area Under the Receiver Operating Characteristics Curve (AUROC) and confusion matrices.
Gleason score
Quadratically weighted Cohen’s kappa (QWK) will be used to quantify the agreement of Gleason scoring with the reference standard. In addition, we will also report linearly weighted Cohen’s kappa (LWK) and confusion matrices. To allow calculating weighted kappas, Gleason patterns (e.g. 3+4) will be encoded into ordinal variables following earlier studies (Jung et al., 2022; Egevad, Micoli, Delahunt, et al., 2024; Egevad, Micoli, Samaratunga, et al., 2024) as follows: benign (0), 3+3 (1), 3+4 (2), 4+3 (3), 3+5 (4), 4+4 (5), 5+3 (6), 4+5 (7), 5+4 (8), 5+5 (9). Confidence intervals will be computed using the non-parametric bootstrap over cases.
ISUP grade
Quadratically weighted Cohen’s kappa (QWK) will be used to quantify the agreement of the ISUP grade with the reference standard. In addition, we will also report linearly weighted Cohen’s kappa (LWK) and confusion matrices. To allow calculating weighted kappas, ISUP grades will be treated as ordinal variables (0-5), with benign encoded as 0. Confidence intervals will be computed using the non-parametric bootstrap over cases.
Secondary analysis: Cancer extent prediction
We will quantify the concordance of the AI system’s prediction of linear cancer extent expressed in millimetres with the reference standards in those tuning, internal validation and external validation cohorts where a reference standard is available (AUH, KUH-1, STHLM3, SUH, STG, MLP, SCH, SFI, SFR). The concordance will be quantified using root mean squared error (RMSE). In addition, we will also report Pearson’s linear correlation coefficient, and show scatter plots of predicted millimetre cancer length vs. millimetre cancer length reported by the reference standard. The analysis will be conducted on slide level (AUH, KUH-1, STHLM3, SUH, STG, SFR), anatomical location level (MLP, SFI, SCH) and/or patient level (MLP, SCH, SFI, SFR) depending on the granularity of the available reference standards (see Table 3). Confidence intervals will be computed using the non-parametric bootstrap over cases.
Secondary analysis: Cribriform cancer detection
We will quantify the concordance of the AI system’s prediction of the presence of cribriform cancer with the reference standards in those internal and external validation cohorts where a reference standard is available (MUL, SCH, STHLM3, SUH). The tuning set has an insufficient number of cribriform samples for evaluation and will be included in the training. The concordance will be quantified using unweighted Cohen’s kappa. In addition, we will also report AUROC, sensitivity (true positive rate), specificity (true negative rate) and confusion matrices. Slides reported as borderline for cribriform cancer will be considered negative. The analysis will be conducted on slide level. Confidence intervals will be computed using the non-parametric bootstrap over cases.
Secondary analysis: Perineural invasion detection
We will quantify the concordance of the AI system’s prediction of the presence of perineural invasion with the reference standards in those internal and external validation cohorts where a reference standard is available (MUL, SCH, STHLM3, SUH). The tuning set has an insufficient number of PNI samples for evaluation and will be included in the training. The concordance will be quantified using unweighted Cohen’s kappa. In addition, we will also report AUROC, sensitivity (true positive rate), specificity (true negative rate) and confusion matrices. Slides reported as borderline for perineural invasion will be considered negative. The analysis will be conducted on slide level. Confidence intervals will be computed using the non-parametric bootstrap over cases.
Subgroup analyses
Subgroup analysis A
We will measure the performance of the AI system in terms of the primary and secondary objectives across subgroups of patients divided by age. Analysis will be conducted on the cohorts where age information can be retrieved (see Table 1) according to the age groups: <50, 50 - 59, 60 - 69, and ≥ 70.
Subgroup analysis B
We will measure the performance of the AI system in terms of the primary and secondary objectives across subgroups of patients divided by biopsy sampling technique (systematic vs. targeted vs. combined). The analysis will be conducted on the cohorts where biopsy sampling technique information can be retrieved.
Subgroup analysis C
We will measure the performance of the AI system in terms of the primary and secondary objectives across subgroups of patients who were treatment-naive or had received treatment for benign prostatic hyperplasia (BPH) (using e.g. 5-alpha reductase inhibitors) before the biopsy procedure. The analysis will be conducted on the cohorts where treatment information can be retrieved. Some (very few) individuals included in the patient cohorts may also have undergone prior prostate cancer treatment (e.g. radiation therapy), but the number of cases is insufficient for a subgroup analysis.
Subgroup analysis D
We will measure the performance of the AI system in terms of the primary objective on subgroups of slides representing morphological subtypes of benign and malignant tissue that are usually hard for pathologists to diagnose. We evaluate the performance of the AI system in the STHLM3 morphological subtypes internal validation cohort, the KUH-2 external validation cohort and the AQ external and partly external validation cohorts. See Table 5 for the distribution of morphological subtypes reported in each cohort. We will evaluate performance in terms of cancer diagnosis and additionally, Gleason scoring, where applicable to the subtype.
Subgroup analysis E
We will measure the performance of the AI system in terms of the primary objective across subgroups of slides which required IHC staining for confirming the diagnosis and slides which the pathologists could assess without IHC. The analysis will be conducted on the cohorts where information on IHC can be retrieved (see Table 6).
Subgroup analysis F
We will measure the performance of the AI system in terms of the primary objective in comparison to the state-of-the-art algorithms developed in the PANDA challenge (Bulten et al., 2022). The analysis will be conducted on the subgroups of the KUH-1,
RUMC and STHLM3 cohorts representing the internal and external validation sets of PANDA. For a fair comparison, we will apply the AI system on the WSIs provided to the challenge participants, which differ in terms of preprocessing and file format from the underlying original WSIs of the KUH-1 and STHLM3 cohorts, which are used in our primary analysis.
We evaluate the performance in the tuning cohort KUH-1 (i.e. PANDA European external validation set) and compare the AI system with the PANDA challenge algorithms.
We evaluate the performance in the combined PANDA subset of the RUMC and STHLM3 internal validation cohorts (i.e. PANDA internal validation set) and compare the AI system with the PANDA challenge algorithms.
Sensitivity analyses
Sensitivity analysis A
We will evaluate the reproducibility of the AI system’s output in terms of the primary and secondary objectives on WSIs obtained from the same slides on multiple scanners. The analysis will be conducted on the STHLM3 tuning and internal validation cohorts and the MUL external validation cohort, which contain WSIs rescanned on different scanners (see Table 2). In the STHLM3 cohort, a subset of slides (n=287) have been rescanned on five scanners: Aperio AT2 DX, Grundium Ocus40, Hamamatsu NanoZoomer 2.0-HT C9600-12, Hamamatsu NanoZoomer XR C12000-02 and Philips IntelliSite UFS. In the MUL cohort, a subset of slides (n=503) have been rescanned on two scanners: Grundium Ocus40 and Philips IntelliSite UFS. We will quantify the reproducibility of the AI predictions across scanners using QWK, and LWK and the percentage of slides with discordant predictions for each objective and each pair of scanners. We will additionally report confusion matrices.
Sensitivity analysis B
To put the discrepancies between the AI system and the reference standards in the context of inter-observer variation between pathologists, we will quantify all-against-all pairwise agreements in panels consisting of pathologists and the AI system.
For the primary objective, the analysis will be conducted on subsets of the STHLM3 (ImageBase) and RUMC (PANDA Radboud) internal validation cohorts and on the full UKK and WNS external validation cohorts, which were assessed by a panel of pathologists and have per-pathologist grades available in addition to their consensus (see Table 3). For the secondary objectives of cribriform cancer and PNI detection, the analysis will be conducted on subsets of the STHLM3 internal validation cohort, assessed by panels of pathologists (see Table 4).
We will calculate the average pairwise agreement (QWK and LWK for the primary objective, unweighted Cohen’s kappa for the secondary objectives) for all the pathologists in the panels, including the AI system, and compare the average AI-pathologist agreement to the average pathologist-pathologist agreement. Confidence intervals will be computed using bootstrapping, as detailed before (Egevad et al., 2018).
Sensitivity analysis C
To assess the sensitivity of the results to different pathologists providing the cohort-specific reference standards and to isolate differences in observed AI performance due to varying reference standards from those due to imperfect generalisation to different labs and scanners, we will repeat the primary analysis using a consistent reference standard. We will measure the agreement between the AI system and the uniform reference standard set by the lead pathologist (L.E.) on subsets of the SUH and RUMC internal validation cohorts and the AUH, MLP, MUL, SCH, SFI, SFR, and SPROB20 external validation cohorts (see Table 3 for a summary of the re-assessed subsets and Section 7 for details on the case selection for each cohort). While the original reference standards were varyingly reported either on the level of slides, anatomical locations, or patients, L.E.’s re-assessments are consistently reported on slide level.
Furthermore, we will measure the agreement in ISUP grades (QWK and LWK) between the original reference standards and the lead pathologist on the re-assessed subsets of each cohort. To facilitate this comparison for cohorts with original reference standards provided on anatomical location or patient level (whereas the grading by L.E. is on slide level), the location or patient level grading by L.E. will be obtained as the maximum ISUP grade over all slides belonging to a location or patient.
Sensitivity analysis D
We will perform a sensitivity analysis that involves a re-assessment of slides where the AI system committed clinically significant errors by repeating the primary analysis against the updated reference standard. This analysis aims to evaluate what portion of clinically significant errors can be attributed to data quality issues, such as mistyped information in the reference standard tables, mixed-up slide identifiers, or WSI scanning issues in cases where the original reference standard was set using a microscope. Significant errors are defined as cases where the AI model predicts a slide as benign, but the reference standard indicates ISUP grade ≥ 2, or conversely the AI predicts a slide as ISUP grade ≥ 2, but the reference standard indicates benign. These slides will be re-assessed by the lead pathologist (L.E.) and/or other experienced uropathologists, blinded to the original reference standard and the AI output. If a slide cannot be assessed due to e.g. poor focus, it will be excluded. The evaluation will be conducted on the internal and external validation cohorts, on both the full cohorts after updating the reference standards, and on only the updated subsets. Additionally, during this analysis, pathologists will report whether any of the cases with clinically significant errors represent ductal adenocarcinoma (DAC). Despite being the second most common subtype of prostate cancer after acinar adenocarcinoma, DAC only accounts for 0.17% of prostate cancers (Ranasinha et al., 2021) and may therefore be challenging for AI to detect due to the limited amount of training data.
Sensitivity analysis E
We will perform a sensitivity analysis that involves the exclusion of samples reported by the pathologists as “borderline” for cribriform cancer or PNI, followed by repeating the secondary analyses concerning these objectives. Conducting the analysis only on samples indicated as negative or positive will provide an estimate of the AI system’s performance in detecting cribriform cancer and PNI less affected by the uncertainty and subjectivity in the definition of these entities. We will additionally quantify the prevalence of borderline diagnoses among slides initially classified as false positives vs. true negatives to quantify whether borderline cases are overrepresented among false positives. This would indicate that false positives mainly arise due to uncertainty of the reference standard.
Exploratory analysis: Evaluate visualisations of the AI output
We will output visualisations of the AI system’s predictions to highlight areas on each slide containing different Gleason patterns, cribriform cancer or PNI. The visualisations will be assessed qualitatively by the lead pathologist (L.E.) and/or other experienced uropathologists for concordance with their assessments. We may additionally quantify the rate of agreement between the AI system and the pathologists by collecting region annotations to serve as a reference standard, and by calculating the pixel-wise sensitivity, specificity, intersection over union or other suitable metrics.
Exploratory analysis: Evaluate the impact of tissue segmentation algorithms
Detecting tissue from the background to only apply the rest of the analysis on tissue pixels is a common preprocessing step for most computational pathology algorithms. While this task of tissue segmentation may seem trivial, many modern AI algorithms reach such low error rates in their main task, that any errors in tissue detection can contribute to the overall model performance in a considerable way. In particular, missed tissue poses a risk of false negative diagnoses, if this leads to the exclusion of malignant tissue from the analysis. We will evaluate the effect of tissue segmentation on the overall performance of the AI system in terms of the primary and secondary objectives by comparing two different tissue segmentation algorithms. One of the algorithms represents classical image processing and relies on filtering and thresholding the image (Ström et al., 2020). The other algorithm is a trained deep learning based segmentation model. We will apply both algorithms to perform the tissue segmentation during model training and validation and compare the results on the internal and external validation cohorts.
Exploratory analysis: Evaluate end-to-end vs. transfer-learning-based models
Recently, so-called foundation models trained in a self-supervised manner on large and heterogeneous datasets, have been proposed as generally applicable solutions to diverse tasks in computational pathology as an alternative to tissue type or task specific models (Chen et al., 2024). We aim to compare our end-to-end trained prostate cancer specific model to transfer-learning-based models relying on state-of-the-art foundation models for histopathology. We will apply a suitable foundation model as a feature extractor and train an additional classifier to adapt the model to the task of diagnosis and Gleason scoring of prostate biopsies. For this transfer learning step, we will use the same development cohorts as for the end-to-end trained model. We will then evaluate the model on the same internal and external validation cohorts as the end-to-end trained model for a direct comparison.
Exploratory analysis: Evaluate the impact of physical colour calibration
Variations in the reproduction of colour across different digital pathology scanners may pose a problem for AI, leading to inconsistent model outputs depending on the scanner used for slide digitisation. A physical calibrant in the form of a spectrophotometrically characterised slide has been proposed as a means for standardising the colour characteristics of WSIs acquired with different scanners (Clarke et al., 2018). We will evaluate the impact of applying physical colour calibration on the performance of the AI model on those internal and external validation cohorts where the calibrant slide could be scanned on the same scanner as the prostate biopsies to allow calibration.
8.3. Confounding factors
Statistical confounding, or spurious correlations, in the training and validation data of predictive models, may lead to “shortcut learning” or so-called “Clever Hans predictors” (Lapuschkin et al., 2019), where overly optimistic performance on validation data is seen as the result of the model taking advantage of unintended correlations between some attributes of the data and the correct labels. Such biases are also common in digital pathology datasets (Howard et al., 2021; Schmitt et al., 2021). We have carefully considered the potential presence of such biases in our cohorts and taken the steps described below to mitigate the issue.
An important confounding factor is the scanner instruments used for digitising various subsets of our data cohorts. Patients in different cohorts and subsets of cohorts have been sampled in varying ways, leading to differences in the compositions of these groups in terms of GS and ISUP grade distribution. These correlations between specific clinical sites or scanner instruments and the target labels can create biases during training since the model could learn to associate the appearance of WSIs obtained from a specific site or with a specific scanner with a higher or lower likelihood of a particular diagnostic or grading outcome. If the same bias is present in validation data, this will lead to overly optimistic results. Conversely, if the bias present in training data is not present in the validation data, a model relying on these spurious correlations will perform poorly. The main approach we have taken to mitigate the risk of overly optimistic validation results is relying on fully external validation data. The external validation cohorts represent patients, clinical sites, laboratories and scanners not present in the training data. This minimises the risk of the same spurious correlations appearing in both training and external validation data. When it comes to discouraging the model from learning any spurious correlations between laboratories or scanners and the target labels, which could result in suboptimal performance in the absence of these correlations, we will apply a sampling scheme which removes the correlations between these variables during model training.
Another common confounding factor we have identified is markings on the slides. Pathologists often place pen marks on the glass slides to indicate cancerous regions. These can lead the AI model to directly associate the presence of markings with the presence of cancer, or indirectly to associate image quality artefacts such as poor focus caused by the pen marks with a higher likelihood of cancer being present. We have mitigated these issues by 1) Applying tissue detection and masking of background pixels as an image preprocessing step, ensuring that pen markings adjacent to tissue will not be shown to the model, 2) Washing and rescanning of slides where pen markings are placed on top of tissue or caused focusing issues, or 3) Excluding slides where neither of the first two options was possible. The first approach of background masking is applied to all the WSIs included in the study. The second approach of washing slides was applied to the development cohorts where we had control over the scanning process, namely STHLM3 and SUH. In the RUMC cohort, we excluded slides with pen marks on the tissue based on the findings of the participants in the PANDA challenge.
8.4. Representative sampling
A key issue in the evaluation of diagnostic tests is how disease prevalence influences estimates of statistical measures used to assess the diagnostic performance of the tests. Prevalence is generally defined as the proportion of individuals in a population who have a particular disease at a given time. However, more specifically, the prevalence relates to the datasets used for evaluating a diagnostic test.
The positive predictive value (PPV; i.e. the probability that individuals with a positive test result truly have the disease), negative predictive value (NPV; i.e. the probability that individuals with a negative test result truly do not have the disease), and the Cohen’s kappa statistics are influenced by the disease prevalence in the datasets used for evaluating the performance of diagnostic tests. As prevalence increases, the PPV of a test also increases; and conversely, NPV decreases with increasing prevalence. This relationship means that in datasets where a disease (or disease subtype) is more common, the test’s ability to identify true positives increases and true negatives decreases. Similarly, the disease prevalence and case mix will impact estimates of Cohen’s kappa.
In contrast to PPV, NPV and Cohen’s kappa, sensitivity (also known as true positive rate i.e. the ability of a test to correctly identify patients with the disease) and specificity (also known as true negative rate i.e. the ability to correctly identify those without the disease) are not affected by changes in prevalence. These measures are intrinsic properties of the test and do not depend on how common the disease is in a population or dataset.
The sampling scheme or experimental design impacts the estimated prevalence in a study, thereby affecting the diagnostic performance statistics that are sensitive to prevalence. For example, in case-control studies, the prevalence is artificially set by the researcher. In datasets collected for the development of diagnostic AI systems (such as the one described in this protocol), it is common to upsample patients with a disease or disease subtype. If a consecutive case series were used for training an AI system to perform Gleason scoring, a very large set would be required in order to ensure a sufficiently large subsample of e.g. Gleason score 9 and 10 samples for efficient training. Similarly, convenience sampling, where subjects are selected based on their availability rather than at random or according to a defined study design, can lead to a sample with a prevalence rate that does not match the general population. These types of experimental designs and sampling schemes can lead to assessments of PPV, NPV, and Cohen’s kappa that do not reflect estimates that would be obtained in a consecutive case series in the general population.
The impact of prevalence on performance estimates underlines the importance of carefully considering the design of diagnostic studies. When prevalence is expected to differ, adjustments or different interpretations of PPV and NPV may be necessary to avoid misinformative conclusions. The data we use for training and evaluation of the AI system is a mixture of convenience samples (AMU, AQ, KUH-2, RUMC, SPROB20, STG) and data representing consecutive clinical cases or another well defined and controlled sampling scheme (AUH, KUH-1, MLP, MUL, SCH, SFI, SFR, STHLM3, SUH, UKK, WNS). For the datasets with a known sampling scheme and experimental design, we can use prior probability shift corrections to achieve estimates of PPV, NPV, and Cohen’s kappa on a well defined base population (Schölkopf et al., 2012; Heiser, Allikivi and Kull, 2020).
8.5. Power
We have not performed formal power (or sample size) calculations. The reason for this is as follows:
The central objective of this study is to calculate point estimates of performance (using statistical measures as described above) and their confidence intervals, rather than emphasising power to detect a specific effect size (which is more relevant when comparing interventions or diagnoses).
This is a retrospective evaluation of AI for prostate pathology. This means that the sample size is fixed based on the datasets at hand.
8.6. Data quality and label noise
Collecting and pseudonymising or anonymising clinical and pathology data and associating these records with the correct WSIs requires a number of steps, each introducing potential sources for error. Our data collection, management and verification process generally followed these steps:
Retrieval and digitisation of clinical/pathology data
Depending on the data cohort, the clinical and pathology data were extracted from existing databases/registries (STHLM3) in tabular form, provided in tabular form by the data providing sites (AMU, AQ, AUH, MLP, MUL, RUMC, SPROB20, SUH, UKK, WNS) or tabulated manually in-house from pathology reports scanned into PDF files (KUH-1, KUH-2, SCH, SFI, SFR, STG). The manual tabulation in-house involved human translation of the reports from Finnish (SFI), French (SCH, SFR) and Swedish (KUH-1, KUH-2, STG) by trained non-experts fluent in the respective languages. Patient identifiers were pseudonymised during the data extraction or tabulation process by each data provider.
Retrieval and digitisation of slides
Slides were retrieved from the respective archives at each site and scanned with the instruments tabulated in Table 2. Each slide had a label with an identifier and depending on the scanning site, the identifiers were stored either in the form of macro/label images as part of the WSI metadata, automatically detected from QR codes and stored as WSI metadata, or manually typed in by the scanner operator when naming the resulting WSI files.
Linking slides to clinical/pathology data
Depending on the manner in which slide identifiers were stored for each WSI, the linking step involved one of the following approaches. For WSIs, where the identifier was manually typed into the filename, customised scripts were written in Python for each data cohort to parse the filename strings. This involved comparing the parsed identifiers to those present in the clinical/pathology data, and iterative refinements to rectify issues such as missing or additional zeros, missing or additional whitespace or other delimiters, and discrepancies with the representation of characters not belonging to the Basic Latin (standard ASCII) set, e.g. Ä or Ö. For WSIs, where the identifier was stored in the form of WSI metadata, we used an in-house developed optical character recognition (OCR) system to extract identifiers in a semi-automated manner from the QR-code based metadata items and the macro/label images embedded in the WSIs. The system first extracted the QR-code based identifier, if available, or performed OCR using the pytesseract (version 0.3.2) implementation of the Tesseract OCR engine (Smith, 2007). The system featured a simple user interface, which presented the automatically detected identifier pre-filled into a text box, alongside the macro/label image of the slide. The human operator then had the option of accepting the proposed identifier or correcting it manually based on the label image. All identifiers were assessed by trained non-experts using this semi-automated approach.
Relabeling
Slides and patients were initially labelled independently by each data provider using pseudonymised identifiers. This poses a risk that the same identifier (e.g. Patient_01) is used by multiple data providers, which would cause ambiguous matches in the final combined dataset. In order to minimise this risk and to obtain unique identifiers for each WSI, each slide and each patient, we calculated unique MD5 hashes based on the variables below. This step additionally provided another round of pseudonymisation to minimise the risk of any non-pseudonymised identifiers being accidentally used by the data providing sites.
WSI ID: Filename + scanner serial number + scanning time stamp
Slide ID: Cohort name + original slide ID
Patient ID: Cohort name + original patient ID
Verification
The final dataset covering all the data cohorts is managed internally as a CSV spreadsheet, generated and maintained using scripts written in Python relying on pandas (Creators The pandas development team; McKinney, 2010). Upon generation and any modifications, the dataset undergoes comprehensive unit testing to ensure correctness, implemented in Python using the unittest framework. A version history of the dataset is retained to allow tracing back errors. In summary, the tests used for verification cover the following aspects. The uniqueness and unambiguity of matches based on the identifiers described above are verified. Patient-level variables are tested for consistency across all slides and WSIs from the
same patient, and slide level variables are checked for consistency across multiple WSIs representing the same slide. We verify that all variables have valid values, with specific tests for categorical, quantitative, and Boolean variables and test for logical mismatches between variables (e.g. a slide negative for cancer cannot be positive for PNI). We ensure there is no overlap between patients in different development vs. validation splits or between cross-validation folds in the development data. Please refer to the Supplementary Appendix Section 2 for an extensive list of all the tests.
9. Discussion
This study protocol underscores our dedication to transparency and scientific rigour in developing AI systems for medical diagnostics. The protocol outlines data cohorts, development-validation partitions, performance metrics and an experimental pipeline prespecified before any investigations or experiments on the validation datasets have taken place. For each data cohort, we report information on patient characteristics and selection, biopsy acquisition, histopathological sample preparation, digitisation, and previous utilisation of the cohorts in earlier studies on other AI systems. Furthermore, we report reference standard protocols detailing the variables assessed by pathologists, the level of assessment (pixels, slides, anatomical locations or patients), and any additional re-assessments. This comprehensive documentation of data cohorts facilitates transparency and reproducibility of the research, interpretation of data diversity and representativeness, as well as reliability and integrity of developing and validating the AI system. The study results will be submitted for publication regardless of whether they are positive, negative or inconclusive in relation to the study hypothesis.
Despite the rigorous design, the study has a number of limitations, which we aim to address in future revisions of the protocol and in follow-up studies. Firstly, many AI systems, including those developed for diagnostic purposes, often suffer from the under-representation of certain demographic groups in the data used for their development and validation (Garin et al., 2023). In this study as well, we recognise potential biases in patient demographic representation and are committed to addressing them through additional data collection and subsequent validation processes. Importantly, while all data cohorts and partitions are predefined, the protocol is designed to accommodate the addition of new cohorts for development (up until the model design freeze and initiation of the validation phase) or for validation without altering the initial partitions. For example, we are currently collecting validation data from ethnically diverse North American (Vigneswaran et al., 2024) and Middle Eastern cohorts. This protocol will be extended accordingly to support additional retrospective evaluation of the AI system across these and other patient populations on a global scale.
Secondly, reproducible AI performance across different digital pathology scanners would greatly facilitate scalable clinical deployment of AI systems, and we address this question in a prespecified cross-scanner consistency analysis, which currently has some limitations. The majority of the scanners used for rescanning slides from the STHLM3 and MUL validation cohorts for this analysis were also involved in the digitisation of the development data (except the Grundium Ocus40 scanner). This can potentially lead to optimistic results due to the AI model having been exposed to the variation seen between these scanners during training. Nevertheless, it should be noted that all the external validation cohorts described in this protocol have been digitised on scanners not involved in the collection of the AI development data, which will allow us to assess cross-scanner generalisation indirectly. For a direct comparison using the exact same set of slides digitised on multiple external scanners (i.e. corresponding to a paired study design), we are in the process of rescanning slides on additional scanners. This will allow us to repeat the analysis using scanners fully external to the AI system in a follow-up study.
Thirdly, the criteria for distinguishing between uropathologists and general pathologists are often vague and lack standardised definitions across different countries and hospitals. This may introduce differences when comparing agreement rates between general pathologists and uropathologists across different cohorts. Furthermore, there are varying practices in the reporting of prostate pathology, for example in terms of measuring cancer extent and summarising Gleason scoring results on the patient level. This might introduce additional systematic differences when evaluating the performance across cohorts, which we have mitigated by additional re-assessments performed in a consistent manner by the lead pathologist (L.E.). Still, prostate pathology assessment remains a subjective process and inter- and intra-observer variability cannot be fully eliminated from the reference standards.
This protocol covers retrospective validation of an AI system for assessing prostate core needle biopsies for four main objectives i.e. prostate cancer diagnosis and grading, cancer extent, cribriform cancer and perineural invasion. These objectives are crucial for predicting disease prognosis and guiding treatment for prostate cancer patients. However, additional objectives of our work on AI for prostate cancer will be added. For example, the diagnostic AI system described in this protocol can serve as a foundation model for developing models for direct prognostication (based on relevant oncological outcomes, such as time to biochemical recurrence (BCR), metastatic disease or prostate cancer death) and treatment prediction, and with further refinements can be adapted to predict additional objectives based on prostate morphology or other use cases in prostate pathology, such as reducing the need for IHC staining (Table 6). Moreover, we will perform molecular characterisation (genomic and transcriptomic profiling) of tissue samples from diagnostic biopsies, following the same protocol as we use in the ProBio trial for metastatic prostate cancer (Crippa et al., 2020; De Laere et al., 2022). Linked imaging and genomic data will be used to develop models to predict clinically important genomic alterations and mutations from the morphological data in the WSIs. For example, we will develop AI models for the prediction of alterations in the BRCA genes; patients with alterations in these genes tend to respond well to poly ADP-ribose polymerase (PARP) inhibitors (de Bono et al., 2020; Chi et al., 2023; Fizazi et al., 2023). Such AI models could in a clinical setting help to triage tissue samples for genomic analysis to verify AI predictions, which would reduce costs and improve chances of detecting clinically actionable genetic information. We will also use the data presented in this protocol to further develop conformal predictors to detect unreliable AI predictions (Olsson et al., 2022). Additional information regarding these objectives will be added in future revisions of this protocol (and then noted in the revision history of the document).
The importance of relating performance to a well defined population (see Section 8.4) motivates prospective evaluation in a clinical trial, which we are currently planning. (The prospective trial will be described and detailed in its own protocol.) Prospective evaluation also enables assessing aspects relevant to the clinical implementation of AI systems that are not possible to evaluate on retrospective data, e.g. user interaction, pathologist-in-the-loop approaches, etc. This planned clinical trial will thus evaluate the AI system performance in a real-world clinical setting against gold-standard diagnostic practices and provide evidence of its efficacy and reliability for guiding clinical decision-making in prostate cancer diagnosis.
Data Availability
A subset of the data used for model training (STHLM3 and RUMC cohorts) are available for non-commercial purposes subject to a CC BY-SA-NC 4.0 license as part of the PANDA challenge dataset and are freely downloadable after registration at https://www.kaggle.com/c/prostate-cancer-grade-assessment. The validation cohort SPROB20 is available for non-commercial purposes under the AIDA BY license upon accepted access request at the AIDA Data Hub at https://datahub.aida.scilifelab.se/10.23698/aida/sprob20. The validation cohorts UKK and WNS are available for non-commercial purposes under the CC BY-NC-SA 4.0 license upon accepted access request at https://zenodo.org/records/8102833 and https://zenodo.org/records/8102929.
10. Ethical considerations
The study is conducted in agreement with the Helsinki Declaration. The collection of patient samples was approved by the Stockholm regional ethics committee (permits 2012/572-31/1, 2012/438-31/3, and 2018/845-32), the Swedish Ethical Review Authority (permit 2019-05220), and the Regional Committee for Medical and Health Research Ethics (REC) in Western Norway (permits REC/Vest 80924, REK 2017/71). Informed consent was provided by the participants in the Swedish dataset. For the other datasets, informed consent was waived by the institutional review board due to the usage of de-identified prostate specimens in a retrospective setting.
12. Competing interests
N.M., L.E., K.K. and M.E. are shareholders of Clinsight AB, and M.R. is a co-founder and shareholder of Stratipath AB.
11. Acknowledgements
A.B. received a grant from the Health Faculty at the University of Stavanger, Norway. B.G.P and K.D.S received funding from Innovation Fund Denmark (Grant no. 8114-00014B) for the Danish branch of the NordCaP project. M.R. received funding from the Swedish Research Council and the Swedish Cancer Society. P.R. received funding from the Research Council of Finland (Grant no. 341967) and the Cancer Foundation Finland. M.E. received funding from the Swedish Research Council, Swedish Cancer Society, Swedish Prostate Cancer Society, Nordic Cancer Union, Karolinska Institutet, and Region Stockholm. K.K. received funding from the David and Astrid Hägelen Foundation, Instrumentarium Science Foundation, KAUTE Foundation, Karolinska Institute Research Foundation, Orion Research Foundation and Oskar Huttunen Foundation.
We want to thank Carin Cavalli-Björkman, Astrid Björklund and Britt-Marie Hune for assistance with scanning and database support. We would also like to thank Simone Weiss for assistance with scanning in Aarhus, and Silja Kavlie Fykse and Desmond Mfua Abono for scanning in Stavanger. We would like to acknowledge the patients who participated in the STHLM3 diagnostic study and the OncoWatch and NordCaP projects and contributed the clinical information that made this study possible.
The computations are possible through the National Academic Infrastructure for Supercomputing in Sweden (NAISS) and the Swedish National Infrastructure for Computing (SNIC) at C3SE partially funded by the Swedish Research Council through grant agreement no. 2022-06725 and no. 2018-05973, by the supercomputing resource Berzelius provided by the National Supercomputer Centre at Linköping University and the Knut and Alice Wallenberg Foundation, and by CSC - IT Center for Science, Finland.