ABSTRACT
MedSegBench is a comprehensive benchmark designed to evaluate deep learning models for medical image segmentation across a wide range of modalities.. This benchmark includes 35 datasets with over 60,000 images, covering modalities such as ultrasound, MRI, and X-ray. It addresses challenges in medical imaging, such as variability in image quality and dataset imbalances, by providing standardized datasets with train/validation/test splits. The benchmark supports binary and multi-class segmentation tasks with up to 19 classes. Evaluations are conducted using the U-Net architecture with various encoder/decoder networks, including ResNets, EfficientNet, and DenseNet, to evaluate model performance. MedSegBench serves as a valuable resource for developing robust and flexible segmentation algorithms. It allows for fair comparisons across different models and promotes the development of universal models for medical tasks. The datasets and source code are publicly available, encouraging further research and development in medical image analysis.
Background & Summary
Deep learning has become essential in medical image analysis and segmentation, offering powerful methods to help doctors and researchers better understand and diagnose diseases1. Deep learning can identify patterns and details in medical images that might be difficult for human eyes to detect using complex networks such as convolutional neural networks2. These techniques are precious for finding tumors in X-rays, classifying different cell types in whole-slide images, or segmenting different brain parts in MRI scans. However, working with biomedical datasets presents unique challenges, including variability of image quality and resolution, the need for well-annotated examples, imbalances of the datasets, and different modalities. Addressing these challenges and ensuring the effectiveness of deep learning methods in real-world medical settings requires large and diverse datasets3. These comprehensive collections of medical images help train the algorithms to handle different modalities and medical tasks. They also allow researchers to compare deep learning methods fairly, determine the most effective approaches for specific medical tasks, and develop universal models for different medical tasks.
There are limited benchmark studies in the literature focused on medical imaging, with most concentrating on medical image classification problems4–8. Gelasca et al.4 present a comprehensive biomedical segmentation benchmark that evaluates bioimage analysis methods. It includes six datasets with associated ground truth and validation methods, covering different scales from subcellular to tissue levels. Rebuffi et al.5 propose the Visual Decathlon Challenge, a benchmark that evaluates models across ten diverse visual classification domains, including datasets such as Aircraft, CIFAR-100, and ImageNet. Medical Segmentation Decathlon6 supports the creation and benchmarking of semantic segmentation algorithms. It includes 2633 3D images from ten different anatomical sites and modalities collected from multiple institutions and annotated by experts. Yang et al.7 introduce the MedMNIST Benchmark, a collection of ten pre-processed medical image datasets standardized to 28×28 pixels. It covers various medical image modalities and support multiple classification tasks. Yang et al.8 extend MedMNIST with MedMNIST v2, a standardized collection of biomedical image datasets. This includes 12 datasets for 2D images and 6 for 3D images, covering various data modalities, scales, and classification tasks,
This study introduces a comprehensive benchmark dataset for medical image segmentation (Figure 1). It includes 35 distinct datasets with over 60,000 images covering various data modalities such as ultrasound, dermoscopy, MRI, X-ray, OCT, and more. It provides a diverse resource for evaluating the performance of deep learning models in medical image segmentation tasks. The dataset includes a wide range of scales, from small collections with just a few dozen images to extensive datasets containing tens of thousands of samples. The segmentation tasks cover both binary and multi-class problems, with some datasets featuring up to 19 different classes. This benchmark offers several powerful advantages as a robust and versatile tool for the research community:
Diversity of modalities: The benchmark includes datasets from various imaging modalities such as Ultrasound, MRI, X-Ray, OCT, Dermoscopy, Endoscopy, and various types of microscopy.
Task complexity: It covers both binary segmentation tasks and multi-class segmentation tasks with up to 19 classes.
Dataset sizes: There’s a wide range in the number of images per dataset, from as few as 28 to as many as 21,165.
Data split: All datasets follow a standard train/validation/test split, which is crucial for the proper evaluation of machine learning models.
Standardization: All datasets are standardized to enhance comparability and ease of use. Samples across all datasets have been resized to three standard resolutions - 128, 256, and 512 pixels - and stored in a uniform format.
Application areas: The datasets cover various medical applications, including cancer detection, COVID-19 diagnosis, cell and nuclei segmentation, and organ segmentation.
We have evaluated each dataset on state-of-the-art segmentation model (U-Net9) with different encoder/decoder network types (ResNet-18, ResNet-50, Efficient-Net, MobileNet-v2, DenseNet-121, Mix Vision Transformer)10. Each experiment are performed 3 times and average results are reported.
This benchmark is carefully designed to assess how well deep learning models can generalize across different medical domains, perform on small and large datasets, and handle varying task complexities. By including such a wide array of medical imaging challenges, this benchmark is a powerful tool for comprehensively evaluating the robustness, flexibility, and overall efficacy of segmentation algorithms in the medical imaging field.
Methods
Data Preparation
The MedSegBench dataset comprises 35 distinct 2D medical image segmentation datasets, some of which are extracted from 3D slices. These datasets cover various data modalities such as Ultrasound, OCT, Chest X-ray, MR, and more. The original datasets differ in scales, segmentation tasks (binary/multi-class), classes, imaging modalities, and annotation styles. Hence, we have selected a standardized format and performed pre-processing to ensure a consistent format across all datasets.
Numerous medical image segmentation datasets are available in the literature, each presenting various challenges in- cluding variations in annotations, image sizes, and file formats. Additionally, many of these datasets lack officially shared train/test/validation splits, making it challenging to fairly compare different methods. To address these issues, we performed pre-processing steps. All image and label pairs are resized to 128×128, 256×256, and 512×512 pixels using the bicubic interpolation method. Although we used 512×512 sized images in our experiments, we have made the 128×128 and 256×256 sized versions publicly available for researchers with limited GPU memory. Also, we have applied a mapping to labels; pixels with values of 0 and 255 are mapped to 0 and 1 for binary segmentation tasks, and for multi-class segmentation tasks, pixels are mapped to integer values between 0 and (#Classes - 1). No additional augmentation or pre-processing steps are applied to the images and labels. We have followed three different scenarios based on MedMNIST v28 to create train/test/validation splits: (1) Utilizing the source train/test/validation splits if published by the authors; (2) Using the source validation set as the test set and splitting the source training set into 90% training and 10% validation (9:1 ratio) if the source training and validation splits are published by the authors; (3) Randomly splitting the dataset into 70% training, 10% validation, and 20% test sets if no public train/test/validation splits are available (7:1:2 ratio). Most of these datasets are publicly published under Creative Commons Licenses, some of which are CC-BY-NC, CC-BY-SA, and CC-BY-NC-SA, permitting the redistribution of datasets. We have published datasets in MedSegBench under Creative Commons Licences, and source codes have been published under Apache License 2.0.
Table 1 presents the summary information for all MedSegBench datasets. In addition, Table 2 shows the data-modality-based overview for MedSegBench datasets. Furthermore, Table 3 provides an overview of various datasets, detailing their sub-categories and the number of samples for training, validation, and testing. In the following sections, we will describe the details of each dataset.
Details
AbdomenUSMSBench
The AbdomenUSMSBench created from AbdomenUS11,12 consists of 926 ultrasound images of the abdominal region, each with a resolution of 449×464 pixels. This dataset is designed for multi-class segmentation tasks and includes eight distinct classes. We have used the official train and test splits, and the train set is split into a training and validation set with a ratio of 9:1. The samples are resized to 1×512×512 pixels, and the labels are mapped to integer values between 0 and (#Classes - 1).
Bbbc010MSBench
The Bbbc010MSBench dataset derived from Bbbc01013,14, contains 100 microscopy images, each with a resolution of 696×520 pixels. These images are created for binary segmentation tasks and are originally captured for a screen in Fred Ausubel’s Massachusetts General Hospital (MGH) lab. The dataset is split into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 1 × 512 × 512 pixels, and the labels are mapped to 0 and 1.
Bkai-Igh-MSBench
The Bkai-Igh-MSBench dataset is derived from the BKAI-IGH NeoPolyp dataset15–17 and consists of 1,200 endoscopy images, each with a resolution of 1280×995 pixels. It is designed for multi-class segmentation tasks, with three distinct classes. We can not use publicly shared test sets because of a lack of ground truth annotations. The dataset is split into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 3×512×512 pixels, and the labels are mapped to integer values between 0 and (#Classes - 1).
BriFiSegMSBench
The BriFiSegMSBench, which originates from the BriFiSeg dataset18,19, includes 1,360 microscopy images with a resolution of 512 512 pixels. This dataset is intended for binary segmentation tasks and contains two classes. The images are single-channel samples derived from various cell lines, such as A549, HeLa, MCF7, and RPE1. The dataset is divided into training and validation sets with a 9:1 ratio. Additionally, task-specific images and annotations are provided in npz file format (see Table 3). The samples are resized to 1×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
BusiMSBench
The BusiMSBench dataset is derived from the Breast Ultrasound Images Dataset20,21 and contains 647 ultrasound images with an average resolution of 500×500 pixels. This dataset is designed for binary segmentation tasks, categorizing data into two classes: benign and malignant. It is split into three parts: train/val/test, in a 7:1:2 ratio. Additionally, class-based images (benign and malignant) and annotations are provided in .npz file format (see Table 3). The samples are resized to 1 × 512 × 512 pixels, and the labels are mapped to integer values between 0 and 1.
CellNucleiMSBench
The CellNucleiMSBench comes from the 2018 Data Science Bowl22,23 and consists of 670 nuclei images with a resolution of 320 256 pixels. This dataset is specifically designed for binary segmentation tasks. We could not use 65 test images because ground truths are not published officially. Therefore, the source dataset split into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 3×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
ChaseDB1MSBench
ChaseDB1MSBench is based on the CHASE_DB1 dataset24, released in 2012 by Kingston University, London, and St. George’s, University of London, consists of 28 fundus images with a resolution of 999×960 pixels. This dataset is designed for binary segmentation tasks, including two classes. We split the source dataset into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 3×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
ChuacMSBench
The ChuacMSBench, derived from the CHUAC dataset25, includes 28 fundus images with 189 189 pixels. It is designed for binary segmentation tasks. The source dataset is split into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 1×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
Covid19RadioMSBench
The COVID-19 Radiography Database26–28 is the source of the Covid19RadioMSBench dataset, which consists of 21,165 chest X-ray images, each with a resolution of 299×299 pixels. This dataset is designed for binary segmentation tasks. We divide the source dataset into three parts: train/val/test sets with a ratio of 7:1:2. It is developed by a collaborative effort of researchers from Qatar University, the University of Dhaka, and partners from Pakistan and Malaysia, working alongside medical professionals. It includes chest X-ray images for COVID-19 positive cases and Normal and Viral Pneumonia images. The authors have also categorized the images into four groups: COVID, Lung_Opacity, Normal, and Viral Pneumonia. We provide these category-based images and their corresponding annotations in .npz file format (see Table 3). The samples are resized to 1 512 512 pixels, and the labels are mapped to integer values between 0 and 1.
CovidQUExMSBench
The CovidQUExMSBench, based on the COVID-QU-Ex Dataset29,30, consists of 2,913 chest X-ray images, each with a resolution of 256×256 pixels. This dataset is specifically designed for binary segmentation tasks. We use only infection segmentation samples. The source dataset is split into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 1×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
MosMedPlusMSBench
The MosMedPlusMSBench, based on the MosMedDataPlus57,58 dataset, comprises 2,729 Covid-19 CT images, each sized 512×512 pixels. This dataset is designed for binary segmentation tasks. We split source data into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 3×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
CystoFluidMSBench
The CystoFluidMSBench is based on Intraretinal Cystoid Fluid dataset31–33, comprises 1,006 OCT (Optical Coherence Tomography) images, most of which are sized at 512×512 pixels. This dataset is designed for binary segmentation tasks. The images are carefully chosen by medical experts at Liaquat University of Medical and Health Sciences (LUMHS) Jamshoro, who are trained to identify Cystoid Macular Edema (CME) and its progression, providing a confirmatory diagnosis of CME. The source dataset is split into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 3×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
Dca1MSBench
The Dca1MSBench is derived from the DCA1 dataset34,35 and contains 134 fundus images, each with a resolution of 300×300 pixels. The images are provided by the Cardiology Department of the Mexican Social Security Institute, UMAE T1-León. This dataset is specifically created for binary segmentation tasks. The dataset is split into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 1×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
DeepbacsMSBench
The DeepbacsMSBench, based on the DeepBacs dataset36,37, consists of 34 samples of fundus images, each with a size of 1024×1024 pixels. It is designed for binary segmentation tasks. We use official train/validation/test splits published officially by authors. The samples are resized to 1×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
DriveMSBench
The DriveMSBench dataset, based on the DRIVE dataset38,39, includes 40 fundus images, each with dimensions of 565×584 pixels. The images are obtained from a diabetic retinopathy screening program in the Netherlands. It is designed for binary segmentation and uses official splits for training, validation, and testing. The samples are resized to 3 × 512 × 512 pixels, and the labels are mapped to integer values between 0 and 1.
DynamicNuclearMSBench
The DynamicNuclearMSBench, created from the DynamicNuclearNet Segmentation dataset40, 41, consists of 7084 samples of nuclear cell images, each 128×128 pixels in size. This dataset is utilized for a binary segmentation task. Training, validation, and test splits that are officially published are used. The samples are resized to 1×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
FHPsAOPMSBench
The FHPsAOPMSBench dataset is based on a prior dataset42,43 and comprises 4,000 ultrasound images, each with a resolution of 256×256 pixels. This dataset is designed for a multi-class segmentation task, including three distinct classes. The source dataset is split into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 1 × 512 × 512 pixels, and the labels are mapped to integer values between 0 and (#Classes - 1).
IdribMSBench
The IdribMSBench is based on the Indian Diabetic Retinopathy Image Dataset44, 45 and includes 80 high-resolution fundus images (4288×2848 pixels) for a binary segmentation task. We use official train/validation/test splits published officially by authors. The authors have also categorized the labels into four categories: Microaneurysms, hemorrhages, Hard Exudates, and Optic Discs. These category-based labels and annotations are provided in a npz file (see Table 3). The samples are resized to 3×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
Isic2016MSBench
The Isic2016MSBench is derived from the ISIC 2016 Challenge46,47, which consisted of 1,279 dermoscopy samples of varying sizes designed for binary segmentation tasks. We use official training, validation, and test splits published by authors. The samples are resized to 3×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
Isic2018MSBench
The Isic2018MSBench is derived from the ISIC 2018 Challenge48–50, which consisted of 3,694 dermoscopy samples of varying sizes designed for binary segmentation tasks. We use official training, validation, and test splits published by authors. The samples are resized to 3×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
KvasirMSBench
The KvasirMSBench, derived from the Kvasir-SEG dataset51,52, consists of 1,000 endoscopy images with resolutions ranging from 332×487 to 1920×1072 pixels. The dataset includes images of gastrointestinal polyps and their segmentation masks, which have been annotated and verified by an experienced gastroenterologist. It is structured for a binary classification task. The source dataset is divided into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 3×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
M2caiSegMSBench
M2caiSegMSBench is based on a prior dataset53,54 comprising 614 pathology samples and specifically designed for multi-class segmentation tasks, which include 19 distinct classes. The images within this dataset exhibit variable dimensions, and we use official train/validation/test splits. The samples are resized to 3 × 512 ×512 pixels, and the labels are mapped to integer values between 0 and (#Classes - 1).
MonusacMSBench
MonusacMSBench is based on the MoNuSAC challenge55,56. It consists of 310 samples and is designed for multi-class segmentation with 6 classes. The images in this dataset are H&E stained digitized tissue images from several patients acquired at multiple hospitals using a standard 40x scanner magnification. The annotations are provided by expert pathologists. We use the officially published train/validation/test splits from the challenge. The samples are resized to 3 ×512 ×512 pixels, and the labels are mapped to integer values between 0 and (#Classes - 1).
NucleiMSBench
The NucleiMSBench is based on a prior dataset59, which consisting of 141 pathology samples each with an image size of 2000 2000 pixels. This source dataset is designed for binary segmentation tasks. The source dataset is split into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 3 ×512 ×512 pixels, and the labels are mapped to integer values between 0 and 1.
NusetMSBench
The NusetMSBench, derived from the NuSet dataset60,61, contains 3,408 pathology samples designed for binary segmentation problems. We split the source dataset into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 1 × 512 × 512 pixels, and the labels are mapped to integer values between 0 and 1.
PandentalMSBench
The PandentalMSBench is created from the Panoramic Dental X-rays dataset62, 63 and contains 116 X-ray samples of varying sizes. It is specifically intended for binary segmentation tasks. The dataset comprises anonymized and deidentified panoramic dental X-rays of 116 patients taken at Noor Medical Imaging Center in Qom, Iran. The source dataset is divided into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 1×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
PolypGenMSBench
The PolypGenMSBench is based on a prior endoscopy dataset64,65 consisting of 1,412 endoscopy samples, each with an image size of 1920 1080 pixels. It is designed for binary segmentation tasks. It includes colonoscopy video frames captured from a diverse patient population at six different centers in Egypt, France, Italy, Norway, and the United Kingdom. We provide these images and annotations are captured from these centers in a npz file. The source dataset is divided into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 3×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
Promise12MSBench
The Promise12MSBench, derived on the PROMISE12 dataset66,67, contains 1,473 MR samples, each with an image size of 512×512 pixels. It is designed for binary classification. We split the source dataset into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 1×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
RoboToolMSBench
The RoboToolMSBench, based on the RoboTool dataset31, consisting of 500 samples, designed for binary segmentation tasks. The source dataset is divided into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 1×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
TnbcnucleiMSBench
The TnbcnucleiMSBench is based on a prior dataset68,69, consisting of 50 pathology samples, each with an image size of 512×512 pixels. This dataset is based on the merging of two different datasets: the first dataset, generated at the Curie Institute, consists of annotated H&E stained histology images at 40× magnification, and the second dataset, provided by the Indian Institute of Technology Guwahati, also consists of annotated H&E stained histology images captured at 40× magnification. It is designed for binary segmentation tasks. We split the source dataset into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 3×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
JUltrasoundNerveMSBench
The UltrasoundNerveMSBench, derived from prior dataset70, contains 2,323 ultrasound samples, each with an image size of 580×420 pixels and designed for binary segmentation tasks. The primary task in this dataset is to segment a collection of nerves known as the Brachial Plexus (BP) in ultrasound images. Due to the lack of test image annotations, we split the source dataset into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 1 × 512 × 512 pixels, and the labels are mapped to integer values between 0 and 1.
USforKidneyMSBench
The USforKidneyMSBench is derived from the CT2USforKidneySeg dataset71, 72, comprised of 4,586 ultrasound samples, each with an image size of 256×256 pixels, and designed for binary segmentation tasks. The source dataset is split into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 1×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
UWSkinCancerMSBench
The UWSkinCancerMSBench is based on the Skin Cancer Detection dataset73, consisting of 206 dermoscopy samples, designed for binary classification tasks The dataset includes images extracted from the public databases DermIS and DermQuest, along with manual segmentations of the lesions. We split the source dataset into three parts: train/val/test, in a 7:1:2 ratio. The authors have also categorized the labels into two categories: Melenoma and Not-Melenoma. These category-based labels and annotations are provided in a .npz file (see Table 3). The samples are resized to 3×512×512 pixels, and the labels are mapped to integer values between 0 and 1.
WbcMSBench
The WbcMSBench, based on prior datasets74,75, is a microscopy imaging dataset consisting of 80 samples, with image sizes of 120×120 and 300×300 pixels. It is designed for multi-class segmentation tasks including 3 classes. The dataset is based on two sources: Dataset 1, obtained from Jiangxi Tecom Science Corporation, China, contains 300 images of white blood cells with a resolution of 120×120 pixels. Dataset 2 consists of 100 color images with a resolution of 300 300 pixels, collected from the CellaVision blog. The authors have grouped the samples into four categories: Lymphocyte, Monocyte, Neutrophil, and Eosinophil, and we provide these category-based images and corresponding labels in npz file format (see Table 3). The source dataset is divided into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 3×512×512 pixels, and the labels are mapped to integer values between 0 and (#Classes - 1).
YeazMSBench
The YeazMSBench, derived from the YeaZ dataset76,77, consists of 707 microscopy images with varying sizes and is designed for binary segmentation tasks. We split the source dataset into three parts: train/val/test, in a 7:1:2 ratio. The samples are resized to 1 × 512 × 512 pixels, and the labels are mapped to integer values between 0 and 1.
Data Records
We have publicly shared each dataset with varying sizes (128, 256, and 512 sized) in MedSegBench at Zenodo (Link). The MedSegBench consists of 35 pre-processed 2D medical image segmentation datasets (some of them extracted 3D slices) from various data modalities and tasks (binary/multi-class). The data storage format published by MedMNISTv28 is followed. We save each dataset in Numpy npz format, named as {dataset}_{size}.npz. Each npz file contains following keys: [“{train,val,test}_images”, “{train,val,test}_label”]. Also, some authors have published class- or category-based images and labels. We have also added this information with the following keys into the npz file and explain them in source code files: [“{train,val,test}_images_ {classno}”, “{train,val,test}_label_{classno}”]. All images and labels are stored in uint8 data type. {train,val,test}_images: Numpy array contains train, validation and test images with N×W×H×C shape for RGB datasets, and N×W×H for gray-scale datasets. Here, N refers to the number of samples, W is the width, H is the height, and C denotes the number of channels. {train,val,test}_label: It contains train, validation and test labels with N×W×H shape. {train,val,test}_images_{classno} and {train,val,test}_label_{classno}: These contain class or category-based train, validation, and test images and labels with shapes N×W×H×C (for RGB images, and N×W×H for gray-scale images), respectively.
Technical Validation
Baseline methods
In this study, we chose the U-Net architecture as the baseline structure for image segmentation tasks. We have selected six encoder/decoder networks to enhance performance and adaptability. These include ResNet18, ResNet50, and DenseNet121, commonly used as benchmarks in segmentation research. Additionally, we have selected EfficientNet and MobileNetv2 because they are lightweight models that offer a more computationally efficient alternative to ResNets and DenseNet. Furthermore, we have added a transformer-based approach using the Mix Vision Transformer, acknowledging the growing interest in transformer models for vision tasks.
The U-Net structure and diverse encoder/decoder networks are implemented using the qubvel-segmentation framework10. We have not used pre-trained ImageNet weights; we train each model from scratch on our datasets. We have trained each model with three randomly selected seed values to ensure the robustness of our results. All images are resized to 512×512 pixels, a standardized dimension for the training, validation, and testing phases. Training is conducted over 200 epochs using the Adam Optimizer with a learning rate 1e-3. For binary segmentation tasks, we used dice loss, while categorical cross-entropy loss is used for multi-class tasks. A batch size of 128 is selected throughout the training process. We have not applied weight decay methods or any data augmentation techniques, focusing on the raw performance of the models. The model weights corresponding to the best validation IOU are recorded for each network configuration. Further details regarding the model implementation, training, and evaluation steps are available in our code repository.
Performance Measures
We have evaluated each model on 35 different datasets using four performance measures: Precision (PREC), Recall (REC), F1-score (F1), and Intersection over Union (IOU). Precision measures the accuracy of positive predictions, highlighting its ability to avoid false positives, while Recall evaluates the model’s capacity to identify all relevant positive instances, minimizing false negatives. The F1-Score, as the harmonic mean of Precision and Recall, provides a balanced view, which is especially useful when there is an unbalanced class distribution. IoU, primarily used in image segmentation and object detection, evaluates the overlap between predicted and actual regions, ensuring accurate localization and identification of objects. We have individually reported PREC, REC, F1, and IOU scores for each dataset and averaged the results.
Results
The average PREC and REC results obtained from three different run are showed in Table 4 and average F1 and IOU scores are reported in Table 5 for each individual datasets. Also, the average results for each baseline methods are shown in Table 5,
Table 4 presents a comprehensive overview of the average precision and recall results for six different encoder networks across various datasets. These networks include ResNet-18 (RN-18), ResNet-50 (RN-50), Efficient-Net (EN), Mobile-Net-v2 (MN-v2), DenseNet-121 (DN-121), and Mix Vision Transformer (MVT). The results are divided into two main categories: precision and recall. In terms of precision, DenseNet-121 consistently demonstrated strong performance across numerous datasets. For example, it achieved the highest precision scores in datasets such as BusiMSB (0.794), ChuahMSB(0.870) and Dca1MSB (0.801). Similarly, Efficient-Net also demonstrated strong precision, particularly in datasets like Isic2016MSB and Isic2018MSB, where it scored 0.912 and 0.857, respectively. Although the Mix Vision Transformer is not evaluated on all datasets because it only accepts at least three channel images as input, it performed competitively where applicable, achieving high precision in datasets like Bkai-Igh-MSB (0.983). Regarding Recall, DenseNet-121 has emerged as a top performer, achieving the highest recall in datasets such as Bbbbc010MSB (0.920) and WbcMSB (0.970). Efficient-Net also performed well in recall metric, particularly in datasets like DynamicNuclearMSB (0.966) and USforKidneyMSB (0.982). The results indicate that DenseNet-121 and Efficient-Net are particularly robust across precision and recall metrics, suggesting their effectiveness in various applications. Overall, the analysis highlights DenseNet-121’s consistently high performance across multiple datasets, making it a reliable choice for tasks requiring high precision and recall. Efficient-Net also stands out, especially in recall, indicating its potential for applications where recall is critical.
Table 5 provides a comprehensive evaluation of six encoder networks across various datasets, using F1-score and Intersection over Union (IOU) as performance metrics. DenseNet-121 consistently performs well, frequently achieving the top F1 and IOU metrics scores across numerous datasets. For example, in the Bbbc010MSB and CellNucleiMSB datasets, DenseNet-121 records the highest F1-scores of 0.920 and 0.907, respectively, and similarly high IOU scores, indicating its robustness in handling diverse data types. Efficient-Net also shows significant performance, particularly in datasets like Isic2016MSB and USforKidneyMSB, where it achieves the highest F1-scores of 0.903 and 0.981, respectively. This indicates that Efficient-Net is particularly effective in scenarios requiring high precision and recall, as showed in its F1-scores. ResNet-50 performs best with an F1-score of 0.931 and an IOU of 0.870 for the DeepbacsMSB. Additionally, it has also achieved the highest F1-score of 0.786 and an IOU of 0.648 in the DriveMSB dataset. For the FHPsAOPMSB dataset, ResNet-18 has achieved the highest F1-score of 0.961 and an IOU of 0.929. While Mix Vision Transformer does not frequently perform as well as DenseNet-121, it shows competitive performance in specific datasets such as UWSkinCancerMSB, achieving the second-highest F1 Score of 0.881. This indicates its potential in specialized applications, particularly in medical imaging contexts. Overall, DenseNet-121 is the most robust and effective network, frequently outperforming other networks in achieving high F1-scores and IOU values.
Table 6 shows the average performance metrics for six different encoder networks. Efficient-Net (EN) and DenseNet-121 (DN-121) demonstrate the highest F1 scores, both achieving a value of 0.772. This indicates that these models have a balanced performance in terms of precision and recall. DenseNet-121 also achieves the highest precision at 0.848, suggesting it is effective at minimizing false positives. On the other hand, Efficient-Net leads in recall with a score of 0.788, indicating its strength in capturing true positives. Additionally, DenseNet-121 achieves the highest IOU with 0.702, closely followed by Efficient-Net with 0.700 This suggests that these two models provide the most accurate predictions. Overall, DenseNet-121 and Efficient-Net achieve similar high-performance metrics, with both models performing well in F1 score, precision, recall, and IOU. However, DenseNet-121’s complex architecture causes higher computational demands, whereas Efficient-Net provides a more efficient design, making it suitable for resource-constrained applications.
Usage Notes
The MedSegBench datasets are freely available at Zenodo We kindly request that users of the MedSegBench dataset cite this paper, along with the relevant source dataset files, in their publications. This dataset is created in order to fairly compare different models over various segmentation models from different data modalities and to create universal models. It is not suitable for clinical or medical use.
Data Availability
All data produced are available online at Zenodo
Code availability
The Python data API, source code files and evaluation scripts for binary and multi-class segmentation tasks can be found at https://github.com/zekikus/MedSegBench.
Author contributions statement
M.A. conducted data collection, cleaning and pre-processing steps. Z.K. performed the evaluation tests for binary and multi-class task for each network and datasets. All authors wrote and reviewed the manuscript.
Competing interests
The authors state that they have no conflicting interests.
Acknowledgements
We would like to express our gratitude for the authors of the MedMNIST8, which served as the baseline for our study, and for the shared source code that we referenced to develop our own code.