Abstract
Brain tumors, which are abnormal growths of cells in the brain, represent a significant health concern, necessitating prompt and accurate detection for effective treatment. If left untreated, brain tumors can lead to severe complications, including cognitive impairment, paralysis, and even death. This study evaluates six machine learning classifiers: Support Vector Classifier (SVC), Logistic Regression Classifier, K-Nearest Neighbors (KNN) Classifier, Naive Bayes Classifier, Decision Tree Classifier, and Random Forest Classifier - on a comprehensive brain tumor dataset. Our results showed that Random Forest achieved the highest accuracy of 98.27%, demonstrating its potential in detecting brain tumors. However, Support Vector Classifier (SVC) emerged as the top performer, achieving an impressive accuracy of 97.74%, showcasing its exceptional ability to detect brain tumors accurately. This significant improvement in SVC’s performance highlights its potential as a reliable tool for medical diagnostics, contributing to the development of efficient and accurate automated systems for early brain tumor diagnosis, ultimately aiming to improve patient outcomes and treatment efficacy.
I. Introduction
Brain tumors represent a significant global health concern, with the Nature Brain Tumor Society (NBTS) estimating that approximately 700,000 individuals in the United States are affected by these malignancies each year [1]. Brain tumors can be classified into various categories such as gliomas, medulloblastomas, and acoustic neuromas, each presenting distinct characteristics and treatment challenges [2]. Effective early detection and accurate diagnosis of brain tumors are crucial for optimizing treatment strategies and improving patient outcomes. Untreated brain tumors can lead to severe neurological deficits, cognitive impairments, and, in many cases, death [3].
Traditionally, brain tumors have been diagnosed using Magnetic Resonance Imaging (MRI), which offers high-resolution images of brain structures [4]. However, analyzing MRI data manually is both time-consuming and prone to variability, and often lacks the precision required for accurate tumor detection and segmentation [5]. This challenge has spurred interest in leveraging machine learning techniques to automate and enhance the diagnostic process.
Recent advancements in machine learning and artificial intelligence have shown promising potential in the realm of medical diagnostics [6]. In particular, machine learning models that rely on numerical values extracted from patient data, such as clinical features, genetic information, and laboratory results, have been increasingly explored as a means of improving brain tumor detection [7]. These models can offer significant advantages over traditional image-based methods by facilitating faster and more consistent diagnostic processes [3].
This study aims to evaluate the effectiveness of several well-known machine learning classifiers for the task of brain tumor detection using numerical data. Specifically, we examine the performance of Support Vector Classifier (SVC), Logistic Regression Classifier, K-Nearest Neighbors (KNN) Classifier, Naive Bayes Classifier, Decision Tree Classifier, and Random Forest Classifier. Each of these algorithms brings unique strengths to the table. For instance, SVM is known for its effectiveness in high-dimensional spaces and its ability to handle non-linearly separable data [8]. Logistic Regression is appreciated for its simplicity, interpretability, and capability to manage both continuous and categorical features [9]. KNN is valued for its robustness to noise and ability to capture complex feature interactions [10]. Naive Bayes offers benefits in handling categorical data and learning from smaller datasets [11]. Decision Trees are favored for their interpretability and ability to model both categorical and numerical features [12]. Random Forest, an ensemble method, is known for reducing overfitting and handling high-dimensional data effectively [13].
The motivation behind using these classifiers lies in their distinct advantages for processing numerical data and their varying approaches to handling complex patterns in the data. This study leverages a dataset consisting of clinical and diagnostic numerical values related to brain tumors, providing a platform for evaluating the performance of these classifiers [14]. The aim is to determine which classifier provides the highest accuracy and reliability for brain tumor detection, contributing to the development of efficient diagnostic tools [15].
Machine learning has demonstrated significant promise in the medical field, with various studies highlighting its effectiveness in improving diagnostic accuracy [16]. For instance, recent research has shown that machine learning models can significantly enhance the accuracy of cancer detection and prognosis prediction [17]. By applying these techniques to brain tumor detection using numerical data, this study seeks to build upon these advancements and offer a novel approach to diagnosing brain tumors [18].
In this study, this study focuses on evaluating the efficacy of various machine learning classifiers in detecting brain tumors from numerical values rather than MRI images. The goal is to identify the most effective algorithm for this task, thereby contributing to the broader effort of improving brain tumor diagnosis and ultimately enhancing patient outcomes.
The paper is further structured as follows: Section II discusses the literature review on the brain tumor detection. Section III highlights motivation. Section IV describes machine learning algorithms. Section V presents methodology. Section VI result. Section VII discussion. Lastly, Section VIII concludes the paper and future work.
II. Literature Review
This paper [19] highlights the “curse of dimensionality” often encountered in brain tumor datasets with many features. They propose a two-pronged approach: first, using Particle Swarm Optimization to select the most informative features, mimicking the efficient foraging behavior of birds or fish. Second, they employ ensemble learning with Majority Voting, combining the predictions of multiple classifiers to improve accuracy and robustness. This approach could be particularly relevant to your work if you’re dealing with a large number of features.
While the paper [20] utilizes an SVM, its core contribution lies in a novel feature extraction method designed to capture the most discriminative information from brain tumor data. This emphasis on feature engineering is highly transferable. You could apply their proposed feature extraction techniques and then experiment with alternative classifiers like Random Forest, which is known for its ability to handle high-dimensional data, or Gradient Boosting, which excels at reducing bias and achieving high accuracy.
This paper [21] delves into the realm of unsupervised learning for brain tumor detection, specifically employing the K-Means clustering algorithm. K-Means groups similar data points together based on their features, aiming to uncover hidden patterns and structures within the data without relying on labeled examples. This approach could be beneficial for your research by potentially revealing distinct clusters or subgroups within your dataset that correspond to different tumor characteristics or stages.
While the [22] title mentions CNNs, which are typically used for image data, this paper emphasizes the critical role of feature extraction for accurate brain tumor classification, regardless of the data type. They highlight how carefully engineered features can significantly improve the performance of machine learning models. You can draw inspiration from their feature engineering techniques and apply them to your tabular data to potentially enhance the accuracy of your chosen classifiers.
[23] Brain tumor detection and segmentation have been extensively explored using machine learning and deep learning techniques. Various studies have proposed CNN-based methods, automated feature extraction and classification approaches, and comparisons of deep learning models. Additionally, hybrid approaches combining different techniques have been investigated. These studies have achieved high accuracy rates, ranging from 91.43% to 98.69%, demonstrating the potential of machine learning and deep learning in brain tumor detection and segmentation.
[24] Brain tumor segmentation has been extensively explored using machine learning techniques. Previous reviews have focused on traditional computer vision methods and deep learning approaches. Recent studies have investigated the use of convolutional neural networks (CNNs) for brain tumor segmentation. Other approaches include using transfer learning, ensemble learning, and hybrid models combining CNNs with traditional machine learning techniques. These studies demonstrate the potential of machine learning for brain tumor segmentation, achieving high accuracy and efficiency.
In study [25] MRI-based brain tumor detection using convolutional deep learning methods and machine learning techniques was explored. A 2D CNN and auto-encoder network were proposed, achieving training accuracies of 96.47% and 95.63%, respectively. Six machine learning techniques were compared, with KNN achieving the highest accuracy (86%) and MLP the lowest (28%). The study demonstrates the effectiveness of deep learning methods in brain tumor detection, with the proposed 2D CNN showing optimal accuracy and performance. This work contributes to the development of automated brain tumor detection systems, improving diagnosis and treatment.
III. Motivation
Brain tumors are a leading cause of cancer-related deaths worldwide, with high mortality rates and a profound impact on the quality of life for patients and their families. Early and accurate diagnosis is crucial for effective treatment, improved patient outcomes, and enhanced survival rates. However, brain tumor diagnosis remains a challenging task, particularly in resource-constrained settings where access to advanced medical facilities, specialized personnel, and cutting-edge technologies is limited. In such settings, the lack of resources hinders the widespread adoption of advanced medical imaging techniques like MRI and CT scans, which are essential for accurate brain tumor diagnosis.
This research is motivated by the need for a cost-effective, objective, and accessible tool for brain tumor diagnosis that can operate within the constraints of resource-constrained settings. We aim to develop a predictive model that can aid in brain tumor diagnosis using readily available patient attributes and clinical features, eliminating the reliance on advanced medical imaging techniques or deep learning features. By leveraging machine learning algorithms and data analytics, our model seeks to provide a valuable tool for healthcare professionals, enabling them to make informed decisions and improve patient outcomes. Ultimately, our research strives to contribute to the development of efficient and accurate automated systems for early brain tumor diagnosis, leading to better patient care and treatment efficacy.
IV. Machine Learning Classifiers
A. Support Vector Classifier
Support Vector Machines are powerful supervised learning models used for classification and regression tasks. In the context of classification, an SVM aims to find an optimal hyperplane that best separates data points belonging to different classes.
The hyperplane is chosen to maximize the margin, which is the distance between the hyperplane and the closest data points from each class, known as support vectors. This focus on maximizing the margin contributes to the SVC’s ability to generalize well to unseen data [26].
SVCs can be applied to both linearly separable and non-linearly separable data. For non-linearly separable data, SVCs utilize kernel functions to map the data into a higher-dimensional space where it becomes linearly separable [27].
B. Logistic Regression Classifier
Logistic regression is a statistical model used to predict the probability of a binary outcome (yes/no, 1/0) based on one or more independent variables. Unlike linear regression, which predicts continuous outcomes, logistic regression employs a sigmoid function to map predictions to a probability range between 0 and 1 [28].
This algorithm works by estimating the log odds of the outcome occurring based on the values of the independent variables. These log odds are then transformed into probabilities using the sigmoid function.
While primarily used for binary classification, logistic regression can be extended to handle multinomial outcomes (multiple categories) through variations like multinomial logistic regression. Logistic regression models are widely used in various fields, such as biology and social sciences, where the objective is to predict a categorical outcome [29].
C. K-Nearest Neighbor (KNN) Classifier
The K-Nearest Neighbors (KNN) classifier is a straightforward yet powerful supervised learning algorithm used for both classification and regression tasks [30]. At its core, KNN operates on the principle that similar data points tend to cluster together.
When classifying a new data point, the algorithm identifies the k nearest neighbors to the point in the feature space, based on a chosen distance metric (e.g., Euclidean distance). The class label of the new data point is then determined by a majority vote among its k neighbors. For instance, if k is set to 5, and 3 out of the 5 nearest neighbors belong to class A, the new data point would be classified as belonging to class A.
One of the key advantages of KNN is its simplicity and ease of implementation [31]. It’s a non-parametric method, meaning it makes no assumptions about the underlying data distribution, making it suitable for datasets with complex or unknown structures. However, the choice of k is crucial, as a small k can make the model susceptible to noise, while a large k might lead to over smoothing and misclassification.
D. Logistic Regression Classifier
The Naive Bayes Classifier (NBC) is a widely used machine learning algorithm for classification tasks. It is based on Bayes’ theorem, which describes the probability of a hypothesis given some observed evidence. In the context of classification, the hypothesis is the class label, and the evidence is the feature values of the instance to be classified [32].
The NBC algorithm assumes independence between features, meaning that each feature contributes independently to the probability of the class label. This assumption simplifies the calculation of the posterior probability of the class given the features. The algorithm calculates the likelihood of each feature given the class, as well as the prior probability of each class. Then, it applies Bayes’ theorem to calculate the posterior probability of each class given the features [32].
There are three main types of NBC, each suited to different types of data. Multinomial Naive Bayes (MNB) is used for multi-class problems with discrete features. Bernoulli Naive Bayes (BNB) is used for binary classification with binary features. Gaussian Naive Bayes (GNB) is used for continuous features and assumes a Gaussian distribution [33].
E. K-Nearest Neighbor (KNN) Classifier
Decision Tree Classifiers are a popular supervised learning method used in machine learning for both classification and regression tasks [34]. Their strength lies in their intuitive, tree-like structure that breaks down complex decisions into a series of simpler ones, mirroring human-like reasoning. This makes them easy to understand and interpret, even for non-experts.
The algorithm works by recursively partitioning the dataset into increasingly homogeneous subsets based on the values of input features [35]. Starting at the root node, which represents the entire dataset, the algorithm searches for the best feature to split the data, aiming to create subsets that are as pure as possible in terms of class distribution [36]. This process continues down the tree, with each internal node representing a decision point based on a specific feature. The branches stemming from these nodes represent decision rules, guiding the data towards leaf nodes, which hold the final predictions or class labels.
F. Random Forest Classifier
The Random Forest Classifier is a powerful ensemble learning method used in machine learning for both classification and regression tasks [37]. It operates by constructing a multitude of decision trees during training and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees [38].
The “random” aspect of Random Forest stems from two key concepts: random sampling of the training data and random subspace selection. During the creation of each tree, a technique called bootstrap sampling is employed, where the algorithm randomly selects a subset of the training data with replacement [39]. This means that some data points may be selected multiple times, while others might be left out. This process introduces diversity among the trees, as each tree learns from a slightly different perspective of the data.
V. Methodology
A. Introduction
The goal of this study is to develop and evaluate machine learning models to detect brain tumors using a dataset containing numerical values rather than images. This study employs multiple Machine Learning Classifiers, including Support Vector Classifier (SVC), Logistic Regression Classifier, K-Nearest Neighbors (KNN) Classifier, Naive Bayes Classifier, Decision Tree Classifier, and Random Forest Classifier, to classify individuals as having brain tumors or being healthy. The following sections detail the comprehensive methodology implemented in this study, which is divided into data preprocessing, model building, evaluation, and validation steps.
B. Data Collection
The dataset used in this study consists of MRI images of brain tumors sourced from Kaggle.com. These images were preprocessed to extract numerical features relevant to brain tumor detection, resulting in a dataset with 3761 rows and multiple columns. The columns include ‘Class’ (indicating the presence or absence of a brain tumor), as well as various texture features such as ‘Mean’, ‘Variance’, ‘Standard Deviation’, ‘Entropy’, ‘Skewness’, ‘Kurtosis’, ‘Contrast’, ‘Energy’, ‘ASM’, ‘Homogeneity’, ‘Dissimilarity’, and ‘Correlation’.
Sample images from the dataset are shown in Figure 1, which displays four MRI images (fig 1(a), fig 1(b), fig 1(c), fig 4(d)) of patients without brain tumors, while Figure 2 displays four MRI images (fig 2(a), fig 2(b), fig 2(c), fig 2(d)) of patients with brain tumors.
C. Preprocessing Steps
The extracted features were preprocessed to ensure data quality and consistency. The preprocessing steps included normalization, noise reduction, and feature scaling.
a) Import Libraries
First, essential libraries were imported for data analysis, visualization, and model. As shown in Table 1, the necessary libraries were imported using the following code:
b) Load Dataset
The dataset was loaded into a pandas DataFrame for analysis, as shown in Table 2.
D. Exploratory Data Analysis (EDA)
EDA was performed to understand the distribution and characteristics of the dataset. Techniques such as histogram plotting, box plots, and scatter plots were used to visualize data trends and outliers. These visualizations helped in identifying potential issues such as class imbalance and guided the data preprocessing steps.
a) Heatmap
Heatmaps are a powerful visualization tool in machine learning, used to represent complex data insights intuitively. By leveraging heatmaps, machine learning practitioners can gain a deeper understanding of their data, develop more effective models, and communicate insights more effectively. The code used to generate the heatmap is presented in Table 3:
b) Heatmap of Correlation Matrix
The dataset was loaded into a pandas DataFrame for analysis, as shown in Table 2.
E. Data Splitting
Data splitting is a crucial step in machine learning, where the available data is divided into two subsets: training data and testing data. The training data is used to train the model, while the testing data is used to evaluate its performance. The code used to split the data into training and testing is presented in Table 6.
F. Feature Scaling
Feature scaling, also known as data normalization. It involves transforming numeric features into a common scale, usually between 0 and 1, to prevent differences in scales from affecting model performance. The code used for feature scaling is presented in Table 7.
G. Model Building and Training
After preprocessing the data, various machine learning models were developed and trained to predict the target variable. Both the original and scaled datasets were used to train the models, allowing for a comprehensive evaluation of their performance.
a) Support Vector Classifier
The SVC was trained with both the original and scaled data. Training the model with the original data is presented in Table 8, and training with scaled data is presented in Table 9.
b) Logistic Regression Classifier
The LRC was trained with both the original and scaled data. Training the model with the original data is presented in Table 10, and training with scaled data is presented in Table 11.
c) K-Nearest Neighbor (KNN) Classifier
The KNN was trained with both the original and scaled data. Training the model with the original data is presented in Table 12, and training with scaled data is presented in Table 13.
d) Naive Bayes Classifier
The Naive Bayes Classifier was trained with both the original and scaled data. Training the model with the original data is presented in Table 14, and training with scaled data is presented in Table 15.
f) Random Forest Classifier
The RFC was trained with both the original and scaled data. Training the model with the original data is presented in Table 18, and training with scaled data is presented in Table 19.
H. Result
In the study, several traditional machine learning algorithms were evaluated on a dataset comprising numerical features related to brain tumor detection. The performance of each algorithm was assessed both on the original data and after applying data scaling techniques to standardize the features. The results demonstrated significant improvements in model accuracy post-scaling.
The Support Vector Classifier (SVC) achieved a baseline accuracy of 78.61% on the original data. However, after scaling, the model’s performance surged to 97.74%, highlighting the crucial role of data scaling in boosting accuracy.
Logistic Regression Classifier demonstrated a notable accuracy of 90.96% on the original data. When applied to scaled data, the model’s performance further improved to 95.88%, indicating enhanced class separation and more effective feature utilization.
The K-Nearest Neighbors (KNN) Classifier exhibited an accuracy of 80.61% on the original data. Nonetheless, its performance plummeted to 54.58% on scaled data, suggesting a high sensitivity to the scaling method employed.
Naive Bayes Classifier achieved a commendable accuracy of 95.35% on the original data. Although scaling had a minimal impact, the model’s accuracy edged up to 95.48%, indicating a slight benefit from standardized features.
The Decision Tree Classifier showcased an impressive accuracy of 98.14% on the original data. However, its performance dipped to 90.30% on scaled data, possibly due to altered feature importance or decision boundary adjustments.
Lastly, the Random Forest Classifier achieved an outstanding accuracy of 98.27% on the original data. While scaling had a moderate impact, the model’s accuracy settled at 93.89%, suggesting a reliance on inherent relationships within the original data that were partially disrupted by scaling.
The summarized results of all classifiers in Table 20:
I. Discussion
In this study, we conducted a comprehensive evaluation of six machine learning classifiers: Support Vector Classifier (SVC), Logistic Regression, K-Nearest Neighbors (KNN), Naive Bayes, Decision Tree, and Random Forest on a brain tumor dataset, with a specific focus on the impact of data preprocessing on performance. Our results underscore the critical role of feature scaling in enhancing the performance of Support Vector Classifier (SVC) in medical diagnostics, highlighting its potential as a reliable tool for automated brain tumor detection.
a) Classifiers Performace
While Random Forest achieved the 98.27% of highest accuracy on the original data, SVC emerged as the top performer on scaled data, with an impressive accuracy of 97.74%. This significant improvement in SVC’s performance on scaled data highlights the importance of feature scaling in unlocking its full potential. The disparity in performance between SVC on original and scaled data emphasizes the necessity of preprocessing for optimal results, particularly for classifiers sensitive to data distribution like SVC.
b) Implementations for Clinical Application
The marked improvement in SVC’s performance with feature scaling has important implications for clinical applications. Integrating preprocessing steps can significantly enhance diagnostic accuracy, particularly for classifiers like SVC that rely heavily on data distribution. This finding suggests that SVC, combined with feature scaling, has the potential to become a reliable model for automated brain tumor detection, offering a valuable tool for clinicians in diagnosing and treating brain tumors.
c) Future Work
Future research should explore the integration of deep learning techniques and ensemble learning methods to further enhance detection accuracy and robustness. Additionally, investigating the impact of feature scaling on other classifiers, exploring the application of SVC to other medical diagnostics tasks, and expanding the dataset to include a more diverse range of brain tumor types and stages would provide a more comprehensive evaluation of SVC’s performance and its potential as a clinical tool.
J. Conclusion
This study demonstrates the potential of machine learning classifiers in brain tumor detection, emphasizing the importance of data preprocessing and the careful selection of models based on dataset characteristics. The findings contribute to the ongoing development of accurate and efficient automated diagnostic systems, ultimately aiming to improve patient outcomes through early and precise detection of brain tumors.
In summary, scaling the data substantially enhanced the performance of our brain tumor detection model, with Support Vector Classifier (SVC) achieving optimal results on scaled data. SVC’s accuracy exhibited a notable improvement on scaled data, surpassing the performance of other classifier like Random Forest, which worked best on original data. This highlights the critical importance of data scaling in optimizing SVC’s diagnostic accuracy in brain tumor detection. Future research should focus on refining SVC’s performance with scaled data and exploring how deep learning and combined modeling approaches can enhance detection effectiveness and reliability.
Data Availability
https://www.kaggle.com/datasets/jakeshbohaju/brain-tumor