1. Abstract
Background Expert interrater agreement for epileptiform discharges can be moderate. This reasonably will affect the performance when developing classifiers based on annotations performed by experts. In addition, evaluation of classifier performance will be difficult since the ground truth will have a variability. In this pilot study, these aspects were investigated to evaluate the feasibility of conducting a larger study on the subject.
Methods A multi-channel EEG of 78 minutes duration with abundant periodic discharges was independently annotated for epileptiform discharges by two experts. Based on this, several deep learning classifiers were developed which in turn produced new annotations. The agreements of all annotations were evaluated by pairwise comparisons using Cohen’s kappa and Gwet’s AC1. A cluster analysis was performed on all periodic discharges using a newly developed version of parametric t-SNE to assess the similarity between annotations.
Results The Cohen’s kappa values were 0.53 for the experts, 0.52–0.65 when comparing the experts to the classifiers, and 0.67–0.82 for the classifiers. The Gwet’s AC1 values were 0.92 for the experts, 0.92–0.94 when comparing the experts to the classifiers, and 0.94–0.96 for the classifiers. Although there were differences between all annotations regarding which discharges that had been selected as epileptiform, the selected discharges were mostly similar according to the cluster analysis. Almost all identified epileptiform discharges by the classifiers were also periodic discharges.
Conclusions There was a discrepancy between agreement scores produced by Cohen’s kappa and Gwet’s AC1. This was probably due to the skewed prevalence of epileptiform discharges, which only constitutes a small part of the whole EEG. Gwet’s AC1 is often considered the better option and the results would then indicate an almost perfect agreement. However, this conclusion is questioned when considering the number of differently classified discharges. The difference in annotation between experts affected the learning of the classifiers, but the cluster analysis indicates that all annotations were relatively similar. The difference between experts and classifiers is speculated to be partly due to intrarater variability of the experts, and partly due to underperformance of the classifiers. For a larger study, in addition to using more experts, intrarater agreement should be assessed, the classifiers can be further optimized, and the cluster method hopefully be further improved.
2. Introduction
Automating tasks that are time-consuming and depend on expert knowledge is important to increase availability. This is certainly the case in EEG analysis and deep learning has emerged as a promising way to accomplish this (Craik et al., 2019; Roy et al., 2019). To assess the utility and risks of the technology, it is important to evaluate it, as well as to study different aspects of the conditions for developing deep learning, to optimize the performance. Most deep learning applications for detecting epileptiform discharges (EDs) are developed using supervised learning (Nhu et al., 2022) where deep classifiers are trained to align to human expert assessments. At present time, there is no generally accepted objective way of defining or quantitatively measure EDs and the gold standard is rating by experts. However, experts do not always agree and in this work, interrater agreement for EDs and how it affects training of deep classifiers was investigated.
In cases where there is no way to know which expert is right, one can at least evaluate if how the experts are assessing is reproducible or how much they agree. The most frequently used quantitative measures for the agreement between experts in the setting of rating EDs, as found in the review below, are percent agreement, Cohen’s kappa (Cohen, 1960), Fleiss’ kappa (Fleiss, 1971), and Gwet’s AC1 (Gwet, 2008). The percent agreement is simply the percentage of the data for which the experts agree. Cohen’s kappa, Fleiss’ kappa, and Gwet’s AC1 produce relative scores where the percent agreement is adjusted for the agreement by chance. Cohen’s kappa uses the marginal distribution to estimate the chance agreement (Cohen, 1960) and Fleiss’ kappa is an extension of this to multiple raters. These two kappa scores have been criticized for producing unexpectedly low scores in some cases of high percent agreement and higher scores when the marginal distribution is asymmetrical compared to the symmetrical case (Gwet, 2014). To address the problems with Cohen’s kappa, Gwet (2008) suggests an agreement coefficient, now referred to as Gwet’s AC1, which is less sensitive to variation in the marginal distributions. To interpret the kappa scores, Landis and Koch (1977) suggest: 0–0.20, slight agreement; 0.2–0.40, fair agreement; 0.4–0.60, moderate agreement; 0.61–0.80, substantial agreement; and 0.81–1.00, almost perfect agreement.
Studies assessing interrater agreement for EEGs can be divided into the rating of individual EDs and rating the presence of EDs in whole EEGs. In the former case, the median agreement score is 0.56 (min: 0.30; max: 0.81) (Bagheri et al., 2017; Beuchet et al., 2017; Dümpelmann and Elger, 1999; Halford et al., 2017; Halford et al., 2018; Jing et al, 2019; Wilson et al., 1996). In the latter case, the median score is 0.67 (0.16, 0.83) (Abend et al., 2011; Abend et al., 2017; Azuma et al., 2003; Black et al., 2000; Gaspard et al. 2014; Hussein et al., 2017; Mani et al., 2012; Nguyen et al., 2010; Piconelli et al., 2005; Reus et al., 2022; Stroink et al., 2006; Zhuo et al., 2019). The median agreement is thus higher for whole EEGs, but the number of studies is relatively low, they use different types of agreement scores, the number of raters vary from 2 to 49, which makes the difference difficult to assess. A few studies compare the two conditions, and the scores are higher when assessing whole EEGs (Black et al., 2000; Jing et. al., 2019). In summary, the agreement for EDs has a large variation and there are methodological differences that make comparisons of different studies difficult, but overall, the agreement is moderate for individual EDs and possibly substantial for the presence of EDs in whole EEGs.
Expert annotations are used as labels in classification tasks in the setting of supervised learning in deep learning. From what has been presented above, there will thus be a label noise due to disagreement between experts for EDs. It is known that label noise has a negative impact on training deep classifiers in medical imaging (Karimi et al., 2020; Abdala and Fine, 2023). The subject has to the best of our knowledge not been thoroughly investigated in clinical neurophysiology, but it is reasonably a general problem. Label noise will also make evaluation harder since it is not known if the labels are correct—an error in the prediction accuracy may be either due to an error made by the expert or the classifier. When reviewing studies where deep learning is applied to ED classification, the most used performance measures are accuracy (ACC) and area under the curve (AUC). Including studies using either or both of these measures, twenty-two relevant studies were found spanning the years 2016–2023. The median ACC is 0.90 (min: 0.71; max: 1.00) and the median AUC is 0.94 (0.76, 1.00) (Abou Jaoude et al., 2020;Antonaides et al., 2017; Cheng et al., 2022; Chung et al., 2023; Faghihpirayesh et al., 2021; Fürbass et al., 2020; Geng et al., 2021; Jeon et al., 2022; Jiang et al., 2023; Johansen et al., 2016; Kural et al., 2022; McDougall et al., 2023; Medvedev et al., 2019; Nejedly et al., 2023; Nhu et al., 2023; Thangavel et al., 2021; Thomas et al., 2020; Thomas et al., 2021; Tjepkema-Cloostermans et al., 2018; Wang et al., 2023; Wei et al., 2021; Zhang et al., 2023). In these studies, there is a large variation in data sizes, in the methods to calculate ACC (e.g., some used balanced ACC) and AUC (e.g., ROC vs. precision and recall), making comparisons difficult.
The commercial system Persyst 13 has been evaluated in several studies, and its performance in detecting EDs is often characterized as ‘noninferior’ compared to experts (Halford et al., 2018; Joshi et al., 2018; Kural et al., 2022; Reus et al., 2020; Reus et al., 2021; Reus et al., 2022), but it is unclear to what extent it is based on deep learning. Searching the literature, no other studies were found where both interrater agreement for EDs between experts and between experts and deep learning classifiers are assessed. Wang et al. (2023) report an AUC of 0.99 for their classifier and a Cohen’s kappa of 0.91 when comparing it to three experts. The classifier developed by Thomas et al. (2020) achieve an AUC of 0.84 and percent agreement of 81 with one expert. In addition, in the study by Abou Jaoude et al. (2020), they report that 36 percent of false positives made by their classifier are retrospectively assessed as EDs missed by the annotating expert.
In this pilot study, two experts annotated EDs, and different classifiers were created based on their annotations. The agreements between the experts’ and classifiers’ annotations were assessed and the results were also visualized using a newly developed clustering technique (Svantesson et al, 2023). The main aim was to evaluate the feasibility of performing a larger study.
3. Materials and methods
3.1. EEG data
One EEG from the Temple University Hospital EEG Corpus database (Obeid and Picone, 2016) was used for this study. The EEG had been extracted previously in the context of another study (Svantesson et al., 2021). The duration was approximately 78 minutes sampled at 250 Hz. Electrodes were placed according to the international 10–20 system. Twenty-one electrodes were used (Fp1, Fp2, Fz, F3, F4, F7, F8, Cz, C3, C4, Pz, P3, P4, T3, T4, T7, T8, O1, O2, A1, and A2).
The EEG represented a focal status epilepticus with continuous periodic discharges (PDs) with varying morphology, of which a subset was epileptiform (Fig. 1). It was selected for the study because the exam was somewhat longer than a standard exam, it contained relatively large amounts of candidate EDs which had a variation in morphology but relatively similar electric fields. The expectation was that this would provide enough data for training and analysis, while still being a reasonable task for the experts to perform.
3.1.1. Preprocessing
For the visualization, the EEG was bandpass filtered between 0.3 and 40 Hz. For deep learning classifiers and cluster encoders, the EEG was bandpass filtered between 1 and 40 Hz, re-referenced to the common average, and normalized by dividing by the 99th percentile of the absolute amplitude. All filters were implemented in Python as 5th order Butterworth filters using scipy.signal (Virtanen et al., 2020) and zero-phase filtering.
3.1.2. EEG annotation
Two senior specialists in clinical neurophysiology (author M.T. and another member of the research group, non-author; both with more than 20 years of experience) independently annotated the EEG for EDs. They were instructed to only annotate potentials with a strict epileptiform morphology, where each potential should be assessed independent of other potentials. No further instructions or definitions were provided of what constitutes EDs. From their annotations, two additional annotations were created by taking the union and the intersection of the two original annotations.
The EEG was also annotated for PDs (author M.S., specialist in clinical neurophysiology). It was assumed that the same biological process generated all the PDs, i.e., discharges of epileptiform morphology were not generated by another process. The intention was thus to let the epileptiform discharges be a subset of the PDs and to guarantee this, the union of all annotations were taken as the final annotation of PDs.
3.1.3. Visualization and manual annotation
A GUI implemented in Python was created for visualizing and annotating the EEG. The visualization was calibrated to display amplitude as 100 μV per cm and time as one second per three cm. Fifteen seconds of EEG was displayed at a time. It was possible to navigate forward or backward in steps of 1 or 15 seconds, and to jump directly to the start or the end. Navigation was possible by using the mouse and clicking on buttons in the GUI, or by using designated buttons on the keyboard. The EEG could be visualized in a common average reference or a longitudinal bipolar montage.
EDs were annotated by using the mouse cursor to place markings. The experts were instructed to position the markers at the peak of the maximum of each discharge. Each marker only represented the time of the discharge, not which EEG-lead the maximum was located in. Each annotation was saved as an array of zeros and ones, where ones signified EDs. The length of the arrays was equal to the length of the EEG (78 minutes sampled at 250 Hz). For each marking the 29 preceding and 30 succeeding positions in the array were set to one. Each EDs would then have a length of 60 sample points or 240 ms.
3.2. Deep learning
All deep learning was developed in Python (version 3.7.13) using the API Keras (version 2.8.0) and the programming module TensorFlow (version 2.8.0). Three different computers were used, equipped with 64–128 GB of RAM, and Nvidia graphics cards (Quadro P5000, Quadro P5000 RTX, or two Titan RTX).
The Adam optimizer was used in all training (Kingma and Ba, 2015) with parameters β1 = 0.5 and β2 = 0.9. A learning rate of 1e-5 was used for the classifiers and 1e-4 for the cluster encoder. Batch sizes of 500 were used in all instances. The binary cross-entropy was used as loss function for the classifiers. To reduce differences due to random variations, the classifiers were initialized using the same values for the weights.
In what follows, channel refers to EEG-channel.
3.2.1. Classifiers
All classifiers used the same architecture (Fig. 2) and hyperparameters. The input size to the classifiers were 21×60 (channels x time). Each classifier started with five convolutional layers; each layer was followed by a scaled-exponential rectifying unit (SELU) (Klambauer et al., 2017), and a max pooling layer which downsampled by a factor of two. Filter kernel sizes were three and the number of filters per layer were 32, 64, 128, 256, and 512. Channels were analyzed separately. Thereafter, four fully connected layers followed, the three first having 512 nodes. SELU activations was used for all layers except the last, which had one node and a sigmoid activation to complete the classifier. The total number of trainable parameters was 6,640,769.
Due to the small data size, a five-fold cross-validation strategy was used in the evaluation. The EEG was partitioned into five segments, training was performed on four segments with evaluation on the remaining segment, and this was repeated for all five combinations so that the whole EEG was evaluated.
Using a validation set in every instance to determine when to end training proved difficult due to large variation in validation values from iteration to iteration. Instead, repeated training was performed using the annotation of expert 1, withholding 15 percent of the intended training data as validation data, and then averaging the validation loss and accuracy. From these tests, it was determined that 3,000 iterations (corresponding to roughly 12 minutes of training for the used computer) was adequate to maximize the accuracy of the validation data without exaggerating the overfitting. Comparisons were also made between using different distributions of the three categories: non-ED/non-periodic, non-ED/periodic, and ED data. The large variation in validation output made comparisons difficult, but there did not seem to be a distribution that would produce a high accuracy without producing many false positives and minimizing false positives or both false positives and negatives yielded low number of true positives. In the end, it was decided towards aiming to detect an equal amount of EDs as the respective annotation since the aim of the study was to compare the annotations. A distribution of 0.15, 0.75, and 0.1 (non-ED/non-periodic, non-ED/periodic, and ED) was therefore used for the final results. It was not tested to use different weights for the categories in the loss function.
To produce as much variation as possible in the training data, batches of examples were generated by random selection from the EEG. Each example was selected by first randomly selecting one of the three categories defined above according to a given distribution, and then generating a random position in the EEG that belonged to the selected category. An example was defined as ED if > 75 percent overlapped with an ED marking, as periodic non-ED if not ED and > 75 percent overlapped with a periodic marker, and as non-periodic/non-ED otherwise. The input size of 60 samples was chosen so that the EDs would fit inside while still allowing some variation where the EDs were located (cf., Fig. 18) and without risk of overlap between EDs (or PDs).
Since EEG is continuous data where EDs can occur at any time, the training and test data was for each classifier scanned in steps of 15 samples, i.e., overlapping. An array of ones and zeros was created where each segment classified as ED was marked as ones (60 sample steps), and the remaining array positions were marked as zeros. The scanning process made some markings longer than 60 sample steps (75 or 90 steps). The evaluation time for each fold was 45–50 minutes.
3.2.2. Cluster encoder
Cluster analysis was performed on the PDs, to better understand the results. A further development of a deep learning-based cluster encoding published previously (Svantesson et al., 2022) was used. The clustering is based on the principle of t-distributed stochastic neighbor embeddings (t-SNE) where high-dimensional data can be reduced into two dimensions to produce visualizations (van det Maaten and Hinton, 2008). The improved version used here, produced more distinctly separate clusters and in addition produced identifiers for each cluster. Details on the cluster encoder developed for this work is presented in Appendix 9.3.
3.3. Statistics
In the evaluation, the data were analyzed as consecutive examples of 240 ms duration. The whole EEG then amounted to 19,408 examples. When comparisons were made between different annotations, overlapping markers corresponding to the same potentials had to be aligned since the exact position would vary between both experts and classifiers. The result of this procedure was visually inspected by plotting the original and aligned markers across the EEG to ensure accurate results.
In previous studies, presented in the introduction, there is a large variation in used metrics to assess the accuracy of classifiers. In this pilot study a battery of metrics was included. In addition to ACC and AUC based on the receiver operating characteristic (R-AUC), balanced accuracy (B-ACC), AUC based on precision-recall (PR-AUC), F0.5, and Matthew’s correlation coefficient (MCC) were used. The library sklearn.metrics (Pedregosa et al., 2011) was used to calculate these metrics. For R-AUC, average was set to “weighted”. For F0.5, average was set to “binary”. Sensitivity, specificity, precision, and negative predictive value (NPV) was also calculated.
Interrater agreement was assessed by pairwise comparisons using both Cohen’s kappa and Gwet’s AC1. Cohen’s kappa was computed using sklearn.metrics.cohens_kappa_score. Gwet’s AC1 was computed using irrCAC (https://pypi.org/project/irrcac/; accessed 2023-10-15).
4. Results
4.1. Annotations
Expert 1 identified 1,709 EDs, expert 2 identified 1,430 EDs, the union of expert 1 and 2 resulted in 2,253 identified EDs, and the intersection of expert 1 and 2 resulted in 886 identified EDs. There were 8,107 identified PDs. Examples of marked potentials are presented in the Appendix, section 9.1. All experts used the average referenced montage, and the annotation time was about two hours.
4.2. Classifiers
For the classifiers to identify roughly the same number of EDs as their respective expert annotation they were trained on, the thresholds for identification based on the sigmoid output were adjusted to 0.48, 0.45, 0.35, and 0.55.
The performances as assessed by the metrics were moderate in most instances (Tab. 1): B-ACC 0.76–0.82, R-AUC 0.52–0.72, PR-AUC 0.51–0.65, F0.5 0.53–0.75, MCC 0.51–0.72, sensitivity 0.54–0.75, and precision 0.55–0.75. Most of the EEG was non-ED, and this was reflected in higher scores for ACC 0.94–0.96, specificity 0.97–0.98, and NPV 0.97-0.98. The best overall result was obtained for classifier U. The metrics were also calculated comparing the experts using expert 1 as ground truth. In comparison to the classifiers most values were somewhat lower for the experts.
Classifier 1 identified 1,703 EDs, classifier 2 identified 1,431 EDs, classifier U identified 2,228 EDs, and classifier I identified 862 EDs. Out of these false positives, almost all were within the PD population (Tab. 2).
The Euclidean distance (‖x − y‖) between periodic non-EDs (x) and EDs (y) for the annotations of expert 1 and 2 and the corresponding classifiers were calculated for each channel. The distribution of distances across the electrodes were mostly similar with largest difference for Fp1, Fp2, F8, Cz, Pz, and O1. There were however some differences. Most emphasis was on Fp1, Fp2, Pz, and O1 for expert 1 and on Cz, Pz, and O1 for expert 2. This was also reflected by the classifiers. The classifiers thus preserved the distribution of the distances between periodic non-EDs and EDs compared to the experts (Fig. 4), but the distances increased and more notably so for expert 2 (Tab. 3).
4.3. Interrater agreement
The Cohen’s kappa value for the experts was 0.53 and the kappa values when comparing the experts to the classifiers were 0.52–0.65 (Tab. 4). There was thus a moderate and comparable agreement between the experts and between experts and classifiers. The kappa values when comparing the classifiers with each other were 0.67-0.82 (Tab. 4), mostly corresponding to a substantial agreement.
Using Gwet’s AC1 produced a different result with almost perfect agreement. For the experts, the value was 0.92 and when comparing the experts to the classifiers the values were 0.92–0.94 (Tab. 4). When comparing the classifiers (trained with different annotations) to each other, the values were 0.94–0.96 (Tab. 4).
The percent agreement comparing the experts was 93 percent, experts compared to the classifiers 93–96 percent, and comparing the classifiers 95–97 percent (Tab. 4).
4.4. Cluster analysis
A cluster analysis was performed for the annotations of expert 1 and 2 and the corresponding classifiers. The cluster encoder produced distinct clusters of the PDs and the automatic cluster identification works (Fig. 5). Unfortunately, there were three examples that produces zero output (Fig. 5; cluster 11).
Most of the discharges identified as EDs aggregated, and the distribution of aggregates were similar for experts and classifiers (Fig. 6 and Fig. 7). In most instances, differences concerned data points that were close. There were also indications that the experts assessed more consistently different relative each other for some discharges, and that this affected the learning of the classifiers (Fig. 8). For example, in cluster 2 and 9, expert 1 and 2 tended to select from opposing sides of the clusters. In cluster 2, expert 1 selected more compared to expert 2, while the opposite was true for cluster 9. There were also differences between the experts and the corresponding classifiers, e.g., in cluster 1, the classifier was more prone to identify EDs compared to expert 1 while the opposite was true for expert 2 and the corresponding classifier. The experts only identified a few scattered EDs in cluster 10, only one of these were common to the experts. Expert 2 also identified a few in cluster 4, where expert 1 did not identify any EDs. The classifiers did not identify any EDs in any of the two clusters.
Averages of the discharges for each cluster are presented in Appendix 9.2. It is difficult to draw any major conclusions from the averages. There was some variation in the electric fields of the clusters. For the O1 electrode, averages of EDs were of higher amplitude and of sharper morphology compared to non-EDs. The overall averages of cluster 4 and 10 had a blunter morphology compared to the other clusters, in line with that non to few were identified as EDs in the annotations.
5. Discussion
5.1. Agreement scores
For the percent agreement, Cohen’s kappa and Gwet’s AC1, comparing classifiers achieved the highest agreement, followed by experts compared to classifiers, while comparing the experts produced the lowest scores. As expected, agreement scores were higher when using Gwet’s AC1, where differences in agreement were small.
The two experts identified 1,709 and 1,430 EDs, respectively, but only agreed on 886. According to the often-used interpretation by Landis and Koch (1977), Gwet’s AC1 indicate an almost perfect agreement for the experts. This is not a reasonable interpretation and support the conclusion by Vach and Gerke (2023) that this ordinal scale should not be used for Gwet’s AC1.
5.2. Disagreement – similar or different?
Although there were differences between the annotations, the cluster analysis showed that annotated examples tended to aggregate, i.e., were similar, and that the different annotations showed similar distributions of aggregates. The cluster analysis also showed that differences in expert annotation to some extent induced corresponding differences for the classifiers (e.g., Fig. 8 A and B; clusters 2, 3, and 9). However, as the clustering method is new, it is uncertain how strong conclusions that can be made, and as discussed below, it must be further and more systematically evaluated.
Jing et al. (2019) suggest that experts rate EDs based on the same internal model and that differences in rating mostly are due to differences in their thresholds for scoring EDs. Assuming that: 1) the experts use the same model for classifying, 2) the only significant difference between them is a threshold, and 3) that the thresholds are simple and stable; then most of the examples classified as EDs by the expert with higher threshold should be a subset of the other expert’s selected examples (Fig. 9 A). This was not the case here since each expert has a subset of significant number of examples that were uniquely scored as EDs by the respective expert (Fig. 9 B).
The distances between EDs and non-EDs (Fig. 1) show that the annotation of expert 1 seemed to emphasize the dipole O1, Fp1, Fp2, while the annotation of expert 2 emphasized the peak Cz, Pz, and O1. This was reflected in how the classifiers performed. The result thus indicates that the experts to some extent rate differently and that this affects the learning of the classifiers.
Interpreting the cluster analysis cautiously, part of the data seems to be assessed more differently by the experts (e.g., Fig. 8 A; clusters 2, 3, and 9). Assuming that scoring is based on a common model, this could imply a multidimensional threshold, and the more different assessments may represent dimensions where the experts had a more consistent difference in threshold. Part of the data that has been scored differently seem to be around more similar data points, which could imply that the thresholds are closer but vary over time, i.e., intrarater variability. In this case, it can thus vary which expert is most sensitive to scoring, and this could contribute to different but similar annotations.
5.3. The classifiers
There are several factors the classifiers cannot learn. For instance, the annotation process consumed several hours, and fatigue may have affected the thresholds. This is a form of external factor, not encoded in the EEG. The experts were instructed to assess each candidate discharge independently, but recently scored material may still affect the threshold, e.g., if there is a period where discharges are blunter, the threshold may be lowered for scoring a discharge as epileptiform. Of course, there may also be long-term biological dependencies in the data that are clinically relevant and influence how the experts detect EDs. The classifiers cannot learn variation caused by external factors, or long-term dependencies since this type of classifier’s assessment is independent of the location in the data. It is therefore further speculated that the classifiers learn some form of average threshold.
In a practical setting, how to balance between false positives and negatives will depend on the application. For the alternatives tested, none produced a perfect classifier, and since the aim of the study was to compare annotations, it was decided to train the classifiers toward producing roughly the same amount of EDs as the annotations they were trained on. There were limitations in creating the classifiers, e.g., they had a very basic architecture, no extensive hyperparameter tuning was performed, and the data size used to train them was relatively small. A problem during this work was the large variation in validation values (accuracy, false positive and negative predictions) from iteration to iteration during training. This made it difficult to predict how the resulting classifiers would perform when tested (it was hard to aim for something). The weights of the classifiers were initialized with the same randomized values, but there were other sources of randomness that occurred during training, e.g., randomization of the training data, that could have contributed to a variation in the final results. In a larger study, this variation could be assessed by training multiple classifiers on the same data. There was some correlation between the performance metrics (Tab. 1) and the number of PDs in each annotation with higher values for larger number of PDs. It may be possible to tune the classifiers to classify more aligned to the experts, but the relatively small dataset will set a limit to what is possible to achieve. Interpreting the (balanced) performance metrics in a traditional way, the classifiers would be assessed as performing relatively poorly. However, when comparing the experts to each other using the same metrics, the classifiers had a better performance. Almost all false positive EDs identified by the classifiers were within the PD population. Part of the false positives would possibly be accepted as EDs, and part of the false negatives would possibly be accepted as rejected EDs by experts in a Turing test. Regardless of room for finetuning the classifiers and using larger data sizes, it is speculated that the classifiers probably cannot learn to reproduce the experts’ annotations exactly, and that intrarater variability at least to some degree could contribute to this.
5.4. The EEG data
The aim of the study was not to settle the question of what constitutes EDs, but to study how rating varies between experts and how this affects the learning by classifiers. Only one EEG was used for this study. There were several reasons for this choice. The EEG was longer than a standard exam, so it would provide more training data for the classifiers while still being reasonable for an expert to annotate in one session. The chosen exam represented a focal status epilepticus and so would contain relatively large amounts of candidate EDs. This would also increase the chance of training good classifiers based on a smaller amount of EEG. The PDs had a variation in morphology, but relatively similar electric fields. Hence, there would be a combination of a large but still limited variation that would provide more extensive testing of the scoring thresholds of the experts. The times to perform the annotations were shorter than expected and it seem possible to use recordings of longer duration.
5.5. The cluster method
The basis for the cluster encoder has been evolving over an extended period of time. The current version of the encoder has several limitations. Similar to k-means clustering, the number of clusters has to be decided beforehand. Furthermore, there is no guarantee that all clusters will be utilized. The architecture may cause zero output for some data, and this is a more serious problem. This is represented by cluster 11 in the results. The number of clusters can affect the degree of zero output. The loss function will promote a uniform distribution of examples across the clusters. This may cause blending of categories in clusters if the training data have skewed prevalence of categories. In addition to the number of clusters, there are several new parameters that must be chosen (e.g., neighbors, sigmoid distribution, weights for loss function).
For different data tested during the development, the best results have been produced using 5–15 clusters and a low number of neighbors (1–10). In many instances, a problem has rather been that clusters tend to be almost point-like, compared to the more confluent character produced by previous versions of cluster encoders. The slope parameter S of the sigmoid distribution may be less important but using a steep slope (larger value) might promote clusters to expand. Similarly, but probably more important, the bias parameter b may also cause clusters to expand by using larger values. The sigmoid distribution can produce very small values. It may be the case that differences in smaller values may not convey information about rank, but rather be due to machine precision. Furthermore, in practice, to avoid division by zero, a lower limit must be set for the distribution. If the sigmoid distribution produce values below this limit, the corresponding data points will be modeled as equidistant. Using the binary cross-entropy may therefore not be a good option, but the impression was that it improved the results.
Possibly, the clustering and separation is artificially good, i.e., the architecture forces the data into well separated clusters. Furthermore, there were a few data points color coded as belonging to one cluster but are closer to another (e.g., cluster 2 and 5; Fig. 5). It seems more likely that this color coding was an effect of the loss function promoting a uniform distribution of examples per cluster rather than coding for similarity. However, for most data, the color coding seemed reasonable relative the similarity.
5.6. Conducting a larger study
In this pilot study, only two experts annotated the data, and the experts were also members of the research group. Studies by Halford et al. (2017) and Bagheri et al. (2017), imply that around ten experts can yield a stable group consensus, and is a reasonable size for a larger study. Ideally, a larger study should include experts from different centers and non-members of the research group. In addition, intrarater agreement could be assessed by having the experts perform the annotation twice on separate occasions. The EEG data and performed annotations in this pre-study can be used to optimize the classifiers to perform more aligned to the experts. New EEG data can then be used to assess the larger expert group. The new data should be similar, but if possible have longer duration.
6. Conclusions
The different measures of interrater agreement disagree. However, the general conclusion, including the cluster analysis, is that experts and classifiers rate similar discharges as EDs. From a practical point of view regarding the applicability of machine learning to EEG assessment in clinical practice, these techniques may well come to contribute to more uniform evaluations, possibly closer to the biological basis of the assessed signals. The small sample size of the study makes the results uncertain and strong claims cannot be made.
Data Availability
EEG data were taken from https://isip.piconepress.com/projects/tuh_eeg/
8. Author Contributions
M.S.: conceptualization, formal analysis, investigation, methodology, project administration, resources, software, visualization, writing—original draft, writing—review and editing. A.E.: supervision, writing—review and editing. M.T.: supervision, writing—review and editing.
7. Acknowledgements
We are very grateful to dr Saad Nagi for supporting this work. Grants were received from Region Östergötland (RÖ-974228, RÖ-962769, RÖ-941377, RÖ-986017, LIO-936176, and RÖ-941359). A.E. was supported by the ITEA3/VINNOVA funded project (AS-SIST).
9. Appendix
9.1. Examples of discharges
All discharges were first sorted according to increasing amplitude in the O1-electrode and twenty-one evenly spaced discharges from this order were then plotted (Fig. 10; Fig. 11; Fig. 12).
9.1.1. Periodic discharges
9.1.2. EDs by Expert 1
9.1.3. EDs by Expert 2
9.2. Average discharges of clusters
Averages for the clusters were calculated for both experts and the corresponding classifier. In the figures below, the same color code as in Fig. 5 was used. For each cluster the following averages were calculated of: 1) all discharges, 2) non-EDs of the expert, 3) non-EDs of the classifier, 4) EDs of the expert, and 5) EDs of the classifier.
9.3. Cluster encoder
Cluster analysis was performed using a further development of parametric t-distributed stochastic neighbor embedding (t-SNE) that allowed for automatic identification of different clusters.
In t-SNE, a low-dimensional representation (LDR) can be generated of a high-dimensional representation (HDR) of data (van det Maaten and Hinton, 2008). This is accomplished by matching probabilities of data points being neighbors in both the HDR and LDR. The HDR is assumed to have a normal distribution, the LDR a t-distribution, and the distributions are match using the Kullback-Leibler divergence. In parametric t-SNE, this is implemented as a fully connected neural network (van der Maaten, 2009). In recent work, this was adapted to EEG to create cluster encoders (Svantesson et al., 2023). The encoders consisted of convolutional layers that produced a new HDR. Instead of following a normal distribution, the corresponding neighbor distribution was assumed to follow a two-value function based on ranked distances. That is, data points below a certain rank threshold were regarded as neighbors and given a high probability, while those above the threshold were given a low probability.
In this work the method was further developed. The encoder (Fig. 15; Tab. 5) produced two HDRs, Z and C, that were both matched to a LDR Y, as well as to each other. The two-value function used previously only model whether points are neighbors or not. To additionally provide information of the rank, i.e., a proxy value of distance, a sigmoid probability distribution was used instead. The Kullback-Leibler divergence emphasize high probabilities, i.e., close data points. To possibly model distant data points better, the binary cross-entropy was used.
The cluster encoder consisted of five main parts (Fig. 15; the corresponding parts indicated by numbers in the figure):
The first part of the cluster encoder was a section of convolutional and max pooling layers, similar to the classifiers, which produced the representation Z. Convolutions were performed channel-wise and the size of Z was 21×64.
The second part started with a section of fully connected layers analyzing channel-wise, which was followed by a section of locally connected layers. The channels were kept separate. The first section thus performed a more general waveform analysis, while the second section performed analysis adapted to each channel. The basic size was kept 21×64.
The third part received input from the second part and consists of fully connected layers which ended in a layer with the number of nodes equal to the number of clusters, and this was followed by a softmax which produced the representation C. This part was then ended by a ReLU with a threshold of 0.5. The idea is that the combination of a softmax function followed by a ReLU with threshold 0.5 can at most generate a value (0.5–1) through one node. The output size from this part was 10 and it was used to produce automatic cluster identifier (numbers or color code).
The fourth part also received input from the second part and consisted of fully connected layers. The output size from this part was 10×32.
The fifth part received input from the third and fourth part. These inputs were first multiplied and then processed by locally connected layers and ended in a fully connected layer with two nodes which produced the LRD Y. In the locally connected layers, the basic size was the number of clusters times the size of the locally connected layers, which for this work was 10×32.
Locally and fully connected sections of parts two to five had the general structure of blocks composed of a series of layers (Fig. 16): connected layer, ReLU, connected layer, ReLU, add residual, and layer normalization. The first connected layer either expanded or contracted the input size to 512, while the second connected layer restored it to the original size. The block is reminiscent of the feedforward network of the transformer (Vaswani et al., 2017).
In part two, channel-wise fully connected means that the same fully connected layer was applied to each channel, while locally connected means that each channel had its own connected layer. In parts three and four, the channels were analyzed together in fully connected layers. In part five, each cluster was analyzed separately using locally connected layers, except in the last layer, which was fully connected and produced the LDR.
To be able to extract the HDRs Z and C, the full encoder was implanted in three sections:1) part one, 2) part two, three and four, and 3) part five.
The intention with the architecture of the full encoder was that part one and two break down the signal into characteristic features for each channel. Based on these, part three classified the signal into one cluster, while part four analyzed and conveyed information of the mapping into lower dimensions. In part five, the outputs from part three and four were multiplied. Since part three only can produce one non-zero value, this selected information from part four and guaranteed the cluster encoding of part three was perpetuated.
There were two main problems arising from this architecture. One was that it was not guaranteed that input data would produce a non-zero output. The other was that the encoder may not use all possible clusters. There were several components in the loss function to counteract this. During training the batch size was 500 and the encoder had 10 clusters. The loss function then consisted of six parts:
To promote the encoder to learn to distribute data uniformly across clusters, the sum across the batch for each cluster for the output from part three was set to match 500/10 using the sum of the absolute error. The main reason for doing this was to promote the usage of all clusters, but also to avoid non-zero outputs.
To promote values above the ReLU threshold and for them to be closer to one, the sum for each example for the output from part three was set to match one using the sum of the absolute error. The main reason for doing this was to avoid non-zero outputs.
To promote values closer to one, a one hot encoding was created from the output of part three, and the representation C was set to match this using sum of the absolute error. This had an important effect on the separation of clusters.
There were three instances of matching representations using the binary cross-entropy:
The sigmoid distribution of Z and t-distribution of C.
The sigmoid distribution of Z and t-distribution of Y
The sigmoid distribution of C and t-distribution of Y.
These subfunctions were weighted 10, 1, 1, 1, 1, and 1, respectively, and then summed to a total loss. Weighting part 1 higher was necessary to promote usage of all clusters.
The sigmoid distribution based on the ranked distances rij between example i and j was computed according to where n is the number of neighbors, the parameter S determine the slope, the term B(rij) is used to provide the possibility to accentuate the difference between neighbors and non-neighbors with B(rij) = b, rih ≤ n and B(rij) = 0, rih > n, where b is a constant (examples of distributions using different parameter values are shown in Fig. 17).
To further promote data points belonging to the same cluster being close, the ranked distances were conditional on cluster. This mean that during training, data points belonging to the same cluster were automatically ranked lower compared to the remaining data points. The distribution was computed before training and stored as an array, and during training examples could be assigned their values based on rank. Most of the processing time is due to computing the ranks, so using this approach does not add much time to training compared to using the two-valued distribution. In this project, parameters were set to n = 5, S = 1, and b = 1,000.
The encoder was trained on the PDs. Each example had a duration of 64 samples (256 ms). During training, the input EEG-interval to the encoder was randomly shifted ±16 samples. In parallel, the distance computed for the HDR was based on independently randomly shifted input EEG-interval. The intention was to dissociate the exact position of the marker from the HDR (Fig. 18).