Abstract
This paper proposes a deep learning (DL) model for automatic sleep stage classification based on single-channel EEG data. The DL model features a convolutional neural network (CNN) and transformers. The model was designed to run on energy and memory-constrained devices for real-time operation with local processing. The Fpz-Cz EEG signals from a publicly available Sleep-EDF dataset are used to train and test the model. Four convolutional filter layers were used to extract features and reduce the data dimension. Then, transformers were utilized to learn the time-variant features of the data. To improve performance, we also implemented a subject specific training before the inference (i.e., prediction) stage. With the subject specific training, the F1 score was 0.91, 0.37, 0.84, 0.877, and 0.73 for wake, N1-N3, and rapid eye movement (REM) stages, respectively. The performance of the model was comparable to the state-of-the-art works with significantly greater computational costs. We tested a reduced-sized version of the proposed model on a low-cost Arduino Nano 33 BLE board and it was fully functional and accurate. In the future, a fully integrated wireless EEG sensor with edge DL will be developed for sleep research in pre-clinical and clinical experiments, such as real-time sleep modulation.
I. Introduction
Sleep quality and health are closely related; therefore, it is important to understand one’s sleep quality to improve health condition. A measurement of sleep quality is the time spent in each sleep stage. There are five sleep stages, which are wake, N1, N2, N3, and REM, with each stage progressively deeper sleep. Most of the sleep occurs between stages N1 and N3 [1]. The clinical evaluation of sleep stages is performed by polysomnogram (PSG), a procedure that records one’s electroencephalogram (EEG), electrooculogram (ECG), and other physiological features. Medical professionals will manually classify their sleep stages over time according to one or more of the features mentioned above.
With the development of electronic technology and machine intelligence, wearable devices, such as smartwatches, can measure user biosignals and potentially classify their sleep stages. However, the cost of these devices is high and the performance of sleep stage classification is limited. In addition, classification often requires the transmission of data to mobile phones or the cloud, raising concerns about cybersecurity [2].
High-quality real-time sleep classification and sleep modulation still needs to be performed in sleep laboratories. There is a need for low-cost at-home sleep monitoring devices that can perform sleep stage classification on device, and potentially use the sleep stage to generate auditory stimulation for treating sleep disorders or enhancing sleep quality [3].
In this paper, we develop a lightweight DL model for running on devices with restricted energy and memory, such as microcontrollers [4]. There are two main constraints to the development of the model for hardware. The first constraint is that the size of the model will be limited by the memory resources available on the hardware, including the non-volatile Flash memory for model storage and the random-access memory (RAM) for model computing. The second constraint is that the computational demand of the model will be limited by the clock rate, bit width, and computational capabilities (such as floating point or fixed point) of the device. Key trade-offs are between model performance and complexity. In this work, we developed a DL model that can run on a low-power wireless microcontroller, based on a low-cost Arduino Nano 33 BLE development board. Despite its small size, our model achieved performances comparable to the state-of-the-art during a validation using a publicly available sleep dataset.
Fig. 1 shows the overall block diagram of the envisioned fully integrated sleep stage classification device featuring the developed DL model. The device will be miniature in size and fully self-contained. It can enable a wide range of sleep research in pre-clinical and clinical studies.
II. Methods
A. Dataset
Sleep-EDF Expanded Database (version 1, published in 2013) contains 197 whole-night Polysomnographic (PSG) sleep recordings [5], [6]. It contains two subsets, the Sleep Cassette Study (SC) and Sleep Telemetry Study (ST). Data from the Sleep Telemetry Study was obtained in 1994 to study the effects of temazepam on sleep. Since we are proposing a model to classify the stages of sleep of healthy people, we will not use data from this subset. Our experiments will be performed on the SC subset.
Two 20-hour PSG recordings were taken for 77 subjects between the age of 25 and 101. The first nights of subjects 36 and 52, and the second night of subject 13 were lost. The PSG recordings contain three channels of EEG signals, one channel of EOG and chin EMG signals, oronasal airflow, rectal body temperature, and event marker. To reduce our model’s size, we only use Fpz-Cz EEG signals as input to our model. The EEG signals were sampled at 100 Hz. Each of the 30-second segments of the signals was labeled by well-trained sleep experts. There are eight stages, N1, N2, N3, N4, Wake, REM, MOVEMENT, UNKNOWN). To make our results consistent and comparable with previous studies [7]–[9], we preprocessed the data with the following methods:
Discarded the segments with UNKNOWN and MOVEMENT labels.
Combined N4 and N3 together as N3 stage.
Ignored wake epochs longer than 30 minutes outside of sleep periods.
B. Performance Metrics
We evaluated our model’s performance using per-class Precision (PR), per-class Recall (RE), per-class F1-score (F1), and overall accuracy (acc). Overall accuracy is the ratio between the number of correct predictions and the population. For a category prediction, there are four outcomes: true positive (TP), false positive (FP), true negative (TN), false negative (FP). Metrics are defined as: Overall accuracy is commonly used to measure classification performance. However, for an imbalanced dataset, precision does not provide adequate information on classifiers, because it hardly reveals performance in minority groups [10]. Table I shows the distribution of the dataset. The dataset is highly imbalanced, so we introduced additional metrics, PR, RE, and F1, to correctly measure the performance of our model.
C. Proposed Model
Raw EEG data contains time-invariant and time-variant features. Each 30-second input data segment with 100-Hz sampling frequency, so the input shape is (3000,1). It is too large to feed it into a transformer unit. Fig. 2 shows our model’s architecture. A convolutional neural network can extract time-invariant data and output smaller data. We implemented four sequential convolutional layers to output features with shape (19,128). Then we use a transformer unit to learn some time-variant information from the features. Its attention mechanism learned the contexts on all positions of the time series data. The two dense layers inside the transformer unit work as an encoder. The output of the encoder is then added to the input data for additional features. Finally, to correctly classify sleep scores, we used a dense layer with a softmax activation function to obtain the most possible categories. We also tried other models, such as recurrent neural network and auto-encoders. Experiments showed that our proposed model yielded the best performance.
III. Experiments
A. Data Preprocessing
Since our model was designed to be deployed on micro-controllers with limited memory and computation resources, we cannot design a complex data preprocessing method, so we implemented a simple standarization. The performance of deep-learning models is highly dependent on the statistical properties of the input data. If the input data is too small or too large, the models perform poorly. The EEG data in the Fpz-Cz channel is within the scope of 10−5, which is too small for our model. Standarization transforms data to have a zero mean and a standard deviation of one. For each sample X, we standardized it to Z using where M indicates the mean of the samples and S indicates the standard deviation of the samples. After standarization, the input data was in the range of 10−1 to 101, and the model showed the best performance.
B. Basic Training
We used 5-fold cross-validation to train and test our model.
There are 77 subjects in total, and each fold contains 16 subjects’ data (one fold has 13 subjects). In each iteration, we selected four folds as training and validating data, and one fold as testing data. Among the four folds of data, we randomly sampled 10% for validation. The remaining 90% of data was for training. We used Adam algorithm for optimization, which is a stochastic gradient descent method based on adaptive estimation of first-order and second-order moments. It computes efficiently and requires little memory [11]. We utilized categorical cross-entropy as the loss function and fed our model with a batch size of 64 samples.
C. Subject-Specific Training
We also performed subject-specific training in the testing stage to further adapt the patterns of each subject. We randomly selected 10% of the test data and fed them to our trained model. The remaining 90% of the test data were used to test our models after subject-specific training.
IV. Results
Table IV and Table III show the confusion table and the performance of our model, respectively. Since the dataset is imbalanced, the per-class performance in N1 and REM was expected to be poorer than the majority classes. The model performed well in the Wake and N2 classes. The state-of-the-art performances from literature are also similar. From Table VI, all models have an F1 score per class less than 0.5 in N1 and less than 0.8 in REM. To improve the performance of our model, we perform a subject-specific training to further adapt subject-wise patterns. Table IV and Table V show that performances improved after subject-specific training. Firstly, the overall accuracy increased from 0.775 to 0.795. Secondly, per-class precision, recall, and F1 score increased for all classes.
Table VI compares the F1 score of different models from the literature. Since our proposed model is lightweight and small, it cannot yield the best performance. However, it still had performance comparable to that of the state-of-the-art. For the wake stage, we yielded the highest F1-score of 0.91. For other classes, we were close to the highest. There is no model that could beat our performances in all classes. Some models performed better on certain classes.
V. Discussion
Our lightweight model was designed for memory-constrained microcontroller units. It had around 300,000 pa-rameters, and its size was 2 MB. We used an Arduino Nano 33 BLE board, which integrates a wireless microntroller (nRF52840, Nordic Semiconductor) with a 64 MHz 32-bit ARM CPU, 1 MB of flash memory, and 256 KB of SRAM [16]. The deployed model was stored in the flash memory, so the model was supposed to be less than 1 MB. We planned to add an external memory unit to the model in the future. To test the availability of our design in the current stage, we implemented a smaller version of the model and deployed it on the Arduino board. The model had an accuracy of 68%, but the microcontroller was fully functional for target classification. To reduce the size of our model, we also quantized the model in the deployment stage.
We randomly selected 10% of the test data set to perform subject-specific training. If a dataset is large, the randomly selected data should follow the distribution of the dataset. In our experiments, the distribution varied as the dataset was not large enough. The improvement of performance was highly dependent on the distribution of the selected data. Therefore, we will enforce the distribution of the subject-specific training data. This means that the number of selected data in each category is calculated and fixed based on the test dataset. This method would maximize the benefits of subject-specific training.
As mentioned previously, the dataset has an imbalanced class distribution, which significantly affects models’ performance, especially on the minority categories. There are different methods to mitigate the effect of imbalance data, such as oversampling and undersampling. Oversampling is used to duplicate samples in minority classes, while undersampling is used to remove samples from the majority classes. There are two common ways in training imbalanced data. Another method we will try in the future is to add weights to the loss function. The loss function can be weighted differently for different classes, so that minority classes are learned more.
VI. Conclusion
In this work, we developed a lightweight DL model for real-time sleep stage classification using single-channel EEG data. The DL model features CNN and transformers. We validated the model using the Sleep-EDF dataset and tested it in a low-power microcontroller. The model achieved performance comparable to the state-of-the-art works. In the future, we plan to develop a fully integrated wireless EEG sensor using the model. The developed device can enable a wide range of sleep research.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Footnotes
zongyan.yao{at}mail.utoronto.ca
xilinliu{at}ece.utoronto.ca