An automated heart rate-based algorithm for sleep stage classification: validation using conventional PSG and innovative wearable ECG device ============================================================================================================================================ * Nicolò Pini * Ju Lynn Ong * Gizem Yilmaz * Nicholas I. Y. N. Chee * Zhao Siting * Animesh Awasthi * Siddharth Biju * Kishan Kishan * Amiya Patanaik * William P. Fifer * Maristella Lucchini ## Abstract **Study Objectives** Validate a HR-based deep-learning algorithm for sleep staging named Neurobit-HRV (Neurobit Inc., New York, USA). **Methods** The algorithm can perform classification at 2-levels (Wake; Sleep), 3-levels (Wake; NREM; REM) or 4-levels (Wake; Light; Deep; REM) in 30-second epochs. The algorithm was validated using an open-source dataset of PSG recordings (Physionet CinC dataset, n=994 participants) and a proprietary dataset (Z3Pulse, n=52 participants), composed of HR recordings collected with a chest-worn, wireless sensor. A simultaneous PSG was collected using SOMNOtouch. We evaluated the performance of the models in both datasets using Accuracy (A), Cohen’s kappa (K), Sensitivity (SE), Specificity (SP). **Results** CinC - The highest value of accuracy was achieved by the 2-levels model (0.8797), while the 3-levels model obtained the best value of K (0.6025). The 4-levels model obtained the lowest SE (0.3812) and the highest SP (0.9744) for the classification of Deep sleep segments. AHI and biological sex did not affect sleep scoring, while a significant decrease of performance by age was reported across the models. Z3Pulse - The highest value of accuracy was achieved by the 2-levels model (0.8812), whereas the 3-levels model obtained the best value of K (0.611). For classification of the sleep states, the lowest SE (0.6163) and the highest SP (0.9606) were obtained for the classification of Deep sleep segment. **Conclusions** Results demonstrate the feasibility of accurate HR-based sleep staging. The combination of the illustrated sleep staging algorithm with an inexpensive HR device, provides a cost-effective and non-invasive solution easily deployable in the home. Key words * Sleep Staging and Scoring * Automatic Sleep Staging * Wearables * Heart Rate Variability * Artificial Intelligence ## INTRODUCTION Sleep is a biological necessity that leads humans to spend roughly one third of their life asleep. Emerging research is highlighting the critical role that sleep plays in overall health. Poor sleep health has been associated with several negative health outcomes, including increased risk of cardiovascular disease,1,2 obesity,3 depression,4 and neurodegenerative disorders.5 Despite these well-known health risks, we are in the middle of a sleep crisis, with more than 70 million Americans experiencing sleep related problems6 and with less than 20% of patients estimated to be properly diagnosed and treated for sleep disorders.7 Polysomnography (PSG) is presently considered the gold standard method to assess sleep and is performed in sleep laboratories and clinical settings. It records simultaneously electroencephalogram (EEG), electromyogram, electroocular activity, heart rate, respiration, leg movements, nasal pressure, oxygen desaturation and body position and utilized these signals to characterize sleep architecture and sleep disorders. Limitations of PSG include the cost of the devices, laborious setup procedures, the discomfort, and the necessity to have an expert perform the time-consuming process of coding the data. These limitations highlight the need of noninvasive, inexpensive, and reliable sleep monitors that could be deployed in the home to monitor sleep health as well as automated solutions to perform reliable sleep stage scoring. While accelerometer-based actigraphy devices have been used in the field for decades, estimation is limited to sleep/wake states which is insufficient for a full evaluation of sleep architecture. Recent technological advancements have led to a proliferation of consumer-oriented tools to monitor sleep.8 Initially these tools relied mainly on activity, similarly to actigraphy, but lately they have started to incorporate additional physiological signals, such as EEG, heart rate (HR), breathing and pulse oximetry.9–11 Increasing research indicates that HR, defined as the number of heart beats in 1 minute, and its variability might be an accurate and accessible physiological proxy for sleep measurement.12–15 A wealth of literature has shown profound differences in HR across sleep stages,16–18 primarily comparing REM versus NREM sleep. In adults, REM sleep has been reported to be associated with an increment in low frequency power of the HR variability signal, and a decrement in high frequency power. Heart rate can be derived from electrocardiography (ECG), which records the heart’s electrical activity, or photoplethysmography (PPG), which monitors changes in blood volume in a portion of the peripheral microvasculature, estimating pulse waveforms. While PPG is more commonly employed in wearable devices19,20 because it can easily estimate HR from various peripheral body locations, several concerns have been raised regarding its reliability given the strong dependence on ambient light, skin conditions and colors, and dealing with physical motion artifacts.21,22 Thus, ECG-derived HR is still considered the gold standard, but it rarely used in wearable technologies. In addition, very often algorithms that utilize activity and physiological data to characterize sleep are proprietary and not fully/only partially validated against gold standard PSG measures. In most cases, raw data access is either unavailable or limited,23,24 and algorithm or firmware upgrades are also not always transparent to users, limiting longer-term comparisons either within or between individuals. In summary, the ‘black-box’ nature of these devices and the paired firmware/software packages limits the application and deployment of these devices for clinical and research purposes. To address this gap in the literature, Neurobit HRV (Neurobit Inc., New York, NY, USA) was designed, a sleep staging algorithm which utilizes HR parameters derived from a single-channel ECG. In this manuscript, we will show the performance of this algorithms on a public dataset (Physionet CinC25), and on a secondary dataset with data simultaneously acquired with PSG and an ECG patch. Results demonstrate the feasibility of accurate sleep staging based of HR derived from ECG, obtained from PSG or wearable devices. The combination of the proposed sleep staging algorithm with inexpensive and commonly used ECG-patches (∼100$), provides a cost-effective and non-invasive solution easily deployable in the home for large-scale sleep characterization in the field. ## MATERIAL AND METHODS ### The sleep staging algorithm The HR based automated sleep staging software called Neurobit-HRV was developed by Neurobit Inc., New York, NY, USA. The software is open source and publicly available through the cloud as a software development kit (SDK, [https://bitbucket.org/NeurobitTech/pyndf](https://bitbucket.org/NeurobitTech/pyndf)) as well as an application programming interface (API) along with reference code and documentation.26 The deep-learning algorithm was implemented in Python 3.6 using the Keras ([https://keras.io/](https://keras.io/)) and Tensorflow 2 library ([https://www.tensorflow.org/](https://www.tensorflow.org/)). The algorithm was trained on private datasets comprising of ECG extracted from 12,404 PSG records primarily collected from academic sleep centers in South-East Asia (35%), North America (30%) and Europe (30%). Fifty-nine percent of the assessed participants had a suspected sleep disorder whereas the remaining 41% were from healthy subjects. The mean age of such aggregated dataset was (mean ± std) 42.3 ± 16.8 years. The observations were first randomly split into training (80%) and testing (20%) data. The split was stratified by the data source. The model which obtained the smallest test error was selected as the optimal one. Figure 1 displays the trends of model accuracy for the training and testing sets as a function of progressive iterations. The software can operate on either a single channel ECG or directly on R-peak locations. To achieve optimal performance, the temporal precision of the R-peak location must be ±4ms or better. Neurobit-HRV incorporates an extensive wavelet-based ECG signal quality assessment toolbox for a real-time QRS detector, followed by a spurious R-peak detector for signal processing and quality assurance. Then, the processed RR interval tachogram is fed to the above-described automated sleep staging algorithm. Sleep state classification can be performed with different levels of granularity, namely 2-level (Wake; Sleep), 3-level (Wake: NREM: N1+N2+N3; and REM) or 4-level (Wake; Light: N1+N2; Deep: N3; and REM) in 30 second epochs compliant with the American Academy of Sleep Medicine (AASM) standard. The 2 and 3 level classifications are obtained by appropriately collapsing the 4-level classification. Additional information describing the mode of operation of the algorithm are reported in the Supplement (see Supplement Figure 1). ![Figure 1](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/01/13/2021.12.21.21268117/F1.medium.gif) [Figure 1](http://medrxiv.org/content/early/2022/01/13/2021.12.21.21268117/F1) Figure 1 Trend of accuracy in the training (blue solid line) and testing (orange solid line) datasets as a function of training epochs for the 4-level model. Following an initialization period (for epoch value <∼50), the difference in performance between the training and testing sets stabilizes at ∼5%. ### Participants & Study Protocol #### You Snooze You Win - The PhysioNet Computing in Cardiology (CinC) Challenge 2018 dataset The CinC dataset is available at: [https://physionet.org/content/challenge-2018/1.0.0/](https://physionet.org/content/challenge-2018/1.0.0/) and it is comprised of 994 participants (18-90 years old) which were monitored at Massachusetts General Hospital (MGH) sleep laboratory for the diagnosis of sleep disorders. Each participant has a complete set of 30-s annotated segments with corresponding sleep stages and respiratory events annotated by clinical staff at the MGH according to the AASM manual for the scoring of sleep. For this study, the initial preprocessing step consisted of assessing the quality of the ECG recordings extracted from the ensemble of recorded signals (EEG, EOG, EMG, respiration, and SaO2). The signal quality index is an estimate of signal to noise ratio (SNR). The ECG channel was fed to the Neurobit-HRV software to extract the RR interval tachogram along with a signal quality index. ECG traces with either a SNR <5 dB or with >10% unusable data were rejected and not included in the analytic steps described in the following sections. #### Z3Pulse datasets The Z3Pulse dataset consisted of a set of 52 healthy adults in the age range of 23 to 69 years with no known sleep or psychiatric disorders. All participants provided written informed consent in compliance with a protocol approved by the National University of Singapore’s Institutional Review Board (NUS-IRB) and were paid for their involvement. The study protocol was carried out over three nights, when trained research assistants visited participants’ homes. On each night, participants wore both a wearable ECG patch paired with an Android phone, alongside with simultaneous PSG. PSG was collected using the SOMNOtouch device (SOMNOmedics GmbH, Randersacker, Germany). Electroencephalography was recorded from two channels (C3 and C4 in the international 10–20 system of electrode placement) referenced to the contralateral mastoids. The common ground and reference electrode were placed at Fpz and Cz, respectively. Electrooculography (EOG; right and left outer canthi) and submental electromyography (EMG) were also recorded. EEG signals were sampled at 256 Hz and impedance was kept at less than 5KΩ for EEG and below 10KΩ for EOG and EMG channels. Sleep scoring was manually performed based on the AASM manual.27 Self-reported bed and wake up times were also collected. Each participant had at least one usable recording, 40 (∼77%) had 2 separate overnight recordings and 19 (∼37%) had 3. The wearable ECG patch utilized in this study was the Z3Pulse device (Neurobit Inc., New York, USA), a chest worn, wireless device capable of recording HR, body position, activity and temperature. Z3Pulse was developed using the Movesense open wearable tech platform (Movesense, Finland, [www.movesense.com](http://www.movesense.com)). The device is first connected to a reusable belt worn around the chest or a one-time Ag-Cl patch (see Supplement Figure 2). In the latter case, the patch is directly applied on the skin below the sternum. The Movesense device consists of a single-channel ECG sensor, a nine-axis inertial measurement unit (IMU) and a temperature sensor. The device captures ECG at 128 Hz and IMU at 13 Hz. The Z3Pulse device transmits the data in real-time to a mobile app over Bluetooth low energy (BLE). Once recording is complete, the collected data is uploaded to the cloud and analyzed using Neurobit-HRV. The process of aligning the data across PSG and Z3Pulse was essential to allow accurate comparison on an epoch-by-epoch basis. The timestamps of both systems were first synchronized using an internet time server as reference. Bedtimes for the PSG were estimated from the sleep diary. For Z3Pulse, position data were used to derive times on and off the bed. The time recorded from either the PSG or Z3Pulse, depending on the device with earlier lights-off time, was used as the reference data point. The shorter recording was extended to match the longer one by assuming “awake time” for missing epochs. An example of the scenario described is summarized in Figure S3 in the Supplement. ### Data Analysis We first evaluated the classification performance of the 2-, 3-, and 4-levels models in both datasets considering each scored epoch as an independent observation. The goodness of fit metrics computed were the following: Accuracy (A), Cohen’s kappa (K), Sensitivity (SE), Specificity (SP), Positive Predictive Value (PPV), and Negative Predicted Value (NPV). Then, the estimates obtained at the epoch level were averaged for each participant, to derive individual metrics and the same goodness of fit metrics were computed. The wealth of data within the CinC dataset allowed for performance evaluation of the algorithm as a function of factors known to have an impact on HR; age, AHI score, and gender. Participants’ ages and AHI scores were independently stratified using 3-level categorization; age≤40 years-old, 4060 years-old and AHI≤5 (None/Minimal) or 515 (Severe). Multiple independent linear regression models were used to estimate the association between age, AHI score, and gender on A, K, SE, SP, PPV, and NPV. The limited sample size of the Z3Pulse dataset did not allow for analyses stratified by age, AHI score, and gender. ## RESULTS ### Populations For the CinC dataset a total of 6 participants were excluded from the analyses as the associated ECG traces had either a SNR <5 dB or >10% unusable data, thus the final dataset included 988 subjects. There were 664 (∼67%) males, 324 (∼33%) females; 137 (∼14%), 285 (∼29%) and 566 (∼57%) participants in the None/Minimal, Mild, and Severe AHI groups, respectively; 157 (∼16%), 456 (∼46%) and 375 (∼38%) participants in the age≤40 years-old, 4060 years-old groups, respectively. For the Z3pulse dataset, there were 27 (∼52%) males, 25 (∼48%) females; 38 (∼73%), 7 (∼13%) and 7 (∼13%) participants in the age≤40 years-old, 4060 years-old groups, respectively. ### CinC dataset validation Results for the 2-, 3-, and 4-levels prediction for the CinC dataset data are displayed in Figure 2 via confusion matrices and reported in **Table SA**. The highest value of accuracy was achieved by the 2-levels classification (wake vs sleep), whereas the predictions obtained in the 3-levels model obtained the best value of K. By design of the algorithm, all the derived metrics (SE, SP, PPV, and NPV) relative to the classification of wake segments, are identical across the 2-, 3-, and 4-levels models. On the other hand, the classification of sleep segments is achieved by progressively collapsing different sleep states passing from the 4-levels to the 2-levels model. As a consequence, values of PPV and NPV were substantially equivalent across models and sleep states. The granular level of classification achieved by the 4-levels model obtained the lowest value of SE (0.3812) and the highest value of SP (0.9744) for the classification of DEEP sleep segments. The comparison of the goodness of fit metrics at the segment level (**Table SA**) versus the ones obtained at the participant level, (**Table A**) revealed a close concordance between the two approaches. In addition, the narrow confidence intervals (ranging from ±1% to ±5% of the mean) of the averaged metrics in **Table A** indicates the absence of underfitting nor overfitting. View this table: [Table A](http://medrxiv.org/content/early/2022/01/13/2021.12.21.21268117/T1) Table A Mean and [95% Confidence Intervals] Accuracy (A), Cohen’s kappa (K), Sensitivity (SE), Specificity (SP), Positive Predictive Value (PPV), and Negative Predicted Value (NPV) were obtained by firstly collapsing metrics within and subsequently across participants in the CinC dataset. Results are reported separately for the 2-, 3- and 4-levels models. ![Figure 2](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/01/13/2021.12.21.21268117/F2.medium.gif) [Figure 2](http://medrxiv.org/content/early/2022/01/13/2021.12.21.21268117/F2) Figure 2 Confusion matrices for results for the 2-, 3-, and 4-levels prediction for the CinC dataset **Table B** summarizes the performance of the 2-levels model as a function of participants’ characteristics. AHI and gender groups did not affect the outcome metrics. In contrast, a significant decrease in A and K were reported for older age groups when compared to the reference age group (age≤40 years-old). Specifically, the average decrease in A for the age group 4060 was 3.02% and 7.45%, respectively. Similar results were found for the other outcome metrics such that the mean decrease in performance in the group of participants age>60 years old is approximately double that of the 40