The Harvard-Emory ECG Database ============================== * Zuzana Koscova * Qiao Li * Chad Robichaux * Valdery Moura Junior * Manohar Ghanta * Aditya Gupta * Jonathan Rosand * Aaron Aguirre * Shenda Hong * David E. Albert * Joel Xue * Aarya Parekh * Reza Sameni * Matthew A. Reyna * M. Brandon Westover * Gari D. Cliford ## ABSTRACT The Harvard-Emory ECG Database (HEEDB) is a large collection of 12-lead electrocardiogram (ECG) recordings, developed through a collaboration between Harvard University and Emory University. The database consists of 10,608,417 unique ECG recordings from 1,818,247 patients from Massachusetts General Hospital (MGH) and 1,452,964 recordings from 552,481 patients from Emory University Hospital (EUH) collected in clinical settings as part of routine patient care since the early 1990s. Continuously updated with new data, the recordings consist of 10-second, 12-lead ECGs sampled at 250 and 500 Hz, and stored in WFDB format. Future updates will include demographic information such as age, sex, race, and ethnicity. Additionally, 12SL annotations–encompassing ECG diagnoses, morphology, and rhythms–are available for 10,471,531 recordings from MGH and 1,268,277 recordings from EUH. Shortly, ICD-9/10 codes, CPT codes, and medication history will also be made publicly available. ## Background & Summary The Harvard-Emory ECG database (HEEDB) is a large collection of 12-lead (I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, V6) electrocardiography (ECG) recordings, prepared through a collaboration between Harvard University and Emory University investigators. HEEDB comprises clinical ECGs obtained during routine care over several decades (from the 1990s to present), alongside associated covariates, including patient demographics, medications, diagnoses, and clinical outcomes which will be added in the future. These data were collected at Massachusetts General Hospital (MGH, Boston, MA, US) and Emory University Hospital (EUH, Atlanta, GA, US) and represent one of the largest ECG datasets available. Large-scale datasets with extended follow-up, such as this one, are uncommon, making this resource particularly valuable for conducting population-wide analyses of cardiac rhythms, their disturbances, and associated outcomes. The HEEDB supports numerous applications, ranging from identifying early markers of cardiovascular disease to validating detection and prediction algorithms for clinical use. A major focus is the identification of ECG-derived biomarkers that may predict arrhythmias, sudden cardiac death, and other critical conditions. The inclusion of temporal data, with multiple ECG recordings for individual subjects over time, provides an opportunity to study the progression of cardiac abnormalities and their clinical implications. The scale and longitudinal nature of this database offer significant potential to advance both clinical research and computational cardiology, enabling new insights into cardiovascular disease. ## Methods The HEEDB contains 12-lead ECG recordings of 10 seconds duration sampled at 250 and 500 Hz. At the time of initial publication, the database includes 10,608,417 unique ECG recordings (10,771,552 ECGs total) from 1,818,247 patients from MGH. New ECGs will be added to the database periodically, including 1,452,964 recordings from 552,481 subjects recorded at EUH. All ECGs were collected in the course of routine clinical care using the MUSE ECG system. The MUSE system facilitates the acquisition, storage, and review of ECG data, providing consistent recordings across the dataset. The raw ECG waveforms, along with associated metadata, were stored in the XML format, which allows for both structured data representation and interoperability with other systems. ECG waveforms were further de-identified following the Safe Harbor method and converted to the WFDB (Waveform Database)1,2 and MATLAB (V4) compatible format. ## Data Records Each ECG recording includes one waveform data file (.mat) and one header file (.hea). The waveform data file can be read by WFDB library functions, applications, and Toolbox, or be loaded to MATLAB directly. Most waveform files are synchronized 12-lead ECG signals recorded at 250 Hz (59% from MGH) and 500 Hz (41% from MGH, 100% from EUH) for a duration of 10 s with analog-to-digital converter (ADC) gain 1,000, stored in mV with byte offset of 24. The header file specifies the names of the associated waveform files and their attributes such as sampling rate and units, including the channel names of the signal. It contains line-oriented and field-oriented ASCII text and can be read by the WFDB library or generic text editors. The data file contains 12-lead information encoded in 16 bits. The ECG database exhibits several common abnormalities that occur in real ECG recordings, such as missing leads, incomplete time lengths, and various types of noise affecting the ECG recordings. ### Metadata The dataset includes general metadata for each subject, including a de-identified unique subject identifier, sex, age, race, ethnicity, and a shifted acquisition date. All the dates are shifted between *±*90 days from the date of acquisition and the shift is different for every subject/recording. Demographic characteristics for both data sources are summarized in Table 1. To ensure confidentiality, individuals aged over 89 years at the time of data acquisition are uniformly recorded as 90 years old. It is important to note that age data is missing for 15,466 recordings from MGH and 37,481 recordings from EUH due to discrepancies in documenting either the acquisition date or birth date. The age distribution for both data sources is displayed in the histogram in Figure 1. View this table: [Table 1.](http://medrxiv.org/content/early/2024/10/01/2024.09.27.24314503/T1) Table 1. Available demographic information for the ECG records in the HEEDB dataset, including age, gender, race, ethnicity, and education level. ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/10/01/2024.09.27.24314503/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2024/10/01/2024.09.27.24314503/F1) Figure 1. Histogram depicting the age distribution within the HEEDB dataset for Massachusetts General Hospital (MGH) recordings (top figure) and Emory University Hospital (EUH) recordings (bottom figure). ### ECG Annotations The annotations in the dataset are generated using the Marquette 12SL ECG Analysis Program (GE Healthcare) version 43. These annotations consist of textual reports that describe the morphology, rhythm, and diagnostic information of the ECGs. Each annotation includes statement numbers that correspond to human-readable diagnoses. The frequency of statements with the highest occurrence is summarized in Table 2 for MGH and Table 3 for EUH. View this table: [Table 2.](http://medrxiv.org/content/early/2024/10/01/2024.09.27.24314503/T2) Table 2. This table lists the 50 statements with the highest prevalence in the HEEDB for MGH data, including their corresponding 12SL codes, human-readable descriptions, and counts. The statements encompass various elements, including diagnoses, rhythms, conjunctions, ECG lead names, and brackets and other symbols, all of which contribute to the formation of a human-readable statement generated by the 12SL software. View this table: [Table 3.](http://medrxiv.org/content/early/2024/10/01/2024.09.27.24314503/T3) Table 3. This table lists the 50 statements with the highest prevalence in the HEEDB for Emory data, including their corresponding 12SL codes, human-readable descriptions and counts. The statement numbers include various components such as ECG diagnoses, morphology, rhythms, lead names, text symbols (e.g., brackets, delimiters), measurements (e.g., “QTcB ≥ 480 ms”), conjunctions (e.g., “with,” “or,” “and”), and descriptive words like “present,” “accelerated,” and “blocked.” Together, these elements comprise the 12SL language model to form human-readable diagnosis reports. For example, the following statement consists of the codes: 19, 177, 231, 178, 244, 390, 411, 700, 831, 1100, 1150, and 169, which map to the diagnosis: Sinus rhythm, with, premature ventricular complexes, or, fusion complexes, Indeterminate axis, Pulmonary disease pattern, Septal infarct, , age undetermined, ST & T wave abnormality, consider anterior ischemia, Abnormal ECG. Annotations are available for 10,471,531 ECG recordings, leaving 136,886 unlabeled recordings from MGH. For the EUH dataset, 1,268,277 recordings are annotated, with 184,687 remaining unlabeled. ### Technical Validation The ECG records were obtained in a clinical environment, during routine care guided by specialized personnel using the MUSE ECG system facilitating acquisition, storage, and review of ECG data, enabling consistent recordings across the dataset. The HEEDB data was checked for completeness and includes the non-preprocessed raw ECG waveforms in WFDB format coupled with demographics and 12SL statements. Data was de-identified following the Safe Harbor method. ECG recordings contain various types of standard ECG artifacts such as missing leads, missing parts of signals, or noise. ### Usage Notes The HEEDB dataset can be downloaded from the Brain Data Science Platform (BDSP) by following the instructions provided at Harvard-Emory ECG database website4 ([https://bdsp.io/content/heedb/1.0/](https://bdsp.io/content/heedb/1.0/)). The data are made available under an Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license, with the intention not to restrict research or commercial use. We consider training any algorithm on the data to be protected by fair use (e.g. 17 USC 107). Moreover, given that the training (whether it is based on machine learning or not) is a collection of statistical information relating to a work, and assuming the algorithm doesn’t memorize large portions of the database, it does not involve “copying” the work within the meaning of 17 USC 106 (except for a *de minimis* period which is similar to the caching performed by a web browser, and therefore subject to a fair use defense). ## Data Availability All data produced are available online with credentialed access at [https://bdsp.io/content/heedb/1.0/](https://bdsp.io/content/heedb/1.0/) and shortly will be available publicly. [https://bdsp.io/content/heedb/1.0/](https://bdsp.io/content/heedb/1.0/) ## Code availability The downloaded waveform data files can be accessed using WFDB library functions, applications, and toolboxes in Python5 and MATLAB6 or can be directly loaded into MATLAB. No custom code was developed for this study. ## Author contributions statement Z.K. provided technical validation of the database, statistical analysis of the database, processing of the annotations and drafted the manuscript; Q.L. contributed to statistical analysis of the data; C.R. contributed to data collection; J.M.V contributed to data collection, developed infrastructure for data storage; M.G. contributed to data collection; A.G. conducted preliminary analyses of the data; J.R. conducted preliminary analyses of the data; S.H. conducted preliminary analyses of the data; A.A. contributed to data collection; D.E.A. provided insights into the clinical relevance of the data; J.X secured data annotations; A.P.secured data annotations; R.S. contributed to the design and implementation of the data collection processes as well as overseeing the technical validation of the database and contributed to the writing; M.A.R. contributed to the design and implementation of the data collection processes, contributed to the writing and led; M.B.W. secured the funding, designed the database, contributed to the design and implementation of the data collection processes, and contributed to the writing; G.D.C. secured the funding, designed the database, provided intellectual input on processing and labeling, contributed to the manuscript and led the project. All authors read the manuscript. ## Competing interests M.B.W. is a co-founder, scientific advisor, and consultant to Beacon Biosignals and has a personal equity interest in the company. D.E.A. is the founder and Chief Medical Officer of AliveCor Inc. and holds personal equity in the company. J.X. is a research fellow at Alivecor Inc. and also holds personal equity in the company. G.D.C. holds significant stock in AliveCor Inc. The other authors report no competing interests. ## Acknowledgements Publication of HEEDB is supported by a grant (R01HL161253) from the National Heart Lung and Blood Institute (NHLBI) of the NIH to Massachusetts General Hospital, Emory University, Stanford University, Kaiser Permanente, Boston Children’s Hospital, and Beth Israel Deaconess Medical Center. Publication of the HEEDB is also supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under NIH grant number R01EB030362, and an unrestricted funds from Emory University. Q.L., M.R. and G.D.C. are supported in part by unrestricted funding from AliveCor Inc. ## Footnotes * † co-senior authors * Received September 27, 2024. * Revision received September 27, 2024. * Accepted October 1, 2024. * © 2024, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/) ## References 1. 1.Moody, G., Pollard, T. & Moody, B. WFDB Software Package (version 10.7.0) (2022). 2. 2.Goldberger, A. et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 101, e215–e220 (2000). [Online]. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTQ6ImNpcmN1bGF0aW9uYWhhIjtzOjU6InJlc2lkIjtzOjExOiIxMDEvMjMvZTIxNSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDI0LzEwLzAxLzIwMjQuMDkuMjcuMjQzMTQ1MDMuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 3. 3. GE Healthcare. MarquetteTM 12SLTM ECG Analysis Program: Physician’s Guide (2019). 4. 4.Moura Junior, V. et al. Harvard-Emory ECG Database (version 1.0), doi:10.60508/g072-7n95 (2023). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.60508/g072-7n95&link_type=DOI) 5. 5.Xie, C. et al. Waveform Database Software Package (WFDB) for Python (version 4.1.0), doi:10.13026/9njx-6322 (2023). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.13026/9njx-6322&link_type=DOI) 6. 6.Silva, I. & Moody, G. An Open-source Toolbox for Analysing and Processing PhysioNet Databases in MATLAB and Octave. J. Open Res. Softw. 2, e27, doi:10.5334/jors.bi (2014). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.5334/jors.bi&link_type=DOI)