ABSTRACT
The Harvard-Emory ECG Database (HEEDB) is a large collection of 12-lead electrocardiogram (ECG) recordings, developed through a collaboration between Harvard University and Emory University. The database consists of 10,608,417 unique ECG recordings from 1,818,247 patients from Massachusetts General Hospital (MGH) and 1,452,964 recordings from 552,481 patients from Emory University Hospital (EUH) collected in clinical settings as part of routine patient care since the early 1990s. Continuously updated with new data, the recordings consist of 10-second, 12-lead ECGs sampled at 250 and 500 Hz, and stored in WFDB format. Future updates will include demographic information such as age, sex, race, and ethnicity. Additionally, 12SL annotations–encompassing ECG diagnoses, morphology, and rhythms–are available for 10,471,531 recordings from MGH and 1,268,277 recordings from EUH. Shortly, ICD-9/10 codes, CPT codes, and medication history will also be made publicly available.
Background & Summary
The Harvard-Emory ECG database (HEEDB) is a large collection of 12-lead (I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, V6) electrocardiography (ECG) recordings, prepared through a collaboration between Harvard University and Emory University investigators. HEEDB comprises clinical ECGs obtained during routine care over several decades (from the 1990s to present), alongside associated covariates, including patient demographics, medications, diagnoses, and clinical outcomes which will be added in the future. These data were collected at Massachusetts General Hospital (MGH, Boston, MA, US) and Emory University Hospital (EUH, Atlanta, GA, US) and represent one of the largest ECG datasets available. Large-scale datasets with extended follow-up, such as this one, are uncommon, making this resource particularly valuable for conducting population-wide analyses of cardiac rhythms, their disturbances, and associated outcomes.
The HEEDB supports numerous applications, ranging from identifying early markers of cardiovascular disease to validating detection and prediction algorithms for clinical use. A major focus is the identification of ECG-derived biomarkers that may predict arrhythmias, sudden cardiac death, and other critical conditions. The inclusion of temporal data, with multiple ECG recordings for individual subjects over time, provides an opportunity to study the progression of cardiac abnormalities and their clinical implications. The scale and longitudinal nature of this database offer significant potential to advance both clinical research and computational cardiology, enabling new insights into cardiovascular disease.
Methods
The HEEDB contains 12-lead ECG recordings of 10 seconds duration sampled at 250 and 500 Hz. At the time of initial publication, the database includes 10,608,417 unique ECG recordings (10,771,552 ECGs total) from 1,818,247 patients from MGH. New ECGs will be added to the database periodically, including 1,452,964 recordings from 552,481 subjects recorded at EUH. All ECGs were collected in the course of routine clinical care using the MUSE ECG system. The MUSE system facilitates the acquisition, storage, and review of ECG data, providing consistent recordings across the dataset. The raw ECG waveforms, along with associated metadata, were stored in the XML format, which allows for both structured data representation and interoperability with other systems. ECG waveforms were further de-identified following the Safe Harbor method and converted to the WFDB (Waveform Database)1,2 and MATLAB (V4) compatible format.
Data Records
Each ECG recording includes one waveform data file (.mat) and one header file (.hea). The waveform data file can be read by WFDB library functions, applications, and Toolbox, or be loaded to MATLAB directly. Most waveform files are synchronized 12-lead ECG signals recorded at 250 Hz (59% from MGH) and 500 Hz (41% from MGH, 100% from EUH) for a duration of 10 s with analog-to-digital converter (ADC) gain 1,000, stored in mV with byte offset of 24. The header file specifies the names of the associated waveform files and their attributes such as sampling rate and units, including the channel names of the signal. It contains line-oriented and field-oriented ASCII text and can be read by the WFDB library or generic text editors. The data file contains 12-lead information encoded in 16 bits. The ECG database exhibits several common abnormalities that occur in real ECG recordings, such as missing leads, incomplete time lengths, and various types of noise affecting the ECG recordings.
Metadata
The dataset includes general metadata for each subject, including a de-identified unique subject identifier, sex, age, race, ethnicity, and a shifted acquisition date. All the dates are shifted between ±90 days from the date of acquisition and the shift is different for every subject/recording. Demographic characteristics for both data sources are summarized in Table 1. To ensure confidentiality, individuals aged over 89 years at the time of data acquisition are uniformly recorded as 90 years old. It is important to note that age data is missing for 15,466 recordings from MGH and 37,481 recordings from EUH due to discrepancies in documenting either the acquisition date or birth date. The age distribution for both data sources is displayed in the histogram in Figure 1.
ECG Annotations
The annotations in the dataset are generated using the Marquette 12SL ECG Analysis Program (GE Healthcare) version 43. These annotations consist of textual reports that describe the morphology, rhythm, and diagnostic information of the ECGs. Each annotation includes statement numbers that correspond to human-readable diagnoses. The frequency of statements with the highest occurrence is summarized in Table 2 for MGH and Table 3 for EUH.
The statement numbers include various components such as ECG diagnoses, morphology, rhythms, lead names, text symbols (e.g., brackets, delimiters), measurements (e.g., “QTcB ≥ 480 ms”), conjunctions (e.g., “with,” “or,” “and”), and descriptive words like “present,” “accelerated,” and “blocked.” Together, these elements comprise the 12SL language model to form human-readable diagnosis reports. For example, the following statement consists of the codes: 19, 177, 231, 178, 244, 390, 411, 700, 831, 1100, 1150, and 169, which map to the diagnosis: Sinus rhythm, with, premature ventricular complexes, or, fusion complexes, Indeterminate axis, Pulmonary disease pattern, Septal infarct, , age undetermined, ST & T wave abnormality, consider anterior ischemia, Abnormal ECG. Annotations are available for 10,471,531 ECG recordings, leaving 136,886 unlabeled recordings from MGH. For the EUH dataset, 1,268,277 recordings are annotated, with 184,687 remaining unlabeled.
Technical Validation
The ECG records were obtained in a clinical environment, during routine care guided by specialized personnel using the MUSE ECG system facilitating acquisition, storage, and review of ECG data, enabling consistent recordings across the dataset. The HEEDB data was checked for completeness and includes the non-preprocessed raw ECG waveforms in WFDB format coupled with demographics and 12SL statements. Data was de-identified following the Safe Harbor method. ECG recordings contain various types of standard ECG artifacts such as missing leads, missing parts of signals, or noise.
Usage Notes
The HEEDB dataset can be downloaded from the Brain Data Science Platform (BDSP) by following the instructions provided at Harvard-Emory ECG database website4 (https://bdsp.io/content/heedb/1.0/). The data are made available under an Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license, with the intention not to restrict research or commercial use. We consider training any algorithm on the data to be protected by fair use (e.g. 17 USC 107). Moreover, given that the training (whether it is based on machine learning or not) is a collection of statistical information relating to a work, and assuming the algorithm doesn’t memorize large portions of the database, it does not involve “copying” the work within the meaning of 17 USC 106 (except for a de minimis period which is similar to the caching performed by a web browser, and therefore subject to a fair use defense).
Data Availability
All data produced are available online with credentialed access at https://bdsp.io/content/heedb/1.0/ and shortly will be available publicly.
Code availability
The downloaded waveform data files can be accessed using WFDB library functions, applications, and toolboxes in Python5 and MATLAB6 or can be directly loaded into MATLAB. No custom code was developed for this study.
Author contributions statement
Z.K. provided technical validation of the database, statistical analysis of the database, processing of the annotations and drafted the manuscript; Q.L. contributed to statistical analysis of the data; C.R. contributed to data collection; J.M.V contributed to data collection, developed infrastructure for data storage; M.G. contributed to data collection; A.G. conducted preliminary analyses of the data; J.R. conducted preliminary analyses of the data; S.H. conducted preliminary analyses of the data; A.A. contributed to data collection; D.E.A. provided insights into the clinical relevance of the data; J.X secured data annotations; A.P.secured data annotations; R.S. contributed to the design and implementation of the data collection processes as well as overseeing the technical validation of the database and contributed to the writing; M.A.R. contributed to the design and implementation of the data collection processes, contributed to the writing and led; M.B.W. secured the funding, designed the database, contributed to the design and implementation of the data collection processes, and contributed to the writing; G.D.C. secured the funding, designed the database, provided intellectual input on processing and labeling, contributed to the manuscript and led the project. All authors read the manuscript.
Competing interests
M.B.W. is a co-founder, scientific advisor, and consultant to Beacon Biosignals and has a personal equity interest in the company. D.E.A. is the founder and Chief Medical Officer of AliveCor Inc. and holds personal equity in the company. J.X. is a research fellow at Alivecor Inc. and also holds personal equity in the company. G.D.C. holds significant stock in AliveCor Inc. The other authors report no competing interests.
Acknowledgements
Publication of HEEDB is supported by a grant (R01HL161253) from the National Heart Lung and Blood Institute (NHLBI) of the NIH to Massachusetts General Hospital, Emory University, Stanford University, Kaiser Permanente, Boston Children’s Hospital, and Beth Israel Deaconess Medical Center. Publication of the HEEDB is also supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under NIH grant number R01EB030362, and an unrestricted funds from Emory University. Q.L., M.R. and G.D.C. are supported in part by unrestricted funding from AliveCor Inc.
Footnotes
↵† co-senior authors