ABSTRACT
Background Echocardiograms provide essential insights into cardiac health, yet their complex, multidimensional data poses significant challenges for analysis and interpretation. Existing deep learning models for echocardiogram analysis often rely heavily on supervised training, which limits their generalizability and robustness across different datasets and clinical environments.
Objective To develop and evaluate Echo-Vision-FM (Echocardiogram video Vision Foundation Model), a self-supervised video learning framework designed to pre-train a video encoder on large-scale, unlabeled echocardiogram data. Echo-Vision-FM aims to produce robust and transferable video representations, improving downstream performance across diverse echocardiogram datasets and clinical conditions.
Methods The proposed framework employs advanced self-supervised video learning through a masked auto-encoding technique, which compresses segments of video data and reconstructs the full video by masking non-overlapping video patches. An asymmetric encoder-decoder architecture underpins this approach. To further enhance the learned representations, we introduce STF-Net, a Spatial-Temporal Fusion Net, which integrates spatial and temporal correlations from the video representations. We pre-trained Echo-Vision-FM using the MIMIC-IV-ECHO dataset and fine-tuned it across multiple downstream datasets for specific clinical tasks, including morphological value estimation and the diagnosis of heart function and diseases.
Results Echo-Vision-FM achieved superior performance in classifying left ventricular ejection fraction (LVEF), with an accuracy of 0.905, an F1 score of 0.941, and an AUC of 0.931. In regression tasks, Echo-Vision-FM outperformed state-of-the-art models, achieving a mean absolute error (MAE) of 3.87% and an r2 of 0.825 for LVEF prediction. The model also demonstrated significant improvements in estimating end-systolic and end-diastolic volumes, with r2 values of 0.782 and 0.742, respectively. Incorporating STF-Net further enhanced performance across all tasks.
Conclusion Our results demonstrate that large-scale self-supervised video learning on echocardiogram data enables the extraction of transferable and clinically relevant features, surpassing existing methods. The Echo-Vision-FM framework, particularly with the inclusion of STF-Net, significantly improves the extraction of spatiotemporal features, resulting in enhanced predictive accuracy for a range of cardiac parameters. Echo-Vision-FM offers a scalable and effective solution for echocardiogram analysis, with promising applications in clinical diagnostics and research.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This study is partially supported by the American Heart Association Grant (24GWTGTG1268589).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
MIMIC-IV-ECHO data are available at Medical Information Mart for Intensive Care: https://physionet.org/content/mimic-iv-echo/0.1/. Echonet-Dynamic dataset is available at https://aimi.stanford.edu/datasets/echonet-dynamic-cardiac-ultrasound.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
The Chan Zuckerberg Initiative, Cold Spring Harbor Laboratory, the Sergey Brin Family Foundation, California Institute of Technology, Centre National de la Recherche Scientifique, Fred Hutchinson Cancer Center, Imperial College London, Massachusetts Institute of Technology, Stanford University, University of Washington, and Vrije Universiteit Amsterdam.