RT Journal Article SR Electronic T1 A prospective real-time transfer learning approach to estimate Influenza hospitalizations with limited data JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2024.07.17.24310565 DO 10.1101/2024.07.17.24310565 A1 Meyer, Austin G A1 Lu, Fred A1 Clemente, Leonardo A1 Santillana, Mauricio YR 2024 UL http://medrxiv.org/content/early/2024/07/18/2024.07.17.24310565.abstract AB Accurate, real-time forecasts of influenza hospitalizations would facilitate prospective resource allocation and public health preparedness. State-of-the-art machine learning methods are a promising approach to produce such forecasts, but they require extensive historical data to be properly trained. Unfortunately, historically observed data of influenza hospitalizations, for the 50 states in the United States, are only available since the beginning of 2020, as their collection was motivated and enabled by the COVID-19 pandemic. In addition, the data are far from perfect as they were under-reported for several months before health systems began consistently and reliably submitting their data. To address these issues, we propose a transfer learning approach to perform data augmentation. We extend the currently available two-season dataset for state-level influenza hospitalizations in the US by an additional ten seasons. Our method leverages influenza-like illness (ILI) surveillance data to infer historical estimates of influenza hospitalizations. This cross-domain data augmentation enables the implementation of advanced machine learning techniques, multi-horizon training, and an ensemble of models for forecasting using the ILI training data set, improving hospitalization forecasts. We evaluated the performance of our machine learning approaches by prospectively producing forecasts for future weeks and submitting them in real time to the Centers for Disease Control and Prevention FluSight challenges during two seasons: 2022-2023 and 2023-2024. Our methodology demonstrated good accuracy and reliability, achieving a fourth place finish (among 20 participating teams) in the 2022-23 and a second place finish (among 20 participating teams) in the 2023-24 CDC FluSight challenges. Our findings highlight the utility of data augmentation and knowledge transfer in the application of machine learning models to public health surveillance where only limited historical data is available.Author summary Influenza is a major public health concern in the United States, causing thousands of hospitalizations annually. Accurate and timely forecasts of hospitalization rates are essential for effective public health preparedness. However, limited historical data makes forecasting with state-of-the-art models challenging. To address this issue, we developed a cross-domain data augmentation method that allowed us to train advanced machine learning models using symptom-based (syndromic) surveillance data. We then created a set of models, focusing on gradient-boosted machines, and combined them into an ensemble framework. This approach successfully overcame data limitations, outperforming the majority of teams participating in the CDC FluSight project for 2022-23 and 2023-24. Additionally, our forecasts demonstrated superior accuracy to the CDC’s composite model in the 2022-23 season and matched its performance in 2023-24. Our study demonstrates a robust and data-efficient strategy for training machine learning models for use in public health forecasting.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study was funded by the Council for State and Territorial Epidemiologists, the United States Centers for Disease Control and Prevention, and the National Institute of General Medical Sciences of the National Institutes of Health. MS has been funded (in part) by contracts 200-2016-91779 and cooperative agreement CDC-RFA-FT-23-0069 with the Centers for Disease Control and Prevention (CDC). The findings, conclusions, and views expressed are those of the author(s) and do not necessarily represent the official position of the CDC. MS was also partially supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R01GM130668.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The study used (or will use) ONLY openly available human data that were originally located at: https://gis.cdc.gov/grasp/fluview/main.html https://gis.cdc.gov/GRASP/Fluview/FluHospRates.html https://github.com/cdcepi/FluSight-forecast-hubI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesThe data are publicly available at: https://gis.cdc.gov/grasp/fluview/main.html https://gis.cdc.gov/GRASP/Fluview/FluHospRates.html https://github.com/cdcepi/FluSight-forecast-hub