Influenza forecasting for the French regions by using EHR, web and climatic data sources with an ensemble approach ARGONet ========================================================================================================================== * Canelle Poirier * Yulin Hswen * Guillaume Bouzillé * Marc Cuggia * Audrey Lavenu * John S Brownstein * Thomas Brewer * Mauricio Santillana ## Abstract Effective and timely disease surveillance systems have the potential to help public health officials design interventions to mitigate the effects of disease outbreaks. Currently, healthcare-based disease monitoring systems in France offer influenza activity information that lags real-time by 1 to 3 weeks. This temporal data gap introduces uncertainty that prevents public health officials from having a timely perspective on the population-level disease activity. Here, we present a machine-learning modeling approach that produces real-time estimates and short-term forecasts of influenza activity for the 12 continental regions of France by leveraging multiple disparate data sources that include, Google search activity, real-time and local weather information, flu-related Twitter micro-blogs, electronic health records data, and historical disease activity synchronicities across regions. Our results show that all data sources contribute to improving influenza surveillance and that machine-learning ensembles that combine all data sources lead to accurate and timely predictions. **Author summary** The role of public health is to protect the health of populations by providing the right intervention to the right population at the right time. In France and all around the world, Influenza is a major public health problem. Traditional surveillance systems produce estimates of influenza-like illness (ILI) incidence rates, but with one-to three-week delay. Accurate real-time monitoring systems of influenza outbreaks could be useful for public health decisions. By combining different data sources and different statistical models, we propose an accurate and timely forecasting platform to track the flu in France at a spatial resolution that, to our knowledge, has not been explored before. ## Introduction Influenza is a major public health problem causing up to 5 million severe cases and 500,000 deaths per year worldwide [1–3]. In France alone, the epidemic of 2018-2019 caused 9,500 deaths. During epidemic peaks, large increases of visits to general practitioners and to emergency departments are observed and often lead to disruptions to healthcare delivery and thus increase the risk of undesirable outcomes in patients with influenza infections. To reduce the impact of influenza outbreaks in the population and to better design timely public health interventions, surveillance systems that produce accurate real-time and short-term forecasts of disease activity may prove to be instrumental. In France, an important influenza monitoring system was implemented by the Sentinelles network in 1984 [4, 5]. This system centralizes information obtained from a group of volunteer (1314 in 2018) general practitioners and (116 in 2018) pediatricians that each week report the proportion of patients with Influenza-Like-Illness (ILI, any acute respiratory infection with fever ≥ 38 °C, cough and onset within the last 10 days) seeking medical attention. Data collection, processing, aggregation and distribution processes of this information, at the national and regional levels, introduce up to three weeks delays in the availability of flu activity information. This temporal data gap prevents public health officials from having the most up-to-date epidemiological information, and thus leads to the design of interventions that do not take into consideration recent changes in disease activity [2, 6]. For example, if estimates were available in real-time, information campaigns and vaccination prevention could be deployed earlier and and could lead to grater impact. Additionally, healthcare facilities could be better prepared to respond to unexpected increases in the flux of high-risk patient during time periods of increased disease activity. With the motivation to alleviate this time delay, mathematical modeling and machine learning approaches have been proposed to produce disease estimates in real time and ahead of healthcare-based surveillance systems in multiple nations around the world. Most of these studies have been designed and tested in developed nations, such as the United States and France, where information on disease outbreaks has been collected historically for decades [2]. Numerous research studies have been conducted on the use of traditional statistical methods, like temporal series or compartmental methods, as well as the inclusion of disparate data sources such as meteorological or demographic data to track flu activity, as discussed in Nsoesie et al. 2014 and Yang and Shamman 2014 [7, 8]. And in recent years, multiple more studies have emerged exploring the use of Internet-based data sources that capture aspects of human behavior and environmental factors to track the spread of diseases. With over 3.2 billion web users, data flows from the internet are huge and of all types. Some studies have used data from Google [2, 3, 9–11], Twitter [12–14] or Wikipedia [15–18] to monitor flu specifically. One of the first and most prominent studies on the use of internet data for monitoring influenza epidemics is Google Flu Trends (GFT) [19]. This web-based platform, created in 2009 and designed and deployed by Google, used the volume of selected Google search terms to estimate ILI activity in real time. GFT led to multiple prediction errors during the 2009 H1N1 Flu Pandemic (due to changes in people’s search behaviour as a result of the exceptional nature of the pandemic) and later produced large overestimations during the 2012-2013 US flu season (due to the announcement of a pandemic that finally did not appear). These events led to eventual discontinuation of this disease monitoring platform [20]. Since then, multiple research teams have proposed improved methodologies that are capable of extracting information more efficiently from flu-related Google searches and produce improved flu estimates. Among these methods, the work of Shihao Yang et al. [2] explored a penalized regression methodology that combines historical flu activity with Google search activity dynamically, called ARGO, to better predict flu. Additional data sources have been explored to monitor flu activity such as clinicians’ searches, electronic health records (EHR), crowd-sourced flu monitoring apps [21–23]. Among these, electronic health records have been shown to track flu accurately and timely in the US and France [6, 24–26]. Specifically, in United States, Santillana et al. [6] showed that a model leveraging EHR data and a machine learning algorithms was capable to monitor flu activity in multiple spatial resolutions that included the regional level. In France, Poirier et al. [24] similarly showed multiple statistical models that incorporate EHR and Internet-search data, can yield accurate ILI incidence rates in real time at the national level. In early 2019, Fred S. Lu et al. [27] extended the ARGO methodology to accurately track flu activity in multiple states of the United States. In their approach, they included Google search data, EHRs and historical flu trends. They developed also a spatial network approach, called Net, to capture the synchronicity observed historically in flu activity between each states. Finally, by dynamically combining estimates from ARGO and Net, they showed that an ensemble approach, named ARGONet, led to improved results. ### Our contribution In this study, we propose a forecasting platform that combines multiple data sources and statistical models to track flu activity in France at a spatial resolution that, to our knowledge, has not been explored before. Our forecasting platform produces accurate region-specific real-time and short-term flu activity forecasts for the 12 continental French regions, by leveraging national-level flu-related Google searches, electronic health records data, Twitter data, and region-specific climate data. Additionally, historical synchronicities across regions are captured with a Network model. A machine learning ensemble approach is proposed to improve predictions by dynamically combining estimates from these two distinct approaches. Near real-time estimates as well as one- and two-week ahead forecasts are presented. ## Materials and methods ### Data sources #### Sentinelles network data We obtained weekly ILI incidence rates (per 100000 inhabitants) for the French regions (12) from the French Sentinelles network (websenti.u707.jussieu.fr/sentiweb). We retrieved these data in August 2018 from 05 January 2004 to 13 March 2017. We considered these data as the gold standard and as our task for our prediction models. #### Google Data We obtained the frequency per week of the 100 most correlated internet queries (if correlation ≥ 0.60) by French users from Google Correlate ([https://www.google.com/trends/correlate](https://www.google.com/trends/correlate). Because our prediction period spans 05 January 2015 to 20 February 2017, we utilized the ILI signal for each French region, from January 2004 to December 2014 to obtain the most highly correlated search terms using the tool Google Correlate. In this way, we obtained different search terms for each individual region. The signals obtained correspond to queries performed by French users at the national level. We retrieved Google Correlate data in August 2018 for the period going from 05 January 2004 to 13 March 2017. #### Electronic Health Record Data We retrieved EHR data from the clinical data warehouse (CDW) of Rennes University Hospital (France),This CDW, called eHOP, integrates structured (laboratory test results, prescriptions, ICD-10 diagnoses) and unstructured (discharge letter, pathology reports, operative reports) patients’ data. It includes data from 1.2 million inpatients and outpatients and 45 million documents that correspond to 510 million structured elements. eHOP consists of a powerful search engine system that can identify patients with specific criteria by querying unstructured data with keywords, or structured data with querying codes based on terminologies. The first approach to obtain eHOP data connected with ILI was to perform different manual queries to retrieve patients who had at least one document in their EHR that matched the following search criteria: (1) Queries directly connected with flu or ILI with the keywords “flu” or “ILI”; (2) Queries connected with flu symptoms with the keywords “fever”, “pyrexia”, “body aches” or “muscular pain”; (3) Queries connected with flu drugs with the keyword “Tamiflu”; (4) Queries with the ICD-10 terminology; (5) Queries connected with flu tests, positive or negative results. In total, we performed 34 manual queries. For each query, the eHOP search engine returned all documents containing the chosen keywords (often, several documents for one patient and one stay). For query aggregation, we kept the oldest document for one patient and one stay and then calculated, for each week, the number of stays with at least one document mentioning the keyword contained in the query. From the CDW eHOP, we built a database containing the time series constructed from the structured data. In all, we have 1 335 347 time series. As Google Correlate, the Pearson correlation between each signal of each region and the time series from the database was calculated. In this way, for each region, the second approach was to retrieve the 100 most correlated signals to ILI signal. Because our test period is from 05 January 2015 to 20 February 2017, we calculated the correlation between January 2004 and December 2014. As a result, for each region, we obtained 134 variables from the CDW eHOP where there are at least 34 variables common to all regions (manual queries). We retrieved retrospective data in August 2018 for the period going from 03 January 2005 to 13 March 2017. This study was approved by the local Ethics Committee of Rennes Academic Hospital (approval number 16.69). #### Weather Data We obtained region-specific weather data from the French climatological website Info Climat ([https://www.infoclimat.fr](https://www.infoclimat.fr)). It has been shown in several studies that humidity is correlated with the spread of influenza. [28]. In the absence of humidity data on the Climat website, we retrieved precipitation and temperatures data. This choice was made knowing that both variables, [29, 30] and temperature and precipitation can be used as a proxy for humidity since they are directly related by the Clausius–Clapeyron relation. [31] We obtained temperatures and precipitations per day for the largest city of each region, and calculated the weekly mean for both temperature and precipitation. We retrieved climatic data in August 2018 for the time period going from 07 January 2008 to 13 March 2017. #### Twitter Data Geotag tweets were extracted as the national scale for France from Boston Children’s Hospital Geotweet dataset with the following keywords pertaining to influenza (“grippe”, “grippé”, “syndrome grippal”, “fièvre”, “toux”, “congestion”, “malade”, “faiblesse”, “courbatures”, “tamiflu”, “la crève”). From there, we aggregated tweets to get weekly counts. In this way, we obtained 11 variables from Twitter. We retrieved Twitter data in December 2018 for the period going from 30 December 2013 to 13 March 2017. ### Statistical models #### The ARGO model The ARGO model is a regularized regression dynamically calibrated weekly using the LASSO method [32] to combine multiple external data sources with historical flu information. We performed the LASSO regression with the R package caret and the associated function fit with the method glmnet [33, 34]. We optimized the shrinkage parameter lambda via a 10-fold cross-validation. To test the stationarity and whiteness of residuals, we used Dickey Fuller’s and Box-Pierce’s tests available from the R packages tseries and stats [35]. The formulation of our model is : * Real time estimates: ![Formula][1]</img> * One-week ahead forecast: ![Formula][2]</img> * Two-week ahead forecast: ![Formula][3]</img> where *y**it* corresponding to the flu incidence rate at time *t* for the region ![Graphic][4]</img> corresponding to the historical flu incidence rates for the region *i*, ![Graphic][5]</img> corresponding to Google data for the region *i*, ![Graphic][6]</img> corresponding to hospital data for the region *i*, ![Graphic][7]</img> corresponding to Twitter data, ![Graphic][8]</img> corresponding to climatic data for the region *i, ϵ**t* corresponding to residuals. We applied this model for each region. The model was dynamically recalibrated every week by incorporating all data available. In this way, the size of our training dataset increases every week. We obtained estimates from January 2011 to March 2017. #### The Net model The Net model is a LASSO model dynamically calibrated weekly and using the relationship between the regions to know how synchronicity could improve forecasts. Indeed, Figure S1 (Heatmap of pairwise correlations between all regions) shows that the flu incidence rates of the different areas are correlated. For each region, we used historical data of all regions and estimates obtained with ARGO model for all regions expected the region to be predicted. The formulation of our model is : * Real time estimates: ![Formula][9]</img> * One-week ahead forecast: ![Formula][10]</img> * Two-week ahead forecast: ![Formula][11]</img> where *y**it* corresponding to the flu incidence rate at time *t* for the region *i*, ![Graphic][12]</img> corresponding to two weeks of historical flu incidence rates for all regions, ![Graphic][13]</img> corresponding to ARGO predictions for all regions excepted the region *i* to be predicted and *ϵ**t* corresponding to residuals. We applied this model for each region. We used a two years’ training dataset. We obtained estimates from January 2013 to March 2017. #### The ARGONet model The ARGONet model is an ensemble approach combining the predictive power of ARGO and Net models. For this model we tested three methods : * The first is, ARGONet’s estimate is the ARGO estimate if ARGO model gives the lowest mean error in the previous K estimates compared to Net model. Otherwise, ARGONet’s estimate is the Net estimate. The value of K can be 1, 2, 3 or 4. * The second is, ARGONet’s estimate is the mean between ARGO’s estimate and Net’s estimate. * The third is, ARGONet’s estimate is the result of a linear regression between ARGO’s estimate and Net’s estimate. We trained the linear regression model on a period of two years. #### The Baseline Autoregressive model To assess the importance of external data sources, we built an autoregressive model of order 52 (AR(52)). We used the LASSO regression with the previous 52 weeks of ILI incidence rates to predict the current week and the two weeks after. * Real time estimates: ![Formula][14]</img> * One-week ahead forecast: ![Formula][15]</img> * Two-week ahead forecast: ![Formula][16]</img> where *y**it* corresponding to the flu incidence rate at time *t* for the region *i*, ![Graphic][17]</img> corresponding to the previous 52 weeks, *ϵ**t* corresponding to residuals. We applied this model for each region. We used a six years’ training dataset. The model was dynamically recalibrated every week. ### Evaluation Our test period consists on 115 weeks starting from January 2015 to March 2017. #### Metrics To assess the performance of the models, we compared estimates to the official incidence rates from the Sentinelles network by calculating two metrics : the mean squared error (MSE) and the Pearson correlation coefficient (PCC). * ![Graphic][18]</img> * ![Graphic][19]</img> where *ŷ**i* is the predicted value for the week *i*, ![Graphic][20]</img> is the mean of predicted values, *y**i* the real value for the week *i*, ![Graphic][21]</img> is the mean of real values. We also estimated the relative efficiency of ARGONet model compared to the autoregressive model with 95% confidence interval (CI) by using a Bootstrap method. A relative efficiency, calculated by ![Graphic][22]</img> bigger than 1, suggests increased predictive power of ARGONet compared to the autoregressive model. The CI and relative efficiency have been computed based on 100 Bootstrap samples of length 52. The 52 weeks were randomly selected from estimates from January 2015 to February 2017. #### Comparisons First, we assessed the importance of adding external data sources by comparing : * MSE and PCC of the autoregressive model and the ARGO model including historical data plus the 10 most correlated variables from hospital data and Google data. The individual contribution of hospital data and Google data has already been shown in a previous study [24]. But, we added in appendices, two comparisons: A comparison with the 10 most correlated variables from hospital data and a comparison with the 10 most correlated variables from Google data. * MSE and PCC of the autoregressive model and the ARGO model including historical data plus climatic data. * MSE and PCC of the autoregressive model and the ARGO model including historical data plus Twitter data. Second, we compared the autoregressive model, ARGO model (including all the data sources), Net model and ARGONet model. ## Results ### Evaluation of Data Sources as Predictors In order to assess the predictive value of each and all external data source, we compared ARGO models that incrementally included external data sources with a baseline autoregressive model, AR(52), model that only uses historical information as input. As shown in the next sections, we found that all external data sources improve flu estimates, specially in the one- and two-week ahead forecasts. #### EHR Data and Google Data Our first modeling experiment involved comparing ARGO models that use Google search and EHR data simultaneously with the baseline AR(52) in all French regions. A detailed analysis on the individual contribution of Google data and EHR data into predictions, separately, is provided for completeness in the supplementary materials. Our findings suggest that each of these data sources individually improves predictions in all time-horizons. This is consistent with the findings of a previous study conducted at the national-level and the French region of Brittany [24], where both Google and EHR information were found meaningful, but EHR data was shown to possess a stronger predictive power. The join contribution of both EHR and Google data on predictions is presented below. In real time (Table 1), in terms of correlation and error metrics, estimates produced using EHR data and Google data improve the accuracy for all the regions. The combination of both sources lead to correlation improvements of up to 5% and decreases in error of up to 30% for the region Bretagne. View this table: [Table 1.](http://medrxiv.org/content/early/2019/11/25/19009795/T1) Table 1. Real time estimate - MSE and PCC for ARGO models including only historical data (AR(52)) and the 10 most correlated variables from hospital and Google data, for the period starting from January 2015 to March 2017 For One-week ahead estimate (Table 2), estimates obtained with EHR and Google data are more accurate or comparable for 11 of the 12 regions. The combination of both sources lead to correlation improvements of up to 15% and decreases in error of up to 45% for the region Bourgogne. View this table: [Table 2.](http://medrxiv.org/content/early/2019/11/25/19009795/T2) Table 2. One-week ahead estimate - MSE and PCC for ARGO models including only historical data (AR(52)) and the 10 most correlated variables from hospital and Google data, for the period starting from January 2015 to March 2017 For two-week ahead predictions (Table 3), estimates obtained with EHR and Google data are more accurate for all the regions. The combination of both sources lead to correlation improvements of up to 30% and decreases in error of up to 60% for the region Centre. View this table: [Table 3.](http://medrxiv.org/content/early/2019/11/25/19009795/T3) Table 3. Two-week ahead estimate - MSE and PCC for ARGO models including only historical data (AR(52)) and the 10 most correlated variables from hospital and Google data, for the period starting from January 2015 to March 2017 #### Climatic Data When combining climatic data with historical activity via ARGO was shown to consistently improve prediction results across all regions (Table 4). However, this improvement is lower than the one observed with EHR and Google data. Indeed, climatic data lead to correlation improvements of 2% and decreases in error of 7% for the region Pays de la Loire. View this table: [Table 4.](http://medrxiv.org/content/early/2019/11/25/19009795/T4) Table 4. Real time estimate - MSE and PCC for ARGO models including only historical data (AR(52)) and only climatic data, for the period starting from January 2015 to March 2017 For one-week ahead estimate (Table 5), in term of correlation and error, results obtained with Climatic data are better or comparable for 11 of the 12 regions. Climatic data lead to correlation improvements of up to 5% and decreases in error of up to 12% for the region Bourgogne-Franche-Comté. View this table: [Table 5.](http://medrxiv.org/content/early/2019/11/25/19009795/T5) Table 5. One-week ahead estimate - MSE and PCC for ARGO models including only historical data (AR(52)) and only climatic data, for the period starting from January 2015 to March 2017 For two-week ahead estimate (Table 6), results obtained with Climatic data are better for all the regions. Climatic data lead to correlation improvements of up to 20% and decreases in error of up to 30% for the region Bourgogne-Franche-Comté. View this table: [Table 6.](http://medrxiv.org/content/early/2019/11/25/19009795/T6) Table 6. Two-week ahead estimates - MSE and PCC for ARGO models including only historical data (AR(52)) and only climatic data, for the period starting from January 2015 to March 2017 #### Twitter Data Overall, we found that national-level flu-related Twitter data improves prediction results for all regions. In real time (Table 7), we see that Twitter data improves results for 8 out of the 12 regions. Twitter data lead to correlation improvements of 2% and decreases in error of 30% for the regions Occitanie and Pays de la Loire. View this table: [Table 7.](http://medrxiv.org/content/early/2019/11/25/19009795/T7) Table 7. Real time estimate - MSE and PCC for ARGO models including only historical data (AR(52)) and only Twitter data, for the period starting from January 2015 to March 2017 For one-week ahead estimate (Table 8), estimates obtained with Twitter data are more accurate for all the regions. Twitter data lead to correlation improvements of 10% and decreases in error of 20% for the region Pays de la Loire. View this table: [Table 8.](http://medrxiv.org/content/early/2019/11/25/19009795/T8) Table 8. One-week ahead estimate - MSE and PCC for ARGO models including only historical data (AR(52)) and only Twitter data, for the period starting from January 2015 to March 2017 For two-week ahead estimate (Table 9), results obtained with Twitter data are more accurate for all the regions. Twitter data lead to correlation improvements of 15% and decreases in error of 20% for the region Bourgogne-Franche-Comté. View this table: [Table 9.](http://medrxiv.org/content/early/2019/11/25/19009795/T9) Table 9. Two-week ahead estimate - MSE and PCC for ARGO models including only historical data (AR(52)) and only Twitter data, for the period starting from January 2015 to March 2017 ### Evaluation of Statistical Models Here, we compare the predictive performance of four different modeling approaches AR(52), ARGO, Net, and ARGONet for three time horizons: real-time, one-week and two-week ahead estimates. Figure 1 displays the ranking of each model for each time horizon of prediction across regions during the out-of-sample evaluation time period (January 2015 to March 2017). If a model is ranked in the 1st position, it means that it led to the best prediction results in terms of error (MSE) and in most cases this was also the case in terms of correlation. As displayed in Figure 1, ARGONet is the most accurate model, ranking either 1st or 2nd in all regions for real-time estimates, and ranking 1st in both the one- and two-week prediction horizons. Further details about each model’s performance are shown in Figures S2 to S13.  [Fig 1.](http://medrxiv.org/content/early/2019/11/25/19009795/F1) Fig 1. Ranks obtained by each model over the 12 French regions for PCC and MSE #### Real-time estimates Figure 2 and Table 10 summarize results obtained with AR(52), ARGO, Net and ARGONet models for the period starting from January 2015 to March 2017, for the 12 regions. Over this time period, the 90% confidence interval (CI) of the best correlation is [0.915;0.971] with a median value equal to 0.950. The 90% CI of the relative error is [0.057;0.169] with a median value equal to 0.096 which implies a reduction of the error from 5% to 17% thanks to our models.  [Fig 2.](http://medrxiv.org/content/early/2019/11/25/19009795/F2) Fig 2. Real-time estimate obtained with ARGO, Net and ARGONet models from January 2015 to March 2017 View this table: [Table 10.](http://medrxiv.org/content/early/2019/11/25/19009795/T10) Table 10. PCC and MSE for real-time estimate for all french regions for the period starting from January 2015 to March 2017 Figure 3 confirms, region by region, that the best PCC and MSE are mostly obtained with ARGONet for real-time predictions. In this time-horizon, ARGO shows good performance.  [Fig 3.](http://medrxiv.org/content/early/2019/11/25/19009795/F3) Fig 3. PCC and MSE obtained for real-time estimate with ARGO, Net and ARGONet models For 5 regions the best model is ARGO with the highest PCC and the lowest MSE. For the other 7 regions, the best model is ARGONet. For the one- and two-week time horizons, Figures 7, 11, 14 and 15 confirm that ARGONet outperforms all other models.  [Fig 4.](http://medrxiv.org/content/early/2019/11/25/19009795/F4) Fig 4. Nouvelle-Aquitaine Real time estimate  [Fig 5.](http://medrxiv.org/content/early/2019/11/25/19009795/F5) Fig 5. Coefficients Nouvelle-Aquitaine Real-time estimate  [Fig 6.](http://medrxiv.org/content/early/2019/11/25/19009795/F6) Fig 6. One-week ahead estimate obtained with ARGO, Net and ARGONet models from January 2015 to March 2017  [Fig 7.](http://medrxiv.org/content/early/2019/11/25/19009795/F7) Fig 7. PCC and MSE obtained for one-week ahead estimate with ARGO, Net and ARGONet models  [Fig 8.](http://medrxiv.org/content/early/2019/11/25/19009795/F8) Fig 8. Nouvelle-Aquitaine one-week ahead estimate  [Fig 9.](http://medrxiv.org/content/early/2019/11/25/19009795/F9) Fig 9. Coefficients Nouvelle-Aquitaine one-week ahead estimate  [Fig 10.](http://medrxiv.org/content/early/2019/11/25/19009795/F10) Fig 10. Two-week ahead estimate obtained with ARGO, Net and ARGONet models from January 2015 to March 2017  [Fig 11.](http://medrxiv.org/content/early/2019/11/25/19009795/F11) Fig 11. PCC and MSE obtained for two-week ahead estimate with ARGO, Net and ARGONet models  [Fig 12.](http://medrxiv.org/content/early/2019/11/25/19009795/F12) Fig 12. Nouvelle-Aquitaine Two-week ahead estimate  [Fig 13.](http://medrxiv.org/content/early/2019/11/25/19009795/F13) Fig 13. Coefficients Nouvelle-Aquitaine Two-week ahead estimate  [Fig 14.](http://medrxiv.org/content/early/2019/11/25/19009795/F14) Fig 14. Error distribution  [Fig 15.](http://medrxiv.org/content/early/2019/11/25/19009795/F15) Fig 15. Correlation distribution To assess the statistical significance of the improved prediction power of ARGONet, we constructed a 95% confidence interval for the relative efficiency of ARGONet compared to the autoregressive model (the error of ARGONet is in the denominator). Table 11 shows that in real-time, the improvement obtained thanks to the ARGONet model compared to the autoregressive model is statistically significant for all regions. Depending on the region, ARGONet allows to reduce the error by 15% to 60%. View this table: [Table 11.](http://medrxiv.org/content/early/2019/11/25/19009795/T11) Table 11. Real-time estimate - Relative efficiency being bigger than 1 suggests increased predictive power of ARGONet compared to the autoregressive model Figure 4 and Figure 5 show a typical example of plot and heatmap obtained for estimates in real time. The heatmap allows to visualize coefficients used for ARGO model. On these plots, we can see that all models have estimates close to the gold standard. However, for the autoregressive model, there is a time lag more important. On the heatmap, we can see that ARGO model uses mostly 5 variables including 2 variables from Google Data, 2 variables from Hospital Data and 1 variable from Historical Data. #### One-week ahead estimates Figure 6 and Table 12 show results for one-week ahead forecasts for the time period January 2015-March 2017. Over this time period, the 90% CI of the best correlation is [0.852;0.970] with a median value equal to 0.936. The 90% CI of the relative error is [0.060;0.294] with a median value equal to 0.127 which implies a reduction of the error from 6% to 30% thanks to our models. View this table: [Table 12.](http://medrxiv.org/content/early/2019/11/25/19009795/T12) Table 12. PCC and MSE for one-week ahead estimate for all french regions for the period starting from January 2015 to March 2017 View this table: [Table 13.](http://medrxiv.org/content/early/2019/11/25/19009795/T13) Table 13. One-week ahead estimate - Relative efficiency being bigger than 1 suggests increased predictive power of ARGONet compared to the autoregressive model For one-week ahead forecasts the best model is ARGONet. AR(52) is the model giving the worst results, but, in contrast to real-time results, ARGO and Net models are comparable. Indeed, for 4 regions, Net model allows to have better results than ARGO model. We can also observe these results on barplots (Figure 7) and on the distribution of correlation and error (Figures 14 and 15). Table 12 shows that the improvement obtained thanks to the ARGONet model compared to the autoregressive model is statistically significant for all regions for one-week ahead estimate. Depending on the region, ARGONet allows to reduce the error by 55% to 87%. Figure 8 shows one-week ahead estimate obtained for the french region Nouvelle-Aquitaine. On this plot, we can see that AR(52) and ARGO models still have a lag of one or two weeks. It is not the case for Net and ARGONet models. On this plot, estimates obtained with Net and ARGONet models are comparable. Figure 9, the heatmap shows that ARGO model uses mostly 9 variables including 2 variables from Google Data, 2 variables from Hospital Data, one variable from Climatic data and one variable from Historical data. #### Two-week ahead estimate Figure 10 and Table 14 show results for two-week ahead forecasts for the time period January 2015-March 2017. Over this time period, the 90% CI of the best correlation is [0.825;0.935] with a median value equal to 0.885. The 90% CI of the relative error is [0.129;0.347] with a median value equal to 0.229 which implies a reduction of the error from 13% to 35% thanks to our models. Like for real-time and one-week ahead forecasts, AR(52) is the model giving the worst estimates. For all french regions, the best model is ARGONet with the method using the mean between estimates obtained from ARGO and Net models. View this table: [Table 14.](http://medrxiv.org/content/early/2019/11/25/19009795/T14) Table 14. PCC and MSE for two-week ahead estimate for all french regions for the period starting from January 2015 to March 2017 On Figure 11, barplots confirm that ARGONet is the best model for all regions in term of correlation and error. On the same way, for the distribution of PCC and MSE (Figure 14 and 15), ARGONet is the best model. Table 15 shows that the improvement obtained thanks to the ARGONet model compared to the autoregressive model is statistically significant for all regions for two-week ahead estimate. Depending on the region, ARGONet allows to reduce the error by 60% to 82%. View this table: [Table 15.](http://medrxiv.org/content/early/2019/11/25/19009795/T15) Table 15. Two-week ahead estimate - Relative efficiency being bigger than 1 suggests increased predictive power of ARGONet compared to the autoregressive model Figure 12 shows two-week ahead estimates for the region Nouvelle-Aquitaine. As for one-week ahead estimate, we can see that estimates obtained with AR(52) and ARGO models are still delayed. It is not the case for Net and ARGONet models. Nevertheless, unlike ARGONet model, Net model tends to overestimate the peaks. On the heatmap Figure 13, we can see that ARGO model uses mostly 9 variables, including 6 variables from Google Data, one variable from Climatic Data, two variables from Historical data. ## Discussion We have introduced a machine learning ensemble methodology that combines multiple data sources and multiple statistical approaches to accurately track flu activity in the 12 continental regions of France. To the best of our knowledge, this is a spatial resolution for which no forecasting approaches have been explored before in France. Our methodology provides real-time estimates as well as one- and two-week ahead forecasts. The success of our approach comes from the ability to dynamically identify the appropriate method and data sources to produce the best disease activity estimates for a given location and time horizon in a prospective way (out-of-sample). Specifically, we show that the ARGO model alone (one that does not incorporate flu activity from neighboring regions) yields accurate results for real-time estimates but fails to produce optimal predictions for longer-term time-horizons. We find that the Net model (one that leverages information from neighboring regions alone) leads to reasonable flu predictions but tends to overestimate epidemic peaks. The proposed ensemble approach, named ARGONet (that combines information from both ARGO and the Net model), an extension of a model proposed in the USA [27], produces forecasts with the lowest errors and highest correlation as captured by Figure 1. This machine-learning ensemble approach displays both accuracy and robustness to estimate ILI activity up to two-weeks ahead of time at the french regional level. Prediction error reductions are observed when using ARGONet over its autoregressive counterpart (up to 50% across regions) in real-time predictions. Whereas the prediction performance of ARGONet and ARGO are comparable (Table 10) in this same task. As the time-horizon of prediction increases, the improvements of predictions are more evident, leading to up to 80% error reductions when comparing ARGONet with AR, and up to 60% error reduction of ARGONet over ARGO for one-week ahead predictions; and up to 80% (ARGONet vs AR) and 30% (ARGONet vs ARGO) respectively in two-week ahead predictions (Tables 11 and 12). Figures S2 through S13 show these results graphically. As expected, autoregressive approaches show “within-range” prediction values that consistently lag behind the observed disease activity and lead to under-predictions close to peak activity. We find that all external data sources contribute to improving local flu estimates, when compared to the baseline autoregressive model, specially for longer-term forecasts. Indeed, for the two-week ahead estimates, the combination of EHR data and Google data lead to correlation improvements of up to 25% and decreases in error of up to 60%. For Climatic data, this improvement reaches 20% for correlation and 25% for the error. For Twitter data, it reaches 20% for both correlation and error. By analyzing heatmaps (Figures 5, 9 and 13 and in the Supplementary materials) obtained for ARGO models, we can see that the contributions of different predictors (data sources) change over time and time-horizon of prediction, but all data sources appear to posses predictive power. Indeed, the most important data sources are EHR data and Google data in real-time and for longer-term forecasts. Historical data is consistently used in real-time, but less used for longer-term forecasting. Conversely, Climatic data and Twitter data are used more prominently for longer-term forecasts than for real-time estimate. The fact that we could only access EHR data from Rennes University Hospital, and thus from the Brittany region, prevented us from being able to quantify the added valued of region-specific EHR information on flu predictions in their respective region. This should be evaluated in future research efforts. On the other hand, we find interesting the fact that data from a hospital in Rennes can improve flu forecasting in other regions. Indeed, tables S4 to S6 show that forecasts that include Rennes’ EHR information, up to two weeks, are more accurate for all the regions when compared to the baseline autoregressive model. Rennes’ EHR data appears to be more relevant for some regions than others. For example, it appears to be an important predictor in the Brittany region (which contains Rennes) as expected, as well as in Normandy, which shares a border with Brittany. For Occitanie, Rennes’ EHR data improves predictions, which is in alignment with the fact that historical information shows that flu activity tends to occur synchronously (with a correlation of 0.93) as seen in Figure S1. We hypothesize that having access to region-specific EHR data, from all the french regions, will lead to prediction improvements across the board. Twitter data was collected at the National level given the sparsity of relevant flu-related Tweets at the regional level. This was the case as we only had access to the publicly available data shared by Twitter’s API that only allows users to view up to 5% of all Geo-coded Tweets (themselves a small fraction of about 5% of the total corpus of all Tweets). We also suspect that gaining access to higher volumes of Tweets at the regional level could improve our forecasts. For climatic data, we only had a access to weekly local temperature and precipitation. Future studies may explore incorporating other climatic indicators known to be more directly related to the transmission of the virus, such as humidity [28]. To conclude, we have shown that Internet-based data sources can yield accurate influenza estimates in the 12 continental regions in France. Operational implementations of these methods may prove to be useful for public health officials in the face of public health threats. Our regional-level flu estimates may contribute to better management of patients’ flow in general practitioners’ offices and in hospitals, particularly emergency departments. ## Data Availability All data cannot be shared publicly, in particular EHR data, due to the protection of patient data. ## Authors Contribution C.P. and M.S. conceived the research. C.P. wrote the manuscript with support from M.S.. G.B extracted hospital data. Y.H and T.B. extracted Twitter data. All authors discussed the results and contributed to the final manuscript. ## Conflicts of Interest None declared. ## Supplementary Material  [Fig S1.](http://medrxiv.org/content/early/2019/11/25/19009795/F16) Fig S1. Correlation between French regions on the period starting from January 2013 to March 2017  [Fig S2.](http://medrxiv.org/content/early/2019/11/25/19009795/F17) Fig S2. Evolution of Auvergne-Rhône-Alpes estimates over time for AR(52) and ARGONET models  [Fig S3.](http://medrxiv.org/content/early/2019/11/25/19009795/F18) Fig S3. Evolution of Bourgogne-Franche-Comté estimates over time for AR(52) and ARGONET models  [Fig S4.](http://medrxiv.org/content/early/2019/11/25/19009795/F19) Fig S4. Evolution of Bretagne estimates over time for AR(52) and ARGONET models  [Fig S5.](http://medrxiv.org/content/early/2019/11/25/19009795/F20) Fig S5. Evolution of Centre-Val de Loire estimates over time for AR(52) and ARGONET models  [Fig S6.](http://medrxiv.org/content/early/2019/11/25/19009795/F21) Fig S6. Evolution of Grand Est estimates over time for AR(52) and ARGONET models  [Fig S7.](http://medrxiv.org/content/early/2019/11/25/19009795/F22) Fig S7. Evolution of Hauts de France estimates over time for AR(52) and ARGONET models  [Fig S8.](http://medrxiv.org/content/early/2019/11/25/19009795/F23) Fig S8. Evolution of Ile de France estimates over time for AR(52) and ARGONET models  [Fig S9.](http://medrxiv.org/content/early/2019/11/25/19009795/F24) Fig S9. Evolution of Normandie estimates over time for AR(52) and ARGONet models  [Fig S10.](http://medrxiv.org/content/early/2019/11/25/19009795/F25) Fig S10. Evolution of Nouvelle-Aquitaine estimates over time for AR(52) and ARGONet models  [Fig S11.](http://medrxiv.org/content/early/2019/11/25/19009795/F26) Fig S11. Evolution of Occitanie estimates over time for AR(52) and ARGONet models  [Fig S12.](http://medrxiv.org/content/early/2019/11/25/19009795/F27) Fig S12. Evolution of Pays de la Loire estimates over time for AR(52) and ARGONet models  [Fig S13.](http://medrxiv.org/content/early/2019/11/25/19009795/F28) Fig S13. Evolution of Provence-Alpes-Côte d’Azur estimates over time for AR(52) and ARGONet models View this table: [Table S1.](http://medrxiv.org/content/early/2019/11/25/19009795/T16) Table S1. Real time estimate: MSE and PCC for ARGO models including only historical data (AR(52)) and the 10 most correlated variables from Google data, for the period starting from January 2015 to March 2017 View this table: [Table S2.](http://medrxiv.org/content/early/2019/11/25/19009795/T17) Table S2. One-week ahead forecast : MSE and PCC for ARGO models including only historical data (AR(52)) and the 10 most correlated variables from Google data, for the period starting from January 2015 to March 2017 View this table: [Table S3.](http://medrxiv.org/content/early/2019/11/25/19009795/T18) Table S3. Two-week ahead forecast : MSE and PCC for ARGO models including only historical data (AR(52)) and the 10 most correlated variables from Google data, for the period starting from January 2015 to March 2017 View this table: [Table S4.](http://medrxiv.org/content/early/2019/11/25/19009795/T19) Table S4. Real time estimate: MSE and PCC for ARGO models including only historical data (AR(52)) and the 10 most correlated variables from Hospital data, for the period starting from January 2015 to March 2017 View this table: [Table S5.](http://medrxiv.org/content/early/2019/11/25/19009795/T20) Table S5. One-week ahead forecast : MSE and PCC for ARGO models including only historical data (AR(52)) and the 10 most correlated variables from Hospital data, for the period starting from January 2015 to March 2017 View this table: [Table S6.](http://medrxiv.org/content/early/2019/11/25/19009795/T21) Table S6. Two-week ahead forecast : MSE and PCC for ARGO models including only historical data (AR(52)) and the 10 most correlated variables from Hospital data, for the period starting from January 2015 to March 2017 View this table: [Table S7.](http://medrxiv.org/content/early/2019/11/25/19009795/T22) Table S7. Real time estimate: MSE and PCC for ARGO models including only historical data (AR(52)) and only hospital and Google data (all variables), for the period starting from January 2015 to March 2017 View this table: [Table S8.](http://medrxiv.org/content/early/2019/11/25/19009795/T23) Table S8. One-week ahead forecast : MSE and PCC for ARGO models including only historical data (AR(52)) and only hospital and Google data (all variables), for the period starting from January 2015 to March 2017 View this table: [Table S9.](http://medrxiv.org/content/early/2019/11/25/19009795/T24) Table S9. Two-week ahead forecast : MSE and PCC for ARGO models including only historical data (AR(52)) and only hospital and Google data (all variables), for the period starting from January 2015 to March 2017 View this table: [Table S10.](http://medrxiv.org/content/early/2019/11/25/19009795/T25) Table S10. PCC and MSE for real time estimate for all french regions for the period starting from January 2015 to March 2017 with all the variables from Google and hospital data included in ARGO model View this table: [Table S11.](http://medrxiv.org/content/early/2019/11/25/19009795/T26) Table S11. PCC and MSE for one-week forecast for all french regions for the period starting from January 2015 to March 2017 with all the variables from Google and hospital data included in ARGO model View this table: [Table S12.](http://medrxiv.org/content/early/2019/11/25/19009795/T27) Table S12. PCC and MSE for two-week forecast for all french regions for the period starting from January 2015 to March 2017 with all the variables from Google and hospital data included in ARGO model  [Fig S14.](http://medrxiv.org/content/early/2019/11/25/19009795/F29) Fig S14. Auvergne Real-time estimate  [Fig S15.](http://medrxiv.org/content/early/2019/11/25/19009795/F30) Fig S15. Coefficients Auvergne Real-time estimate  [Fig S16.](http://medrxiv.org/content/early/2019/11/25/19009795/F31) Fig S16. Auvergne One-week estimate  [Fig S17.](http://medrxiv.org/content/early/2019/11/25/19009795/F32) Fig S17. Coefficients Auvergne One-week estimate  [Fig S18.](http://medrxiv.org/content/early/2019/11/25/19009795/F33) Fig S18. Auvergne Two-week estimate  [Fig S19.](http://medrxiv.org/content/early/2019/11/25/19009795/F34) Fig S19. Coefficients Auvergne Two-week estimate  [Fig S20.](http://medrxiv.org/content/early/2019/11/25/19009795/F35) Fig S20. Bourgogne Franche Comté Real-time estimate  [Fig S21.](http://medrxiv.org/content/early/2019/11/25/19009795/F36) Fig S21. Coefficients Bourgogne Franche Comté Real-time estimate  [Fig S22.](http://medrxiv.org/content/early/2019/11/25/19009795/F37) Fig S22. Bourgogne One-week estimate  [Fig S23.](http://medrxiv.org/content/early/2019/11/25/19009795/F38) Fig S23. Coefficients Bourgogne Franche Comté One-week estimate  [Fig S24.](http://medrxiv.org/content/early/2019/11/25/19009795/F39) Fig S24. Bourgogne Two-week estimate  [Fig S25.](http://medrxiv.org/content/early/2019/11/25/19009795/F40) Fig S25. Coefficients Bourgogne Franche Comté Two-week estimate  [Fig S26.](http://medrxiv.org/content/early/2019/11/25/19009795/F41) Fig S26. Bretagne Real-time estimate  [Fig S27.](http://medrxiv.org/content/early/2019/11/25/19009795/F42) Fig S27. Coefficients Bretagne Real-time estimate  [Fig S28.](http://medrxiv.org/content/early/2019/11/25/19009795/F43) Fig S28. Bretagne One-week estimate  [Fig S29.](http://medrxiv.org/content/early/2019/11/25/19009795/F44) Fig S29. Coefficients Bretagne One-week estimate  [Fig S30.](http://medrxiv.org/content/early/2019/11/25/19009795/F45) Fig S30. Bretagne Two-week estimate  [Fig S31.](http://medrxiv.org/content/early/2019/11/25/19009795/F46) Fig S31. Coefficients Bretagne Two-week estimate  [Fig S32.](http://medrxiv.org/content/early/2019/11/25/19009795/F47) Fig S32. Centre Val-de-Loire Real-time estimate  [Fig S33.](http://medrxiv.org/content/early/2019/11/25/19009795/F48) Fig S33. Coefficients Centre Val-de-Loire Real-time estimate  [Fig S34.](http://medrxiv.org/content/early/2019/11/25/19009795/F49) Fig S34. Centre Val-de-Loire One-week estimate  [Fig S35.](http://medrxiv.org/content/early/2019/11/25/19009795/F50) Fig S35. Coefficients Centre Val-de-Loire One-week estimate  [Fig S36.](http://medrxiv.org/content/early/2019/11/25/19009795/F51) Fig S36. Centre Val-de-Loire Two-week estimate  [Fig S37.](http://medrxiv.org/content/early/2019/11/25/19009795/F52) Fig S37. Coefficients Centre Val-de-Loire Two-week estimate  [Fig S38.](http://medrxiv.org/content/early/2019/11/25/19009795/F53) Fig S38. Grand Est Real-time estimate  [Fig S39.](http://medrxiv.org/content/early/2019/11/25/19009795/F54) Fig S39. Coefficients Grand Est Real-time estimate  [Fig S40.](http://medrxiv.org/content/early/2019/11/25/19009795/F55) Fig S40. Grand Est One-week estimate  [Fig S41.](http://medrxiv.org/content/early/2019/11/25/19009795/F56) Fig S41. Coefficients Grand Est One-week estimate  [Fig S42.](http://medrxiv.org/content/early/2019/11/25/19009795/F57) Fig S42. Grand Est Two-week estimate  [Fig S43.](http://medrxiv.org/content/early/2019/11/25/19009795/F58) Fig S43. Coefficients Grand Est Two-week estimate  [Fig S44.](http://medrxiv.org/content/early/2019/11/25/19009795/F59) Fig S44. Hauts de France Real-time estimate  [Fig S45.](http://medrxiv.org/content/early/2019/11/25/19009795/F60) Fig S45. Coefficients Hauts de France Real-time estimate  [Fig S46.](http://medrxiv.org/content/early/2019/11/25/19009795/F61) Fig S46. Hauts de France One-week estimate  [Fig S47.](http://medrxiv.org/content/early/2019/11/25/19009795/F62) Fig S47. Coefficients Hauts de France One-week estimate  [Fig S48.](http://medrxiv.org/content/early/2019/11/25/19009795/F63) Fig S48. Hauts de France Two-week estimate  [Fig S49.](http://medrxiv.org/content/early/2019/11/25/19009795/F64) Fig S49. Coefficients Hauts de France Two-week estimate  [Fig S50.](http://medrxiv.org/content/early/2019/11/25/19009795/F65) Fig S50. Ile de France Real-time estimate  [Fig S51.](http://medrxiv.org/content/early/2019/11/25/19009795/F66) Fig S51. Coefficients Ile de France Real-time estimate  [Fig S52.](http://medrxiv.org/content/early/2019/11/25/19009795/F67) Fig S52. Ile de France One-week estimate  [Fig S53.](http://medrxiv.org/content/early/2019/11/25/19009795/F68) Fig S53. Coefficients Ile de France One-week estimate  [Fig S54.](http://medrxiv.org/content/early/2019/11/25/19009795/F69) Fig S54. Ile de France Two-week estimate  [Fig S55.](http://medrxiv.org/content/early/2019/11/25/19009795/F70) Fig S55. Coefficients Ile de France Two-week estimate  [Fig S56.](http://medrxiv.org/content/early/2019/11/25/19009795/F71) Fig S56. Normandie Real-time estimate  [Fig S57.](http://medrxiv.org/content/early/2019/11/25/19009795/F72) Fig S57. Coefficients Normandie Real-time estimate  [Fig S58.](http://medrxiv.org/content/early/2019/11/25/19009795/F73) Fig S58. Normandie One-week estimate  [Fig S59.](http://medrxiv.org/content/early/2019/11/25/19009795/F74) Fig S59. Coefficients Normandie One-week estimate  [Fig S60.](http://medrxiv.org/content/early/2019/11/25/19009795/F75) Fig S60. Normandie Two-week estimate  [Fig S61.](http://medrxiv.org/content/early/2019/11/25/19009795/F76) Fig S61. Coefficients Normandie Two-week estimate  [Fig S62.](http://medrxiv.org/content/early/2019/11/25/19009795/F77) Fig S62. Occitanie Real-time estimate  [Fig S63.](http://medrxiv.org/content/early/2019/11/25/19009795/F78) Fig S63. Coefficients Occitanie Real-time estimate  [Fig S64.](http://medrxiv.org/content/early/2019/11/25/19009795/F79) Fig S64. Occitanie One-week estimate  [Fig S65.](http://medrxiv.org/content/early/2019/11/25/19009795/F80) Fig S65. Coefficients Occitanie One-week estimate  [Fig S66.](http://medrxiv.org/content/early/2019/11/25/19009795/F81) Fig S66. Occitanie Two-week estimate  [Fig S67.](http://medrxiv.org/content/early/2019/11/25/19009795/F82) Fig S67. Coefficients Occitanie Two-week estimate  [Fig S68.](http://medrxiv.org/content/early/2019/11/25/19009795/F83) Fig S68. Pays de la Loire Real-time estimate  [Fig S69.](http://medrxiv.org/content/early/2019/11/25/19009795/F84) Fig S69. Coefficients Pays de la Loire Real-time estimate  [Fig S70.](http://medrxiv.org/content/early/2019/11/25/19009795/F85) Fig S70. Pays de la Loire One-week estimate  [Fig S71.](http://medrxiv.org/content/early/2019/11/25/19009795/F86) Fig S71. Coefficients Pays de la Loire One-week estimate  [Fig S72.](http://medrxiv.org/content/early/2019/11/25/19009795/F87) Fig S72. Pays de la Loire Two-week estimate  [Fig S73.](http://medrxiv.org/content/early/2019/11/25/19009795/F88) Fig S73. Coefficients Pays de la Loire Two-week estimate  [Fig S74.](http://medrxiv.org/content/early/2019/11/25/19009795/F89) Fig S74. Provence Alpes Côte d’Azur Real-time estimate  [Fig S75.](http://medrxiv.org/content/early/2019/11/25/19009795/F90) Fig S75. Coefficients Provence Alpes Côte d’Azur Real-time estimate  [Fig S76.](http://medrxiv.org/content/early/2019/11/25/19009795/F91) Fig S76. Provence Alpes Côte d’Azur One-week estimate  [Fig S77.](http://medrxiv.org/content/early/2019/11/25/19009795/F92) Fig S77. Coefficients Provence Alpes Côte d’Azur One-week estimate  [Fig S78.](http://medrxiv.org/content/early/2019/11/25/19009795/F93) Fig S78. Provence Alpes Côte d’Azur Two-week estimate  [Fig S79.](http://medrxiv.org/content/early/2019/11/25/19009795/F94) Fig S79. Coefficients Provence Alpes Côte d’Azur Two-week estimate ## Acknowledgments We would like to thank the French National Research Agency for partially funding this work inside the Integrating and Sharing Health Data for Research Project (Grant No. ANR-15-CE19-0024). We also thank the French Sentinelles network and Google and Twitter services for making their data publicly available. MS and CP were partially funded by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R01GM130668. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health * Received November 19, 2019. * Revision received November 19, 2019. * Accepted November 25, 2019. * © 2019, Posted by Cold Spring Harbor Laboratory The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission. ## References 1. 1.Ferguson NM, Cummings DAT, Fraser C, Cajka JC, Cooley PC, Burke DS. Strategies for mitigating an influenza pandemic;442(7101):448–452. doi:10.1038/nature04795. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature04795&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=16642006&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000239278900042&link_type=ISI) 2. 2.Yang S, Santillana M, Kou SC. Accurate estimation of influenza epidemics using Google search data via ARGO;112:14473–14478. doi:10.1038/srep25732. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/srep25732&link_type=DOI) 3. 3.Yang W, Lipsitch M, Shaman J. Inference of seasonal and pandemic influenza transmission dynamics;112(9):2723–2728. doi:10.1073/pnas.1415012112. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMDoiMTEyLzkvMjcyMyI7czo0OiJhdG9tIjtzOjM5OiIvbWVkcnhpdi9lYXJseS8yMDE5LzExLzI1LzE5MDA5Nzk1LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 4. 4.Kalimeri K, Delfino M, Cattuto C, Perrotta D, Colizza V, Guerrisi C, et al. Unsupervised extraction of epidemic syndromes from participatory influenza surveillance self-reported symptoms;15(4):e1006173. doi:10.1371/journal.pcbi.1006173. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pcbi.1006173&link_type=DOI) 5. 5. D M Fleming WJP J van der Velden. The evolution of influenza surveillance in Europe and prospects for the next 10 years;21:1749–1753. 6. 6.Santillana M, Nguyen AT, Louie T, Zink A, Gray J, Sung I, et al. Cloud-based Electronic Health Records for Real-time, Region-specific Influenza Surveillance;6:25732. doi:10.1038/srep25732. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/srep25732&link_type=DOI) 7. 7.Nsoesie EO, Brownstein JS, Ramakrishnan N, Marathe MV. A systematic review of studies on forecasting the dynamics of influenza outbreaks;8:309–316. 8. 8.Yang W, Karspeck A, Shaman J. Comparison of Filtering Methods for the Modeling and Retrospective Forecasting of Influenza Epidemics;10(4):e1003583. doi:10.1371/journal.pcbi.1003583. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pcbi.1003583&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24762780&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) 9. 9.Chretien JP, George D, Shaman J, Chitale RA, McKenzie FE. Influenza Forecasting in Human Populations: A Scoping Review;9:e94130. doi:10.1371/journal.pone.0094130. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0094130&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24714027&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) 10. 10.Olson DR, Konty KJ, Paladini M, Viboud C, Simonsen L. Reassessing Google Flu Trends Data for Detection of Seasonal and Pandemic Influenza: A Comparative Epidemiological Study at Three Geographic Scales;9:e1003256. doi:10.1371/journal.pcbi.1003256. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pcbi.1003256&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24146603&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) 11. 11.Zhang Y, Bambrick H, Mengersen K, Tong S, Hu W. Using Google Trends and ambient temperature to predict seasonal influenza outbreaks;117:284–291. doi:10.1016/j.envint.2018.05.016. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.envint.2018.05.016&link_type=DOI) 12. 12.Paul MJ, Dredze M, Broniatowski D. Twitter Improves Influenza Forecasting; 6. doi:10.1371/currents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/currents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117&link_type=DOI) 13. 13.Santillana M, Nguyen AT, Dredze M, Paul MJ, Nsoesie EO, Brownstein JS. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance;11(10). doi:10.1371/journal.pcbi.1004513. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pcbi.1004513&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26513245&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) 14. 14.Mowery J. Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on Surveillance Estimates; 8. doi:10.5210/ojphi.v8i3.7011. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.5210/ojphi.v8i3.7011&link_type=DOI) 15. 15.Sharpe JD, Hopkins RS, Cook RL, Striley CW. Evaluating Google, Twitter, and Wikipedia as Tools for Influenza Surveillance Using Bayesian Change Point Analysis: A Comparative Analysis;2(2). doi:10.2196/publichealth.5901. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2196/publichealth.5901&link_type=DOI) 16. 16.McIver DJ, Brownstein JS. Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time;10(4):e1003581. doi:10.1371/journal.pcbi.1003581. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pcbi.1003581&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24743682&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) 17. 17.Global Disease Monitoring and Forecasting with Wikipedia; 10. doi:10.1371/journal.pcbi.1003892. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pcbi.1003892&link_type=DOI) 18. 18.Hickmann KS, Fairchild G, Priedhorsky R, Generous N, Hyman JM, Deshpande A, et al. Forecasting the 2013–2014 Influenza Season Using Wikipedia;11(5):e1004239. doi:10.1371/journal.pcbi.1004239. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pcbi.1004239&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25974758&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) 19. 19.Carneiro HA, Mylonakis E. Google Trends: A Web-Based Tool for Real-Time Surveillance of Disease Outbreaks;49(10):1557–1564. doi:10.1086/630200. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1086/630200&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19845471&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) 20. 20.Butler D. When Google got flu wrong;494(7436):155–156. doi:10.1038/494155a. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/494155a&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23407515&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000315137700005&link_type=ISI) 21. 21.Santillana M, Nsoesie EO, Mekaru SR, Scales D, Brownstein JS. Using Clinicians’ Search Query Data to Monitor Influenza Epidemics;59(10):1446–1450. doi:10.1093/cid/ciu647. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/cid/ciu647&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25115873&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) 22. 22.Smolinski MS, Crawley AW, Baltrusaitis K, Chunara R, Olsen JM, Wójcik O, et al. Flu Near You: Crowdsourced Symptom Reporting Spanning 2 Influenza Seasons;105(10):2124–2130. doi:10.2105/AJPH.2015.302696. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2105/AJPH.2015.302696&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26270299&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) 23. 23.Biggerstaff M, Johansson M, Alper D, Brooks LC, Chakraborty P, Farrow DC, et al. Results from the second year of a collaborative effort to forecast influenza seasons in the United States;24:26–33. doi:10.1016/j.epidem.2018.02.003. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.epidem.2018.02.003&link_type=DOI) 24. 24.Poirier C, Lavenu A, Bertaud V, Campillo-Gimenez B, Chazard E, Cuggia M, et al. Real Time Influenza Monitoring Using Hospital Big Data in Combination with Machine Learning Methods: Comparison Study;4(4):e11361. doi:10.2196/11361. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2196/11361&link_type=DOI) 25. 25.Bouzillé G, Poirier C, Campillo-Gimenez B, Aubert ML, Chabot M, Chazard E, et al. Leveraging hospital big data to monitor flu epidemics;154:153–160. doi:10.1016/j.cmpb.2017.11.012. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.cmpb.2017.11.012&link_type=DOI) 26. 26.Viboud C, Charu V, Olson D, Ballesteros S, Gog J, Khan F, et al. Demonstrating the use of high-volume electronic medical claims data to monitor local and regional influenza activity in the US. PloS one. 2014;9(7):e102429. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0102429&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25072598&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) 27. 27.Lu FS, Hattab MW, Clemente L, Santillana M. Improved state-level influenza activity nowcasting in the United States leveraging Internet-based data sources and network approaches via ARGONet; p. 344580. doi:10.1101/344580. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoiYmlvcnhpdiI7czo1OiJyZXNpZCI7czo4OiIzNDQ1ODB2MSI7czo0OiJhdG9tIjtzOjM5OiIvbWVkcnhpdi9lYXJseS8yMDE5LzExLzI1LzE5MDA5Nzk1LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 28. 28.Lowen AC, Steel J. Roles of Humidity and Temperature in Shaping Influenza Seasonality;88(14):7692–7695. doi:10.1128/JVI.03544-13. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoianZpIjtzOjU6InJlc2lkIjtzOjEwOiI4OC8xNC83NjkyIjtzOjQ6ImF0b20iO3M6Mzk6Ii9tZWRyeGl2L2Vhcmx5LzIwMTkvMTEvMjUvMTkwMDk3OTUuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 29. 29.Lowen AC, Mubareka S, Steel J, Palese P. Influenza Virus Transmission Is Dependent on Relative Humidity and Temperature;3(10):e151. doi:10.1371/journal.ppat.0030151. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.ppat.0030151&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=17953482&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000251114100012&link_type=ISI) 30. 30.Tamerius JD, Shaman J, Alonso WJ, Bloom-Feshbach K, Uejio CK, Comrie A, et al. Environmental Predictors of Seasonal Influenza Epidemics across Temperate and Tropical Climates;9(3):e1003194. doi:10.1371/journal.ppat.1003194. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.ppat.1003194&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23505366&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F25%2F19009795.atom) 31. 31.Lawrence. The Relationship between Relative Humidity and the Dewpoint Temperature in Moist Air;doi:10.1175/BAMS-86-2-225. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1175/BAMS-86-2-225&link_type=DOI) 32. 32.Tibshirani R. Regression Shrinkage and Selection via the Lasso;58:267–288. 33. 33.R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing;. Available from: [https://www.R-project.org/](https://www.R-project.org/). 34. 34.from Jed Wing MKC, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, et al. caret: Classification and Regression Training; 2018. Available from: [https://CRAN.R-project.org/package=caret](https://CRAN.R-project.org/package=caret). 35. 35.Trapletti A, Hornik K. tseries: Time Series Analysis and Computational Finance;. Available from: [http://CRAN.R-project.org/package=tseries](http://CRAN.R-project.org/package=tseries). [1]: /embed/graphic-1.gif [2]: /embed/graphic-2.gif [3]: /embed/graphic-3.gif [4]: /embed/inline-graphic-1.gif [5]: /embed/inline-graphic-2.gif [6]: /embed/inline-graphic-3.gif [7]: /embed/inline-graphic-4.gif [8]: /embed/inline-graphic-5.gif [9]: /embed/graphic-4.gif [10]: /embed/graphic-5.gif [11]: /embed/graphic-6.gif [12]: /embed/inline-graphic-6.gif [13]: /embed/inline-graphic-7.gif [14]: /embed/graphic-7.gif [15]: /embed/graphic-8.gif [16]: /embed/graphic-9.gif [17]: /embed/inline-graphic-8.gif [18]: /embed/inline-graphic-9.gif [19]: /embed/inline-graphic-10.gif [20]: /embed/inline-graphic-11.gif [21]: /embed/inline-graphic-12.gif [22]: /embed/inline-graphic-13.gif