999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

Machine Learning-Based Two-Stage Data Selection Scheme for Long-Term Influenza Forecasting

2021-12-14 06:06:38JaeukMoonSeungwonJungSungwooParkandEenjunHwang
Computers Materials&Continua 2021年9期

Jaeuk Moon,Seungwon Jung,Sungwoo Park and Eenjun Hwang

School of Electrical Engineering,Korea University,Seoul,02841,Korea

Abstract:One popular strategy to reduce the enormous number of illnesses and deaths from a seasonal influenza pandemic is to obtain the influenza vaccine on time.Usually,vaccine production preparation must be done at least six months in advance,and accurate long-term influenza forecasting is essential for this.Although diverse machine learning models have been proposed for influenza forecasting,they focus on short-term forecasting,and their performance is too dependent on input variables.For a country’s longterm influenza forecasting,typical surveillance data are known to be more effective than diverse external data on the Internet.We propose a two-stage data selection scheme for worldwide surveillance data to construct a longterm forecasting model for influenza in the target country.In the first stage,using a simple forecasting model based on the country’s surveillance data,we measured the change in performance by adding surveillance data from other countries,shifted by up to 52 weeks.In the second stage,for each set of surveillance data sorted by accuracy,we incrementally added data as input if the data have a positive effect on the performance of the forecasting model in the first stage.Using the selected surveillance data,we trained a new longterm forecasting model for influenza and perform influenza forecasting for the target country.We conducted extensive experiments using six machine learning models for the three target countries to verify the effectiveness of the proposed method.We report some of the results.

Keywords:Influenza;data selection;machine learning;forecasting

1 Introduction

Seasonal influenza is one of the most globally prevalent infectious diseases,annually causing tens of millions of respiratory illnesses and hundreds of thousands of deaths worldwide [1].Most countries have established national health institutes and conducted various activities,including disinfection and quarantine,to reduce the losses.One of the most effective ways to prevent an influenza pandemic is to prepare an influenza vaccine on time [2].However,due to the timeconsuming nature of vaccine production,an elaborate vaccine strategy should be prepared at least six months in advance [3].Substantial uncertainty exists in such a long-term strategy,which leads to an imbalance in the supply and demand of influenza vaccines during influenza seasons.An insufficient number of vaccines cannot prevent an increase in the number of influenza patients.However,an oversupply of vaccines leads to economic loss because obsolete vaccines must be discarded.Therefore,forecasting how many influenza patients will occur after a long period is required to ensure a smooth vaccine supply [4].

With the recent development of machine learning (ML) technology,various ML-based forecasting models have been proposed to achieve better forecasting accuracy [5-8].These models take data-driven approaches to identify the influence of various factors represented by the input variables and show superior forecasting performance compared to the compartmental and statistical models [9,10].However,most ML-based forecasting models for influenza perform short-term forecasting.Diverse data sources exist for short-term influenza forecasting,such as Google searches and microblogs.Their effectiveness has already been proven in many studies [11,12].In contrast,long-term forecasting faces increasing uncertainty from various sources,such as the accumulation of errors and lack of information [13].Hence,selecting appropriate input variables is important to guarantee good forecasting performance in ML-based long-term influenza forecasting.Otherwise,the forecasting performance may deteriorate [14].For long-term influenza forecasting,traditional surveillance data can provide much more comprehensive data with a small time lag,which is easier to maintain and more reliable for a long-term decision-making system than Internet data [9].Traditional surveillance data generally refers to the number of reported cases over time for a particular disease.

In this paper,we propose an ML-based two-stage data selection scheme for worldwide surveillance data to construct a long-term forecasting model for influenza in a target country after 26 weeks (about six months).In the first stage,based on a simple forecasting model using the country’s surveillance data,we measured performance by adding foreign surveillance data shifted by up to 52 weeks.In the second stage,for each set of surveillance data sorted by performance,we incrementally added each set of surveillance data if it positively affects forecasting.Using the resulting surveillance data,we trained another long-term forecasting model of influenza.To evaluate the performance of our proposed method,we conducted extensive experiments using the influenza surveillance data for various countries and several ML models.

The contributions of this paper are as follows:

· We propose a data selection method for foreign surveillance data,which we use as input variables to construct an accurate forecasting model of influenza regardless of the target country or forecasting model.

· We achieved outstanding forecasting accuracy using traditional surveillance data as input variables without external data sources.

· We verified the effectiveness of the proposed method through extensive comparisons with popular ML models.

The remainder of this paper is organized as follows.Section 2 discusses the literature review.Next,Section 3 describes the details of the proposed scheme.Section 4 demonstrates various experiments,and finally,Section 5 presents the conclusions.

2 Related Work

This section presents a brief literature review on various models for influenza forecasting.Work on influenza forecasting can be generally classified into three categories.The first category is based on compartment models,which use differential equations to model infectious disease transmission.For instance,Zhang et al.[15]proposed a spatiotemporal risk assessment model based on the susceptible infected recovered model,evaluating four influencing factors:biological,behavioral,and environmental parameters and infectious sources.They displayed the model output in a set of maps to analyze how these factors affect the spread of infectious diseases in Beijing.Because these models require a relatively small number of parameters,these models are limited because they often have a low forecasting accuracy or cannot find a correlation between various data when dealing with big data [16].

The second category is based on statistical or time-series-based models.For instance,Choi et al.[17]employed a Bayesian maximum entropy method of spatiotemporal statistics to analyze the geographical risk patterns of influenza mortality in California during winter.They found that the high risk of influenza initially occurred during December in the west-central region of the state,and the risk distribution was extended in the south and east-central regions of the state.Choi et al.[18]proposed an autoregressive integrated moving average (ARIMA) model for forecasting influenza activity.They collected the number of deaths related to pneumonia and influenza and used their ratio to measure influenza activity.Although these models are flexible in capturing the trending behavior of the affected populations,they sometimes suffer from poor accuracy because the influence of external factors (e.g.,climate and environmental factors) is not captured well [19].

The final category is based on ML-based models.Cheng et al.[9]proposed an ensemble approach for short-term influenza forecasting in Taiwan.They used the four ML-based forecasting models random forest (RF),ARIMA,support vector regression,and extreme gradient boosting(XGB) for traditional surveillance data from Taiwan.Then,they integrated their forecasting results using a linear kernel model to produce a more robust model than the individual model.Park et al.[8]used data from other countries to improve the forecasting accuracy of influenza occurrences in a specific country.They obtained the similarity between traditional surveillance data from the target country and other countries with the Euclidean distance.Then,they selected countries with high similarity in influenza patterns and exploited their data as input variables using a light gradient boosting machine (LGBM).Venna et al.[19]proposed a long short-term memory (LSTM)-based multi-stage scheme for influenza forecasting.They employed LSTM to capture the temporal dynamics of seasonal influenza in the first stage.During subsequent stages,the situational time lag between the influenza occurrence and weather variables and the spatial proximity of different geographical regions were captured to adjust the error introduced by the original forecasting model to improve the model performance further.These models take a datadriven approach and grasp the influence of various factors used as input variables,exhibiting superior forecasting results compared to the compartmental and statistical models [9,10].

Most studies on influenza forecasting,including the aforementioned studies,have been for short-term forecasting.In contrast,the studies on long-term influenza forecasting are not yet sufficient.In short-term influenza forecasting,diverse data sources exist,such as Google searches and microblogs,whose effectiveness has already been proven [11,12].However,long-term forecasting faces growing uncertainty arising from various problems,such as the accumulation of errors and lack of information [13].Choi et al.[20]presented a long-term influenza forecasting scheme for the United States (US) using influenza activities in other countries.They first collected the data widely used in influenza forecasting,including traditional surveillance data from other countries and search queries from Google Trend (GT).They calculated the cross-correlation between the traditional data from the US and other countries with similar seasonal patterns and influenza outbreaks,and employed highly correlated data as input variables to forecast the next seasonal influenza pattern using ML models.Although they achieved remarkable forecasting accuracy,they manually selected countries with a high correlation to the target country using a statistical method.In this study,we propose a data selection scheme that automatically finds surveillance data from other countries that can contribute to the prediction of the target country using an ML model to improve efficiency and accuracy.We demonstrate that the input configuration based on this selection improves the performance of diverse influenza forecasting models.

3 Method

In this section,we describe the proposed scheme in detail.Fig.1 illustrates the overall structure of the scheme,which is composed of three parts:(1) data collection and preprocessing,(2) data selection for configuring input variables for the forecasting model,and (3) influenza forecasting for the target country.In the following sections,we first describe the data collection and preprocessing part,and then the data selection and the forecasting part together.

3.1 Data Collection and Preprocessing

In this study,we collected influenza surveillance data from the FluNet database of the World Health Organization (WHO) [21].FluNet is a global web-based tool for influenza virological surveillance,which was first launched in 1997.Influenza surveillance data have been uploaded to the FluNet database every week.Among the diverse data provided by FluNet,we used the number of influenza patients in 168 countries from the first week in 2010 to the 52nd week in 2018.We collected data from 2010 because influenza showed a peculiar outbreak pattern in 2009 due to the introduction of a new epidemic strain (INF A H1N1 pdm09) [22].Some countries uploaded surveillance data only when they had high influenza activity.Hence,we replaced the missing data in the dataset with zero.For the collected surveillance data,we performed a min-max normalization for each country,which is necessary to prevent data selection from focusing on a few specific countries with high average occurrences.

In addition to the surveillance data,we also collected time information,such as the year and week in which the surveillance data were collected.As one year consists of 52 or 53 weeks,we represented the week in the ML model using this week number.In particular,to reflect the periodic property of the week,we transformed the week number into two-dimensional data using Eqs.(1) and (2) [23].Cyclerepresents the period of the week.For instance,when we transform the first week of 2015 into two-dimensional data,weekandcycleare 1 and 53,respectively,andweekxandweekyaresin((360/53) * 1) andcos((360/53) * 1),respectively.Further explanations of this transformation can be found in [24].

Tab.1 lists the initial input and output variables that we considered in this paper.Initial input variables of the forecasting ML model are necessary to provide basic information such as time and the number of influenza patients in the target country.The initial input variables include theyear,week,weekx,weeky,andOccurtarget,t.The last variable represents the number of patients in the target country at timet.One output variable,Occurtarget,t+26,represents the number of influenza patients in the target country 26 weeks (6 months) later because we aim for long-term forecasting for the vaccine strategy:

3.2 Data Selection and Forecasting

Figure 1:Overall structure of the proposed scheme

One simple method to construct an influenza forecasting model is to use all surveillance data collected worldwide for training and validation.However,this method takes a long time and can have poor predictive performance.It would be better to consider data that contribute to influenza forecasting from the collected surveillance data to construct a forecasting model more effectively.To do this,we perform data selection in two stages.In the first stage,we construct a simple ML forecasting model using the surveillance data from the target country,which are initial input variables and output variables in Tab.1.Then,(1) we measure the accuracy of the forecasting model after appending the foreign surveillance data unit to the initial input variables,possibly shifted by up to 52 weeks.In the second stage,(2) we sort all surveillance data units in the order of accuracy.Then,we consider the surveillance data units one by one.(3) If it positively affects forecasting,(4) we select and incrementally add the data to the model as an input variable.Otherwise,we ignore the unit and continue the process.We used this strategy assuming that the data that produced higher forecasting accuracy contain more useful information for forecasting.Finally,the collection of selected surveillance data is used to train the target long-term influenza forecasting model.To evaluate the performance of our proposed method,we conducted extensive experiments using the influenza surveillance data from 168 countries and diverse ML models for data selection and influenza forecasting.

Algorithm 1 describes the overall steps for performing data selection.We considered six ML models for data selection,including the Gradient Boosting Machine (GBM),RF,and linear regression (LR).We also used a validation set to measure the forecasting accuracy.As an evaluation metric,we used the root mean squared error (RMSE).

Table 1:Input and output variables for the data selection model

Algorithm 1:Data selection and input variable configuration Input:I nitial inp Final inut variables Iinitial Output:put variables Ifinal countries:List of 168 countries Occuri,t-j:Occurrence data of country i that shifted right j times SO:Sorted list of occurrence data in order of forecasting accuracy FM(I):Forecasting model using input variables I for i in countries do // The first stage of data selection for j=0 to 52 do measure Accuracy of FM(Iinitial ∪Occuri,t-j) // (1)insert Occuri,t-j into SO ··· (2)Ifinal=Iinitial;// The second stage of data selection for i in SO do if Accuracy of FM (Ifinal)<Accuracy of FM (Ifinal ∪i) do // (3)Ifinal=Ifinal ∪i // (4)return Ifinal

4 Experimental Setup

To demonstrate the effectiveness of the proposed scheme,we constructed six different influenza forecasting models based on the ML models used in data selection using the selected surveillance data as input variables.In summary,we have 36 different combinations of data selection and forecasting methods.The six models are composed of two simple models and four ensemble models.The two simple models are the decision tree (DT) and LR,and the four ensemble models are GBM,XGB,LGBM,and RF.

The hyperparameters of each model,such as the number of estimators or maximum depth of the ensemble models,were decided using the grid search with the validation set.When the validation set is changed,the hyperparameters of each model are reset and redetermined.We implemented all models using Python 3.7.3 with the scikit-learn library [25].

We used time-series cross-validation to evaluate the accuracy of the forecasting models.This method has the advantage that it is closer to real-world practical applications than the traditional evaluation method,which divides a dataset into training and testing sets to train and test models,respectively [26].We forecast one year for influenza from 2015 to 2018 and used data from a year before the test year as the validation set.For example,when the test year is 2015,we use data from 2010 to 2013 as the training set and data from 2014 as the validation set.Likewise,if we forecast 2016,data from 2010 to 2014 were used as a training set,and data from 2015 were used as a validation set.

We also calculated the RMSE and mean absolute error (MAE) using Eqs.(3) and (4),respectively,to compare accuracy.Here,Nis the number of data samples,andAnandFnrepresent actual influenza occurrence and forecasted influenza occurrence,respectively.

5 Experiments and Discussion

To verify the effectiveness of our data selection scheme,we performed two experiments for three target countries:the US,China,and the Republic of Korea (Korea),and Tab.2 shows a brief summary of the datasets for the target countries.In the first experiment,we demonstrated why the proposed data selection is valid.In the second experiment,we evaluated the effect of our data selection on the forecasting performance.

Table 2:Dataset summaries of the target countries

5.1 Validity of Data Selection

In this experiment,we present two analysis results to justify our data selection scheme:(1) the RMSE changes according to input variables and (2) the comparison between the target country and the first selected country.

Figure 2:Root mean squared error (RMSE) according to data selection.(a) US.(b) China.(c)Korea.

Fig.2 depicts the changes in the RMSE of the three target countries according to the input variables when forecasting the influenza occurrence for 2018 using XGB.Thex-axis represents the number of selected data,and they-axis represents the RMSE value.The initial RMSE values in the US,China,and Korea were around 5500,2200,and 45,respectively,which are forecasting results of ML models constructed using the initial input variables.The number of data points selected by the proposed data selection scheme for the US,China,and Korea is 45,56,and 61,respectively.All three countries’graphs depict a sharp decline at the beginning and then a very modest decline.This result indicates that using a few well-selected surveillance data points can significantly improve forecasting performance.

Tab.3 presents detailed information about the selected data,including the country,how many shifts are performed to the initial surveillance data for the three target countries,and the reduction in RMSE by adding the selected data as input.For instance,in the US,the data from Chile 50 weeks ago (OccurChile,t-50) improved forecasting performance the most.Data from Kazakhstan,Belarus,and Congo follow,but these data did not significantly improve the forecasting performance.

Table 3:List of selected data and their root mean squared error (RMSE) with reduced value

Figure 3:Scaled influenza occurrence of the target country and the first selected country.(a) US(b) China (c) Korea

Fig.3 illustrates the graphs of the target country after 26 weeks (Occurtarget,t+26) and the first selected country in Tab.3.Thex-axis represents weeks from 2010 to 2018,and they-axis represents the influenza occurrence scaled from 0 to 1.The first added data (OccurChile,t-50for US,OccurSriLanka,t-15for China,andOccurKorea,t-19for Korea) exhibit patterns similar to the target data except for the first two years (104 weeks).This result indicates that the proposed scheme can select influenza patterns in foreign countries that are very similar to those of the target country.

5.2 Forecasting Performance

In this experiment,we investigated the effect of the proposed data selection scheme on forecasting performance and then compared the performance of diverse combinations of data selection and forecasting models.Tabs.4-6 list the RMSE values for the US,China,and Korea,respectively.In the case of the US,the DT exhibits the best performance for data selection in most cases.However,using LR for both data selection and forecasting exhibits the best overall performance.For China,LR exhibits the best performance in data selection,and in Korea,LR has the best performance in data selection in half of all cases.

Table 4:Root mean squared error (RMSE) of data selection and forecasting in the US

Table 5:Root mean squared error (RMSE) of data selection and forecasting in China

We also compared the accuracy of diverse forecasting models in terms of input variables in influenza forecasting to verify the effectiveness of the configured input variables.For comparison,we considered four input variable sets:(1)I,initial input variables;(2)I+GT,which added GT of the current time to the initial input variables;(3)I+T,which added one year of domestic traditional surveillance data of the target country fromOccurtarget,t-1to the targetOccurtarget,t-52to the initial input variables;and (4)I+T+GT,which added both GT and the domestic traditional surveillance data of the target country.The data selection model with the best performance for each forecasting model (bold font in Tabs.3-5) was used as the performance of our proposed scheme.Tabs.7-9 represent the forecasting performance (in the RMSE and MAE)of the US,China,and Korea,respectively.Input variables that demonstrated the best performance in each forecasting model are marked in bold font.

Table 6:Root mean squared error (RMSE) of data selection and forecasting in Korea

Table 7:Comparison of forecasting result of baseline models and the proposed scheme in the US

Table 8:Comparison of forecasting result of baseline models and the proposed scheme in China

In the comparison,the proposed scheme exhibited the best performance among all input variable sets we constructed.For China and the US,the forecasting performance deteriorated in several cases when using traditional surveillance data or GT data.Although the GT data have been used in short-term influenza forecasting in many papers [27,28],the data are not effective for long-term forecasting.According to Tab.3,the top three surveillance data that improved the forecasting performance in Korea were from the same country.As a result,in Korea (Tab.9),past domestic surveillance data contributed to improving forecasting performance in most cases,which indicates that input variables configured using the proposed scheme are more effective for long-term influenza forecasting than other commonly used input variables.

Table 9:Comparison of forecasting result of baseline models and the proposed scheme in Korea

Figure 4:Forecasting in the US using domestic and selected foreign data vs.actual data

We compared the actual occurrence data with predicted occurrences using initial input variables and selected surveillance data as input variables,respectively,to investigate the effectiveness of the proposed scheme more closely.Figs.4-6 display the comparison results for the US,China,and Korea,respectively.For the US,the forecasting results using the initial input variables showed a slight difference in the first peak time prediction,whereas the proposed scheme gave an accurate prediction for the first peak time.It also showed accurate predictions for the remaining peak times.For China,domestic surveillance data were not sufficient to forecast all peaks accurately.However,the proposed scheme showed accurate forecasting performance.In Korea,forecasting using the proposed scheme matched most of the peak times.

Figure 5:Forecasting in China using domestic and selected foreign data vs.actual data

Figure 6:Forecasting in Korea using domestic and selected foreign data vs.actual data

6 Conclusion

In this paper,we proposed a two-stage data selection scheme for foreign surveillance data to improve the performance of long-item influenza forecasting.We evaluated each foreign surveillance data using a ML-based model constructed for the target country based on domestic surveillance data in the first stage.In the second stage,we evaluated each surveillance data unit in the order of accuracy and incrementally added it into the current model if it improved forecasting performance.We constructed diverse ML-based forecasting models for three countries using the selected data as input variables to evaluate the effect of the proposed data selection scheme.We performed extensive experiments to determine the effects of different combinations of data selection and forecasting models.The experimental results demonstrated that our data selection scheme was remarkably effective in constructing an influenza forecasting model for the target country.Furthermore,the input variable set configured by the proposed scheme stably enhanced the forecasting accuracy compared to the input variable sets using traditional surveillance data of the target country or GT data popularly used in influenza forecasting.

However,the limitation of our data selection scheme is that it evaluates the suitability of input variables through the validation set.If the validation set has many different patterns from the testing set,the input variables configured by the proposed scheme may not improve the forecasting performance.In future work,we will design an influenza forecasting scheme that does not need the validation set for input variable configuration and is more applicable for the real world than this scheme.

Funding Statement:This research was supported by a government-wide R&D fund project for infectious disease research (GFID),Republic of Korea (Grant Number:HG19C0682).

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

Appendix A.Comparison of Forecasting Results Using PCC Values

We also measured pearson correlation coefficient (PCC) in the same way as the comparison experiment in Section 5.2.Tabs.A1-A3 list the obtained PCC values for the US,China,and Korea,respectively.In the tables,the input variables that demonstrated the best performance in each forecasting model are marked in bold font.The proposed scheme achieved the best in all cases.

Table A1:Comparison of PCC value of baseline models and the proposed scheme in the US

Table A2:Comparison of PCC value of baseline models and the proposed scheme in the China

Table A3:Comparison of forecasting result of baseline models and the proposed scheme in the US

主站蜘蛛池模板: 亚洲色中色| 亚洲天堂久久新| 久久毛片网| 性69交片免费看| 精品一区二区三区四区五区| 青青极品在线| 国产在线观看91精品| 97青草最新免费精品视频| 免费激情网址| 国产无码精品在线播放| 欧美色视频日本| 成人国产三级在线播放| 日韩精品成人在线| 国产日韩精品欧美一区灰| 国产一级精品毛片基地| 91黄视频在线观看| 不卡网亚洲无码| 日本欧美一二三区色视频| 一级一级一片免费| 中文字幕有乳无码| 欧美第二区| 在线免费亚洲无码视频| 天堂av高清一区二区三区| 亚洲男人的天堂在线| 亚洲中文字幕手机在线第一页| 日韩最新中文字幕| 91精品啪在线观看国产60岁| 色综合中文字幕| 国产亚洲欧美在线专区| 日韩毛片免费视频| 亚洲浓毛av| 国产本道久久一区二区三区| 欧洲亚洲一区| 欧美日韩国产系列在线观看| 久久综合成人| 免费全部高H视频无码无遮掩| 亚洲国产精品VA在线看黑人| 国产成人亚洲欧美激情| 国产视频a| 香蕉伊思人视频| 亚洲熟女中文字幕男人总站| 狠狠色综合久久狠狠色综合| 国产主播喷水| 国产成人免费| 亚洲精品福利视频| 欧美特级AAAAAA视频免费观看| 日韩小视频在线播放| 全裸无码专区| 亚洲水蜜桃久久综合网站| 欧美无专区| 国产小视频在线高清播放| 一级成人a毛片免费播放| 日韩中文欧美| 无码'专区第一页| 五月激情综合网| 亚洲中文字幕在线观看| 日本AⅤ精品一区二区三区日| 色综合久久久久8天国| 国产大全韩国亚洲一区二区三区| 成人免费午夜视频| 亚洲欧美精品日韩欧美| 欧洲成人免费视频| 国产精品无码一二三视频| 国产精品亚洲一区二区三区z| 狠狠亚洲五月天| 91在线激情在线观看| 亚洲第一成人在线| 亚洲视频黄| 欧美日韩国产成人高清视频| 青青草原国产| 特级毛片8级毛片免费观看| 亚洲va欧美ⅴa国产va影院| 日韩高清成人| 久久国产乱子| 九九视频免费看| 成人午夜免费观看| 久草青青在线视频| 伊人91视频| 日本黄色不卡视频| 在线观看免费黄色网址| 重口调教一区二区视频| 丁香亚洲综合五月天婷婷|