Influenza Modeling Based on Massive Feature Engineering and International Flow Deconvolution

12/06/2019 ∙ by Ziming Liu, et al. ∙ Peking University 47

In this article, we focus on the analysis of the potential factors driving the spread of influenza, and possible policies to mitigate the adverse effects of the disease. To be precise, we first invoke discrete Fourier transform (DFT) to conclude a yearly periodic regional structure in the influenza activity, thus safely restricting ourselves to the analysis of the yearly influenza behavior. Then we collect a massive number of possible region-wise indicators contributing to the influenza mortality, such as consumption, immunization, sanitation, water quality, and other indicators from external data, with 1170 dimensions in total. We extract significant features from the high dimensional indicators using a combination of data analysis techniques, including matrix completion, support vector machines (SVM), autoencoders, and principal component analysis (PCA). Furthermore, we model the international flow of migration and trade as a convolution on regional influenza activity, and solve the deconvolution problem as higher-order perturbations to the linear regression, thus separating regional and international factors related to the influenza mortality. Finally, both the original model and the perturbed model are tested on regional examples, as validations of our models. Pertaining to the policy, we make a proposal based on the connectivity data along with the previously extracted significant features to alleviate the impact of influenza, as well as efficiently propagate and carry out the policies. We conclude that environmental features and economic features are of significance to the influenza mortality. The model can be easily adapted to model other types of infectious diseases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Infectious diseases pose an incessant threat to human health and welfare. Influenza, as one of the most prevalent diseases worldwide, is a typical recurrent seasonal epidemic disease. We shall seek to model the pattern of its behavior and its causes, based on the provided data and external data online. Qualitative patterns are proposed, supplemented a quantitative feature extraction along with its higher-order rectified version, to analyze the possible causes of influenza. Concrete policies are provided, both regarding its effectiveness and propagation viability, after a detailed regional validation of our models.

2 General Idea and Analysis

The organization of our model is mainly composed of five parts. First, we need to preprocess the data and carefully select the relevant information. Then we shall analyze the property of the available data qualitatively and build our model as per the indication and real-life scenarios. Thirdly, the principal features shall then be extracted and their validity shall be tested against a priori criteria. Now we can use such features to fit our models and derive the weights of each feature. A higher-order perturbed version of the model is also introduced to address the international behavior, and we shall test the effectiveness of both models. Finally, we design the corresponding policy to prevent the spread of influenza as suggested by our main model. The pipeline for our models is plotted in Figure 1.

Figure 1: Pipeline for our model

Data Augmentation.

Based on our formulation of the problem, we first augment the data related to the influenza mortality. Besides the provided data on consumption, health indicators, connectivity, immunization, sanitation, and water quality, we also resort to the World Bank Open Data [11] to collect other possible indicators, such as labor or agriculture. In total, we make a massive collection of indicators with dimensions. The rationale behind this resort to external data is due to the sparsity and insufficiency of our provided data. Although there are still missing values in the collected indicators, we can obtain a dense dataset via matrix completion [5]. Finally, we shall use support vector machines (SVM) [7] to certify that our augmented dataset is indeed consistent and satisfies the basic criteria.

Qualitative Analysis of Periodic Behavior.

Since influenza, like all recurrent infectious diseases, follows a somewhat periodic behavior, we have sound reason to conjecture a periodic outbreak of influenza. Discrete Fourier transforms [10] on weekly influenza activity both regionally and globally would reveal such pattern indeed, and we find the period to be approximately one year. Thus we can safely use only the provided annual data, as it is the representative of the mean of influenza activity each year.

Feature Extraction.

We shall use a combination of autoencoders [6] and principal component analysis (PCA) [1] to extract useful features from the augmented data. Autoencoders are used to introduce nonlinearity and thus robustness to our model. Then PCA is introduced for an orthogonal design. We shall use autoencoders to extract our features to a dimensional Riemann manifold, and use PCA to analyze its principal characteristics. Our method is in effect a kernel PCA method which enjoys both orthogonality and nonlinearity.

Regional Model.

To analyze the causes of influenza based on extracted features, we propose a simple linear regression to fit the weights based on the influenza mortality data, as opposed to prevalent models [9] of PDE / integro-differential equations. The rationale behind the choice of the model is analyzed as follows. On the one hand, we do not possess a sufficiently fine resolution of data points in time at such a global scale, thus rendering a continuous model errant. On the other hand, nonlinearity is already introduced in our data pre-processing and feature extracting process. Therefore a stable linear model would appear more rational and can be easily modified by international higher-order perturbation terms.

Global Model.

Taken into account the global migration of population and trade of products, we can propose a rectified version of the regional model using the provided migration data, and trade data from Trade Map [8]. Infectious diseases can indeed spread via migration and trade, so we model the international flow of migration and trade as a convolution on regional influenza activity. Furthermore, we use a second-order polynomial to model the region-wise dependence of the convolution kernel. We jointly optimize the coefficients of the deconvolution and the linear regression towards the correct prediction. Regional data shall be used to verify the correctness of our models.

Policy Design.

Based on the analysis of features and their importance, we shall design corresponding policies to mitigate the negative influences of influenza. Cost-effectiveness and viability shall also be taken into consideration when concrete policies are designed for each region.

3 Main Models and Guiding Methods

3.1 Discrete Fourier Transform

Fourier analysis is used to analyze the behavior of a continuous function in the frequency domain 

[3]. Specifically, Fourier transform is a linear bijective mapping of the space to itself. When given discrete data in a finite interval, Discrete Fourier Transform [4] as a variant of such methods can be used to analyze the periodic behavior quantitatively.

For any given region, we consider the available data on influenza activity. For the known data points of weekly influenza activity , we use the DFT technique to get coefficients:

(1)

We can infer the periodic behavior of influenza activity based on the peaks and periodic decaying property of the coefficients .

3.2 Matrix Completion

Matrix completion [5] is a conventional technique for filling out missing information in a matrix of a large scale, with the aim of maintaining a sparse and low-rank structure. The precise formulation of the problem is defined as follows:

(2)

Though as an NP-hard problem to find the optimal completion, we can introduce regularization and relaxation to address the problem, even with noisy input. We shall resort to the Python package fancyimpute for a numerical implementation of completing the missing data from collected indicators relevant to the influenza mortality.

3.3 Support Vector Machine

Given data points , where is the input feature and is the output pattern, assumed to be or , support vector method approach [7]

aims at a construction of the classifier in the following form:

(3)

Namely, we seek to separate the data labels by a separating hyperplane. The optimal parameters are achieved for hyperplanes minimizing the hinge loss function

(4)

where is a tuning regularization parameter. We shall use SVM and a priori knowledge to verify the results after matrix completion.

3.4 Autoencoder

Figure 2: Architecture of our autoencoder

Recent years have witnessed the success of variants of autoencoders [6], mainly because autoencoders have the power to express and represent nonlinear manifold in a high dimensional space. We use autoencoders to reduce the dimension of collected indicators, which can be modeled by the ‘bottleneck’ structure. To be specific, the architecture is plotted in Figure 2. The encoder part have

cells as inputs, and multiple hidden full connection layers, with ReLU activation function to introduce nonlinearity. A ‘bottleneck’ layer with

cells () is reached in the middle. The decoder part is a mirror of the encoder, with output size . The features in the ‘bottleneck’ layer are recognized as the ‘code’ which most likely represent the most important information hidden in data, thus the name ‘autoencoder’.

In order to choose the optimal number of layers and number of cells in the ‘bottleneck’ layer, therefore balance between the reconstruction error and the model simplicity, we use the Bayesian information criterion (BIC) [2] to evaluate the overall performance of our model. Here, the BIC factor is defined as

(5)

where is the number of countries and regions in context, is the number of cells in the ‘bottleneck’ layer, and is the reconstruction error per country.

3.5 Principal Component Analysis

Principal Component Analysis [1]

is a prevalent method for data compression and feature extraction. It uses orthogonal transformation to extract the high-dimensional data into a sequence of uncorrelated features.

The algorithm is defined as follows. For a data matrix , we extract the first component by

(6)

Then we subsequently extract the -th component given the first ones by a subtracted matrix

(7)

and a similar maximization of Rayleigh quotient

(8)

The procedure could also be explained by the truncation of the largest singular values in the SVD decomposition. We use PCA to achieve an orthogonal design in the extracted feature space, in preparation for the regression.

3.6 Linear Regression

Since infectious diseases follow an exponential increase pattern, we take the logarithm of the death rate and find the data is of a Gaussian distribution. We shall perform linear regression

(9)

where is the normalized log death rate, is the data of the features derived by autoencoders and PCA, augmented by a column of as the interception, and is our target of the weight of each feature.

3.7 Higher-Order Rectification

After the derivation of our regional model , where the weights are optimized by linear regression, we shall rectify the model by a higher-order perturbation,

(10)

The matrix represents international flow of migration and trade, and is assumed small.

4 Concrete Analysis

4.1 Analysis of the Data After Completion

Figure 3: The result of SVM for raw data
Figure 4: The result of SVM for completed data
Figure 5: Fourier series for influenza activity in Japan
Figure 6: Fourier series for influenza activity on earth

We use the technique of matrix completion, for a merged version world_bank_merge.csv of the data world_bank on World Bank [11]. The processed data world_bank_impute.csv is tested by a support vector machine. We would like to check the low-rank and sparse property of our data.

To be precise, we divide the countries into developed ones and developing ones, which is an essential feature indicative of the general condition of the countries. We use SVM for classification based on the completed features, and plot the results in Figure 3 and 4. We find that the countries are successfully classified after completion, whereas the classification fails for the raw data, thus consolidating the effectiveness of our matrix completion.

Next, we use the discrete Fourier transform to analyze the periodic behavior in the dataset influenza_activity.csv. The yearly periodic behavior can be easily spotted in Figure 5 and 6, both regionally and globally, where in the horizontal axis stands for one year.

4.2 Feature Extraction

We use autoencoders to extract features from the originally -dimensional information world_bank_impute.csv. We use a four-layer autoencoder, with a number of selected features following a pattern of geometric series to capture the information better. Thus the number of features in each layer is , , , , and finally on the final output layer.

The advantages of the autoencoder method over PCA lie in its nonlinearity and expressibility, which can be seen by comparing the reconstruction error of PCA and autoencoders in Figure 7. The error of autoencoders is far less than that of PCA given the same number of features. Thus autoencoders can indeed capture the nonlinearity of data features, which is beyond the abilities of PCA. However, taking the subsequent regression procedure into consideration, we shall invoke a PCA orthogonalization after the autoencoder.

Figure 7: Comparison of autoencoders with PCA

Finally, we shall make some observations on our extracted features. The features are derived by a back-propagation of the results outputted by PCA, and they are linear combinations of the provided information. The specific data and corresponding information is stored in the zip file Feature_selection, with each .csv file corresponding to one extracted feature, listed as per importance.

4.3 Linear Regression and Higher-Order Rectification

We take the influenza mortality data in death_rate_ghe2016.csv from WHO [12], and form the normalized log death rate , stored in z_normal.txt. We use a linear regression to obtain the weights , where is the feature matrix obtained by autoencoders and PCA.

As a higher-order rectification, we use the model

(11)

where is assumed small by the intuition of perturbation. Each entry represents the normalized mortality transferred from region to region , which is composed of the flow amount multiplied by a region-dependent factor. The factor is further approximated by a second-order polynomial of , inspired by the local rectangular basis in finite element method:

(12)

where , are the coefficients to be optimized, and , are normalized migration and trade amounts from region to region .

The coefficients in the rectified version of our model can be jointly optimized with the linear regression, which essentially solves the approximated deconvolution problem. We store both of the outputs in out.txt.

As a final refinement of our model, we compute the -value of each feature and shall reject the features with -value more than . The refined weights are listed in out_selected.txt and presented in Table 1.

Feature
Weight
Table 1: Refined weights in significant features

A bootstrap cross-validation test is conducted to examine the behavior of our models. In Figure 8 and 9, we study the behavior of our model depending on whether the higher-order rectification is present. We can easily conclude that the modified model outperforms the original way by a considerable margin.

Figure 8: Bootstrap test for linear regression model
Figure 9: Bootstrap test for model with rectification

We now present some of our significant features and analyze their interpretation in real life scenarios, as in Table 2.

Feature 1 2 3
Bad Other greenhouse gas emissions Fertilizer consumption Rural poverty gap
Medium Electricity production Employment to population ratio Imports of goods and services
Good Average precipitation in depth Community health workers Health care
Table 2: Information contributing to the influenza mortality

5 Policy Design

5.1 Global Policies

Based on table 2, we can design the following policies worldwide.

  1. We can control the use of environmental-harmful products, such as fertilizers and electronic devices prone to greenhouse emission.

  2. We should design policies to diminish the poverty gap, so as to provide the rural population with a better environment against influenza.

  3. We can enhance our health welfare, spending more of government expenditure in health insurance and community health care to conquer the influenza virus.

5.2 Regional Policies

In order to specifically design regional policies, we calculate the significant features for some typical countries, namely Nigeria, Japan and USA, as in Table 3.

Feature
Nigeria
Japan
USA
Table 3: Coefficients in Nigeria, Japan and USA for significant features
  1. We analyze the data of Nigeria, where influenza is of the most mortality. We conclude that feature is of the most significance in the region, and we thus should impose restrictions on environmental-harmful activities such as the emission of greenhouse gases.

  2. We carry out similar analysis on Japan, where influenza is of great significance in periodicity and the -th least mortality. We conclude that feature should be enhanced in the region, and we thus should improve the number of community health workers. Japan is an isolated and highly populated country. Better community health care could lower the risk of spread of infectious diseases by intense contact.

  3. Finally we study the cause for the spread of influenza in the United States, which has the -nd lowest influenza mortality rate. We conclude that feature should be taken special notice in the region. Namely we should heed the poverty gap and provide better health care for the poor.

The policies are indeed viable and can be propagated easily on internet. For example, we can put up slogans in marches and parades, to raise the general awareness of the public on environmental issues. The government should allocate more of its funding on community health care by adjusting its budgets. To address the problem of poverty gap, the government could increase the standard of minimal wages and impose relatively more taxes on the riches.

6 Robustness and Regional Validation

We introduce an additional column of information in world_bank_merge.csv with randomly generated data, to find that our results of feature extraction and weights regression remains similar, by a perturbation less than . Similarly, an additional row of region is introduced in world_bank_merge.csv with randomly generated data, and the output remains a perturbation less than as well. Therefore, we can deduce that our model is of great robustness and stability.

Also, we compute the residual of our models in residual.txt, as regional validations of the models. We find that our model is of satisfactory accuracy, with moderate errors as model errors and algorithm errors.

7 Assessment of Our Models

7.1 Strength of the Model

  1. The model consists of a linear regression of important features and a higher-order perturbation. Thus it enjoys the property of generalization and combines regional as well as international information.

  2. Exterior datasets are introduced to help reconstruct missing information and features. Matrix completion ensures sparsity as well as low-rank property, matching the case in real-life scenarios. A high-dimensional data space ensures the comprehensiveness of our features, and model reduction methods are deployed to capture the crucial information.

  3. DFT method is invoked to verify the periodic behavior of influenza.

  4. Casual inference, cross-validation, and tests of robustness are executed for the sake of stability.

  5. The model is transferable to model other types of infectious diseases due to its interpretability and its comprehensiveness of all potential contributing factors to influenza behavior.

7.2 Deficiencies of the Model

  1. We neglect the time evolution of influenza for lack of data.

  2. A simplified version of the migration model is introduced, due to the complexity of modeling the dynamic system of a Markov process.

  3. Policies are designed without taking into account the correlation between important features. Combined policies could be proposed as a future improvement.

8 Conclusion

We propose in this article a model of influenza spreading by a combination of massive feature engineering and international flow deconvolution. We detect a periodical behavior in influenza activity and use yearly normalized mortality data for regression. Features are extracted from the augmented dataset and weights are computed by higher-order rectification of graph deconvolution. We reach the conclusion that the spread of influenza is affected by its local environment and national economies. Policies are designed both regionally and globally, to mitigate the adverse effects of influenza.

Our model is of high interpretability. Bridging nonlinearity of the feature extraction into the linear regression model, supplemented by a higher-order graph deconvolution, can increase the robustness and consistency of the model. Furthermore, our methods of feature extraction and the main model can be easily transferred to analyze the spread of other infectious diseases.

Acknowledgement

We thank Citadel and Correlation One for all the dedicated efforts to hold such wonderful competition. This work is impossible without their help and support.

References

  • [1] H. Abdi and L. J. Williams. Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010.
  • [2] J. Chen and Z. Chen. Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771, 2008.
  • [3] L. Grafakos. Classical fourier analysis, volume 2. Springer, 2008.
  • [4] F. J. Harris. On the use of windows for harmonic analysis with the discrete fourier transform. Proceedings of the IEEE, 66(1):51–83, 1978.
  • [5] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incomplete matrices.

    Journal of machine learning research

    , 11(Aug):2287–2322, 2010.
  • [6] A. Ng et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19, 2011.
  • [7] J. A. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural processing letters, 9(3):293–300, 1999.
  • [8] Trade Map. Trade statistics for international business development. https://www.trademap.org/, 2018.
  • [9] R. G. Webster, W. J. Bean, O. T. Gorman, T. M. Chambers, and Y. Kawaoka. Evolution and ecology of influenza a viruses. Microbiological reviews, 56(1):152–179, 1992.
  • [10] S. Weinstein and P. Ebert. Data transmission by frequency-division multiplexing using the discrete fourier transform. IEEE transactions on Communication Technology, 19(5):628–634, 1971.
  • [11] World Bank. World bank open data. https://data.worldbank.org/, 2018.
  • [12] World Health Organization.

    Global health estimates 2016: Deaths by cause, age, sex, by country and by region, 2000-2016.

    https://www.who.int/healthinfo/global_burden_disease/estimates/en/, 2018.