I Introduction
The recent resurgence of neural network trained with backpropagation has established stateofart results in a wide range of domains. However, backpropagationbased neural networks (NN) are associated with many disadvantages, including but not limited to, the lack of uncertainty estimation, tendency of overfitting small data, and tuning of many hyperparameters. In backpropagation NNs, the lack of uncertainty information is due to the weights that are treated as point estimates tuned with gradientdescent methods. By contrast,
Bayesian neural networks (BNN) [1, 2] can cope with some of these problems by assigning a prior distribution on the parameters [3, 4, 5, 6]. Nonetheless, the Bayesian inference in BNNs is intractable, and researchers have developed various approximate methods to estimate the uncertainty of the weights.
Blundell et al. [6] proposed a backpropagationbased algorithm which could learn the posterior distribution of the weights by training an ensemble of networks. This method was built upon the work of Hinton and Van Camp [7] and Graves [5]. Specifically, the method, termed as Bayes by Backprop, learned the uncertainty of the weights by minimising the variational free energy on the marginal likelihood.
In additional to variational inference, expectation propagation were applied to estimate the uncertainty of NNs. HernándezLobato and Adams [4]
developed a scalable algorithm which propagated probabilities forward through the network to obtain the marginal likelihood and then obtained the gradients of the marginal probability in the backward step. Similarly, Soudry et al.
[8], described an expectation propagation algorithm aiming at approximating the posterior of the weights with factorized distribution in an online setting.Variational inference was also proved to be theoretically equivalent to the dropout method that was widely used as a regularization techniques for deep NNs [9]. Furthermore, the authors developed tools to extract model uncertainty from dropout NNs. However, variational estimation typically underestimate the uncertainty of the posterior because they ignore the posterior correlation between the weights.
Kalman Filter (KF) is a common approach for parameter estimation in linear dynamic systems. There are a number of work conducted to estimate the parameters of NNs with Extended Kalman Filters (EKF) [10, 11], which perform parameter estimation for nonlinear dynamical systems. However, EKFs were criticized for larger errors of the posterior mean and covariance introduced by the firstorder linearization of the nonlinear system [12]. Instead, Unscented Kalman Filters (UKFs) were explored to estimate the parameters of nonlinear models and were claimed to have better performance to capture the higher order accuracy for the posterior mean and covariance [13].
Compared to UKFs, Ensemble Kalman Filters (EnKF) can scale much better with the dimensionality of the state while capturing nonGaussian distributions. By propagating ensembles rather than mean values and covariances, EnKFs save computation and storage of dealing with large matrices. Thus EnKFs are capable of handing very large state dimensions, which is common in NNs with many weights.
Surpringly, there is very little attention paid to applying EnKFs for parameter estimation in NNs. In an attempt to introduce the EnKFs to the deep learning community, we evaluate the performance of using an EnKF model for the parameter estimation of LSTM and apply it in an outlier detection task. The goal is to model the evolution of the probability distribution of the observed features at time
using Recurrent Neural Networks (RNNs).The probability distribution is then used to determine whether an observation is an outlier.
RNNs are sequencebased networks designed for sequence data. These models have been successfully applied in areas including speech recognition, image caption generation, and machine translation [14, 15, 16]. Compared to feedforward networks, RNNs can capture the information from all previous time steps and share the same parameters across all steps. The term “recurrent” means that we can unfold the network, and at each step the hidden layer performs the same task for different inputs. A typical RNN architecture is illustrated in Figure 1.
Standard RNN is limited by the gradientvanishing problem. To cope with this issue, Long ShortTerm Memory (LSTM) [17] networks have been developed to maintain the long term dependency by specifying gating mechanism for the passing of information through the hidden layers. Namely, memory blocks replace the traditional hidden units, and store information in the cell variable. There are four components for each memory block, which include a memory cell, an input gate, an output gate, and a forget gate.
We propose a Bayesian LSTM where the uncertainty in the weights is estimated using EnKF. To mitigate the underestimation of error covariance due to various sources such as model errors, nonlinearity, and limited ensemble size, in this study we optimize the covariance inflation using maximum likelihood estimation. To assess the proposed algorithm, we apply it to outlier detection in five realworld events retrieved from the Twitter platform.
In the following methodology section we introduce the LSTM, the Bayesian inference using the proposed EnKF, and their application to general outlier detection problems. This will be followed by the subevent detection application in Twitter streams, where the problem specifics and numerical results are presented in the experiment section.
Ii Methodology
Given an observed sequence of features,
, the goal is to contruct the predictive probability density function (pdf)
using a Bayesian LSTM. This pdf is then used to determine whether the next observation is an outlier ( denotes the actual observation).Iia Long ShortTerm Memory (LSTM)
In LSTM each hidden unit in Figure 1 is replaced by a memory cell. Each memory cell is composed of an input gate, a forget gate, an output gate, and an internal state, which process the input data through the gated mechanism depicted in Figure 2 and detailed in the following formula.
(1)  
(2)  
(3)  
(4)  
(5)  
(6) 
Here,
is the logistic sigmoid function,
, , , and represent the three gates and the internal state, is the weight matrix, represents the bias term,is the cell output activation vector,
is elementwise product, and are activation functions, and and represent the input and the output vector, respectively.IiB Ensemble Kalman Filter (EnKF)
The ensemble Kalman filter (EnKF) is an approximate inference for the Bayesian nonlinear filtering problem. It can deal with extremely highdimensional and nonlinear applications
[19]. In EnKF, the probability distribution of state variables is described by ensemble members. Each ensemble member is updated in a similar way as in the Kalman filter. Consider the following system(7)  
(8) 
where is the state variable and is the measurement perturbed by the noise .
Here we use to denotes ensemble members of . By propagating them through Eq. (7), we can get predictions of . Once measurements is obtained, ensemble members of can be updated as follows:
(9)  
(10)  
(11) 
where the perturbed measurements and is a sample of . is the sample covariance matrix of . Using these perturbed measurements one can guarantee the same results as Kalman filter when the ensemble size is infinite [19].
IiC Bayesian LSTM using EnKF
A RNN can be represented as
(12) 
where is the training data, is the parameter vector (weights), is a nonlinear neural network mapping (e.g. LSTM), and is the noise which compensates for the difference between outputs of neural network and real target values.
Let indicate the training data, . Given a new input , we are interested in the predictive distribution
(13) 
Here is the conditional distribution of given the training data , which can be obtained via Bayes’ rule:
(14) 
is the prior distribution of the weights , is the evidence, and is the likelihood which can be obtained though Eq. (12).
To evaluate the integral in Eq. (13), we need to find a solution for Eq. (14). Since the neural network is a nonlinear mapping function, a common way to solve Eq. (14) is using Monte Carlo methods. Suppose samples of are available, and represents the dirac function. Then can be obtained as follows:
(15)  
(16) 
Let denotes samples of , then we have
In this paper, we use EnKF to estimate the uncertainty of the weights . The corresponding dynamic system is shown in Eq. (17) and Eq. (18).
(17)  
(18) 
has a prior distribution and
is a white noise with distribution
. Here and represent the dimensionality of features and targets, respectively.In order to preserve the relation between consecutive data, training data are sent to LSTM in batch. Suppose the batch size is and the number of weights is . Since weights is the quantity that needs to be estimated, we augment the output of neural network with to form an augmented state variable . Here includes all the outputs of the th batch . The matching measurement model of Eq. (8) is given by Eq. (19).
(19)  
Before inference, two hyperparameters
and need to be determined. A common way is to maximize evidence with respect to and .(20) 
This has been successfully applied to Bayesian linear regression. However, for nonlinear models, it is difficult to evaluate the integral above. Here, we fix
and estimate by maximizing . Under the assumption that the data points are generated independently, we have(21) 
For simplicity, we omit in . Each is calculated as follows
Here . Thus the logevidence is given by
Since log is a concave function, according to Jensen’s inequality, we have
(23) 
Thus we obtain the lower bound of :
(24)  
Maximizing the lower bound of logevidence with respect to we obtain
(25) 
IiD Outlier Detection
The inferred distribution of the weights induces a predictive distribution for the next observable . We can use this probability distribution to label the actual observation as outlier. Since each observation is a multidimensional feature with dimension , we can use the chisquared test of the squared Mahalanobis distance [20]. The main idea is to identify when a data point falls outside of the multidimensional uncertainty even when the marginal uncertainties capture the observational data.
The Mahalanobis distance between the actual observation and the predicted uncertainty approximated using a Gaussian distribution is given by
(26) 
where the sample mean and covariance matrix are obtained using the propagated ensemble memebers to the observable.
When the square of the Mahalanobis distance passes the following test, the observations is considered to not be an outlier and a plausible outcome of the model. Here the degree of freedom used to obtain
is .(27) 
Iii Experiments
Iiia Subevent Detection
An event is confined by space and time. Specifically, it consists of a set of subevents, depicting different facets of an event [21]. As an event evolves, users usually post new statuses to capture new states as subevents of the main issue [22]. Within an event, some unexpected situations or results may occur and surprise users, such as the bombing during the Boston Marathon and the power outage during the 2013 Superbowl. Subevent detection provides a deeper understanding of the threats to better manage the situation within a crisis [23].
By formalizing it as an outlier detection task, we built dynamic models to detect subevents based upon the retrieved Twitter data and the proposed window embedding representation described in the following sections.
IiiB Data
Event  Collection Starting Time  Event Time  Collection Ending Time  Key Words/Hashtags 
2013 Boston Marathon  04/12/2013 00:00:00  04/15/2013 14:49:00  04/18/2013 23:59:59  Marathon, #marathon 
2013 Superbowl  01/31/2013 00:00:00  02/03/2013 20:38:00  02/06/2014 23:59:59  Superbowl, giants, ravens, harbaugh 
2013 OSCAR  02/21/2013 00:00:00  02/24/2013 20:30:00  02/27/2013 23:59:59  Oscar, #sethmacfarlane, #academyawards 
2013 NBA AllStar  02/14/2013 00:00:0  02/17/2013 20:30:00  02/20/2013 23:59:59  allstar, allstar 
Zimmerman Trial  07/12/2013 11:30:00  07/13/2013 22:00:00  07/15/2013 11:30:00  trayvon, zimmerman 
We collected the data from Jan. 2, 2013 to Oct. 7, 2014 with the Twitter streaming API and selected five national events for the outlier detection task. The five events include the 2013 Boston Marathon event, the 2013 Superbowl event, the 2013 OSCAR event, the 2013 NBA AllStar event, and the Zimmerman trial event. Each of these events consists of a variety of subevents, such as the bombing for the marathon event, the power outrage for the Superbowl event, the nomination moment of the best picture award, the ceremony for the NBA AllStar MVP, and the verdict of the jury for the Zimmerman trial event.
For these case studies, we filtered out relevant tweets with eventrelated keywords and hashtags, preprocessed the data to remove urls and mentioned users. The basic information of each event is provided in Table I.
IiiC Window Embedding
In computational linguistics, distributed representations of words have shown some advantages over raw cooccurrence count since they can capture the contextual information of words. As categorized by Boroni et al.
[24], distributed semantic models can be termed as count models or prediction models. On one hand, count models, including LSA, HAL, and Hellinger PCA, can efficiently use the statistics of the cooccurrence information but limited to capture complex patterns beyond word similarities. On the other hand, prediction models, such as NNLM and word2vec, can capture complex patterns of the words but limited to use the statistics information of the words. To cope with limits of each approach, Pennington et al. [25] proposed a weighted least squares objective as shown follows:(28) 
where is the number of times word j in the context of word i, and are the word vector and bias of word i, and are the context word vector and bias of word j, and f is a predefined weighting scheme as follows.
Vector representations can be used as features and they have been successfully applied in many natural language processing applications
[25]. Through some experiments, we decided to use the 100 dimension GloVe vector representation that were trained with 27 billion tweets. We further used the Probabilistic PCA to reduce the vector dimensionality into latent components that could capture at least 99% variability of the original information.Here, we define sentence embedding as the average of its word vectors. Given a sentence, it consists of n words represented by vectors , and the sentence embedding is defined as . Furthermore, we define a window embedding as the average of its sentence vectors. For a given time window, it is composed of sentence vectors , and a window embedding is defined as . As we use a moving window approach, we grouped every size window into a training input , and label as the training input . Based upon the grouped data, we can train our proposed multivariate EnKFLSTM model. With some experiments, we chose as the number of latent components , minutes as the time window , and as the grouping size .
IiiD Implementation
The implemented network architecture is shown in Figure 3. The input layer consists 5 nodes, the hidden layer consists 32 LSTM cells, and the output layer consists 5 output nodes. In this implementation, we include the forget gate proposed by [26]
. The implementation is based upon Tensorflow,and it could be easily extended for deep architectures or variants of LSTMs.
Figure 3 provides an intuitive introduction of the architecture and the proposed algorithm. The algorithm proceeds in a batch mode. At the very beginning, the prior weights are drawn from a multivariate Gaussian distribution. Subsequently, we forward propagate each batch through the LSTM cells, and obtain the network outputs. In terms of the network outputs, we augment them with prior weights and update the augmented variable using EnKF. Then we return the updated weights for the next batch process.
Iv Results
The outlier detection results are provided in Figure 4. In terms of the results, we observe 37, 5, 39, 131, and 19 identified subevents, respectively. Of those subevents, 16, 3, 16, 42, and 17 are verified as true subevents. We set the initial sigma value of the noise covariance matrix in the EnKF update step to 1.0, and then further optimized them to 2.17, 2.15, 2.16, 2.018, and 0.19 with Maximum Likelihood Estimation.
To further evaluate our model, we compared it with Gaussian Process (GP) and MC dropout [9]. The comparison result is provided in Table II. The GP model yielded the best recall value in three of the five events, indicating that it captured most true subevents. On the other hand, it also misidentified many normal time windows as subevents, thus yielding many false positives and low precision. Compared to the GP model, our proposed algorithm reliably captured many true subevents and yield the best precision in these five events. Though, on the other hand, it missed capturing some true subevents and had worse recall performance in three of the five events. In terms of the F1 score, our proposed algorithm has the best performance in three of the five cases. The MC dropout model, however, has the worst performance for this specific outlier detection task. Since MC dropout is mathematically equivalent to variational inference, which underestimates the uncertainty, the model mislabels many normal time windows as outliers.
Event  Model  Precision  Recall  F1 Score 
GP  17.1  45.8  24.9  
ENKF LSTM  43.2  24.6  31.3  
2013 Boston Marathon  MC Dropout  10.9  30.8  16.1 
GP  20.4  30.5  24.4  
ENKF LSTM  60.0  10.7  18.2  
2013 Superbowl  MC Dropout  4.9  14.3  7.3 
GP  18.2  37.8  24.6  
ENKF LSTM  41.0  13.6  20.4  
2013 OSCAR  MC Dropout  8.8  34.1  14.0 
GP  18.1  55.6  27.3  
ENKF LSTM  32.1  63.6  42.7  
2013 NBA AllStar  MC Dropout  16.3  45.5  24.0 
GP  25.1  65.9  36.4  
ENKF LSTM  89.5  70.8  79.1  
Zimmerman Trial  MC Dropout  22.5  37.5  28.1 
For the proposed algorithm, ensemble size and the initial sigma value of the noise covariance matrix are two important hyperparameters. To further evaluate their effects on the performance, we provided a sensitivity analysis of the hyperparameters for the 2013 AllStar event. Based upon Figure 5, the algorithm yielded the best result with an ensemble size at 100, and varied slightly with different sizes. According to Figure 6, the evaluation metrics peaked at 0.05 and then slightly decreased with larger value.
Iva Discussion
In this work, we proposed a novel algorithm to estimate the posterior weights for LSTMs, and we further developed a framework for outlier detection. Based upon the proposed algorithm and framework, we applied them for five realworld outlier detection tasks using Twitter streams. As shown in the above section, the proposed algorithm can capture the uncertainty of the nonlinear multivariate distribution. However, the performance of the model is affected by several hyperparameters, including the number of ensembles, the batch size, the initial sigma value of the noise covariance matrix, the number of layers, and the number of nodes in each layer. The performance of the detection is further limited by the choice of window size and word representations. In the future study, we will provide a more detailed analysis of the effects of these hyperparameters on the model performance and finetune them with Bayesian Optimization.
References
 [1] R. M. Neal, Bayesian Learning for Neural Networks. Secaucus, NJ, USA: SpringerVerlag New York, Inc., 1996.
 [2] D. J. C. MacKay, “A practical bayesian framework for backpropagation networks,” Neural Comput., vol. 4, no. 3, pp. 448–472, May 1992. [Online]. Available: http://dx.doi.org/10.1162/neco.1992.4.3.448
 [3] Y. Gal and Z. Ghahramani, “A theoretically grounded application of dropout in recurrent neural networks,” in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, 2016, pp. 1019–1027.

[4]
J. M. HernándezLobato and R. P. Adams, “Probabilistic backpropagation for
scalable learning of bayesian neural networks,” in
Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37
, ser. ICML’15. JMLR.org, 2015, pp. 1861–1869. [Online]. Available: http://dl.acm.org/citation.cfm?id=3045118.3045316  [5] A. Graves, “Practical variational inference for neural networks,” in Proceedings of the 24th International Conference on Neural Information Processing Systems, ser. NIPS’11. USA: Curran Associates Inc., 2011, pp. 2348–2356.
 [6] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural networks,” in Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ser. ICML’15. JMLR.org, 2015, pp. 1613–1622. [Online]. Available: http://dl.acm.org/citation.cfm?id=3045118.3045290

[7]
G. E. Hinton and D. van Camp, “Keeping the neural networks simple by
minimizing the description length of the weights,” in
Proceedings of the Sixth Annual Conference on Computational Learning Theory
, ser. COLT ’93. New York, NY, USA: ACM, 1993, pp. 5–13. [Online]. Available: http://doi.acm.org/10.1145/168304.168306  [8] D. Soudry, I. Hubara, and R. Meir, “Expectation backpropagation: Parameterfree training of multilayer neural networks with continuous or discrete weights.” in NIPS, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds., 2014, pp. 963–971. [Online]. Available: http://dblp.unitrier.de/db/conf/nips/nips2014.html#SoudryHM14
 [9] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ser. ICML’16. JMLR.org, 2016, pp. 1050–1059. [Online]. Available: http://dl.acm.org/citation.cfm?id=3045390.3045502
 [10] I. Rivals and L. Personnaz, “A recursive algorithm based on the extended kalman filter for the training of feedforward neural models,” Neurocomputing, vol. 20, no. 13, pp. 279–294, 1998.

[11]
S. Singhal and L. Wu, “Advances in neural information processing systems 1,” D. S. Touretzky, Ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1989, ch. Training Multilayer Perceptrons with the Extended Kalman Algorithm, pp. 133–140.
 [12] E. A. Wan and R. V. D. Merwe, “The unscented kalman filter for nonlinear estimation,” 2000, pp. 153–158.
 [13] S. J. Julier and J. K. Uhlmann, “Unscented filtering and nonlinear estimation,” in PROCEEDINGS OF THE IEEE, 2004, pp. 401–422.
 [14] H. Sak, A. W. Senior, and F. Beaufays, “Long shortterm memory recurrent neural network architectures for large scale acoustic modeling,” in INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 1418, 2014, 2014, pp. 338–342.
 [15] A. Karpathy and L. FeiFei, “Deep visualsemantic alignments for generating image descriptions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 664–676, Apr. 2017. [Online]. Available: https://doi.org/10.1109/TPAMI.2016.2598339
 [16] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proceedings of the 27th International Conference on Neural Information Processing Systems, ser. NIPS’14. Cambridge, MA, USA: MIT Press, 2014, pp. 3104–3112. [Online]. Available: http://dl.acm.org/citation.cfm?id=2969033.2969173
 [17] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Comput., vol. 9, no. 9, pp. 1735–1780, Nov. 1997. [Online]. Available: http://dx.doi.org/10.1162/neco.1997.9.8.1735
 [18] A. Graves, “Generating sequences with recurrent neural networks.” CoRR, 2014. [Online]. Available: https://arxiv.org/pdf/1308.0850.pdf
 [19] G. Evensen, “The ensemble kalman filter: theoretical formulation and practical implementation,” vol. 53, pp. 343–367, 2003.
 [20] R. Warren, R. E. Smith, and A. K. Cybenko, “Use of mahalanobis distance for detecting outliers and outlier clusters in markedly nonnormal data: A vehicular traffic example,” Air Force Materiel Command, Tech. Rep., 2011.
 [21] D. Pohl, A. Bouchachia, and H. Hellwagner, “Automatic subevent detection in emergency management using social media,” in Proceedings of the 21st International Conference on World Wide Web, ser. WWW ’12 Companion. New York, NY, USA: ACM, 2012, pp. 683–686. [Online]. Available: http://doi.acm.org/10.1145/2187980.2188180
 [22] P. Meladianos, G. Nikolentzos, F. Rousseau, Y. Stavrakas, and M. Vazirgiannis, “Degeneracybased realtime subevent detection in twitter stream.” in ICWSM, M. Cha, C. Mascolo, and C. Sandvig, Eds. AAAI Press, 2015, pp. 248–257.
 [23] D. Pohl, A. Bouchachia, and H. Hellwagner, “Social media for crisis management: Clustering approaches for subevent detection,” Multimedia Tools and Applications, 2013.
 [24] G. K. Marco Baroni, Georgiana Dinu, “Don’t count, predict! a systematic comparison of contextcounting vs. contextpredicting semantic vectors,” 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014  Proceedings of the Conference, vol. 1, pp. 238–247, 2014.
 [25] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation.” in EMNLP, vol. 14, 2014, pp. 1532–1543.
 [26] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with lstm,” Neural Computation, vol. 12, pp. 2451–2471, 1999.
Comments
There are no comments yet.