1 Background
1.1 Sentence representation
A sentence usually consists of a sequence of discrete words or tokens , where can be a onehot vector with the dimension equal to the number of unique tokens in the vocabulary. Pretrained distributed word embeddings, such as Word2vec DBLP:conf/nips/MikolovSCCD13 and GloVe DBLP:conf/emnlp/PenningtonSM14 , have been developed to transform into a lowerdimensional vector representation , whose dimension is much smaller than . Thus, a sentence can be encoded in a more dense representation . The encoding process can be written as: , where
is the transformation matrix. In the areas of natural language processing, the majority of deep learning methods (e.g. RNN and CNN) take
as the input and generate a compact vector representation for a sentence: , where indicates a RNN model. These methods consider the semantic dependencies between and its context and hence believe that summarizes the semantic information of the entire sentence.1.2 Selfattention
The attention mechanism DBLP:journals/corr/BahdanauCB14 ; DBLP:conf/nips/VaswaniSPUJGKP17 has been proposed as an alignment score between elements from two vector representations. Specifically, given the vector representation of a query and a token sequence , the attention mechanism is to compute the alignment score between and .
Selfattention DBLP:conf/emnlp/ParikhT0U16 is a special case of the attention mechanism, where is replaced with a token embedding from the input sequence itself. Selfattention is a method of encoding sequences of vectors by relating these vectors to eachother based on pairwise similarities. It measures the dependency between each pair of tokens, and , from the same input sequence: , where indicates a selfattention implementation.
Selfattention is very expressive and flexible for both longterm and local dependencies, which used to be respectively modeled by RNN and CNN. Moreover, the selfattention mechanism has fewer parameters and faster convergence than RNN. Recently, a variety of NLP tasks have experienced improvement brought by the selfattention mechanism.
1.3 Neural variational inference
Latent variable modeling is popular for many NLP tasks DBLP:conf/icml/MiaoYB16 ; DBLP:conf/coling/BahuleyanMVP18
. It populates hidden representations to a region (in stead of a single point), making it possible to generate diversified data from the vector space or even control the generated samples. It is nontrivial to carry out effective and efficient inference for complex and deep models. Training neural networks as powerful function approximators through backpropagation has given rise to promising frameworks to latent variable modeling
DBLP:journals/corr/LinFSYXZB17 ; DBLP:conf/aaai/LuoZXW18 .The modeling process builds a generative model and an inference model. A generative model is to construct the joint distribution and somehow capture the dependencies between variables. For a generative model with a latent variable
, it can be seen as stochastic units in deep neural networks. We define the observed parent and child nodes of as and respectively. Hence the joint distribution of the generative model is:(1) 
where parameters the generative distributions and . The variational lower bound is:
(2) 
In order to derive a tight lower bound, the variational distribution should approach the true posterior distribution
. A parametrized diagonal Gaussian distribution
is employed as .The inference model is to derive the variational distribution that approaches the posterior distribution of latent variables given observed variables. The three steps to construct the inference model are:

Construct vector representations of the observed variables: , .

Assemble a joint distribution: .

Parameterize the variational distribution over the latent variables: , .
and can be any type of deep neural networks that are suitable for the observed data; is an MLP that concatenates the vector representations of the conditioning variables;
is a linear transformation which outputs the parameters of the Gaussian distribution. By sampling from the variational distribution,
, we are able to carry out stochastic backpropagation to optimize the lower bound.During the training process, the generative model parameters together with the inference model parameters are updated by stochastic backpropagation based on samples drawn from . Let denote the total number of samples. For the gradients w.r.t. , we have the form:
(3) 
For the gradients w.r.t. parameters , we reparameterize and samples
to reduce the variance in stochastic estimation. The update of
can be carried out by backpropagating the gradients w.r.t. and :(4)  
(5) 
It is worth mentioning that unsupervised learning is a special case of the neural variational framework where
has no parent node . In that case is directly drawn from the prior instead of the conditional distribution , and .Here we only discuss the scenario where the latent variables are continuous and the parameterized diagonal Gaussian is employed as the variational distribution. However the framework is also suitable for discrete units, and the only modification needed is to replace the Gaussian with a multinomial parameterized by the outputs of a softmax function. Though the reparameterization trick for continuous variables is not applicable for this case, a policy gradient approach (Mnih & Gregor, 2014) can help to alleviate the high variance problem during stochastic estimation.
2 Variational Selfattention Model
In this paper we propose a Variational Selfattention Model (VSAM) that employs variational inference to learn selfattention. In doing so the model will implement a stochastic selfattention learning mechanism instead of the conventional deterministic one, and obtain a more salient innersentence semantic relationship. The framework of the model is shown in Figure .
Suppose we have a sentence , where is the pretrained word embedding and is the number of words in the sentence. We concatenate the word embeddings to form a matrix , where is the dimension of the word embedding. We aim to learn semantic dependencies between every pair of tokens through selfattention. Instead of using the deterministic selfattention vector, VSAM employs a latent distribution to model semantic dependencies, which is a parameterized diagonal Gaussian . Therefore, the selfattention model extracts an attention vector based on the stochastic vector .
The diagonal Gaussian conditional distribution can be calculated as follows:
(6)  
(7)  
(8) 
For each sentence embedding , the neural network generates the corresponding parameters and that parametrize the latent selfattention distribution over the entire sentence semantics.
The selfattention vector can then be derived as: . The final sentence vector representation is the sentence embedding matrix weighted by the selfattention vector as: , where . For the downstream application with expected output
, the conditional probability distribution
can be modeled as: . As for the inference network, we follow the neural variational inference framework and construct a deep neural network as the inference network. We use and to compute as: . According to the joint representation , we can then generate the parameters and , which parameterize the variational distribution over the sentence semantics :(9)  
(10) 
To emphasize, although both and are modeled as parameterized Gaussian distributions, as an approximation only functions during inference by producing samples to compute the stochastic gradients, while is the generative distribution that generates the samples for predicting . To maximize the loglikelihood we use the variational lower bound. Based on the samples , the variational lower bound can be derived as
(11)  
The generative model parameters and the inference model parameters are updated jointly according to their stochastic gradients. In this case, can be analytically computed during the training process.
3 Experiments
Stance  Training  Test  

Number  Percentage  Number  Percentage  
agree  03,678  07.36  1,903  07.49 
disagree  00840  01.68  0697  02.74 
discuss  08,909  17.83  4,464  17.57 
unrelated  36,545  73.13  18,349  72.20 
49,972  25,413 
In this section, we describe our experimental setup. The task we address is to detect the stance of a piece of text towards a claim as one of the four classes: agree, disagree, discuss and unrelated zhang2018ranking . Experiments are conducted on the FNC1 official dataset ^{1}^{1}1 https://github.com/FakeNewsChallenge/fnc1. The dataset are split into training and testing subsets, respectively; see Table 1 for statistics of the split. We report classification accuracy and micro F1 metrics on test dataset for each type of stances.
Baselines for comparisons include: (1) Average of Word2vec Embedding refers to sentence embedding by averaging vectors of each word based on Word2vec. (2) CNNbased Sentence Embedding
refers to sentence embedding by inputting the Word2vec embedding of each word to a convolutional neural network. (3)
Selfattention Sentence Embedding refers to sentence embedding by calculating selfattention based sentence embedding, without variational inference.Model  Accuracy (%)  Micro F1(%)  

agree  disagree  discuss  unrelated  
Average of Word2vec Embedding  12.43  01.30  43.32  74.24  45.53 
CNNbased Sentence Embedding  24.54  05.06  53.24  79.53  81.72 
RNNbased Sentence Embedding  24.42  05.42  69.05  65.34  78.70 
Selfattention Sentence Embedding  23.53  04.63  63.59  80.34  80.11 
Our model 
28.53  10.43  65.43  82.43  83.54 
Table 2 shows a comparison of the detection performance. As for the micro F1evaluation metric, our model achieves the highest performance (83.54%) on the FNC1 testing subset. The average method can lose emphasis or key word information in a claim; the CNNbased method can only capture local dependency among the text with limit to the filter size; the RNNbased method can obtain semantic relationship in a sequential manner. Differently, the selfattention method is able to combine embedding information between each pair of words, which means more accurate semantic matching of the claim and the piece of text. Compared with the deterministic selfattention, our method is a stochastic approach that is experimentally proven to better integrate the vector embedding of each word.
4 Conclusion
We propose a variational selfattention model (VSAM) that builds a selfattention vector as random variables by imposing a probabilistic distribution. Compared with conventional deterministic counterpart, the stochastic units incorporated by VSAM allow multimodal attention distributions. Furthermore, by marginalizing over the latent variables, VSAM is more robust against overfitting, which is important for small datasets. Experiments on the stance detection task demonstrate the superiority of our method.
Acknowledgments
This project was funded by the EPSRC Fellowship titled "Task Based Information Retrieval", grant reference number EP/P024289/1.
References
 [1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
 [2] H. Bahuleyan, L. Mou, O. Vechtomova, and P. Poupart. Variational attention for sequencetosequence models. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 2026, 2018, pages 1672–1682, 2018.
 [3] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. A structured selfattentive sentence embedding. CoRR, abs/1703.03130, 2017.

[4]
R. Luo, W. Zhang, X. Xu, and J. Wang.
A neural stochastic volatility model.
In
Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 27, 2018
, 2018. 
[5]
Y. Miao, L. Yu, and P. Blunsom.
Neural variational inference for text processing.
In
Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016
, pages 1727–1736, 2016.  [6] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 58, 2013, Lake Tahoe, Nevada, United States., pages 3111–3119, 2013.
 [7] A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 14, 2016, pages 2249–2255, 2016.
 [8] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 2529, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543, 2014.
 [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 6000–6010, 2017.
 [10] Q. Zhang, E. Yilmaz, and S. Liang. Rankingbased method for news stance detection. In Companion of the The Web Conference 2018 on The Web Conference 2018, pages 41–42. International World Wide Web Conferences Steering Committee, 2018.
Comments
There are no comments yet.