Recurrent Neural Network-Based Semantic Variational Autoencoder for Sequence-to-Sequence Learning

02/09/2018 ∙ by Myeongjun Jang, et al. ∙ 0

Sequence-to-sequence (Seq2seq) models have played an import role in the recent success of various natural language processing methods, such as machine translation, text summarization, and speech recognition. However, current Seq2seq models have trouble preserving global latent information from a long sequence of words. Variational autoencoder (VAE) alleviates this problem by learning a continuous semantic space of the input sentence. However, it does not solve the problem completely. In this paper, we propose a new recurrent neural network (RNN)-based Seq2seq model, RNN semantic variational autoencoder (RNN--SVAE), to better capture the global latent information of a sequence of words. To consider the words in a sentence equally, without regard to its position within the sentence, we construct a document information vector using the attention information between the final state of the encoder and every prior hidden state. Then, we combine this document information vector with the final hidden state of the bi-directional RNN encoder to construct the global latent vector, which becomes the output of the encoder part. Then, the mean and standard deviation of the continuous semantic space are learned to take advantage of the variational method. Experimental results of three natural language tasks (i.e., language modeling, missing word imputation, paraphrase identification) confirm that the proposed RNN--SVAE yields higher performance than two benchmark models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sequence-to-sequence (Seq2seq) models (Cho et al., 2014b; Sutskever et al., 2014), based on recurrent neural networks (RNN), show excellent capability for processing a variable lengths of sequential data. In recent years, these structures have led to the noteworthy development of language models and have played an important role in the development of various tasks of natural language processing (NLP), such as machine translation (Bahdanau et al., 2014; Cho et al., 2014b; Sutskever et al., 2014; Ling et al., 2015; Luong et al., 2015; Zhao & Zhang, 2016; Lee et al., 2016; Ha et al., 2016; Artetxe et al., 2017), machine comprehension (Hermann et al., 2015; Rajpurkar et al., 2016; Yuan et al., 2017), text summarization (Bahdanau et al., 2016; Chan et al., 2016; Nallapati et al., 2016), and speech recognition (Graves & Jaitly, 2014; Huang et al., 2016; Chan et al., 2016; Bahdanau et al., 2016).

Figure 1: Cosine similarity of sentence information vectors produced by the VAE.

The simplest Seq2Seq structure is the RNN autoencoder (RNN–AE), which receives a sentence as input and returns itself as output (Dai & Le, 2015). Because this model is an unsupervised method that does not require labeled data, it is very easy to obtain training data. Thus, the RNN–AE can be applied to diverse tasks. It has been used to pre-train parameters of a text classification model, achieving better performance than random parameter initialization (Dai & Le, 2015). It also has been used to generate long-length sentences (Li et al., 2015)

. Furthermore, it has been applied to not only text data but also to acoustic and video data, which also have sequential information, such as novelty detection of acoustic and video data

(Marchi et al., 2017; D’Avino et al., 2017) and representation learning of acoustic data (Amiriparian et al., 2017).

Although the RNN–AE shows good performance in many studies, it has limitations. First, because the compressed information of the input sentence is learned as a fixed vector (i.e., encoder final state), there is a high probability that the output will not be good, even for small changes in the vector value. Second, it is not easy to find the global latent feature of the input sentence, owing to the structural features of the RNN that performs the prediction for the next stage of the series

(Bowman et al., 2015). Bowman et al. (2015) proposed the RNN variational autoencoder (RNN–VAE) model to resolve the above issues by applying a variational inference technique (Kingma & Welling, 2013; Rezende et al., 2014) to the RNN–AE. The RNN–VAE model was successful at moderating the high sensitivity of the RNN–AE to small changes in the final state vector values by learning the information of the input sentence as a probabilistic continuous vector, instead of a fixed vector. The RNN–VAE exhibited a better basic structure than RNN–AE for various NLP tasks, including machine translation (Zhang et al., 2016) and text classification (Xu et al., 2017)

However, the RNN–VAE seems unable to completely solve the problem of RNN-AE, because it seems that it does not capture the global latent feature of the input sentence. With RNN–VAE, the mean and standard deviation of the continuous space of the input sentence information are calculated from the final state of the RNN–AE encoder. Because this final state is updated for each step of the word sequence making up the input sentence, it stores much more information about the last or the beginning parts, if bidirectional RNN is used, of the input sentence, rather than all of the sentence information. Therefore, the continuous space of the input sentence information derived from the final encoder state is hardly a semantic space for preserving the global latent feature. Figure 1 shows an example of this RNN–VAE problem with three sentences, S1, S2, and S3. Although S1 and S2 are semantically similar, their syntactic structures are quite different. On the other hand, although S1 and S3 are semantically opposite, they have the same words (enough lumbers) at the end of the sentence. The vector representation of each sentence is the average of sampled vectors from the continuous semantic space of the trained RNN–VAE. We sampled five times for each sentence to reduce bias of the sampled vector and used cosine similarity as the similarity measure between two vector representations. Although S1 is more semantically similar to S2, the cosine similarity between S1 and S3 is higher than that of S1 and S2.

Figure 2: Structure of Seq2Seq AE model with Bi-directional structure.

RNN–AE uses an RNN architecture to construct the encoder and decoder, as shown in Figure 2. The encoder compresses the information of a series of inputs in a sequence (e.g., the words in a sentence), , into a fixed vector, v,

In this paper, we present RNN–SVAE for overcoming RNN–VAE limitations by generating a document information vector to capture the global latent feature of the input sentence. The document information vector consists of the word weights of a linear combination most correctly representing the paragraph vector using word vectors in the embedding space. This document information vector is combined with the final encoder state. The RNN–SVAE is trained based on the combined vector to find an appropriate continuous space of the input sentence. RNN–SVAE’s effectiveness is verified by comparing its performance to that of RNN–AE and RNN-VAE, using three tasks: language modeling, missing word imputation, and paraphrase identification.

The rest of this paper is organized as follows. In Section 2, we briefly review past research on the autoencoder structure and demonstrate the methodologies used in this study. In Section 3, we catalogue the architecture of RNN–SVAE. In Section 4, experimental settings of each task are described, followed by results and discussion. Finally, in Section 5, we conclude our current work with some future research directions.

2 Background

2.1 Rnn–ae

The AE, first introduced by Rumelhart et al. (1985)

, is a neural network-based unsupervised learning algorithm that has been employed for various tasks, including feature representation, anomaly detection, and transfer learning

(Baldi, 2012; Bengio et al., 2013; Zhu et al., 2016; Sakurada & Yairi, 2014; Chen et al., 2017; Lyudchik, 2016; Zhuang et al., 2015; Deng et al., 2013). Input and output are the same in the AE structure. Thus, the AE’s learning objective is to approximate the output to the input as closely as possible. The preceding part, compressing the information of the input vector to the latent vector, is called the , and the following part, reconstructing the information from the latent vector to the output, is called the .

Figure 3: Structure of RNN–VAE model with Bi-directional structure

The AE, first introduced by Rumelhart et al. (1985), is a neural network-based unsupervised learning algorithm that has been employed for various tasks, including feature representation, anomaly detection, and transfer learning (Baldi, 2012; Bengio et al., 2013; Zhu et al., 2016; Sakurada & Yairi, 2014; Chen et al., 2017; Lyudchik, 2016; Zhuang et al., 2015; Deng et al., 2013). Input and output are the same in the AE structure. Thus, the AE’s learning objective is to approximate the output to the input as closely as possible. The preceding part, compressing the information of the input vector to the latent vector, is called the , and the following part, reconstructing the information from the latent vector to the output, is called the .

(1)
(2)

where is the input word, is the input vector of the , and is the hidden state of the sequence. and are nonlinear functions, where . The decoder is trained to maximize the conditional probability of predicting the next word, , given a fixed vector, v, and previously predicted words, . Thus, the purpose of the decoder is to maximize the probability of predicting the target sequence, ,

(3)

Because the objective is to precisely reconstruct the input, y is identical to x in the RNN–AE. The conditional probability of RNN structure at time, , is defined as

(4)

where and denote the hidden state of the decoder at time, , and a non-linear function, respectively.

2.2 Rnn–vae

RNN–VAE is a generative model that improves the RNN–AE to capture the global feature of the input sentence. RNN–VAE replaces the deterministic function, , of RNN—AE with the posterior recognition model, , which compresses the information of input sentence, x, into a probabilistic distribution. The parameters, and , determining

, are calculated as a linear transformation of the encoder output. Thus, RNN–VAE is a model that learns the compressed information of the input sentence as a region of latent space, rather than as a single point. The structure of RNN–VAE model is shown in Figure

3

If the RNN–VAE is trained only with the RNN–AE’s reconstruction objective, it would encode the input sentence as an isolated point which means that it makes the variance of

very small (Bowman et al., 2015). To deal with this problem, in addition to the reconstruction objective, the RNN–VAE has another objective that approximates the posterior distribution, , to the prior distribution,

. This is generally a standard Gaussian distribution, (

). The Kullback-Leibler divergence (KLD) is used to compute the difference between the two distributions. Thus, the objective of RNN–VAE is defined as

(5)

where is the model parameter (i.e., and of Gaussian distribution) in the RNN–VAE. This objective allows the RNN–VAE to decode output at every point in the continuous space, having high probability under the prior distribution.

2.3 Paragraph Vector

Paragraph vector (Le & Mikolov, 2014) has been widely used to represent a paragraph using an arbitrary number of words into a fixed low-dimensional continuous vector to overcome the limitations of the bag-of-words (BoW) method. There are two main ways to learn the paragraph vector: the paragraph vector with distributed memory (PV-DM) method and the paragraph vector with distributed BoW (PV-DBOW) method. The PV-DM method, which considers the order of word sequence, has a similar model structure to continuous BoW (CBOW) of the Word2Vec model. This model takes the paragraph token vector, p, and word vectors, , to predict the next word, , when the sliding window size is set to . Thus, the paragraph vector is trained to maximize the probability of its appearance with the words contained in the sliding window of the paragraph. In the PV-DBOW method, the words included in the fixed window are arbitrarily sampled from those constituting the paragraph. This model takes the paragraph vector as input and predicts the sampled words. Therefore, it does not consider the order of the paragraph’s word sequence. Both methods define the probability that a paragraph token and a word token appear together, using the dot product between the vectors of each token. Therefore, the paragraph vector is located close to the word vectors within the paragraph from the semantic embedding.

Figure 4: Structure of SVAE model

2.4 Attention Mechanism

The attention mechanism (Luong et al., 2015; Bahdanau et al., 2014), recently recognized for its effectiveness, is widely used for image captioning (Xu et al., 2015), tree parsing (Vinyals et al., 2015), question answering (Hermann et al., 2015)

, and machine translation. The main problem of vanilla RNN is that it hardly preserves the information of the words from the front of sentence when the input sentence becomes longer. This is because it only uses the last hidden state of the encoder. Although the long short-term memory (LSTM,

(Hochreiter & Schmidhuber, 1997)

) or gated recurrent unit (GRU,

(Cho et al., 2014a)) tends to alleviate the problem, they still have trouble preserving the well-balanced semantic information of a sentence, regardless of the word appearance sequence. An attention mechanism solves this problem by using a weighted combination of total hidden states (i.e., context vector) and the last hidden state for each decoding step. The weights of the context vector can be regarded as the importance of input sentence words in the corresponding step.

3 Model Structure

In this study, we propose the RNN semantic variational autoencoder (RNN–SVAE) which represents the global latent feature of an input sentence better than RNN–VAE. As shown in Figure 4

, RNN–SVAE integrates the final hidden state and the document information vector, based on the attention vectors of bi-directional RNN (bi-RNN) hidden states, before estimating the parameters of Gaussian distribution. Because every word in the input sentence is equally considered in the document information vector, the RNN–SVAE can preserve the global latent feature better than the RNN–VAE, which has highly skewed information toward the latter words. Additionally, because the document information vector is computed by aggregating the attention vectors, model-training is not required to separately learn the document information vector; it is learned simultaneously with the hidden state during RNN training.

3.1 Document Information Vector

For both PV-DM and PV–CBOW methods, a paragraph vector is placed near the word vectors constituting it because a -dimensional paragraph vector is trained to maximize the dot product of the -dimensional word vectors in the paragraph. This implies that a linear combination of linearly independent word vectors can accurately reconstruct the paragraph vector. Furthermore, because the paragraph vector has a high similarity to the vectors of words constituting the paragraph, it is possible to approximate the paragraph vector using the embedding vectors of its words, as follows.

(6)

where and denote the word vector and its linear combination weight, respectively. is the paragraph vector.

Whereas PV-DM and PV–CBOW explicitly learn the paragraph vector during model training, the proposed document information vector computes it implicitly using information obtained during Seq2seq model training. Because the last hidden state, (), of the encoder, is a vector containing the sequential information of the input sentence, we compute the weight, , using the relationship between the and the hidden state, (). Many past studies used their dot product as a similarity measure (Karpathy et al., 2014; Karpathy & Fei-Fei, 2015). We instead use the normalized value of the dot product between and as the ,

(7)

It is possible to use many other alignment models that are proposed by Luong et al. (2015) or Bahdanau et al. (2014). However, we used a simple normalized dot product to focus on the effectiveness of document vector itself.

Using the standard RNN structure tends to give a larger weight to the words at the end of the input sentence. The closer is to , the more similar is to . To solve this problem, we use bi-RNN (Schuster & Paliwal, 1997) to take the average of the forward weight, (), and the backward weight, (),

(8)

where and is the forward and backward hidden states at the word, respectively. Finally, we compute the document information vector by combining the total weight, (), and the word sequence, , of the input sentence,

(9)

where the is the document information vector of the input sentence. Contrary to the paragraph vector, which should be trained separately from the RNN model to learn the sentence vector, the proposed document information vector can be computed simultaneously using the learned parameters of the RNN model. Hence, unlike the paragraph vector, it is not necessary to learn the sentence vector for each new sentence.

News Crawl’13 News Crawl’14 News Crawl’15 News Crawl’16 TED Talk
Train 2,500,000 - - - -
Test 15,000 15,000 15,000 15,000 15,000
Table 1: Number of sentences for each data set

3.2 Rnn–svae

The structure of RNN–SVAE model is created by adding the document information vector to the RNN–VAE model. The overall structure of the proposed model is summarized in Figure 4. We construct the final state of the encoder by concatenating the forward final state (), backward final state (), and document information vector () as follows,

(10)

Next, the mean, (), and the standard deviation, (), vectors of the continuous semantic space is calculated from the encoder’s last state () via linear transformation. These vectors have the same dimension as the global latent vector, (z). Finally, we sample the global latent vector, which functions as the semantic vector, of the input sentence and is used as an input vector to the decoder, from the continuous semantic space.

(11)
(12)

where and are the weight and bias for , respectively, whereas, and are the weight and bias, respectively, for .

Similar to RNN–VAE, the RNN–SVAE’s cost function reflects two objectives, as shown in Eq. (5). The first objective is to closely approximate the posterior distribution, , with the parameters, and , to the prior distribution, , which is the standard Gaussian distribution. To do so, the , having the standard Gaussian distribution, should be minimized. The second objective is to maximize the conditional probability, , like in the general Seq2seq model. is the word sequence of the input sentence, and is the output sequence. Because the RNN–SVAE model is an autoencoder structure, w is identical to y.

4 Experiments

During our experiments, we verified the RNN–SVAE with the three tasks: language modeling, missing word imputation, and paraphrase identification. As baseline models, RNN–AE and RNN–VAE were also used. Evaluation and comparison were both conducted quantitatively with the standard evaluation metrics, and qualitatively by exploiting the output examples of the three models.

4.1 Language Modeling

To evaluate the fundamental ability of RNN-SVAE as an autoencoder, language modeling was tested first.

4.1.1 Data Set and Preprocessing

In this study, we used the News Crawl data of WMT’ 17111http://www.statmt.org/wmt17/translation-task.html, English monolingual corpus, and TED Talk data of WIT3222https://wit3.fbk.eu/ (Cettolo et al., 2012). The News Crawl’13 dataset was used to train the RNN models, whereas all datasets were used to test the models. For the language modeling task, it is common to exclude very long sentences (i.e., longer than 30 to 50 words) to accelerate training (Bahdanau et al., 2014; Artetxe et al., 2017). Therefore, we only used sentences shorter than 40 words for computational efficiency. For the training dataset, we randomly sampled 2,500,000 sentences from the News Crawl’13 dataset. As test dataset, News Crawl’13, News Crawl’14, News Crawl’15, and News Crawl’16 datasets of WMT’ 17 English monolingual corpus and the TED Talk dataset were used. For each dataset, 15,000 sentences were randomly sampled. For the test data of News Crawl’13 dataset, the sampled sentences in the training dataset were excluded when sampling the test dataset. The number of training and test sentences in each dataset is summarized in Table 1.

Prior to training the model, we performed tokenization333We used RegExpTokenizer from the NLTK package. (http://www.nltk.org/api/nltk.html) after removing punctuation marks and converting uppercase letters to lowercase letters for all sentences. Following tokenization, we pre-trained the word vectors by using the skip-gram model (Mikolov et al., 2013). Words that appeared fewer than seven times in the training dataset were replaced with the “UNK” token (i.e., unknown word). We set the dimension of word vector to 100, the window size of the skip-gram model to 5, and the negative sampling parameter,

, to 5. Word vector training was done for 10 epochs. Thus, including “UNK” and “EOS” (i.e., end-of-sentence) token, 91,897 unique words were trained.

ModelData News Crawl’13 News Crawl’14 News Crawl’15 News Crawl’16 TED Talk
RNN–AE 17.55 20.21 19.98 20.55 24.84
RNN–VAE 37.51 38.62 39.41 38.98 45.49
RNN–SVAE 41.68 41.89 43.33 43.07 49.32
Table 2: BLEU score of each model for the language modeling task
(a) News Crawl’13 (b) News Crawl’14 (c) News Crawl’15
(d) News Crawl’16 (e) TED Talk
Figure 5: BLEU scores for each model according to the sentence length.

4.1.2 Model Training and Inference

Because the RNN–SVAE model is rooted on vanilla RNN, it is possible to use any type of RNN cell (e.g., basic RNN cell, LSTM cell, GRU cell) (Cho et al., 2014a). We used the GRU cell that solves the gradient vanishing problem of the basic RNN cell. It also has fewer parameters than the LSTM cell. The RNN–SVAE encoder has a bi-directional RNN structure. The forward and backward RNNs of the encoder each consist of 300 hidden units. The global latent vector and hidden states of the decoder also consist of 300 units.

For fair comparison, the baseline models are designed with the same structure as the RNN-SVAE model. The RNN–VAE model also used the GRU cell and had a bi-RNN structure with 300 hidden units. Its global latent vector and decoder hidden units were composed of 300 units, as in the RNN–SVAE model. The RNN–AE model also used the GRU cell with bi-RNN structure and had the encoder and decoder with 300 hidden units, as in the RNN–VAE and RNN–SVAE models.

The three models were all trained under the same condition. We initialized their parameters using the Xavier initialization (Glorot & Bengio, 2010). We used the Adam optimizer (Kingma & Ba, 2014) for training. We trained the models for 30 epochs. Gradient computation and weight update were done with the mini-batch size of 512. The learning rate was set to 0.001 for the first 10 epochs and to 0.0001 for the remaining 20 epochs.

After model training, beam search was used to obtain the output maximizing conditional probability at the inference phase. We set the beam size to 7 and the maximum length of output to 40. For generative models, such as RNN–VAE and RNN–SVAE, an average of five samples was used as the input vector of the decoder to reduce the bias of sampled global latent vector.

4.1.3 Results

As a performance measure for language modeling, the BLEU score, commonly used for machine translation, was used (Papineni et al., 2002). Although there exists other “teacher forcing” metrics, such as negative log likelihood and perplexity, these metrics are insufficient in evaluating whether the semantic space or vector, i.e. the output of the encoder, reflects the global latent feature of the input sentence because the target token at every time step is provided for those “teacher forcing” metrics. As a result, we used BLEU score as a performance measure that target tokens are not provided in each decoding step. The results of each model are summarized in Table 2. For all five test data sets, the proposed RNN–SVAE significantly outperformed the benchmark models. The BLEU scores of RNN–SVAE were almost as twice those of the RNN–AE. Compared to RNN–VAE, RNN–SVAE improved the BLEU score by at least 3.27 (News Crawl’14) and at most 4.17 (News Crawl’13). The relative BLEU improvements of RNN–SVAE against RNN–VAE were between 8.42% and 11.12%.

Table 3 shows examples of the language modeling task444We used smoothing function in NLTK package, because the BLEU score was too optimistic. Examples of RNN–AE are not given, because its language modeling performance was significantly worse than those of RNN–VAE and RNN–SVAE. Highlighted parts, in gray, show the words exactly matched with the ground truth. In the case of RNN–VAE, both the beginning and the end of sentences fit well with the ground truth. However, it seemed to have difficulty generating the parts in the middle of sentences correctly. However, RNN–SVAE succeeded in generating the entire sentences.

Whereas the word sequences generated by the RNN–SVAE are not the same as the ground truth, the chosen words are semantically very similar to those in the ground truth. As shown in Table 4. When the word “nodes” is replaced by “fibres” in the first example sentence, it is still comprehensible and does not undermine the meaning of the original sentence. Similarly, “Norwich” and “Southampton” are both the name of cities in the England in the second sample, and “frost” and “snow” are semantically very similar words in the third example.

Type Sentence
Truth What is the name of the pension plan
RNN–VAE What is the name and the name expires
RNN–SVAE What is the name of the pension plan
Truth The grandmother of three who whishes to remain anonymous said the experience was so traumatic she will never be able to eat popcorn again
RNN–VAE The grandmother of the kitchens of how serving ladies advised were the added experience so you so fully flowers to likely to eatnever desire
RNN–SVAE The grandmother of three who whish to remain anonymous said the experience was so traumatic she will never be able to eat before pets again
Truth The authorities arrested two people but failed to investigate reports that they were part of a large private militia
RNN–VAE The authorities raped some people but failed to investigate reports that they were part of a large private militia
RNN–SVAE The authorities arrested two people but failed to investigate reports that they were part of a large private Kuwaiti
Table 3: Examples of language modeling outputs generated by RNN–VAE and RNN–SVAE
Type Sentence
Truth These nodes range from opening and closing tags to character data and processing instructions
RNN–SVAE These fibres range from opening and closing tags to character data and processing instructions
Truth From there Jennings took the controls and flew to Norwich
RNN–SVAE From there Jennings took the controls and flew to Southampton
Truth Kevin Walker head of science at the BSBI said the trend was down to the mild winter and a lack of frost
RNN–SVAE Kevin Walker head of science at the UNK said the trend was down to the mild winter and a lack of snow
Table 4: Examples of wrong words but have high similarity with the ground truth

Figure 5 shows the average BLEU score of each model per the sentence length of each dataset. When the sentence length is relatively short, sentence generation performance was similar. Even the RNN–AE worked well with very short sentences (i.e., less than five words). Additionally, there was no significant difference between the RNN–VAE and RNN–SVAE. When the sentence length was moderate, RNN–AE tended to fail to generate the original sentence; its BLEU scores were much lower than those of the other two methods. When comparing RNN–VAE and RNN–SVAE, the RNN–SVAE worked better than RNN–VAE in most cases. When the sentence length was long, all three models had trouble generating the original sentence. This is still an open research topic in the field of machine translation.

4.2 Missing Word Imputation

Missing word imputation is the process of completing a sentence by filling it in with appropriate words (Mani, 2015). We performed this task to evaluate how well the proposed RNN–SVAE reflects the global latent feature of the input sentences. In this task, an incomplete sentence with some words erased was provided as an input to the encoder of seq2seq models. The models were trained to guess the erased words or the sequence of words correctly through the decoder. We tested the missing word imputation performance under three different scenarios. The detailed description of each scenario is summarized below.

  • Scenario 1: Imputation for the last word of the sentence.

  • Scenario 2: Imputation for one randomly selected word among the last 20% of the sentence.

  • Scenario 3: Imputation for the sequence of words corresponding to the last 20% of the sentence.

Scenario 1 was the easiest level and Scenario 3 was the most difficult level. Scenario 1 and 2 can be regarded as a multi-class classification task, whereas Scenario 3 can be regarded as a sequence generation task.

ModelScenario Scenario 1 (Accuracy) Scenario 2 (Accuracy) Scenario 3 (BLEU)
RNN–AE 15.71 5.76 34.08
RNN–VAE 16.94 6.23 34.17
RNN–SVAE 15.05 6.37 34.37
Table 5: Performance of missing word imputation task.
Q: Inventories increased across divisions, but were compensated by advance payments received and a better operational  .
Truth RNN–AE RNN–VAE RNN–SVAE
“performance” “performance” “performance” “service”

Q: Click here for instructions on how to enable javascript in your  .
Truth RNN–AE RNN–VAE RNN–SVAE
Level 1 “browser” “browser” “browser” “system”
Q: David Rhodes and Robert Hendricks (Montreal process technical advisory group tac) described tac’s work on a framework of criteria and indicators that provide a common of   management of temperate and boreal forests.
Truth RNN–AE RNN–VAE RNN–SVAE
“sustainable” “the” “the” “sustainable”

Q: Such distinctive homes can attract interest from far beyond your   market.
Truth RNN–AE RNN–VAE RNN–SVAE
“local” “local” “local” “own”

Q: The list is sorted by country so you shouldn’t have a problem to find a   near you.
Truth RNN–AE RNN–VAE RNN–SVAE
Level 2 “vendor” “destination” “few” “hotel”

Q: If you have text in any page of your site that contain any of the keywords below, you can add your contextual listing there. It’s free and your listing will appear online in  .
Truth RNN–AE RNN–VAE RNN–SVAE
“real time conta- ining hyperlink to your page” “real time http www ‘UNK’ com au account” “real time conta- ining your account to your account “real time conta- ining hyperlink to your page”

Q: Energy star is a registered trademark of the US environmental  .
Truth RNN–AE RNN–VAE RNN–SVAE
“protection agency” “protection agency” “protection agency” “insurance program”

Q: Encrypt within the veritas net backup policy eliminating a seperate process or an extra dedicated  .
Truth RNN–AE RNN–VAE RNN–SVAE

Level 3
“device to manage” “to the application” “to the enviroment “device to manage”
Table 6: Examples of missing word imputation.

4.2.1 Data Set

For model training, the training dataset used in the language modelling task (i.e., the randomly sampled 2,500,000 sentences from the News Crawl’13 dataset) was modified. Likewise, we modified the News Crawl’13 test dataset and used it to evaluate performance. For Scenarios 1 and 3, we erased the last word and the last 20% of word sequences from each sentence, respectively. For Scenario 2, we replaced a randomly selected word among the last 20% of the sentence with vector.

Figure 6: Structure of paraphrase identification model

4.2.2 Model Training and Inference

Three imputation models were trained under the same condition. We used the Xavier initialization for parameter initialization and the Adam optimizer. The models were trained for 15 epochs with a learning rate 0.001 for the first five epochs and 0.0001 for remaining 10 epochs. Gradient computation and weight updates were done with a mini-batch size of 512. Like the language modeling task, the outputs of RNN–VAE and RNN–SVAE were decoded from the mean vector of five sampled values to reduce the bias of the global latent vector.

4.2.3 Results

As a quantitative evaluation metric, the simple accuracy (i.e., the proportion of the correctly predicted words to the number of total missing words) was used for Scenarios 1 and 2 (i.e., predicting a single word), whereas the BLEU score was used for Scenario 3 (i.e., predicting a sequence of words).

Table 5 shows the performance of each model for missing word imputation. For Scenario 1, RNN–VAE yielded the highest accuracy, whereas RNN–SVAE resulted in the lowest accuracy. Because imputation for the last word requires more information about the end of the sentence than the global information of the whole sentence, RNN–VAE and RNN–AE, which preserve more information of the end of sentences, showed good performances. Although RNN–SVAE resulted in the worst performance, we found that its imputation results were semantically quite similar to the target word in many examples, as shown in Table 6.

For more difficult tasks, such as Scenario 2 and 3, on the other hand, the RNN–SVAE outperformed the other methods. As shown in Table 6, not only did RNN–SVAE achieve higher accuracy, or BLEU score, it also predicted semantically similar words to the correct answers.

4.3 Paraphrase Identification

Paraphrase identification is a task that determines whether two different sentences have the same meaning (Rus et al., 2008; Hu et al., 2014). In this study, we constructed a binary classification model to determine whether two sentences are paraphrased when global latent vectors, or the last hidden state for RNN—AE, of each sentence are used as input, as shown in Figure 6

. Like the previous tasks, the mean vector of five sampled vectors is used as input to the paraphrase identification model for RNN–VAE and RNN–SVAE. The model is constructed with a feed-forward multi-layer perceptron consisting of two hidden layers. The number of hidden units of the first and the second layer are set to 100 and 50, respectively.

ModelMetric Error rate (1 - Accuracy) False alarm rate (1 - Precision) Miss rate (1 - Recall)
RNN–AE 5.10 0.56 5.09 0.80 5.59 0.92
RNN–VAE 6.05 0.43 5.64 0.78 6.02 0.88
RNN–SVAE 4.65 0.33 3.68 0.65 5.86 0.71
Table 7: Result of paraphrase identification
Model RNN-VAE RNN–SVAE
Cosine similarity 0.598 0.631
Table 8: Result of paraphrase sentence similarity

4.3.1 Data Set

We used the MS Paraphrase Corpus dataset (Dolan et al., 2004; Quirk et al., 2004) to perform the paraphrase identification task. This dataset consists of 5,801 pairs of sentences with 4,076 pairs for training and 1,725 pairs for test. The training dataset consists of 2,753 “equivalent” sentence pairs and 1,323 “not equivalent” sentence pairs, as judged by human raters. The test set consists of 1,147 and 578 “equivalent” and “not equivalent” sentence pairs, respectively.

Dolan et al. (2004) noted that, although the collected paraphrase sentences were judged “not equivalent” by the human raters, it was not desirable to use “not equivalent” sentence pairs as negative class data, because they have significant overlaps between them, in terms of information content and wording. Therefore, we used "equivalent" sentences of the MS Paraphrase Corpus as the positive class dataset and modified one side of the sentence pair to use as the non-paraphrase dataset. The non-paraphrase dataset is generated by replacing 20% of randomly selected words in a paired sentence with other words in the pre-trained word vector dictionary used in language modeling and missing word imputation tasks. For the training data, we used 2,753 pairs of sentences as the positive class and generated 2,753 pairs of negative class sentences by using the method described above. Similarly, 1,147 pairs of sentences for the test were used as the positive class, and 1,147 pairs of negative class sentences were generated for the test data. Thus, a total of 5,506 training pairs and 2,294 test pairs were constructed. The ratio of paraphrase pairs to non-paraphrase pairs was the same.

4.3.2 Training Details

The paraphrase identification models for RNN–AE, RNN–VAE, and RNN–SVAE were trained under the same conditions. The parameters of all models were initialized by using Xavier initialization. Gradient computation and weight updates were done with a mini-batch size of 512. The models were trained for 100 epochs using the Adam optimizer with a learning rate of 0.001. To prevent overfitting, dropout (Srivastava et al., 2014) is used for each layer. The dropout rate is set to 0.3. We repeated training 30 times for each model to obtain the statistical significance of the results.

4.3.3 Results

We used three evaluation metrics: (1) the overall error rate, (2) false alarm rate (i.e., the proportion of incorrectly classified paraphrases as “equivalent” among the paraphrases classified as “equivalent” by the model) and (3) miss rate (i.e., the proportion of incorrectly classified paraphrases that were actually “equivalent” among the actual “equivalent” paraphrases). The results and standard deviations of paraphrase identification task of each model are summarized in Table

7

. The RNN–SVAE resulted in better performance than RNN–AE and RNN—VAE in terms of error rate and false alarm rate. These performance improvements are also supported by the statistical hypothesis testing at a significant level of 0.01. Although RNN–AE showed the best performance in terms of miss rate, there is no statistically significant difference between the performance of RNN–AE and that of RNN–VAE or RNN–SVAE at a significant level of 0.01. Compared to RNN–VAE, RNN–SVAE reduced the error rate by 23.1%

34.8%, which strongly supports the notion that the RNN–SVAE can better capture the global latent context over RNN–VAE.

In addition to paraphrase identification, which was evaluated by the binary decision, we also compared the similarity between latent vectors of two “equivalent” sentences judged by human raters. This evaluation was conducted only with RNN–VAE and RNN–SVAE to exploit the effect of adding document information vector to variational-based RNN models. Table 8 shows that not only did the RNN–SVAE model achieve higher identification accuracy, it also generated more similar latent vectors for two similar sentences than RNN–VAE.

5 Conclusion

For RNN–based autoencoder models (e.g., RNN–AE and RNN—VAE) the final hidden state of the encoder does not contain sufficient information about the entire sentence. In this paper, we proposed RNN–SVAE to overcome this limitation. To consider the information of words in the sentence, we constructed a document information vector by a linear combination of word vectors of input sentence. The weights of individual words are computed using the attention information between the final state of the encoder and every prior hidden state. We then combined this document information vector with the final hidden state of the bi-directional RNN encoder to construct the global latent vector as the output of the encoder part. Then, the mean and standard deviation of the continuous semantic space were learned to take advantage of variational method.

The proposed RNN–SVAE was verified through three NLP tasks: language modeling, missing word imputation, and paraphrase identification. Despite the simple structure of RNN–SVAE combining the document information vector with the RNN–VAE model, experimental results showed that RNN–SVAE achieved higher performance than RNN–AE and RNN–AE for all tasks requiring global latent meaning of the input sentence. The only exception is missing word imputation for a very short sentence, which does not significantly depend on the global semantic information.

Although the experimental results are very favorable for RNN–SVAE, there are some limitations of the current study. This provides some future research directions. First, the prior distribution is assumed to be a specific distribution, such as standard Gaussian. To improve the performance of RNN–SVAE, it will be worth attempting to find an appropriate prior distribution of data. Additionally, there is the risk of learning a model that is far from the actual data distribution. Thus, as in adversarial autoencoder (Makhzani et al., 2015) for image data, further research is needed to map prior distribution to data distribution in language modeling. Second, we should use the Bi-RNN structure to find the weight of a word that is not biased on one side of the sentence. To apply RNN–SVAE to one-directional RNN structures, it is necessary to study a method of re-adjusting weight properly, so that the weight of words is not biased to one side.

References