Towards Linear Time Neural Machine Translation with Capsule Networks

11/01/2018 ∙ by Mingxuan Wang, et al. ∙ Xiamen University Tencent 0

In this study, we first investigate a novel capsule network with dynamic routing for linear time Neural Machine Translation (NMT), referred as CapsNMT. CapsNMT uses an aggregation mechanism to map the source sentence into a matrix with pre-determined size, and then applys a deep LSTM network to decode the target sequence from the source representation. Unlike the previous work sutskever2014sequence to store the source sentence with a passive and bottom-up way, the dynamic routing policy encodes the source sentence with an iterative process to decide the credit attribution between nodes from lower and higher layers. CapsNMT has two core properties: it runs in time that is linear in the length of the sequences and provides a more flexible way to select, represent and aggregates the part-whole information of the source sentence. On WMT14 English-German task and a larger WMT14 English-French task, CapsNMT achieves comparable results with the state-of-the-art NMT systems. To the best of our knowledge, this is the first work that capsule networks have been empirically investigated for sequence to sequence problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Machine Translation (NMT) is an end-to-end learning approach to machine translation which has recently shown promising results on multiple language pairs Luong et al. (2015); Shen et al. (2015); Wu et al. (2016); Gehring et al. (2017a); Kalchbrenner et al. (2016); Sennrich et al. (2015); Vaswani et al. (2017). Unlike conventional Statistical Machine Translation (SMT) systems Koehn et al. (2003); Chiang (2005)

which consist of multiple separately tuned components, NMT aims at building upon a single and large neural network to directly map input text to associated output text  

Sutskever et al. (2014).

In general, there are several research lines of NMT architectures, among which the Enc-Dec NMT  Sutskever et al. (2014) and the Enc-Dec Att NMT are of typical representation  Bahdanau et al. (2014); Wu et al. (2016); Vaswani et al. (2017)

. The Enc-Dec represents the source inputs with a fixed dimensional vector and the target sequence is generated from this vector word by word. The Enc-Dec, however, does not preserve the source sequence resolution, a feature that aggravates learning for long sequences. This results in the computational complexity of decoding process is

, with denoting the source sentence length and denoting the target sentence length. The Enc-Dec Att preserves the resolution of the source sentence which frees the neural model from having to squash all the information into a fixed represention, but at a cost of a quadratic running time. Due to the attention mechanism, the computational complexity of decoding process is . This drawbacks grow more severe as the length of the sequences increases.

Currently, most work focused on the Enc-Dec Att, while the Enc-Dec paradigm is less emphasized on despite its advantage of linear-time decoding Kalchbrenner et al. (2016). The linear-time approach is appealing, however, the performance lags behind the Enc-Dec Att. One potential issue is that the Enc-Dec needs to be able to compress all the necessary information of a source sentence into a constant size encoder. Some simple aggregation methods, such as max (or average) pooling, are often used to compress the sentence meaning. The context vectors are fixed during decoding. These methods process the nformation in a bottom-up and passive way and are lack of child-parent (or part-whole) relationships burdening the model with a memorization step. Therefore, a natural question was raised, Will carefully designed aggregation operations help the Enc-Dec to achieve the best performance?

In recent promising work of capsule network, a dynamic routing policy is proposed and proven to be more effective than the simple aggregation method Sabour et al. (2017); Zhao et al. (2018); Gong et al. (2018). As an outcome, capsule networks could encode the intrinsic spatial relationship between a part and a whole constituting viewpoint invariant knowledge that automatically generalizes to novel viewpoints. Following a similar spirit to use this technique, we present a family of the Enc-Dec approaches, referred as CapsNMT, that are characterized by capsule encoder to address the drawbacks of the conventional linear-time approaches. The capsule encoder processes the attractive potential to address the aggregation issue. We then introduce an iterative routing policy to decide the credit attribution between nodes from lower (child) and higher (parent) layers. Three strategies are also proposed to stabilize the dynamic routing process. We empirically verify CapsNMT on WMT14 English-German task and a larger WMT14 English-French task. CapsNMT achieves comparable results with the state-of-the-art Transformer systems. Our contributions can be summrized as follows:

  • We propose a sophisticated designed linear-time CapsNMT which achieved comparable results with the Transformer framework. To the best of our knowledge, CapsNMT is the first work that capsule networks have been empirically investigated for sequence-to-sequence problems.

  • We also propose several techniques including position-aware routing strategy, separable composition & scoring strategy and non-sharing weight strategy to stabilize the dynamic routing process. We believe that these technique should always be employed by capsule networks for the best performance.

2 Linear Time Neural Machine Translation

Figure 1: CapsNMT:Linear time neural machine translation with capsule encoder

The task of linear-time translation can be understood from the perspective of machine learning as learning the conditional distribution

of a target sentence (translation) given a source sentence . The network that models is composed of two parts: a encoder that processes the source string into a representation and a decoder that uses the source representation to generate the target string. A crucial feature of the linear-time NMT is that the source sentence representation is of pre-determined size. Due to the constant size represention of the encoder, the running time of the network could be linear in the length of the source tokens. Figure  1 shows the linear time NMT with capsule encoder which encodes the source sentence into the fixed size.

Constant Encoder with Aggregation Layers

Given a text sequence with words , Since the words are symbols that could not be processed directly using neural architectures, so we first map each word into a dimensional embedding vector

The goal of the constant encoder is to transfer the inputs into a pre-determined size representation

where is the length of the input sentence, is the pre-determined length of the encoder output, is the dimension of the word embedding, and is the dimension of the hidden states.

In this work, we first build a bi-directional LSTM (BiLSTM) as the primary-capsule layer to incorporate forward and backward context information of a sequence:

(1)

We can get sentence-level encoding of a word by concatenating forward and backward output vector . Thus, the outputs of BiLSTM encoder are a sequence of vectors correspond to the input sequence.

The encoder consists of several primary-capsule layers to extract the basic features and then followed by several aggregation layer to map the variable-length sequence into the fixed-size representation. Max or Average pooling is the simplest way of aggregating information, which does not require extra parameters and is computationally efficient. In the process of modeling natural language, max or average pooling is performed along the time dimension.

(2)
(3)

We propose a more powerful aggregation methods as our strong basline of Enc-Dec approach. The outputs of encoder is static and consists of parts which is

(4)

The last time step state and the first time step state provide complimentary information, thus improve the performance. The compressed represention is fixed once learning, therefore the compression strategy plays a crucial role in the success of building the Enc-Dec NMT model.

LSTM Decoder

The goal of the LSTM is to estimate the conditional probability

, where is the input sequence and is its corresponding output sequence.

A simple strategy for general sequence learning is to map the input sequence to a fixed-sized vector, and then to map the vector to the target sequence with a conditional LSTM decoderSutskever et al. (2014):

(5)

where is the target word embedding of , is the inputs of LSTM at time step , is the concatenation of the source sentence represention and is the projection matrix. Since is calculated in advance, the decoding time could be linear in the length of the sentence length. At inference stage, we only utilize the top-most hidden states

to make the final prediction with a softmax layer:

(6)

Similar as  Vaswani et al. (2017)

, we also employ a residual connection  

He et al. (2016) around each of the sub-layers, followed by layer normalization Ba et al. (2016).

3 Aggregation layers with Capsule Networks

Figure 2: Capsule encoder with dynamic routing by agreement

The traditional linear-time approaches collect information in a bottom-up way, without considering the state of the whole encoding. Therefore, it is difficult to avoid the problems of information attenuation. As can be seen in Figure 2, we introduce a capsule layer to aggregate the information, which can iteratively decides the information flow and provides a more flexible way to select, represent and synthesize the part-whole relationship of the source sentence.

3.1 Child-Parent Relationships

To compress the input information into the representation with pre-determined size, the central issue we should address is to determine the information flow from the input capsules to the output capsules.

Capsule network tries to address the representational limitation and exponential inefficiencies of the simple aggregation pooling method. It allows the networks to automatically learn child-parent (or part-whole) relationships. Formally, denotes the information be transferred from the child capsule into the parent capsule :

(7)

where can be viewed as the voting weight on the information flow from child capsule to the parent capsule;

is the transformation function and in this paper, we use a single layer feed forward neural networks:

(8)

where is the transformation matrix corresponding to the position of the parent capsule.

The parent capsule aggregates all the incoming messages from all the child capsule:

(9)

and then squashes to

confine. ReLU or similar non linearity functions work well with single neurons. But the paper found that this squashing function works best with capsules. This tries to squash the length of output vector of a capsule. It squashes to 0 if it is a small vector and tries to limit the output vector to 1 if the vector is long.

(10)

3.2 Dynamic Routing by Agreement

The dynamic routing process is implemented by an EM iterative process of refining the coupling coefficient , which define proportionally how much information is to be transferred from to .

At iteration , the coupling coefficient is computed by

(11)

where is computed using a soft-max function across the child capsules. Therefore, and all the information from the child capsule will be transferred to the parent. Following Zhao et al. (2018), We explore Leaky-Softmax in the place of standard soft-max while updating the connection strength. The approach help to route the noise child capsules to extra dimension without any additional parameters and computation consuming.

(12)

This coefficient is simply a temporary value that will be iteratively updated with the value of the previous iteration and, after the procedure is over, its value will be stored in . The agreement is simply the scalar product of and . The dot product looks at similarity between input to the capsule and output from the capsule. Also, remember from above, the lower level capsule will sent its output to the higher level capsule whose output is similar. This similarity is captured by the dot product. is initialized with . The coefficient depend on the location and type of both the child and the parent capsules. With a iteratively refinement of

, the capsule network can increase or decrease the connection strength by dynamic routing, which is more effective than the primitive routing strategies such as max-pooling that essentially detects whether a feature is present in any position of the text, but loses spatial information about the feature.

1:procedure Routing(,)
2:     Initialize
3:     for each  do
4:         Compute the routing coefficients for all From Eq.(11)
5:         Update all the output capsule for all From Eq.(7,8,9,10)
6:         Update all the coefficient for all From Eq.(12)
7:     end for
8:return
9:end procedure
Algorithm 1 Dynamic Routing Algorithm

When an output capsule receives the incoming messages , its state will be updated and the coefficient is also re-computed for each input capsule. Thus, we iteratively refine the route of information flowing, towards an instance dependent and context aware encoding of a sequence. After the text sequence is encoded into capsules, we map these capsules into vector representation by simply concatenating all capsules:

(13)

The matrix with pre-determined size will then be fed to the final end to end NMT model as the source sentence encoder. In this paper, we also explore three strategies to improve the accuracy of the routing process.

Position-aware Routing strategy

The routing process can iteratively decide what and how much information is to be sent to the final encoding with considering the state of both the final outputs capsule and the inputs capsule. In order for the model to make use of the order of the child and parent capsules, some information cab be injected about the relative or absolute position of the capsules in the sequence. Adding positional information in the text is more effective than in image since there is some sequential information in the sentence which can help the capsule network model the child-parent relationship more efficiently. There are many choices of positional encodings, learned and fixed Gehring et al. (2017b). To this end, we add “positional encoding” to the child capsules and the parent capsules. The positional encodings have the same dimension as the corresponding hidden state, so that the two can be summed. Following Vaswani et al. (2017), we use sine and cosine functions of different frequencies:

(14)

Non-sharing Weight Strategy

In this paper, we explore two different types of transformation matrices to generate the message vector . from its child capsule to the parent capsule in Eq.(7,8). The first one shares parameters of across different iterations. In the second design, we replace the shared parameters with the non-shared strategy where is the iteration step during the dynamic process. In our preliminary, we found that shared weight strategy works slightly better than the non-shared one which is in consistent with Liao and Poggio (2016). They investigated the effect of sharing weight in deep neural networks. The experiments showed that deep networks with weight sharing generally perform worse than using independent parameters.

Separable Composition and Scoring strategy

The most important idea behind capsule networks is to mesure the input and output similarity. It is often modeled as dot product between input and output of a capsule and then routing coefficient is updated correspondingly. Traditional capsule networks take a straightforward strategy in which the “fusion” decisions (e.g., deciding the voting weight ) are made based on the values of feature-maps. This is essentially a soft template matching Lawrence et al. (1997), which works for tasks like classification, but undesired for maintaining the composition functionality of capsules. In this paper, we propose to use separate functional networks to release the scoring duty, and let defined in Eq.(7) focus on composition. More specifically, we redefined the iteratively scoring function in Eq.(12).

(15)

where

is a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

4 Experiments

4.1 Datasets

We mainly evaluated CapsNMT

on the widely used WMT English-German and English-French translation task. The evaluation metric is BLEU. We tokenized the reference and evaluated the performance with multi-bleu.pl. The metrics are exactly the same as in previous work

Papineni et al. (2002).

For English-German, to compare with the results reported by previous work, we used the same subset of the WMT 2014 training corpus that contains 4.5M sentence pairs with 91M English words and 87M German words. The concatenation of news-test 2012 and news-test 2013 is used as the validation set and news-test 2014 as the test set.

To evaluate at scale, we also report the results of English-French. To compare with the results reported by previous work on end-to-end NMT, we used the same subset of the WMT 2014 training corpus that contains 36M sentence pairs. The concatenation of news-test 2012 and news-test 2013 serves as the validation set and news-test 2014 as the test set.

4.2 Training details

Our training procedure and hyper parameter choices are similar to those used by Vaswani et al. (2017). In more details, For English-German translation and English-French translation, we use sub-word tokens as vocabulary based on Byte Pair EncodingSennrich et al. (2015).

We initialized parameters by sampling each element from the Gaussian distribution with mean

and variance

. Parameter optimization was performed using stochastic gradient descent. Adam 

Kingma and Ba (2015) was used to automatically adapt the learning rate of each parameter ( and ). To avoid gradient explosion, the gradients of the cost function which had norm larger than a predefined threshold were normalized to the threshold Pascanu et al. (2013). We batched sentence pairs by approximate length, and limited input and output tokens per batch to per GPU. Each resulting training batch contained approximately 60,000 source and 60,000 target tokens. We trained our NMT model with the sentences of length up to words in the training data. During training, we employed label smoothing of value Pereyra et al. (2017).

Translations were generated by a beam search and log-likelihood scores were normalized by the sentence length. We used a beam width of and length penalty in all the experiments. Dropout was applied on each layer to avoid over-fitting Hinton et al. (2012). The dropout rate was set to for the English-German task and for the English-French task. Except when otherwise mentioned, NMT systems had layers encoders followed by a capsule layer and layers decoders. We trained for 300,000 steps on 8 M40 GPUs, and averaged the last checkpoints, saved at minute intervals. For our base model, the dimensions of all the hidden states were set to and for the big model, the dimensions were set to . The capsule number is set to .

4.3 Results on English-German and English-French Translation


SYSTEM
Architecture Time EN-Fr BLEU EN-DE BLEU
Buck et al.  Buck et al. (2014) Winning WMT14 - 35.7 20.7
Existing Enc-Dec Att NMT systems
Wu et al.  Wu et al. (2016) GNMT + Ensemble 40.4 26.3
Gehring et al. Gehring et al. (2017a) ConvS2S 40.5 25.2
Vaswani et al.  Vaswani et al. (2017) Transformer (base) + 38.1 27.3
Vaswani et al.  Vaswani et al. (2017) Transformer (large) + 41.0 27.9
Existing Enc-Dec NMT systems
Luong et al.  Luong et al. (2015) Reverse Enc-Dec + - 14.0
Sutskever et al.  Sutskever et al. (2014) Reverse stack Enc-Dec + 30.6 -
Zhou et al.  Zhou et al. (2016) Deep Enc-Dec + 36.3 20.6
Kalchbrenner et al.  Kalchbrenner et al. (2016) ByteNet c+c - 23.7
Our Encoder-Decoder based NMT systems
Base Model Simple Aggregation c+c 38.6 23.4
Base Model CapsNMT c+c 40.4 26.9
Big Model CapsNMT c+c 41.6 27.6
Table 1: Case-sensitive BLEU scores on English-German and English-French translation. CapsNMT achieves compariable results with Transformer Vaswani et al. (2017). We obtained the highest BLEU score among all the competitors on English-French translation.

The results on English-German and English-French translation are presented in Table 1. We compare CapsNMT with various other systems including the winning system in WMT’14 Buck et al. (2014), a phrase-based system whose language models were trained on a huge monolingual text, the Common Crawl corpus. For Enc-Dec Att systems, to the best of our knowledge, GNMT is the best RNN based NMT system. Transformer  Vaswani et al. (2017) is currently the SOTA system which is about BLEU points better than GNMT on the English-German task and BLEU points better than GNMT on the English-French task. For Enc-Dec NMT, ByteNet is the previous state-of-the-art system which has 150 convolutional encoder layers and 150 convolutional decoder layers.

On the English-to-German task, our big CapsNMT achieves the highest BLEU score among all the Enc-Dec approaches which even outperform ByteNet, a relative strong competitor, by BLEU score. In the case of the larger English-French task, we achieves the highest BLEU score among all the systems which even outperform Big Transform, a relative strong competitor, by BLEU score. To show the power of the capsule encoder, we also make a comparison with the simple aggregation version of the Enc-Dec model, and again yields a gain of BLEU score on English-German task and BLEU score on English-French task for the base model. The improvements is in consistent with our intuition that the dynamic routing policy is more effective than the simple aggregation method. It is also worth noting that for the small model, the capsule encoder approach get an improvement of BLEU score over the Base Transform approach on English-French task.

The first column indicates the time complexity of the network as a function of the length of the sequences and is denoted by Time. The ByteNet, the RNN Encoder-Decoder are the only networks that have linear running time (up to the constant c). The RNN Enc-Dec, however, does not preserve the source sequence resolution, a feature that aggravates learning for long sequences. The Enc-Dec Att do preserve the resolution, but at a cost of a quadratic running time. The ByteNet overcomes these problem with the convolutional neural network, however the architecture must be deep enough to capture the global information of a sentence. The capsule encoder makes use of the dynamic routing policy to automatically learn the part-whole relationship and encode the source sentence into fixed size representation. With the capsule encoder,

CapsNMT keeps linear running time and the constant is the capsule number which is set to in our mainly experiments.

4.4 Ablation Experiments

In this section, we evaluate the importance of our main techniques for training CapsNMT. We believe that these techniques are universally applicable across different NLP tasks, and should always be employed by capsule networks for best performance.

Model BLEU
Base CapsNMT
+ Non-weight sharing
+ Position-aware Routing Policy
+ Separable Composition and Scoring
+Knowledge Distillation
Table 2: English-German task: Ablation experiments of different technologies.

From Table 2 we draw the following conclusions:

  • Non-weight sharing strategy We observed that the non-weight sharing strategy improves the baseline model leading to an increase of BLEU.

  • Position-aware Routing strategy Adding the position embedding to the child capsule and the parent capsule can obtain an improvement of BLEU score.

  • Separable Composition and Scoring strategy Redefinition of the dot product function contributes significantly to the quality of the model, resulting in an increase of BLEU score.

  • Knowledge Distillation Sequence-level knowledge distillation is applied to alleviate multimodality in the training dataset, using the state-of-the-art transformer models as the teachersKim and Rush (2016)

    . In addition, we use the same sizes and hyperparameters for the student and its respective teacher. We decode the entire training set once using the teacher to create a new training dataset for its respective student.

4.5 Model Analysis

In this section, We study the attribution of CapsNMT.

Effects of Iterative Routing

We also study how the iteration number affect the performance of aggregation on the English-German task. Figure  3 shows the comparison of iterations in the dynamic routing process. The capsule number is set to and for each comparison respectively. We found that the performances on several different capsule number setting reach the best when iteration is set to 3. The results indicate the dynamic routing is contributing to improve the performance and a larger capsule number often leads to better results.

Figure 3: Effects of Iterative Routing with different capsule numbers

Analysis on Decoding Speed

Model Latency(ms)
Transformer
CapsNMT
Table 3: Time required for decoding. Decoding indicates the amount of time in millisecond required for translating one sentence, which is averaged over the whole English-German newstest2014 dataset.

We show the decoding speed of both the transformer and CapsNMT in Table 3. The capsule number is set to . Thg results empirically demonstrates that CapsNMT can improve the decoding speed of the transformer approach by .

Performance on long sentences

A more detailed comparison between CapsNMT and Transformer can be seen in Figure 4. In particular, we test the BLEU scores on sentences longer than . We were surprised to discover that the capsule encoder did well on medium-length sentences. There is no degradation on sentences with less than 40 words, however, there is still a gap on the longest sentences. A deeper capsule encoder potentially helps to address the degradation problem and we will leave this in the future work.

Figure 4: The plot shows the performance of our system as a function of sentence length, where the x-axis corresponds to the test sentences sorted by their length.

Visualization

Orlando Bloom and Miranda Kerr still love each other

Orlando Bloom and Miranda Kerr still love each other

Orlando Bloom and Miranda Kerr still love each other

Orlando Bloom and Miranda Kerr still love each other
Table 4: A visualization to show the perspective of a sentence from 4 different capsules at the third iteration.

We visualize how much information each child capsule sends to the parent capsules. As shown in Table  4, the color density of each word denotes the coefficient at iteration 3 in  Eq.(11). At first iteration, the

follows a uniform distribution since

is initialized to 0, and then will be iteratively fitted with dynamic routing policy. It is appealing to find that afte 3 iterations, the distribution of the voting weights will converge to a sharp distribution and the values will be very close to or . It is also worth mentioning that the capsule seems able to capture some structure information. For example, the information of phrase still love each other will be sent to the same capsule. We will make further exploration in the future work.

5 Related Work

Neural Machine Translation

A number of attention-based neural architectures have proven to be very effective for NMT. For RNMT models, both the encoder and decoder were implemented as deep Recurrent Neural Networks (RNNs), interacting via a soft-attention mechanism 

Bahdanau et al. (2014); Chen et al. (2018). It is the pioneering paradigm achieving the state-of-the-art performance. Following RNMT, convolutional sequence-to-sequence (ConvS2S) models take advantages of modern fast computing devices which outperform RNMT with faster training speed Kalchbrenner et al. (2016); Gehring et al. (2017a). Most recently, the Transformer model, which is based solely on a self-attention mechanism and feed-forward connections, has further advanced the field of NMT, both in terms of translation quality and speed of convergence Vaswani et al. (2017); Dehghani et al. (2018). The attention mechanism plays a crucial role in the success of all these models to achieve the state-of-the-art results, as the memory capacity of a single dense vector in the typical encoder-decoder model seems not powerful enough to store the necessary information of the source sentence. Despite the generally good performance, the attention based models have running time that is super-linear in the length of the source sequences, burdening the inference speed as the length of the sequences increases. Different from the attention based approach, CapsNMT runs in time that is linear in the length of the sequences.

Linear Time Neural Machine Translation

Several papers have proposed to use neural networks to directly learn the conditional distribution from a parallel corpusKalchbrenner and Blunsom (2013); Sutskever et al. (2014); Cho et al. (2014); Kalchbrenner et al. (2016). In Sutskever et al. (2014), an RNN with LSTM units was used to encode a source sentence and starting from the last hidden state, to decode a target sentence. Similarly, the authors of Cho et al., Cho et al. (2014) proposed to use an RNN to encode and decode a pair of source and target phrases. Different the RNN based approach, Kalchbrenner et al., Kalchbrenner et al. (2016) propose ByteNet which makes use of the convolution networks and successfully build a liner-time NMT system. Unlike the previous work to store the source sentence with a bottom-up way, the CapsNMT encodes the source sentence with an iterative process to decide the credit attribution between nodes from lower and higher layers. As a result, CapsNMT achieves the best performance among the linear time NMT systems.

Capsule Networks for NLP

Currently, much attention has been paid to how developing a sophisticated encoding models to capture the long and short term dependency information in a sequence. Gong et al., Gong et al. (2018) propose an aggregation mechanism to obtain a fixed-size encoding with a dynamic routing policy. Zhao et al., Zhao et al. (2018) explore capsule networks with dynamic routing for multi-task learning and achieve the best performance on six text classification benchmarks. Wang et al., Wang et al. (2018)

propose RNN-Capsule, a capsule model based on Recurrent Neural Network (RNN) for sentiment analysis. For a given problem, one capsule is built for each sentiment category. To the best of our knowledge,

CapsNMT is the first work that capsule networks have been empirically investigated for sequence-to-sequence problems.

6 Conclusion

We have introduced CapsNMT with dynamic routing policy for linear-time NMT. Three strategies were proposed to boost the performance of dynamic routing process. We have shown that CapsNMT is a state-of-the-art encoder-decoder based NMT that outperforms ByteNet while maintaining linear running time complexity. We have also shown that with a carefully designed encoder, the CapsNMT can achieve comparable results with the state-of-the-art Transform system. To the best of our knowledge, this is the first work that capsule networks have been empirically investigated for sequence to sequence problems.

In the future, we would like to investigate more sophisticated routing policy for better encoding the long sequence. Besides, dynamic routing should also be useful to the decoder of the NMT.

References