Self-attention based end-to-end Hindi-English Neural Machine Translation

09/21/2019 ∙ by Siddhant Srivastava, et al. ∙ 0

Machine Translation (MT) is a zone of concentrate in Natural Language processing which manages the programmed interpretation of human language, starting with one language then onto the next by the PC. Having a rich research history spreading over about three decades, Machine interpretation is a standout amongst the most looked for after region of research in the computational linguistics network. As a piece of this current ace's proposal, the fundamental center examines the Deep-learning based strategies that have gained critical ground as of late and turning into the de facto strategy in MT. We would like to point out the recent advances that have been put forward in the field of Neural Translation models, different domains under which NMT has replaced conventional SMT models and would also like to mention future avenues in the field. Consequently, we propose an end-to-end self-attention transformer network for Neural Machine Translation, trained on Hindi-English parallel corpus and compare the model's efficiency with other state of art models like encoder-decoder and attention-based encoder-decoder neural models on the basis of BLEU. We conclude this paper with a comparative analysis of the three proposed models.



There are no comments yet.


page 3

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine Translation, which is a field of concentrate under common language preparing, focuses at deciphering normal language naturally utilizing machines. Information driven machine interpretation has turned into the overwhelming field of concentrate because of the availability of substantial parallel corpora. The primary target of information driven machine interpretation is to decipher concealed source language, given that the frameworks take in interpretation learning from sentence adjusted bi-lingual preparing information.

Statistical Machine Translation (SMT) is an information driven methodology which utilizes probabilistic models to catch the interpretation procedure. Early models in SMT depended on generative models accepting a word as the fundamental element [Brown_1903], greatest entropy based discriminative models utilizing highlights gained from sentences [Och_2002], straightforward and various leveled phrases [Koehn_2003, Chiang_2017]. These strategies have been extraordinarily utilized since 2002 regardless of the way that discriminative models looked with the test of information sparsity. Discrete word based portrayals made SMT vulnerable to learning poor gauge on the record of low check occasions. Additionally, structuring highlights for SMT physically is a troublesome errand and require area language, which is hard remembering the assortment and intricacy of various common dialects.

Later years have seen the extraordinary accomplishment of deep learning applications in machine interpretation. Deep learning approaches have surpassed factual strategies in practically all sub-fields of MT and have turned into the de facto technique in both scholarly world just as in the business. as a major aspect of this theory, we will talk about the two spaces where deep learning has been significantly utilized in MT. We will quickly examine Component or Domain based deep learning strategies for machine translation [Devlin_2014] which utilizes deep learning models to improve the viability of various parts utilized in SMT including language models, transition models, and re-organizing models. Our primary spotlight in on end-to-end deep learning models for machine translation [Sutskever_2014, Bahdanau_2014] that utilizes neural systems to separate correspondence between a source and target language straightforwardly in an all encompassing way without utilizing any hand-created highlights. These models are currently perceived as Neural Machine translation (NMT).

Let signify the source language and mean the objective language, given a lot of model parameters , the point of any machine interpretation calculation is to discover the interpretation having greatest likelihood :


The decision rule is re-written using Bayes’ rule as [Brown_1903]:


Where is called as language model, and is called as transition model. The interpretation model likewise, is characterized as generative model, which is crumbled by means of dormant structures.


Where, signifies the idle structures like word arrangement between source language and target language.

2 End-to-End Deep Learning for Machine translation

Start to finish Machine Translation models [Sutskever_2014, Bahdanau_2014] likewise named as Neural Machine Translation (NMT), intends to discover a correspondence among source and target normal dialects with the assistance of deep neural systems. The fundamental distinction among NMT and customary Statistical Machine Translation (SMT) [Brown_1903, Vogel_1996, Koehn_2003, Och_2002] based methodologies is that Neural model are fit for learning complex connections among characteristic dialects straightforwardly from the information, without turning to manual hand highlights, which are difficult to plan.

The standard issue in Machine Translation continues as before, given an arrangement of words in source language sentence and target language sentence

, NMT endeavors to factor sentence level interpretation likelihood into setting dependant sub-word interpretation probabilities.


Here is alluded to as fractional interpretation. There can be sparsity among setting among source and target sentence when the sentences become excessively long, to tackle this issue, [Sutskever_2014]

proposed an encoder-decoder arrange which could speak to variable length sentence to a fixed length vector portrayal and utilize this conveyed vector to decipher sentences.

2.1 Encoder Decoder Framework for Machine Translation

Neural Machine Translation models stick to an Encoder-Decoder engineering, the job of encoder is to speak to subjective length sentences to a fixed length genuine vector which is named as setting vector. This setting vector contains all the fundamental highlights which can be construed from the source sentence itself. The decoder arrange accepts this vector as contribution to yield target sentence word by word. The perfect decoder is relied upon to yield sentence which contains the full setting of source language sentence. Figure1

shows the overall architecture of the encoder-decoder neural network for machine translation.

Since source and target sentences are ordinarily of various lengths, Initially [Sutskever_2014]

proposed Recurrent Neural Network for both encoder and decoder systems, To address the issue of evaporating angle and detonating slopes happening because of conditions among word sets, Long Short Term Memory (LSTM)


and Gated Recurrent Unit (GRU)

[Cho_2014] were proposed rather than Vanilla RNN cell.

Training in NMT is done by maximising log-likelihood as the objective function:


Where is defined as:


After training, learned parameters is used for translation as:

Figure 1: Encoder-Decoder model for Machine Translation, Crimson boxes portray the concealed expressed of encoder, Blue boxes indicates "End of Sentence" EOS and Green boxes show shrouded condition of the decoder. credits (Neural Machine Translation - Tutorial ACL 2016)

2.2 Attention Mechanism in Neural Machine Translation

The Encoder organize proposed by [Sutskever_2014] spoke to source language sentence into a fixed length vector which was in this way used by the Decoder arrange, through observational testing, it was seen that the nature of interpretation incredibly relied upon the span of source sentence and diminished essentially by expanding the sentence measure.

To address this issue, [Bahdanau_2014] proposed to coordinate an Attention system inside the Encoder arrange and demonstrated this could progressively choose significant parts of setting in source sentence to deliver target sentence. They utilized Bi-directional RNN (BRNN’s) to catch worldwide settings:


The forward hidden state and backward hidden state are concatenated to capture sentence level context.


The basic Ideology behind computing attention is to seek portions of interest in source text in order to generate target words in text, this is performed by computing attention weights first.


Where is the alignment function which evaluates how well inputs are aligned with respect to position and output at position . Context vector is computed as a weighted sum of hidden states of the source.


And target hidden state is computed as follows.


In Figure 2, we have attention mechanism at the encoder level, the context vector is then used by the decoder layer for language translation. The distinction between consideration based NMT [Bahdanau_2014] from unique encoder-decoder based engineering [Sutskever_2014] is the way source setting is registered, in unique encoder-decoder, the source’s shrouded state is utilized to introduce target’s underlying concealed state while in consideration instrument, a weighted aggregate of concealed state is utilized which ensures that the significance of every single source word in the sentence is very much protected in the specific circumstance. This incredibly improves the execution of interpretation and hence this has turned into the state of art model in neural machine interpretation.

Figure 2:

Consideration based Encoder-Decoder Architecture for Machine Translation. The majority of the Architecture is like fundamental Encoder-Decoder with the expansion of Context Vector Computed utilizing consideration loads for each word token, Attention vector is determined utilizing Context vector and concealed condition of encoder. credits (Attention-based Neural Machine Translation with Keras, blog by Sigrid Keydana)

3 Neural Architectures for NMT

The majority of the encoder-decoder based NMT models have used RNN and It’s variations LSTM [Hochreiter_1997] and GRU [Cho_2014]. As of late, Convolution systems (CNN) [Gehring_2017] and self consideration systems [Vaswani_2017] have been examined and have delivered promising outcomes.

The issue with utilizing Recurrent systems in NMT is that it works by sequential calculation and necessities to keep up it’s concealed advance at each progression of preparing. This makes the preparation deeply wasteful and tedious. [Gehring_2017] proposed that convolution systems can, interestingly, become familiar with the fixed length shrouded states utilizing convolution task. The principle preferred standpoint of this methodology being that convolution task doesn’t rely upon recently figured qualities and can be parallelized for multi-center preparing. Additionally Convolution systems can be stacked in a steady progression to learn further setting settling on it a perfect decision for both the encoder and decoder.

Intermittent systems process reliance among words in a sentence in while Convolution system can accomplish the equivalent in where is the extent of convolution part.

[Vaswani_2017] proposed a model which could register the reliance among each word pair in a sentence utilizing just the Attention layer stacked in a steady progression in both the encoder and decoder, he named this as self-attention, the overall architecture is shown as Figure 3. In their model, concealed state is figured utilizing self-consideration and feed forward system, they utilize positional encoding to present the element dependent on the area of word in the sentence and their self-consideration layer named as multi-head attention is very parallelizable. This model has appeared to be exceedingly parallelizable due to before referenced reason and fundamentally accelerates NMT preparing, likewise bringing about preferable outcomes over the benchmark Recurrent system based models.

Figure 3: Self-Attention Encoder-Decoder Transformer model. Encoder and Decoder both consists positional encoding and stacked layers of multi-head attention and feedforward network with the Decoder containing an additional Masked multi head attention. Transition Probabilities are calculated using linear layer followed by softmax. credits (Vaswani et al. 2017)

As of now, there is no clear decision regarding which neural architecture is the best and different architectures give different results depending on the problem in hand. Neural architecture is still considered to be the hottest and most active research field in Neural Machine Translation.

4 Research gaps and open problems

deep learning strategies have altered the field of Machine Translation, with early endeavors concentrating on improving the key segments of Statistical Machine Translation like word arrangement [Yang_2013] , interpretation model [Koehn_2003, Gao_2014], and expression reordering [Li_2013, Li_2014] and language model [Vaswani_2003]. Since 2010, a large portion of the exploration has been moved towards creating start to finish neural models that could relieve the need of broad component designing [Sutskever_2014, Bahdanau_2014]. Neural models have effectively supplanted Statistical models since their commencement in all scholarly and modern application.

Albeit Deep learning has quickened look into in Machine Translation people group yet regardless, Current NMT models are not free from blemishes and has certain constraints. In this segment, we will depict some current research issues in NMT, our point is to control specialists and researchers working in this field to get to know these issues and work towards it for considerably quicker improvement in the field.

4.1 Neural models motivated by semantic approaches

Start to finish models have been named as the de facto model in Machine Translation, yet it is difficult to decipher the inner calculation of neural systems which is frequently essentially said to be the "Black Box" approach. One conceivable zone of research is to grow etymologically propelled neural models for better interpretability. It is difficult to perceive learning from concealed condition of current neural systems and thus it is similarly hard to join earlier information which is emblematic in nature into consistent portrayal of these states [Ding_2017].

4.2 Light weight neural models for learning through inadequate data

Another real disadvantage for NMT is information shortage, It is surely known that NMT models are information hungry and requires a great many preparing cases for giving best outcomes. The issue emerges when there isn’t sufficient parallel corpora present for the majority of the language matches on the planet. In this way fabricating models that can adapt better than average portrayal utilizing generally littler informational collection is an effectively inquired about issue today. One comparative issue is to create one-to-numerous and many-to-numerous language models rather than balanced models. Analysts don’t know how to normal information utilizing neural system from an etymological point of view, as this learning will help create multi-lingual interpretation models rather than balanced models utilized today.

4.3 Multi-modular Neural Architectures for present data

One more issue is to create multi-modular language interpretation models. Practically all the work done has been founded on printed information. Research on creating nonstop portrayal combining content, discourse and visual information to create multi-model frameworks is going all out. Additionally since there is constrained or no multi-model parallel corpora present, advancement of such databases is likewise a fascinating field to investigate and can likewise profit multi-modular neural designs.

4.4 Parallel and conveyed calculations for preparing neural models

At long last, current neural designs depend intensely broad calculation control for giving skillful outcomes [Gilvile_2017_j, Castilho_2018_j, Karakanta_2018_j]. In spite of the fact that there is no figure and capacity lack in current situation, yet it would be increasingly proficient to thought of light neural models of language interpretation. Additionally Recurrent models [Sutskever_2014, Bahdanau_2014] can’t be parallelized because of which it is difficult to create conveyed frameworks for model preparing. Luckily, late advancements, with the rise of Convolution systems and self-consideration Networks can be parallelized and therefore disseminated among various frameworks. But since they contain a great many related parameters, it makes it difficult to circulate them among inexactly coupled frameworks. Along these lines growing light neural designs intended to be circulated can be new likely wilderness of NMT.

5 Methodology

The proposed methodology can be broken down to several atomic objectives. The first step is the Acquisition of parallel corpora, the next step is to pre-process the data acquired. Various neural models is to be implemented and trained on the pre-processed data. The last part of our study is to compare the results obtained by the models and do a comparative study.

5.1 Data Acquisition and preparation

For this study, we intend to work with is the English-Hindi parallel corpus, curated and made publically available by the Center of Indian Language Technologies (CFILT), Indian Institute of Technology, Bombay [Kunchukuttan_2017]. Table 1 shows the number of parallel sentences in the train and test data. This parallel datasets contains more than 1.5 million parallel sentences for training and testing purpose, to the best our knowledge, there is no literature present till date indicating any comparative study done based upon the Neural models on this dataset.

Dataset Sentence Pairs
IITB-CFILT Hi-En 1,495,847
Train 1,492,827
Validation-Test 3,020
Table 1: Statistics of Dataset used

After getting our data in an unzipped form, the next part in our pipeline is to decompose rare words in our corpora using subword byte pair encoding [Sennrich_2015]. Byte pair encoding is a useful approach when we have an extremely large vocabulary which hinders model training and thus we can decompose those rare words into common subwords and build the vocabulary accordingly.To encode the training corpora using BPE, we need to generate BPE operations first. The following command will create a file named bpe32k, which contains 32k BPE operations.It also outputs two dictionaries named and vocab.en. Similar methodology is applied for Hindi-english data as well.

6 Model components

For this study, sequence-to-sequence LSTM network and Attention based encoder-decoder using GRU cell have been implemented. Self attention Transformer network has been implemented and all the models are tested side by side to create a clear superiority distinction among them. The basic theory of model components used is given below.

6.1 RNN Cell

The basic neural cell present in Neural network works well for several problems but fails miserably when the order of data matters, as a result these models fails to generalize and solve problems which deals with temporal or sequential data. To reason behind this failure being that the basic neural cell doesn’t take into account the previous or backward information for it’s computation and using the same philosophy the basic RNN cell was developed. Recurrent Neural Network (RNN) are such network having recurrent cells which are capable of incorporating past information with current information in terms of value computation and as a result these models have seen huge success in problems dealing with sequential input like problems coming under the domain of Natural Language Processing, weather forecast and other such problems.

The basic mathematical equations underlying the RNN cell are Described below:


Here and are the input and output at the time step, , and are connection weights respectively.

6.2 GRU Cell

Although the RNN cell has outperformed non-sequential neural networks but they fail to generalize to problems having longer sequence length, the problem arises due to not able to capture long term dependencies among the sequential units and this phenomena is termed as the vanishing gradient problem, To solve this problem

[Cho_2014] proposed a Gated approach to explicitly caputure long term memory using different cells, this cell was termed as Gated Recurrent Unit (GRU). The schematic diagram of GRU cell is given in Figure 4.

The difference between GRU cell and RNN cell lies at the computation of Hidden cell values, GRU uses two gates update () and reset () to capture long term dependancies. The mathematical equations behind the computation are given below.


6.3 LSTM Cell

Short for Long Short Term Memory, Given by [Hochreiter_1997] is another approach to overcome the vanishing gradient problem in RNN, like GRU, LSTM uses gating mechanism but it uses three gates instead of two cells in GRU to capture long Term Dependencies. The schematic diagram of LSTM cell is given in Figure 4.

(a) GRU cell
(b) LSTM
Figure 4: Cell Structure

LSTM cell uses the input (), output () and forget () gates for computation of hidden states respectively, the equations are similar to that of GRU cell, LSTM like GRU, uses sigmoid activation for adding non-linearity to the function.


6.4 Attention Mechanism

Attention mechanism was first developed by [Bahdanau_2014], in their paper “Neural Machine Translation by Jointly Learning to Align and Translate” which takes in as a natural extension of their previous work on the sequence to sequence Encoder-Decoder model. Attention is proposed as a solution to mitigate the limitation of the Encoder-Decoder architecture which encodes the input sequence to one fixed length vector from which the output is decoded at each time step. This problem seems to be more of a issue when decoding long sequences. Attention is proposed as a singular method to both align and translate. Alignment is the problem in machine translation that seeks to find which parts of the input sequence are relevant to each word in the output, whereas translation is the process of using the relevant information to select the appropriate output.

6.5 Transformer Network

We use transformer self-attention encoder for our study. The transformer model [Vaswani_2017] is made up of consecutive blocks. Each block of the transformer, denoted by , contains two separate components, multi-head attention and a feed forward network. The output of each token of block

is connected to it’s input in a residual connection. The input to the first block is



Multi-head attention applies self-attention over the same inputs multiple times by using separately normalized parameters (attention heads) and finally concatenates the results of each head, multi-head attention mechansim is considered as a better alternative to applying one pass of attention with more parameters as the former process can be easily parallelized. Furthermore, computing the attention with multiple heads make it easier for the model to learn and attend to different types of relevant information with each head. The self-attention updates input by computing a weighted sum over all tokens in the sequence, weighted by their importance for modeling token .

Each input, inside the multi-head attention is projected to query, key and value () respectively. and are all of dimensions , where is the number of heads and is the dimension of embedding. the attention weights for head between token and is given by scaled dot product between


Finally the output of each head in multi-head attention in concatenated serially.

7 Experiments and results

7.1 Neural models

For this study, self attention based transformer network is implemented and compared using Sequence-to-sequence and attention based encoder decoder neural architectures. All the implementation and coding part is done using the above mentioned programming framework. We train all the three models in an end to end manner using CFILT Hindi-English parallel corpora and the results from all the three models are compared keeping in mind the usage of similar hyper-parameter values for ease of comparison.

For the sequence-to-sequence model, we are using LSTM cell for computation since there is no attention mechanism involved and it is desirable for the model to capture long term dependencies. Since LSTM works far better than GRU cell in terms of capturing long term dependencies we chose to go with it. The embedding layer, hidden layer is taken to be of 512 dimension, both encoder and the decoder part contains 2 as the number of hidden layers. For regularization, we are using dropout and set the value to be 20 percent. Batch size of 128 is taken.

For the attention based RNN search, we are using GRU cell for computation since attention mechanism is already employed and will capture long term dependencies explicitly using attention value. GRU cell is computationally efficient in terms of computation as compared with LSTM cell. Like in our sequence-to-sequence model, the embedding layer, hidden layer is taken to be of 512 dimension, both encoder and the decoder part contains 2 as the number of hidden layers. For regularization, we are using dropout and set the value to be 20 percent. Batch size of 128 is taken. For the self attention Transformer network, we are using hidden and embedding layer to be of size 512, for each encoder and decoder, we fix the number of layers of self-attention to be 4. In each layer, we assign 8 parallel attention heads and the hidden size of feed forward neural network is taken to be 1024 in each cell. attention dropout and residual dropout is taken to be 10 percent. Table

2: shows the number of trainable parameters in each of our three models.

Model Trainable parameters
Sequence-to-Sequence 38,678,595
Attention Encoder-Decoder 38,804,547
Self-Attention Transformer 122,699,776
Table 2: Number of Trainable parameters in each model

The optimizer used for our study is the Adam optimizer, taking learning_rate decay value of 0.001, value of 0.9 and

value of 0.98 for first order and second order gradient moments respectively. The accuracy metric taken is log-loss error between the predicted and actual sentence word. We are trying to optimize the log-loss error between the predicted and target words of the sentence. All the models are trained on 100000 steps, where is each step, a batch size of 128 is taken for calculating the loss function. The primary objective of our training is to minimize the log-loss error between the source and target sentences and simultaneously maximize the metric which is chosen to be the BLEU score


8 Results

Table 3 shows the BLEU score of all three models based on English-Hindi, Hindi-English on CFILT’s test dataset respectively. From the results which we get, it is evident that the transformer model achieves higher BLEU score than both Attention encoder-decoder and sequence-sequence model. Attention encoder-decoder achieves better BLEU score and sequence-sequence model performs the worst out of the three which further consolidates the point that if we are dealing with long source and target sentences then attention mechanism is very much required to capture long term dependencies and we can solely rely on the attention mechanism, overthrowing recurrent cells completely for the machine translation task.

Model Hindi-English English-Hindi
Sequence-to-Sequence 9.40 8.38
Attention Encoder-Decoder 11.59 10.13
Self-Attention Transformer 13.96 13.47
Table 3: Model performance in terms of BLEU score on the task of Hindi-English and English-Hindi translation task
(a) Sample 1
(b) Sample 2
Figure 5: Hindi-English Heatmap
(a) Sample 1
(b) Sample 2
Figure 6: English-Hindi Heatmap sample

Figure 5 shows the word-word association heat map for selected translated and target sentences when transformer model is trained on English-Hindi translation task and similarly Figure 6 shows the word-word association heat map for selected translated and target sentences when transformer model is trained on Hindi-English translation task.

9 Conclusion

In this paper, we initially discussed about Machine translation. We started our discussion from a brief discussion on basic Machine translation objective and terminologies along with early Statistical approaches (SMT). We then discussed the role of deep learning models in improving different components of SMT, Then we shifted our discussion on end-to-end neural machine translation (NMT). Our discussion was largely based on the basic encoder-decoder based NMT, attention based model. We finally listed the challenges in Neural Translation models and mentioned future fields of study and open ended problems. Later we proposed a self-attention transformer network for Hindi-English language translation and compare this model with other neural machine translation models on the basis of BLEU. We concluded our study by delineating the advantages and disadvantages of all the three models.