Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation

06/18/2021
by   Raj Dabre, et al.
0

In deep neural network modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in order to obtain high-quality continuous space representations which in turn improves the quality of the network's prediction. Conventionally, each layer in the stack has its own parameters which leads to a significant increase in the number of model parameters. In this paper, we propose to share parameters across all layers thereby leading to a recurrently stacked neural network model. We report on an extensive case study on neural machine translation (NMT), where we apply our proposed method to an encoder-decoder based neural network model, i.e., the Transformer model, and experiment with three Japanese–English translation datasets. We empirically demonstrate that the translation quality of a model that recurrently stacks a single layer 6 times, despite having significantly fewer parameters, approaches that of a model that stacks 6 layers where each layer has different parameters. We also explore the limits of recurrent stacking where we train extremely deep NMT models. This paper also examines the utility of our recurrently stacked model as a student model through transfer learning via leveraging pre-trained parameters and knowledge distillation, and shows that it compensates for the performance drops in translation quality that the direct training of recurrently stacked model brings. We also show how transfer learning helps in faster decoding on top of the already reduced number of parameters due to recurrent stacking. Finally, we analyze the effects of recurrently stacked layers by visualizing the attentions of models that use recurrently stacked layers and models that do not.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/14/2018

Recurrent Stacking of Layers for Compact Neural Machine Translation Models

In Neural Machine Translation (NMT), the most common practice is to stac...
research
08/27/2019

Multi-Layer Softmaxing during Training Neural Machine Translation for Flexible Decoding with Fewer Layers

This paper proposes a novel procedure for training an encoder-decoder ba...
research
02/20/2020

Balancing Cost and Benefit with Tied-Multi Transformers

We propose and evaluate a novel procedure for training multiple Transfor...
research
10/22/2020

Not all parameters are born equal: Attention is mostly what you need

Transformers are widely used in state-of-the-art machine translation, bu...
research
02/15/2019

SVM-based Deep Stacking Networks

The deep network model, with the majority built on neural networks, has ...
research
01/23/2017

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

The capacity of a neural network to absorb information is limited by its...
research
03/28/2022

MixNN: A design for protecting deep learning models

In this paper, we propose a novel design, called MixNN, for protecting d...

Please sign up or login with your details

Forgot password? Click here to reset