Multi-Path Transformer is Better: A Case Study on Neural Machine Translation

05/10/2023
by   Ye Lin, et al.
0

For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/23/2021

Recurrent multiple shared layers in Depth for Neural Machine Translation

Learning deeper models is usually a simple and effective approach to imp...
research
10/21/2020

Multi-Unit Transformers for Neural Machine Translation

Transformer models achieve remarkable success in Neural Machine Translat...
research
07/19/2021

Residual Tree Aggregation of Layers for Neural Machine Translation

Although attention-based Neural Machine Translation has achieved remarka...
research
10/06/2020

On the Sparsity of Neural Machine Translation Models

Modern neural machine translation (NMT) models employ a large number of ...
research
03/18/2019

Neutron: An Implementation of the Transformer Translation Model and its Variants

The Transformer translation model is easier to parallelize and provides ...
research
06/27/2021

Power Law Graph Transformer for Machine Translation and Representation Learning

We present the Power Law Graph Transformer, a transformer model with wel...

Please sign up or login with your details

Forgot password? Click here to reset