Log In Sign Up

NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems

by   Ozan Caglayan, et al.

In this paper, we present nmtpy, a flexible Python toolkit based on Theano for training Neural Machine Translation and other neural sequence-to-sequence architectures. nmtpy decouples the specification of a network from the training and inference utilities to simplify the addition of a new architecture and reduce the amount of boilerplate code to be written. nmtpy has been used for LIUM's top-ranked submissions to WMT Multimodal Machine Translation and News Translation tasks in 2016 and 2017.


page 1

page 2

page 3

page 4


Nematus: a Toolkit for Neural Machine Translation

We present Nematus, a toolkit for Neural Machine Translation. The toolki...

Neural Machine Translation

Draft of textbook chapter on neural machine translation. a comprehensive...

The Sockeye 2 Neural Machine Translation Toolkit at AMTA 2020

We present Sockeye 2, a modernized and streamlined version of the Sockey...

Controlling Text Complexity in Neural Machine Translation

This work introduces a machine translation task where the output is aime...

Training Flexible Depth Model by Multi-Task Learning for Neural Machine Translation

The standard neural machine translation model can only decode with the s...

Additive Interventions Yield Robust Multi-Domain Machine Translation Models

Additive interventions are a recently-proposed mechanism for controlling...

Public Health Informatics: Proposing Causal Sequence of Death Using Neural Machine Translation

Each year there are nearly 57 million deaths around the world, with over...

1 Overview

nmtpy is a refactored, extended and Python 3 only version of dl4mt-tutorial 111, a Theano (Theano Development Team, 2016) implementation of attentive Neural Machine Translation (NMT) (Bahdanau et al., 2014).

The development of nmtpy project which has been open-sourced222 under MIT license in March 2017, started in March 2016 as an effort to adapt dl4mt-tutorial to multimodal translation models. nmtpy has now become a powerful toolkit where adding a new model is as simple as deriving from an abstract base class to fill in a set of fundamental methods and (optionally) implementing a custom data iterator. The training and inference utilities are as model-agnostic as possible allowing one to use them for different sequence generation networks such as multimodal NMT and image captioning to name a few. This flexibility and the rich set of provided architectures (Section  3) is what differentiates nmtpy from Nematus (Sennrich et al., 2017) another NMT software derived from dl4mt-tutorial.

2 Workflow

Figure 1 describes the general workflow of a training session. An experiment in nmtpy is described with a configuration file (Appendix  A) to ensure reusability and reproducibility. A training experiment can be simply launched by providing this configuration file to nmt-train which sets up the environment and starts the training. Specifically nmt-train automatically selects a free GPU, sets the seed for all random number generators and finally creates a model (model_type option) instance. Architecture-specific steps like data loading, weight initialization and graph construction are delegated to the model instance. The corresponding log file and model checkpoints are named in a way to reflect the experiment options determined by the configuration file (Example: model_type-e<embdim>-r<rnndim>-<opt>_<lrate>...).

Figure 1: The components of nmtpy.

Once everything is ready, nmt-train starts consuming mini-batches of data from the model’s iterator to perform forward/backward passes along with the weight updates. A validation on held-out corpus is periodically performed to evaluate the generalization performance of the model. Specifically, after each valid_freq updates, nmt-train calls the nmt-translate utility which will perform beam-search decoding, compute the requested metrics and return the results back so that nmt-train can track the progress and save best checkpoints to disk.

Several examples regarding the usage of the utilities are given in Appendix  B.

2.1 Adding New Architectures

New architectures can be defined by creating a new file under nmtpy/models/ using a copy of an existing architecture and modifying the following predefined methods:

  • __init__(): Instantiates a model. Keyword arguments can be used to add options specific to the architecture that will be automatically gathered from the configuration file by nmt-train.

  • init_params(): Initializes the layers and weights.

  • build(): Defines the Theano computation graph that will be used during training.

  • build_sampler(): Defines the Theano computation graph that will be used during beam-search. This is generally very similar to build()

    but with sequential RNN steps and non-masked tensors.

  • load_valid_data(): Loads the validation data for perplexity computation.

  • load_data(): Loads the training data.

2.2 Building Blocks

In this section, we introduce the currently available components and features of nmtpy that one can use to design their architecture.



provides Theano implementations of stochastic gradient descent (SGD) and its adaptive variants RMSProp

(Tieleman & Hinton, 2012), Adadelta (Zeiler, 2012) and Adam (Kingma & Ba, 2014) to optimize the weights of the trained network. A preliminary support for gradient noise (Neelakantan et al., 2015) is available for Adam. Gradient norm clipping (Pascanu et al., 2013) is enabled by default with a threshold of 5 to avoid exploding gradients. Although the provided architectures all use the cross-entropy objective by their nature, any arbitrary differentiable objective function can be used since the training loop is agnostic to the architecture being trained.


A dropout (Srivastava et al., 2014) layer which can be placed after any arbitrary feed-forward layer in the architecture is available. This layer works in inverse mode where the magnitudes are scaled during training instead of testing. Additionally, L2 regularization loss with a scalar factor defined by decay_c option in the configuration can be added to the training loss.


The weight initialization is governed by the weight_init option and supports Xavier (Glorot & Bengio, 2010) and He (He et al., 2015) initialization methods besides orthogonal (Saxe et al., 2013) and random normal.


The following layers are available in the latest version of nmtpy:

  • Feed-forward and highway layer (Srivastava et al., 2015)

  • Gated Recurrent Unit (GRU) (Chung et al., 2014)

  • Conditional GRU (CGRU) (Firat & Cho, 2016)

  • Multimodal CGRU (Caglayan et al., 2016a, b)

Layer normalization (Ba et al., 2016)

, a method that adaptively learns to scale and shift the incoming activations of a neuron, can be enabled for GRU and CGRU blocks.


Parallel and monolingual text iterators with compressed (.gz, .bz2, .xz) file support are available under the names TextIterator and BiTextIterator. Additionally, the multimodal WMTIterator allows using image features and source/target sentences at the same time for multimodal NMT (Section  3.3). We recommend using shuffle_mode:trglen when implemented to speed up the training by efficiently batching same-length sequences.


All decoded translations will be post-processed if filter option is given in the configuration file. This is useful in the case where one would like to compute automatic metrics on surface forms instead of segmented. Currently available filters are bpe and compound for cleaning subword BPE (Sennrich et al., 2016) and German compound-splitting (Sennrich & Haddow, 2015) respectively.



performs a patience based early-stopping using either validation perplexity or one of the following external evaluation metrics:

  • bleu: Wrapper around Moses multi-bleu BLEU (Papineni et al., 2002)

  • bleu_v13a: A Python reimplementation of Moses BLEU

  • meteor: Wrapper around METEOR (Lavie & Agarwal, 2007)

The above metrics are also available for nmt-translate to immediately score the produced hypotheses. Other metrics can be easily added and made available as early-stopping metrics.

3 Architectures

3.1 Nmt

The default NMT architecture (attention) is based on the original dl4mt-tutorial implementation which differs from Bahdanau et al. (2014) in the following major aspects:

  • CGRU decoder which consists of two GRU layers interleaved with attention mechanism.

  • The hidden state of the decoder is initialized with a non-linear transformation applied to

    mean bi-directional encoder state in contrast to last bi-directional encoder state.

  • The Maxout (Goodfellow et al., 2013) hidden layer before the softmax operation is removed.

In addition, nmtpy offers the following configurable options for this NMT:

  • layer_norm Enables/disables layer normalization for bi-directional GRU encoder.

  • init_cgru Allows initializing CGRU with all-zeros instead of mean encoder state.

  • n_enc_layers Number of additional unidirectional GRU encoders to stack on top of bi-directional encoder.

  • tied_emb Allows sharing feedback embeddings and output embeddings (2way) or all embeddings in the network (3way) (Inan et al., 2016; Press & Wolf, 2016).

  • *_dropout

    Dropout probabilities for three dropout layers placed after source embeddings (

    emb_dropout), encoder hidden states (ctx_dropout) and pre-softmax activations (out_dropout).

3.2 Factored NMT

Factored NMT (FNMT) is an extension of NMT which is able to generate two output symbols. The architecture of such a model is presented in Figure 2. In contrast to multi-task architectures, FNMT outputs share the same recurrence and output symbols are generated in a synchronous fashion333FNMT currently uses a dedicated nmt-translate-factors utility though it will probably be merged in the near future..

Figure 2: Global architecture of the Factored NMT system.

Two FNMT variants which differ in how they handle the output layer are currently available:

  • attention_factors: the lemma and factor embeddings are concatenated to form a single feedback embedding.

  • attention_factors_seplogits: the output path for lemmas and factors are kept separate with different pre-softmax transformations applied for specialization.

FNMT with lemmas and linguistic factors has been successfully used for IWSLT’16 EnglishFrench (García-Martínez et al., 2016) and WMT’17444 EnglishLatvian and EnglishCzech evaluation campaigns.

3.3 Multimodal NMT & Captioning

We provide several multimodal architectures (Caglayan et al., 2016a, b) where the probability of a target word is conditioned on source sentence representations and convolutional image features (Figure  3

). More specifically, these architectures extends monomodal CGRU into a multimodal one where the attention mechanism can be shared or separate between input modalities. A late fusion of attended context vectors are done using either by summing or concatenating the modality-specific representations.

Our attentive multimodal system for Multilingual Image Description Generation track of WMT’16 Multimodal Machine Translation surpassed the baseline architecture (Elliott et al., 2015) by +1.1 METEOR and +3.4 BLEU and ranked first among multimodal submissions (Specia et al., 2016).

Figure 3: The architecture of multimodal attention (Caglayan et al., 2016b).

3.4 Language Modeling

A GRU-based language model architecture (rnnlm) is available in the repository which can be used with nmt-test-lm to obtain language model scores.

3.5 Image Captioning

A GRU-based reimplementation of Show, Attend and Tell architecture (Xu et al., 2015) which learns to generate a natural language description by applying soft attention over convolutional image features is available under the name img2txt. This architecture is recently used 555 as a baseline system for the Multilingual Image Description Generation track of WMT’17 Multimodal Machine Translation shared task.

4 Tools

In this section we present translation and rescoring utilities nmt-translate and nmt-rescore. Other auxiliary utilities are briefly described in Appendix C.

4.1 nmt-translate

nmt-translate is responsible for translation decoding using the beam-search method defined by NMT architecture. This default beam-search supports single and ensemble decoding for both monomodal and multimodal translation models. If a given architecture reimplements the beam-search method in its class, that one will be used instead.

Since the number of CPUs in a single machine is 2x-4x higher than the number of GPUs and we mainly reserve the GPUs for training, nmt-translate makes use of CPU workers for maximum efficiency. More specifically, each worker receives a model instance (or instances when ensembling) and performs the beam-search on samples that it continuously fetches from a shared queue. This queue is filled by the master process using the iterator provided by the model.

One thing to note for parallel CPU decoding is that if the numpy is linked against a BLAS implementation with threading support enabled (as in the case with Anaconda & Intel MKL), each spawned process attempts to use all available threads in the machine leading to a resource conflict. In order for nmt-translate to benefit correctly from parallelism, the number of threads per process is thus limited 666This is achieved by setting X_NUM_THREADS=1 environment variable where X is one of OPENBLAS,OMP,MKL depending on the numpy installation. to 1. The impact of this setting and the overall decoding speed in terms of words/sec (wps) are reported in (Table  1) for a medium-sized EnTr NMT with 10M parameters.

# BLAS Threads Tesla K40 4 CPU 8 CPU 16 CPU
Default 185 wps 26 wps 25 wps 25 wps
Set to 1 185 wps 109 wps 198 wps 332 wps
Table 1: Median beam-search speed over 3 runs with beam size 12: decoding on a single Tesla K40 GPU is rougly equivalent to using 8 CPUs (Intel Xeon E5-2687v3).

4.2 nmt-rescore

A 1-best plain text or -best hypotheses file can be rescored with nmt-rescore using either a single or an ensemble of models. Since rescoring of a given hypothesis simply means computing the negative log-likelihood of it given the source sentence, nmt-rescore uses a single GPU to efficiently compute the scores in batched mode. See Appendix  B for examples.

5 Conclusion

We have presented nmtpy, an open-source sequence-to-sequence framework based on dl4mt-tutorial and refined in many ways to ease the task of integrating new architectures. The toolkit has been internally used in our team for tasks ranging from monomodal, multimodal and factored NMT to image captioning and language modeling to help achieving top-ranked submissions during campaigns like IWSLT and WMT.


This work was supported by the French National Research Agency (ANR) through the CHIST-ERA M2CR project, under the contract number ANR-15-CHR2-0006-01777


Appendix A Configuration File Example

# Options in this section are consumed by nmt-train
model_type: attention   # Model type without .py
patience: 20            # early-stopping patience
valid_freq: 1000        # Compute metrics each 1000 updates
valid_metric: meteor    # Use meteor during validations
valid_start: 2          # Start validations after 2nd epoch
valid_beam: 3           # Decode with beam size 3
valid_njobs: 16         # Use 16 processes for beam-search
valid_save_hyp: True    # Save validation hypotheses
decay_c: 1e-5           # L2 regularization factor
clip_c: 5               # Gradient clip threshold
seed: 1235              # Seed for numpy and Theano RNG
save_best_n: 2          # Keep 2 best models on-disk
device_id: auto         # Pick 1st available GPU
snapshot_freq: 10000    # Save a resumeable snapshot
max_epochs: 100
# Options below are passed to model instance
tied_emb: 2way          # weight-tying mode (False,2way,3way)
layer_norm: True        # layer norm in GRU encoder
shuffle_mode: trglen    # Shuffled/length-ordered batches
filter: bpe             # post-processing filter(s)
n_words_src: 0          # limit src vocab if > 0
n_words_trg: 0          # limit trg vocab if > 0
save_path: ~/models     # Where to store checkpoints
rnn_dim: 100            # Encoder and decoder RNN dim
embedding_dim: 100      # All embedding dim
weight_init: xavier
batch_size: 32
optimizer: adam
lrate: 0.0004
emb_dropout: 0.2        # Set dropout rates
ctx_dropout: 0.4
out_dropout: 0.4
# Dictionary files produced by nmt-build-dict
src: ~/data/
trg: ~/data/
# Training and validation data
train_src     : ~/data/
train_trg     : ~/data/
valid_src     : ~/data/
valid_trg     : ~/data/
valid_trg_orig: ~/data/

Appendix B Usage Examples

# Launch an experiment
$ nmt-train -c wmt-en-de.conf
# Launch an experiment with different architecture
$ nmt-train -c wmt-en-de.conf ’model_type:my_amazing_nmt’
# Change dimensions
$ nmt-train -c wmt-en-de.conf ’rnn_dim:500’ ’embedding_dim:300’
# Force specific GPU device
$ nmt-train -c wmt-en-de.conf ’device_id:gpu5’
Listing 1: Example usage patterns for nmt-train.
# Decode on 30 CPUs with beam size 10, compute BLEU/METEOR
# Language for METEOR is set through source file suffix (.en)
$ nmt-translate -j 30 -m best_model.npz -S val.tok.bpe.en \
                -R -o -M bleu meteor -b 10
# Generate n-best list with an ensemble of checkpoints
$ nmt-translate -m model*npz -S \
                -o -b 50 -N 50
# Generate json file with alignment weights (-e)
$ nmt-translate -m best_model.npz -S val.tok.bpe.en \
                -R -o -e
Listing 2: Example usage patterns for nmt-translate.
# Rescore 50-best list with ensemble of models
$ nmt-rescore -m model*npz -s val.tok.bpe.en \
                -t \
Listing 3: Example usage patterns for nmt-rescore.

Appendix C Description of the provided tools


Generates .pkl vocabulary files from preprocessed corpus. A single/combined vocabulary for two or more languages can be created with -s flag.


Extracts arbitrary weights from a model snapshot which can further be used as pre-trained weights of a new experiment or analyzed using visualization techniques (especially for embeddings).


A stand-alone utility which computes multi-reference BLEU, METEOR, CIDE-r (Vedantam et al., 2015) and ROUGE-L (Lin, 2004) using MSCOCO evaluation tools (Chen et al., 2015). Multiple systems can be given with -s flag to produce a table of scores.


Copy of subword utilities 888 (Sennrich et al., 2016) which are used to first learn a BPE segmentation model over a given corpus file and then apply it to new sentences.


Computes language model perplexity of a given corpus.