Log In Sign Up

Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation

Recent work in computer vision has yielded impressive results in automatically describing images with natural language. Most of these systems generate captions in a sin- gle language, requiring multiple language-specific models to build a multilingual captioning system. We propose a very simple technique to build a single unified model across languages, using artificial tokens to control the language, making the captioning system more compact. We evaluate our approach on generating English and Japanese captions, and show that a typical neural captioning architecture is capable of learning a single model that can switch between two different languages.


VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

We present a new large-scale multilingual video description dataset, VAT...

Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset

Research in massively multilingual image captioning has been severely ha...

Neural Twins Talk Alternative Calculations

Inspired by how the human brain employs a higher number of neural pathwa...

Creative Captioning: An AI Grand Challenge Based on the Dixit Board Game

We propose a new class of "grand challenge" AI problems that we call cre...

Let Your Heart Speak in its Mother Tongue: Multilingual Captioning of Cardiac Signals

Cardiac signals, such as the electrocardiogram, convey a significant amo...

Translation between Molecules and Natural Language

Joint representations between images and text have been deeply investiga...

Text to Image Generation: Leaving no Language Behind

One of the latest applications of Artificial Intelligence (AI) is to gen...

1 Introduction

A key problem in the intersection between computer vision and natural language processing is to automatically describe an image in natural language, and recent work has shown exciting progress using deep neural network-based models 

[18, 9, 7]. Most of this work generates captions in English, but since the image captioning models do not require linguistic knowledge, this is an arbitrary choice based on the easy availability of training data in English. The same captioning models are applicable for non-English languages as long as sufficiently large training datasets are available [2, 12].

Of course, real image captioning applications will require support for multiple languages. It is possible to build a multilingual captioning system by training a separate model for each individual language, but this requires creating as many models as there are supported languages. A simpler approach would be to create a single unified model that can generate captions in multiple languages. This could be particularly advantageous in resource-constrained environments like mobile devices, where storing and evaluating multiple large neural network models may be impractical. But to what extent can a single model capture the (potentially) very different grammars and styles of multiple languages?

In this short paper, we propose training a unified caption generator that can produce meaningful captions in multiple languages (Figure 1). Our approach is quite simple, but surprisingly effective: we inject an artificial token at the beginning of each sentence to control the language of the caption. During training, this special token informs the network of the language of the ground-truth caption, while at test time, it requests that the model produce a sentence in the specified language. We evaluated our approach using image captioning datasets in English and Japanese. We chose Japanese because it reportedly has the greatest linguistic distance from English compared to most other popular languages, which means it is the most difficult language for a native English speaker to learn [1]. Our experiments suggest that even for these two very distant languages, a single neural model can produce meaningful multi-lingual image descriptions.

Figure 1: In this work, we train a single model for multilingual captioning using artificial tokens to switch languages.

2 Related Work

The latest image captioning systems use multimodal neural networks, inspired from sequence to sequence modeling in machine translation [15]

. Images are fed into a Convolutional Neural Network (CNN) to extract visual features and then converted to word sequences using a Recurrent Neural Network (RNN) that has been trained on image-sentence ground truth pairs 

[18, 9, 7].

Most of this work on image captioning has considered a single target language (English), although various other uses of multi-language captioning data have been studied. For instance, Elliott et al. [2] and Miyazaki et al. [12] consider the problem of image captioning for one language when a caption in another language is available. Other work has considered the related but distinct problem of improving machine translation by using images that have been captioned in multiple languages [14, 5]. In contrast, our work assumes that we have training images that have been captioned in multiple languages (i.e., each image has captions in multiple images, and/or some images are captioned in one language and others are captioned in another), and we wish to apply a single, unified model to produce captions in multiple languages on new, unseen test images.

More generally, multilingual machine translation is an active area of research. The performance of machine translation can be improved when training data in more than two languages is available [19, 3], but the models become more complex as the number of languages increases, because they use separate RNNs for each language. The closest related work to ours is Google’s multilingual translation system [8] that uses artificial tokens to control the languages. We apply a similar idea here for image caption generation.

3 Model

Our model uses a CNN to extract image features and then an RNN to generate captions in a manner very similar to previous work [18, 9]. Formally, we minimize the negative log likelihood of the caption given an image,


where each pair corresponds to an image and its caption , and is the -th word of . We use a combination of CNN and RNN to model in the following manner:


where is a word embedding matrix, and

is represented as a one-hot vector. We use a special token assigned to

to denote the start of the sentence and the captioning language. For example, a monolingual captioning model (a baseline) uses <sos> to indicate the starting of the sentence, and the multilingual captioning model uses <en> or <jp> to indicate English or Japanese, respectively. When generating a caption, we find the sequence that (approximately) satisfies equation (1) using a beam search.

4 Experiments

In order to investigate if it is possible to train a unified captioning model across multiple languages, we experiment with English and Japanese, which are the most distant language pair among major languages [1] and thus should be particularly challenging. We evaluate the quality of captions in English and Japanese under various models, including baseline models trained only on individual languages, and the unified model trained with both.


We used the YJ Captions 26k Dataset [12], which is based on a subset of MSCOCO [11]. It has 26,500 images with Japanese captions. We also used the English captions from the corresponding images in the original MSCOCO dataset. We divided the dataset into 22,500 training, 2,000 validation, and 2,000 test images.

Implementation Details.

We segmented English sentences into tokens based on white space, and segmented Japanese using TinySegmenter111 We implemented a neural captioning model using the Chainer framework [16], consulting the publicly-available NeuralTalk implementation222 for reference [9]. For our CNN, we used ResNet50 [4]

pre-trained on ImageNet, and an LSTM 


with 512 hidden units as our RNN. For training, we used stochastic gradient descent with the Adam algorithm 


with its default hyperparameters and a batch size of 128. We trained 40 epochs, chose the model at the epoch with the highest CIDEr validation score, and report the performance on the test dataset. We used a beam size of 5 when generating captions. We made the implementation publicly available


Train Test Bleu1 Bleu2 Bleu3 Bleu4 CIDEr
En En 0.632 0.443 0.298 0.119 0.651
En+Jp En 0.558 0.410 0.275 0.184 0.593
Jp Jp 0.698 0.553 0.410 0.303 0.631
En+Jp Jp 0.693 0.534 0.414 0.310 0.594
Table 1: Evaluation of captioning results.


Table 1 reports the results of our various models using Bleu scores [13], which are designed to evaluate machine translation, and the CIDEr score [17], which is designed to evaluate image captioning. Bleu uses the overlap of -grams between the predicted sentence and ground truth sentences; we follow typical practice and report for four different values of . The results show that Bleu scores decrease slightly from the single language models to the dual-language models, but are very close (and in fact the dual model performs better than the single language models for 3 of the 8 Bleu variants). Under CIDEr, the multi-language models again perform slightly worse (e.g. 0.651 vs 0.593 for English and 0.631 versus 0.594 in Japanese). However, the single unified models effectively use half as much memory and require half as much time to generate captions. This is a promising result: despite being trained on very distant languages, the scores indicate that the model can still produce meaningful, linguistically correct captions, with a nearly trivial change to the training and caption generation algorithms.

Some randomly selected captions are shown in Figure 2. We qualitatively observe that the single and dual-language models tend to make similar types of mistakes. For example, all captions for the bottom picture incorrectly mention water-related concepts such as surfboards and umbrellas, presumably due to the prominent water in the scene.

5 Conclusion

We introduced a simple but effective approach that uses artificial tokens to control the language for a multilingual image captioning system. Our approach is simpler and more compute- and memory-efficient than having to build a separate model for each language, by allowing the system to share a single model across multiple languages. Our preliminary experiments suggest that a neural captioning architecture can learn a single model even for English and Japanese, two languages whose linguistic distance is especially high, with minimal decrease in accuracy compared to individual models. In future work, we plan to investigate further refinements that would remove these accuracy gap altogether, as well as to evaluate on multiple languages beyond just English and Japanese.


Satoshi Tsutsui is supported by the Yoshida Scholarship Foundation in Japan. This work was supported in part by the National Science Foundation (CAREER IIS-1253549) and the IU Office of the Vice Provost for Research and the School of Informatics and Computing through the Emerging Areas of Research Funding Program, and used the Romeo FutureSystems Deep Learning facility, supported by Indiana University and NSF RaPyDLI grant 1439007.

Figure 2: Randomly selected samples of automatically-generated image captions. En only captions are from the model trained on English, Jp only are from the model trained on Japanese, and Unified are from the single model trained with both.