A key problem in the intersection between computer vision and natural language processing is to automatically describe an image in natural language, and recent work has shown exciting progress using deep neural network-based models[18, 9, 7]. Most of this work generates captions in English, but since the image captioning models do not require linguistic knowledge, this is an arbitrary choice based on the easy availability of training data in English. The same captioning models are applicable for non-English languages as long as sufficiently large training datasets are available [2, 12].
Of course, real image captioning applications will require support for multiple languages. It is possible to build a multilingual captioning system by training a separate model for each individual language, but this requires creating as many models as there are supported languages. A simpler approach would be to create a single unified model that can generate captions in multiple languages. This could be particularly advantageous in resource-constrained environments like mobile devices, where storing and evaluating multiple large neural network models may be impractical. But to what extent can a single model capture the (potentially) very different grammars and styles of multiple languages?
In this short paper, we propose training a unified caption generator that can produce meaningful captions in multiple languages (Figure 1). Our approach is quite simple, but surprisingly effective: we inject an artificial token at the beginning of each sentence to control the language of the caption. During training, this special token informs the network of the language of the ground-truth caption, while at test time, it requests that the model produce a sentence in the specified language. We evaluated our approach using image captioning datasets in English and Japanese. We chose Japanese because it reportedly has the greatest linguistic distance from English compared to most other popular languages, which means it is the most difficult language for a native English speaker to learn . Our experiments suggest that even for these two very distant languages, a single neural model can produce meaningful multi-lingual image descriptions.
2 Related Work
The latest image captioning systems use multimodal neural networks, inspired from sequence to sequence modeling in machine translation 
. Images are fed into a Convolutional Neural Network (CNN) to extract visual features and then converted to word sequences using a Recurrent Neural Network (RNN) that has been trained on image-sentence ground truth pairs[18, 9, 7].
Most of this work on image captioning has considered a single target language (English), although various other uses of multi-language captioning data have been studied. For instance, Elliott et al.  and Miyazaki et al.  consider the problem of image captioning for one language when a caption in another language is available. Other work has considered the related but distinct problem of improving machine translation by using images that have been captioned in multiple languages [14, 5]. In contrast, our work assumes that we have training images that have been captioned in multiple languages (i.e., each image has captions in multiple images, and/or some images are captioned in one language and others are captioned in another), and we wish to apply a single, unified model to produce captions in multiple languages on new, unseen test images.
More generally, multilingual machine translation is an active area of research. The performance of machine translation can be improved when training data in more than two languages is available [19, 3], but the models become more complex as the number of languages increases, because they use separate RNNs for each language. The closest related work to ours is Google’s multilingual translation system  that uses artificial tokens to control the languages. We apply a similar idea here for image caption generation.
Our model uses a CNN to extract image features and then an RNN to generate captions in a manner very similar to previous work [18, 9]. Formally, we minimize the negative log likelihood of the caption given an image,
where each pair corresponds to an image and its caption , and is the -th word of . We use a combination of CNN and RNN to model in the following manner:
where is a word embedding matrix, and
is represented as a one-hot vector. We use a special token assigned toto denote the start of the sentence and the captioning language. For example, a monolingual captioning model (a baseline) uses <sos> to indicate the starting of the sentence, and the multilingual captioning model uses <en> or <jp> to indicate English or Japanese, respectively. When generating a caption, we find the sequence that (approximately) satisfies equation (1) using a beam search.
In order to investigate if it is possible to train a unified captioning model across multiple languages, we experiment with English and Japanese, which are the most distant language pair among major languages  and thus should be particularly challenging. We evaluate the quality of captions in English and Japanese under various models, including baseline models trained only on individual languages, and the unified model trained with both.
We used the YJ Captions 26k Dataset , which is based on a subset of MSCOCO . It has 26,500 images with Japanese captions. We also used the English captions from the corresponding images in the original MSCOCO dataset. We divided the dataset into 22,500 training, 2,000 validation, and 2,000 test images.
We segmented English sentences into tokens based on white space, and segmented Japanese using TinySegmenter111https://github.com/SamuraiT/tinysegmenter. We implemented a neural captioning model using the Chainer framework , consulting the publicly-available NeuralTalk implementation222https://github.com/karpathy/neuraltalk2 for reference . For our CNN, we used ResNet50 
pre-trained on ImageNet, and an LSTM
with 512 hidden units as our RNN. For training, we used stochastic gradient descent with the Adam algorithm
with its default hyperparameters and a batch size of 128. We trained 40 epochs, chose the model at the epoch with the highest CIDEr validation score, and report the performance on the test dataset. We used a beam size of 5 when generating captions. We made the implementation publicly available333http://vision.soic.indiana.edu/multi-caption/.
Table 1 reports the results of our various models using Bleu scores , which are designed to evaluate machine translation, and the CIDEr score , which is designed to evaluate image captioning. Bleu uses the overlap of -grams between the predicted sentence and ground truth sentences; we follow typical practice and report for four different values of . The results show that Bleu scores decrease slightly from the single language models to the dual-language models, but are very close (and in fact the dual model performs better than the single language models for 3 of the 8 Bleu variants). Under CIDEr, the multi-language models again perform slightly worse (e.g. 0.651 vs 0.593 for English and 0.631 versus 0.594 in Japanese). However, the single unified models effectively use half as much memory and require half as much time to generate captions. This is a promising result: despite being trained on very distant languages, the scores indicate that the model can still produce meaningful, linguistically correct captions, with a nearly trivial change to the training and caption generation algorithms.
Some randomly selected captions are shown in Figure 2. We qualitatively observe that the single and dual-language models tend to make similar types of mistakes. For example, all captions for the bottom picture incorrectly mention water-related concepts such as surfboards and umbrellas, presumably due to the prominent water in the scene.
We introduced a simple but effective approach that uses artificial tokens to control the language for a multilingual image captioning system. Our approach is simpler and more compute- and memory-efficient than having to build a separate model for each language, by allowing the system to share a single model across multiple languages. Our preliminary experiments suggest that a neural captioning architecture can learn a single model even for English and Japanese, two languages whose linguistic distance is especially high, with minimal decrease in accuracy compared to individual models. In future work, we plan to investigate further refinements that would remove these accuracy gap altogether, as well as to evaluate on multiple languages beyond just English and Japanese.
Satoshi Tsutsui is supported by the Yoshida Scholarship Foundation in Japan. This work was supported in part by the National Science Foundation (CAREER IIS-1253549) and the IU Office of the Vice Provost for Research and the School of Informatics and Computing through the Emerging Areas of Research Funding Program, and used the Romeo FutureSystems Deep Learning facility, supported by Indiana University and NSF RaPyDLI grant 1439007.
-  B. R. Chiswick and P. W. Miller. Linguistic distance: A quantitative measure of the distance between English and other languages. Journal of Multilingual and Multicultural Development, 26(1):1–11, 2005.
-  D. Elliott, S. Frank, and E. Hasler. Multilingual Image Description with Neural Sequence Models. arXiv1510.04709, 2015.
-  O. Firat, K. Cho, and Y. Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv:1601.01073, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  J. Hitschler, S. Schamoni, and S. Riezler. Multimodal Pivots for Image Caption Translation. In ACL, 2016.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
J. Johnson, A. Karpathy, and L. Fei-Fei.
Densecap: Fully convolutional localization networks for dense captioning.In CVPR, 2016.
-  M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. arXiv:1611.04558, 2016.
-  A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
-  T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014.
-  T. Miyazaki and N. Shimizu. Cross-Lingual Image Caption Generation. In ACL, 2016.
-  K. Papineni, S. Roukos, T. Ward, and W. Zhu. BLEU: A method for automatic evaluation of machine translation. In ACL, 2002.
-  L. Specia, S. Frank, K. Sima’an, and D. Elliott. A Shared Task on Multimodal Machine Translation and Crosslingual Image Description. In ACL, 2016.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
-  S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a next-generation open source framework for deep learning. In NIPS Workshop, 2015.
-  R. Vedantam, C. Lawrence Zitnick, and D. Parikh. CIDEr: Consensus-based Image Description Evaluation. In CVPR, 2015.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
-  B. Zoph and K. Knight. Multi-source neural translation. In HLT-NAACL, 2016.