Learning to Scale Multilingual Representations for Vision-Language Tasks

by   Andrea Burns, et al.
Boston University

Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we propose a Scalable Multilingual Aligned Language Representation (SMALR) that represents many languages with few model parameters without sacrificing downstream task performance. SMALR learns a fixed size language-agnostic representation for most words in a multilingual vocabulary, keeping language-specific features for few. We use a novel masked cross-language modeling loss to align features with context from other languages. Additionally, we propose a cross-lingual consistency module that ensures predictions made for a query and its machine translation are comparable. The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4 word embedding methods.



page 21

page 22


Larger-Scale Transformers for Multilingual Masked Language Modeling

Recent work has demonstrated the effectiveness of cross-lingual language...

MULE: Multimodal Universal Language Embedding

Existing vision-language methods typically support two languages at a ti...

DICT-MLM: Improved Multilingual Pre-Training using Bilingual Dictionaries

Pre-trained multilingual language models such as mBERT have shown immens...

Improving Multilingual Models with Language-Clustered Vocabularies

State-of-the-art multilingual models depend on vocabularies that cover a...

Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models

Massively multilingual models subsuming tens or even hundreds of languag...

The Geometry of Multilingual Language Model Representations

We assess how multilingual language models maintain a shared multilingua...

Wine is Not v i n. – On the Compatibility of Tokenizations Across Languages

The size of the vocabulary is a central design choice in large pretraine...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning a good language representation is a fundamental component of addressing a vision-language task, such as phrase grounding [20, 32] or visual question answering [3, 15]. Many recent methods have demonstrated that learning text representations aligned to images can boost performance across many vision-language tasks over traditional text-only trained representations [7, 17, 27, 35, 36]. This is often accomplished by using auxiliary vision-language tasks when learning the language representation (such as image-sentence retrieval, as shown in Figure 1(a)). However, these methods often only support a single language. Although some work has addressed a multilingual scenario (e.g., [14, 21, 39]), these methods do not scale well to support many languages in terms of memory or performance (see Figure 1(b)). As the number of languages grows, methods like LIWE [39] that use character-based recognition systems can save memory but suffer from performance degradation. In contrast, methods that learn to align word embeddings across languages can maintain (or even improve) performance as languages are added (e.g., [14, 21]), but require additional parameters for the word embeddings that represent each new language’s vocabulary. This becomes a challenge when scaling to support many languages, as an increasing majority of trainable parameters are required for representing each language (e.g. 93% of parameters of [21] with ten languages). While pretrained word embeddings could be used without fine-tuning, e.g. Multilingual BERT [11] or MUSE [9], this comes at a significant cost in downstream task performance [7, 21].

(a) Multilingual image-sentence retrieval
(b) MSCOCO multilingual retrieval
Figure 1: (a) presents the task of multilingual image-sentence retrieval. For the sentence to image direction we have annotations in ten languages as input to our language model, which is embedded with our full SMALR training paradigm and then used to compute similarity scores against all images. (b) shows the effect of the number of training languages on performance for two prior works: MULE [21] and LIWE [39]. LIWE refers to the original model, hereafter referred to as S-LIWE. The plot contains two standalone points; one for LIWE trained with a larger embedding dimension (120-D instead of 24-D) for fairer comparison, coined L-LIWE, in orange, as well as one for our model SMALR, in yellow. The points are scaled to the number of parameters, P; specifically, their area is

To address this trade-off between multilingual capacity and performance, we propose a Scalable Multilingual Aligned Language Representation (SMALR) model, which we demonstrate achieves strong task performance while also being highly compact compared to state-of-the-art word embedding methods [11, 22, 24]. As seen in Figure 1, LIWE drops over 10% in performance going from supporting one to ten languages. MULE slightly increases performance with more languages, but requires 6x more parameters compared to its single language model. Our approach, SMALR, outperforms both with only 1/5th the parameters of MULE. We learn to efficiently represent each language by separating our language embedding into language-specific and language-agnostic token representations. As language follows a long-tailed distribution, only a few words occur often, with large portions of tokens occurring very rarely. For example, in the MSCOCO dataset [26] there are 25,126 unique tokens, but 61% of them occur less than 4 times. This suggests that having unique representations for every token in the vocabulary in unnecessary, as only a subset would affect downstream task performance significantly. Thus, we use a Hybrid Embedding Model (HEM) that contains language-specific embeddings for the common tokens, thereby providing a good representation for each language, and a compact language-agnostic representation for rare and uncommon words. This results in a model that needs far fewer unique embeddings than prior work without sacrificing performance.

We learn how to assign tokens to the language-agnostic representation in a pretraining step, which uses monolingual FastText embeddings [6] to map similar words to the same token, e.g. mapping “double-decker” in English and “impériale” in French to the same shared token. Once we obtain our language embeddings, our goal is to align them so that semantically similar words, even those from other languages, are embedded nearby. To accomplish this, we use a multilingual masked language model, where we randomly mask words and then predict them based on context. Unlike similar masking approaches used to train models such as BERT [11]

, we mask words of sentences in two languages, say German and Chinese, which are semantically similar sentences referring to the same image, and use the context from each to predict both masked tokens. To further encourage cross-language alignment, we also use an adversarial language classifier and neighborhood constraints that have been used in prior work 

[21]. These universal language embeddings are provided as input a multimodal model that learns to relate them to images. Finally, we use a cross-lingual consistency module that uses machine translations to reason about the image-sentence similarity across multiple languages, which we show significantly boosts performance. Figure 2 contains an overview of our model.

Figure 2: The contributions of SMALR are in blue: a Hybrid Embedding Model (HEM), a Masked Cross-Language Modeling component (MCLM), and a Cross-Lingual Consistency stage (CLC). HEM embeds input sentences as a mixture of language-specific and language-agnostic representations using a hard attention mechanism. The MCLM component provides an additional loss to enforce language alignment, while also augmenting the original dataset with masked sentences

We use bidirectional image-sentence retrieval as the primary evaluation of our multilingual language representation. In this task, the goal is to retrieve a relevant sentence from a database given an image or to retrieve a relevant image from a database given a sentence. We augment current multilingual datasets Multi30K [5, 12, 13, 41] and MSCOCO [26, 25, 29] using machine translations so that every image has at least five sentences across ten diverse languages: English (En), German (De), French (Fr), Czech (Cs), Chinese (Cn), Japanese (Ja), Arabic (Ar), Afrikaans (Af), Korean (Ko), and Russian (Ru). See the supplementary for a breakdown of our data augmentation procedure. This constitutes the highest number of languages used in multilingual learning for vision-language tasks to date, supporting more than double the number of visually-semantically aligned languages compared to prior work [4, 9, 14, 21, 34, 39].

We list the contributions of our work below:

  • SMALR, a scalable multilingual model for training visually-semantically aligned word embeddings that outperforms the state-of-the-art on multilingual image-sentence retrieval while also requiring few model parameters.

  • A comparison to four types of vocabulary reduction methods that serve as baselines to complement our evaluation against prior work.

  • A Masked Cross-Language Modeling (MCLM) procedure that further aligns the multilingual embedding, stabilizing variance in performance over all languages, and serves as an additional data augmentation technique.

  • A Cross-Lingual Consistency (CLC) module, the first of its kind, that learns how to aggregate an ensemble of predictions across languages made with machine translations, which, combined with our SMALR architecture, results in a total improvement over the state-of-the-art by 3-4%.

2 Related Work

Transformer [37]

based representation learning models have become prominent in the natural language processing literature since the release of BERT 

[11]. BERT transfers surprisingly well to other languages, despite having no multilingual training data or explicit multilingual loss [40]. However, [31] demonstrates that there is an unequal transfer between different language pairs, notably those with typological differences to English. Both BERT and M-BERT, its multilingual extension, have been shown to be dependent on the depth and number of parameters in the model, which reaches 110M parameters for the smaller base model [19]. Thus, as also shown in [1], a large number of additional parameters are needed to counter the performance degradation introduced when training with many languages. Using the better performing large BERT model is impractical for many vision-language tasks since it contains 340M parameters, leaving little room in many GPUs memory for anything else.

Along with language-only BERT variants, a burst of multimodal BERT-like models have been designed specifically for vision-language tasks [24, 27, 35, 36]. More traditional word embedding models have also been designed for multimodal tasks with the use of either visual-word co-occurrence frequencies [17], multi-task training [30], or both [7], and require significantly less training data to reach similar performance. While these efforts evaluate on many multimodal tasks such as Visual Question Answering [3], Visual Commonsense Reasoning [42], Phrase Grounding [32], and more, they only train and evaluate on a single language.

Recently there have been several multilingual methods that have shown better performance on vision-language tasks than complicated transformer-based methods. LIWE [39]

is a light-weight character embedding model that can represent many languages with few model parameters. LIWE uses a bidirectional gated recurrent unit (GRU) 

[8] to aggregate 24-D character embeddings for a text query that is encouraged to embed semantically similar images and sentences in other languages. Although LIWE represents a single language well, it suffers from significant performance loss when co-training on multiple languages as shown in Figure 1(b). Gella et al[14] learns how to relate an image to language-specific representations, which also constrains semantically similar sentences across languages to embed nearby each other. MULE [21] learns a universal language embedding so that it can use a single language branch in the multimodal model, significantly reducing the number of parameters required to represent each language compared to Gella et al. In addition, MULE combined the same cross-lingual constraints used in Gella et al. and LIWE with an adversarial language classifier to further encourage alignment across languages. This results in a model that slightly improves performance as more languages are added as shown Figure 1(b). However, because MULE learns a word-level embedding that still requires significantly more parameters than LIWE (approximately eight times more with ten languages), capacity concerns remain when scaling to many languages.

3 Scalable Multilingual Aligned Language Representation

In this section we describe how we train our Scalable Multilingual Aligned Language Representation (SMALR) to bridge the gap between scalability and downstream vision-language task performance. To accomplish this, we assume we are provided with an image and sentences that describe it in multiple languages. The intuition behind our model is to first learn a universal language embedding which represents all languages, and then learn to relate the universal embedding to corresponding images using a multimodal model. In our experiments our multimodal model uses a modified version [21] of the Embedding Network architecture [38]

, although our approach can be easily adapted to use other multimodal models. After obtaining image and sentence features, the Embedding Network uses two branches, one for each modality, and projects them into a joint semantic space where distances are meaningful. The image branch consists of two fully connected layers, while the language branch encodes each word using a GRU, and then passes the final hidden representation through a fully connected layer to obtain a sentence representation.

Our approach is architecturally similar to MULE [21], but with some notable distinctions. First, MULE learned a unique word embedding for every word in every language (i.e., no shared tokens), whereas we learn an efficient universal embedding with our Hyrbid Embedding Model (HEM) that consists of a mix of language-agnostic and language-specific word representations (Section 3.1). Then, we learn to align our language representations both for the input of the multimodal model (i.e., the universal language embedding) as well as the final language representation of the multimodal model using a novel Masked Cross-Language Model (MCLM) (Section 3.2). This acts to supplement the neighborhood constraints, adversarial language classifier, and image-sentence matching losses used by MULE that we briefly review in Section 3.3. Finally, we also propose a Cross-Lingual Consistency (CLC) module that boosts model performance in downstream vision-language tasks using machine translation (Section 3.4). See Fig. 2 for an overview of our approach.

3.1 Efficient Multilingual Learning with a Hybrid Embedding Model

A significant challenge in multilingual representation learning is scaling to multiple languages, especially when there is a wide disparity in the available training data of different languages. This is more apparent for vision-language tasks where annotations are very expensive to collect, making it even more difficult to learn a good visually-semantically aligned language representation like those from monolingual settings [7, 24]

. Inspired by work in low-resource neural machine translation 

[16], we propose a Hybrid Embedding Model (HEM) which projects low-frequency words across languages into a shared latent vocabulary, while allowing the top- most frequent words in each language to maintain their own unique (language-specific) representation. The output of the HEM module is the universal language embedding that is used as input to the multimodal model in Fig. 2 and is also used in the language alignment losses described in Section 3.2 and Section 3.3. The exact value of can be determined experimentally for any targeted downstream vision-language task.

The language-specific word embeddings used for common words roughly follows the implementation used in prior work [16, 21]. We begin by using a monolingual pretrained FastText embedding [9] that has been PCA-reduced from 300-D to 50-D. These reduced features are used as input to a fully connected layer that projects them into a 512-D universal embedding space that we align across languages; the alignment is applied with the language-agnostic representations as well (see Section 3.2 and Section 3.3 for details on our language alignment procedures).

While our language-agnostic representation is similar to Gu et al[16], it does have some key differences. Specifically, Gu et al. projects all words into the universal embedding space with learned language-specific mappings. A soft-attention module is used over the universal embedding features (as it assumes an aligned cross-lingual input) to obtain mixing weights; these weights are then used to combine the language-agnostic features. While this does enable feature sharing across languages, it does not reduce the total number of trainable parameters in the network, as a language-specific representation is still necessary for all words in the vocabulary. Additionally, aggregating all the features in the latent vocabulary using soft-attention weights per-word is costly, especially for large latent vocabularies. Instead, we perform a pretraining step where we learn both the initial representation of the latent vocabulary as well as how to assign the infrequent words to entries in it. We use a hard attention mechanism that is directly predicted from FastText features. This allows us to avoid both computing a language-specific representation for the uncommon words and aggregating the latent vocabulary features on a per-word basis.

To learn our latent shared vocabulary in the pretraining step, we train our model to embed semantically similar sentences in multiple languages near each other using a triplet loss. More formally, given a triplet of items that can be decomposed into a positive pair and a negative pair , a triplet loss is computed as:


where is a distance function, and is a scalar parameter. We use cosine distance for all triplet losses and set . Following the methodology of [21, 38], we construct minibatches by providing semantically similar sentence pairs as input and consider any non-paired sentence as a negative example. Then, we enumerate all triplets in the minibatch and compute the loss over the top- most violated constraints, where in our experiments. Note that these sentences may not come from the same language, so semantically similar sentences in different languages are also used as positive pairs. We obtain representations for each sentence by feeding FastText embeddings into a fully connected layer, which is used to predict which latent embedding we map the source word to. Finally, we average the latent embeddings of each word, which has been shown to be an efficient and high-performing representation for vision-language tasks [7].

Instead of deterministically mapping to the latent token which achieves the best score, we randomly choose from the top

scoring tokens with probability

, which we refer to as exploration parameters. This helps ensure that spurious mappings are not learned, typically resulting in a 2% performance improvement on downstream tasks (see supplementary for a detailed comparison). While we freeze the latent token assignments when training the full model, we allow the features themselves to be fine-tuned. Our experiments use a latent vocabulary size of tokens, with exploration parameters , . In practice not all latent tokens are being used at the end of pretraining, which are dropped when training the full model.

3.2 Masked Cross-Language Modeling (MCLM)

Masked Language Modeling has proven to be useful in training language representations by masking some tokens of an input sentence and then trying predict the missing tokens [11]. We present a generalization of this approach to a multilingual scenario to encourage stronger cross-language alignment. In MCLM, we assume we have paired sentences across different languages. These sentences need not be direct translations of each other, but, as our experiments will show, they simply need to be semantically related to each other. This is important as vision-language datasets do not always have paired text queries that are direct translations of each other in other languages, but are often independently generated instead (e.g[13, 29, 25]).

Traditional Masked Language Modeling makes predictions about a single masked token using its surrounding words as context. The words immediately surrounding a token referring to the same entity between sentences in different languages may vary significantly due to differences in grammar. Thus, even if you had a dictionary between languages to identify word correspondences, it may not provide useful context. Instead, our approach is based on the intuition that semantically similar sentences should contain comparable information across languages, so a sentence in one language could be used as context to predict missing information from a sentence in another language. More formally, for a pair of languages we obtain their sentence representations , where both sentences describe the same image (i.e., they are semantically similar to each other). Then, we randomly replace some portion of their words with a special MASK token to obtain masked representations which are concatenated together and fed into a fully connected layer that is shared across language pairs to predict the missing information in both sentences . Our MCLM loss then compares this to the unmasked sentences, i.e.,



identifies vectors we forced to have unit norm. We compute the masking loss described by Eq. (

2) for all unique pairs of languages in our experiments, and found that masking 20% of the words in the sentences worked best.

3.3 Multilingual Visual-Semantic Alignment

In this section we shall briefly review the visual-semantic alignment constraints used by MULE [21] that we also employ. First, we use neighborhood constraints [38] that we shall refer to as to encourage similar sentences to embed nearby each other using a triplet loss (i.e., Eq. (1)). Just as with the MCLM module described in Section 3.2, these neighborhood constraints are applied to both the universal language embedding (i.e., the output of the HEM module) as well as the final language representation in the multimodal model as shown in Fig. 2. The second component of the MULE alignment constraint consists of an adversarial language classifier. We shall refer to this classifier loss as , using the approach of [21]

, whose goal is to ensure that the representations of the different languages in the universal embedding have similar feature distributions. The last component of the MULE constraint is used to train the multimodal model to embed the images and sentences near each other using a triplet loss. This uses a bidirectional triplet loss function,

i.e., for image and paired sentences representing a positive and negative sentence pair, respectively, and sentence and its paired images , this multimodal loss would be,


where is a scalar parameter, which we set to 1.5 in our experiments. In addition to using the unmasked sentence representations for the multimodal loss, we also observe that most sentences tend retain most of their overall semantic meaning if you remove just a few words at random. Using this intuition, we also compute Eq. (3) using the masked sentence representations used in the MCLM module in addition to the unmasked sentences, which we found provides a small, but consistent improvement to performance. As a reminder, all triplet losses use the implementation details (e.g

. hyperparameter settings and hard-negative mining) as described in the first part of Section 

3. Our total loss function to train SMALR is then,


where are scalar parameters that we set to (1e-4, 1e-6, 5e-2), respectively.

3.4 Cross-Lingual Consistency

Prior work on multilingual vision-language tasks has primarily focused on how to change training procedures or architectures in order to support multiple languages, and do not fully take advantage of this multilingual support at test time. In particular, we argue that there are cases in which the same sentence in different languages may capture complementary information, and that considering the predictions made in other languages may help improve performance. We validate our intuition by obtaining machine translations of a query in the other languages supported by our model. More formally, suppose we have a set of languages . Given a query in language , we translate the query to all other supported languages in and use this as input into our Cross-Lingual Consistency (CLC) module.

We propose two variants of CLC: CLC-A and CLC-C. CLC-A simply averages matching scores over all languages, and does not require any additional parameters. CLC-C, on the other hand, uses a small Multilayer Perceptron (MLP) to aggregate the scores of each language, which enables us to consider the relative information present in each language’s predictions. This MLP has two layers with input size

and 32 hidden layer units (i.e., it has 352 learnable parameters) and all parameters are initialized with uniform weight. We train the CLC-C module separately to SMALR using the validation set for 30 iterations. No minibatches are employed (i.e., it is trained with all image-sentence pairs at once) and it is trained using the multimodal triplet loss described in Eq. (3).

4 Experimental Setup

Datasets. SMALR is evaluated on bidirectional retrieval with image-sentence pairs from Multi30K [5, 12, 13] and MSCOCO [25, 26, 29]. The Multi30K dataset is built off of Flickr30K [41], which originally contained 31,783 images and five English descriptions per image. [5, 12, 13] obtained annotations in German, French, and Czech, resulting in a four-language multilingual dataset. Multi30K contains five descriptions per image in English and German, but only one description per image in French and Czech; the latter two were collected as human-generated translations of the English annotations. We use the 29K/1K/1K train, test, val splits as given with the original dataset [41].

MSCOCO is approximately four times the size of Multi30K, with 123,287 total images. There are five human-generated captions per image in English, but significantly fewer in Chinese and Japanese. YJ Captions [29] introduced new Japanese annotations for MSCOCO, but only provide five captions per image for a subset of approximately 26K images. [25] further extended MSCOCO with a total of 22,218 Chinese captions for 20,341 images. We use train/test/validation splits as defined in [21].

As mentioned in the Introduction, we augment both datasets using machine translations so every image contains at least five sentences for ten languages: Afrikaans, Arabic, English, German, Czech, French, Russian, Chinese, Japanese, and Korean. All models we compare to are trained using this augmented training set. For languages that have no human-generated sentences, we use machine translated sentences at test time as well. While using translations at test time results in a noisy evaluation, we found it did not affect the relative performance of different methods in our experiments. See the supplementary for details.

Visual Features. We use ResNet-152 [18]

features trained on ImageNet 

[10] as input to the Embedding Network (EmbN) [38], our image-sentence retrieval model. As done in [21], we average visual features over ten 448x448 crops of an image. This results in an image embedding of size 2048, which is then passed through a pair of fully connected layers, ultimately resulting in a 512-D image embedding that can be used in the shared image-sentence embedding space. The learning rate was set to for the HEM and LA models, with the remaining hyperparameters being consistent with those used by MULE [21].

Note that all LIWE [39] experiments use bottom-up Faster R-CNN [33] visual features, which are trained on Visual Genome [23]. This represents a very significant increase in the annotation cost compared with our approach, which doesn’t use these annotations. In addition, Visual Genome contains MSCOCO [26] images, which means that there is train/test contamination as LIWE’s features are extracted using the pretrained, publicly available model from [2]. Thus, some test images were used to train the image representation used by LIWE.

Metrics. We evaluate on image-sentence retrieval, and report Recall@, with for both the image-sentence and sentence-image directions of the task. For our results, we report the mean Recall (mR) across these six values per language. All Recall@ values can be found in the supplementary material. We also provide an additional average, “A,” in Tables 1 and 2, which averages the mR across all languages to serve as a global performance metric. The human average, “HA,” refers to the average mR over the languages which have human-generated annotations (i.e. English, Chinese, and Japanese for MSCOCO, and English, German, French, and Czech for Multi30K).

Comparative Evaluation. We compare the following methods:

  • Frequency Thresholding: We drop words that occur fewer than times in the training set. Results are reported in Figure 3.

  • PCA Reduction:

    We use Principal Component Analysis (PCA) 

    [28] to reduce the size of the initial 300-D FastText word embeddings. Results are reported in Figure 3.

  • Dictionary Mapping: We map words that occur fewer than times in non-English languages to English tokens using dictionaries [9]. By mapping rare words in other languages to English, some information may be lost, but the token will still indirectly exist in the vocabulary. However, we expect this method to be insufficient for a larger multilingual setting, where languages have greater linguistic differences, like Arabic and Chinese, as mapping to English may not retain enough language-specific information. Results are reported in Figure 3.

  • Language-Agnostic (LA): We compare to only using a latent vocabulary as described in Section 3.1 with 40K tokens, i.e. not using any language specific features, in Tables 1 and 2.

  • HEM: We then evaluate our full hybrid embedding model (Section 3.1), which uses a mixture of language-agnostic and language-specific representations. This baseline does not include MCLM nor CLC, and can be found in Tables 1 and 2.

  • SMALR: Our base SMALR is be composed of the HEM (Section 3.1) and MCLM (Section 3.2) components of our model. We compare to our complete SMALR which makes use of two CLC variants (CLC-A and CLC-C, described in Section 3.4) in Tables 1 and 2.

We also note that the first line of Tables 1 and 2, Trans To En, refers to using machine translation on non-English sentences to convert them to English, and then using an English-only trained Embedding Network [38], providing a strong baseline method to compare to.

Model En De11footnotemark: 1 Fr11footnotemark: 1 Cs11footnotemark: 1 Cn Ja Ar11footnotemark: 1 Af11footnotemark: 1 Ko11footnotemark: 1 Ru11footnotemark: 1 HA A
(a) Trans. to En [21] 75.6 72.2 66.1 71.3
EmbN [38] 76.8 73.5 73.2 74.5
PAR. EmbN [14] 78.3 73.5 76.0 75.9
MULE [21] 79.5 74.8 76.3 76.9

(1) S-LIWE [39]22footnotemark: 2 80.9 73.6
(2) S-LIWE22footnotemark: 2 77.4 66.6
(10) S-LIWE22footnotemark: 2 77.3 67.4 68.5 66.9 64.5 65.8 63.8 66.2 63.1 63.6 69.2 66.7
(10) L-LIWE22footnotemark: 2 79.1 71.2 70.3 70.1 70.0 69.6 67.5 68.9 66.2 69.6 72.9 70.3
MULE [21] 79.0 77.2 76.8 77.8 75.6 75.9 77.2 77.8 74.3 77.3 76.8 76.9
(c) Language-Agnostic 75.0 74.3 74.1 73.4 72.3 72.1 74.4 74.7 71.6 72.7 73.1 73.5
HEM 78.7 77.3 76.4 77.9 76.7 76.3 77.0 76.7 75.5 77.0 77.3 76.9
SMALR 79.3 78.4 77.8 78.6 76.7 77.2 77.9 78.2 75.1 78.0 77.7 77.7
SMALR-CLC-A 81.2 79.6 75.0 78.6
SMALR-CLC-C 81.5 80.1 77.5 79.7
11footnotemark: 1

uses translations from English for testing
22footnotemark: 2visual features trained using outside dataset that includes some test images

Table 1: MSCOCO results on multilingual bidirectional image-sentence retrieval. (a) contains results from prior work while (b) contains reproductions of two state-of-the art methods evaluated for our scenario using their code, and (c) contains variants of our model
Model En De Fr Cs Cn11footnotemark: 1 Ja11footnotemark: 1 Ar11footnotemark: 1 Af11footnotemark: 1 Ko11footnotemark: 1 Ru11footnotemark: 1 HA A
(a) Trans. to En [21] 71.1 48.5 46.7 46.9 53.3
EmbN [38] 72.0 60.3 54.8 46.3 58.4
PAR. EmbN [14] 69.0 62.6 60.6 54.1 61.6
MULE [21] 70.3 64.1 62.3 57.7 63.6

(1) S-LIWE [39]22footnotemark: 2 76.3 72.1
(2) S-LIWE22footnotemark: 2 75.6 66.1
(10) S-LIWE22footnotemark: 2 75.2 65.1 50.6 53.9 53.9 56.0 61.3 62.3 55.1 64.2 61.2 59.8
(10) L-LIWE22footnotemark: 2 75.1 65.0 51.1 54.7 55.8 55.3 64.2 62.7 63.8 54.4 61.5 60.2
MULE [21] 70.7 63.6 63.4 59.4 64.2 67.3 65.8 67.3 63.6 65.4 64.3 65.1
(c) Language-Agnostic 65.5 61.3 59.9 54.0 59.4 64.7 63.9 66.5 60.3 60.3 60.2 61.6
HEM 69.2 62.8 63.3 60.0 62.4 66.3 64.5 66.8 62.3 62.6 63.8 64.0
SMALR 69.6 64.7 64.5 61.1 64.0 66.7 66.0 67.4 64.2 65.7 65.0 65.4
SMALR-CLC-A 74.1 68.9 65.2 64.5 68.2
SMALR-CLC-C 74.5 69.8 65.9 64.8 68.7
11footnotemark: 1

uses translations from English for testing
22footnotemark: 2visual features trained using outside dataset

Table 2: Multi30K results on multilingual bidirectional image-sentence retrieval. (a) contains results from prior work while (b) contains reproductions of two state-of-the art methods evaluated for our scenario using their code, and (c) contains variants of our model

5 Multilingual Image-Sentence Retrieval Results

Figure 3: We compare three types of vocabulary reduction: frequency thresholding, PCA dimensionality reduction, and mapping rare words to English with the use of dictionaries. The left-hand side evaluates on MSCOCO, the right on Multi30K. We have additional standalone points for the small LIWE (S-LIWE), large LIWE (L-LIWE), MULE, latent vocabulary (LA), and our model, SMALR

We provide results for MSCOCO and Multi30K in Table 1 and Table 2, respectively, which contain comparisons to prior work on fewer languages (a), adaptations of prior work to our setting (b), and our model variants (c). SMALR obtains consistent performance gains when evaluating on ten languages over the state-of-the-art (S-LIWE, line 3(b)) while also being more efficient than high-performing models like MULE (line 4(b)). SMALR outperforms S-LIWE by 11 points on MSCOCO and 5.6 points on Multi30K (line 3(c) versus 3(b)). A parameter comparison is later shown in Figure 3. SMALR’s initial Language-Agnostic (LA) baseline alone is able to boost performance over previous scalable method LIWE by 2-7 points. The HEM, which combines language-agnostic and language-specific embeddings as described in Section 3.1, consistently improves upon the fully language-agnostic vocabulary, even though they share the same vocabulary size of 40K tokens. This points to the utility of our hybrid embedding space, which improves performance upon LA by 3.4 average mR on MSCOCO and 2.4 average mR on Multi30K while adding only a few parameters.

When MCLM losses are added, referred to as SMALR in Tables 1 and 2 (line 3(c)), mR improves for nearly all languages. This is significant, because we find more compact models like LIWE degrade with additional languages when using the same number of parameters (S-LIWE). The LA baseline is still able to outperform L-LIWE on MSCOCO and Multi30K, in which LIWE learns an embedding five fold larger to try to compensate for the increased number and diversity of languages (120-D instead of 24-D embedding). This suggests that the masking process may help regain some semantic information that is lost when tokens are mapped to a language-agnostic space.

We next evaluate two CLC variants that use machine translations at test time (described in Section 3.4) on top of SMALR: an average ensemble over all languages (CLC-A), and a weighted ensemble which makes use of a simple classifier (CLC-C). CLC-A uses no additional test-time parameters, and increases the human average performance by 1-3 points, with a larger gain on Multi30K. This may be because more languages can be leveraged on Multi30K (four versus three, compared to MSCOCO). Surprisingly, English performance improves the most amongst CLC-A metrics on Multi30K, demonstrating that certain image-sentence pairs can be better retrieved from the queries in other languages, which may better capture the visual semantics of the same image. CLC-C further improves the human average over CLC-A by 0.9 points on MSCOCO and 0.5 points on Multi30K, using negligible additional parameters.

Parameter reduction method comparison. We present a side-by-side comparison of baseline vocabulary reduction techniques, described in Section 4, against prior works LIWE, MULE, and SMALR (consisting of only HEM and MCLM components in Figure 3). The frequency thresholding and dictionary mapping labels represent the threshold with which we drop infrequent words or map them to English (e.g. the blue 50 data point represents dropping words that occur fewer the 50 times). PCA point labels represent the dimensionality we reduce our input vectors to (e.g. 300D 50D, 100D, or 200D).

In our comparison of vocabulary reduction methods, we find that frequency thresholding with and vanilla language-agnostic vocabularies (LA) are able to obtain better performance than both LIWE variants on Multi30K, without adding significantly more parameters, as shown on the right side of Figure 3. While more model parameters are needed for MSCOCO, due to the increased vocabulary size, all simple baselines and prior work MULE significantly outperform LIWE. This demonstrates that more-complex character-based models do not necessarily obtain competitive performance with few parameters when addressing a larger multilingual scenario.

SMALR outperforms all baselines for MSCOCO, as seen on the left of Figure 3, outperforming S-LIWE by over 10 points and using fewer parameters than L-LIWE. We also find that average mean recall performance on MSCOCO is more robust to vocabulary reduction, with a maximum range of about 1.5 average mR between the most extreme reduction and the least. We believe this may be due to the size discrepancy between the two datasets, as MSCOCO is approximately four times the size of Multi30K. PCA reduction appears to have a more linear effect as parameters increase on both datasets. Since Multi30K performance is more sensitive to the number of parameters, it is significant that our SMALR model, in green, (which does not yet make use of our cross-lingual consistency module in Figure 3) outperforms all other models while having less than 20M parameters, 1/5th the parameter count of high performing MULE.

In addition to SMALR outperforming MULE on both datasets while using significantly fewer trainable parameters, we find MULE even fails to outperform simple baselines such as dictionary mapping on MSCOCO. This exposes that the large number of parameters used for the word-level embeddings in MULE are unnecessary for performance gains. While SMALR uses more parameters during training than S-LIWE, we have far fewer test-time parameters. We reduce the number of computations needed for evaluation by using precomputed language representations from training. This essentially reduces the entire SMALR model to the image-sentence matching model with our CLC add-on, totaling only 7.1M parameters, now fewer than S-LIWE.

6 Conclusion

In this paper, we have presented a Scalable Multilingual Aligned Representation (SMALR) which addresses the trade-off between multilingual model size and downstream vision-language task performance. Our approach is modular, and thus can be used as a drop-in language representation for any vision-language method/task. SMALR outperforms all prior work on the task of multilingual image-sentence retrieval on average across ten diverse languages, with the use of a hybrid embedding model, masked cross-language modeling loss, and cross-lingual consistency module. Our hybrid embedding model significantly reduces the input to a language model by mapping most tokens to a fixed size, shared vocabulary. The novel masking procedure aligns our diverse set of languages and makes use of the multimodal model to provide additional alignment by visually grounding our language representations. We find that both cross-lingual consistency modules better aggregate retrieved results, boosting performance with minimal additional parameters. This is all accomplished with less than 20M trainable parameters, significantly reducing oversized prior work by 1/5th, while improving performance over the state-of-the-art by 3-4%.


  • [1] R. Aharoni, M. Johnson, and O. Firat (2019-06) Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Cited by: §2.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering


    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §4.
  • [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: Visual Question Answering. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
  • [4] M. Artetxe, G. Labaka, and E. Agirre (2016) Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Empirical Methods in Natural Language Processing (EMNLP), pp. 2289–2294. Cited by: §1.
  • [5] L. Barrault, F. Bougares, L. Specia, C. Lala, D. Elliott, and S. Frank (2018) Findings of the third shared task on multimodal machine translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 304–323. Cited by: §1, §4, §7.1, Table 4.
  • [6] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics (TACL) 5, pp. 135–146. External Links: ISSN 2307-387X Cited by: §1.
  • [7] A. Burns, R. Tan, K. Saenko, S. Sclaroff, and B. A. Plummer (2019) Language features matter: effective language representations for vision-language tasks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §3.1, §3.1.
  • [8] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.
  • [9] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou (2018) Word translation without parallel data. In International Conference on Learning Representations (ICLR), Cited by: §1, §1, §3.1, 3rd item.
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.
  • [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. In arXiv:1810.04805v1, Cited by: §1, §1, §1, §2, §3.2.
  • [12] D. Elliott, S. Frank, L. Barrault, F. Bougares, and L. Specia (2017) Findings of the second shared task on multimodal machine translation and multilingual image description. arXiv:1710.07177. Cited by: §1, §4, §7.1, Table 4.
  • [13] D. Elliott, S. Frank, K. Sima’an, and L. Specia (2016) Multi30k: multilingual english-german image descriptions. arXiv:1605.00459. Cited by: §1, §3.2, §4, §7.1, Table 4.
  • [14] S. Gella, R. Sennrich, F. Keller, and M. Lapata (2017) Image pivoting for learning multilingual multimodal representations. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §1, §2, Table 1, Table 2.
  • [15] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [16] J. Gu, H. Hassan, J. Devlin, and V. O.K. Li (2018) Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), Cited by: §3.1, §3.1, §3.1.
  • [17] T. Gupta, A. Schwing, and D. Hoiem (2019) ViCo: word embeddings from visual co-occurrences. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. arXiv:1512.03385. Cited by: §4.
  • [19] K. K, Z. Wang, S. Mayhew, and D. Roth (2019) Cross-lingual ability of multilingual bert: an empirical study. arXiv:1912.07840. Cited by: §2.
  • [20] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg (2014) ReferItGame: referring to objects in photographs of natural scenes. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1.
  • [21] D. Kim, K. Saito, K. Saenko, S. Sclaroff, and B. A. Plummer (2020) MULE: multimodal universal language embedding. In

    AAAI Conference on Artificial Intelligence

    Cited by: Figure 1, §1, §1, §1, §2, §3.1, §3.1, §3.3, §3, §3, Table 1, Table 2, §4, §4, §7.5, Table 17.
  • [22] B. Klein, G. Lev, G. Sadeh, and L. Wolf (2015) Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [23] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV). Cited by: §4.
  • [24] L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019) VisualBERT: a simple and performant baseline for vision and language. arXiv:1908.03557. Cited by: §1, §2, §3.1.
  • [25] X. Li, C. Xu, X. Wang, W. Lan, Z. Jia, G. Yang, and J. Xu (2019) COCO-cn for cross-lingual image tagging, captioning and retrieval. IEEE Transactions on Multimedia. Cited by: §1, §3.2, §4, §4, §7.1, Table 3.
  • [26] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In The European Conference on Computer Vision (ECCV), Cited by: §1, §1, §4, §4, §7.1, Table 3.
  • [27] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv:1908.02265. Cited by: §1, §2.
  • [28] A. Maćkiewicz and W. Ratajczak (1993) Principal components analysis (pca). Computers and Geosciences 19 (3), pp. 303 – 342. External Links: ISSN 0098-3004 Cited by: 2nd item.
  • [29] T. Miyazaki and N. Shimizu (2016) Cross-lingual image caption generation. In Conference of the Association for Computational Linguistics (ACL), Cited by: §1, §3.2, §4, §4, §7.1, Table 3.
  • [30] D. Nguyen and T. Okatani (2019) Multi-task learning of hierarchical vision-language representation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [31] T. Pires, E. Schlinger, and D. Garrette (2019) How multilingual is multilingual bert?. arXiv:1906.01502. Cited by: §2.
  • [32] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
  • [33] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §4.
  • [34] S. L. Smith, D. H. P. Turban, S. Hamblin, and N. Y. Hammerla (2017) Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv:1702.03859. Cited by: §1.
  • [35] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2019) VL-bert: pre-training of generic visual-linguistic representations. arXiv:1908.08530. Cited by: §1, §2.
  • [36] H. Tan and M. Bansal (2019) LXMERT: learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §2.
  • [37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §2.
  • [38] L. Wang, Y. Li, J. Huang, and S. Lazebnik (2018)

    Learning two-branch neural networks for image-text matching tasks

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41 (2), pp. 394–407. Cited by: §3.1, §3.3, §3, Table 1, Table 2, §4, §4.
  • [39] J. Wehrmann, D. M. Souza, M. A. Lopes, and R. C. Barros (2019) Language-agnostic visual-semantic embeddings. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Figure 1, §1, §1, §2, Table 1, Table 2, §4, Table 17.
  • [40] S. Wu and M. Dredze (2019) Beto, bentz, becas: the surprising cross-lingual effectiveness of bert. arXiv:1904.09077. Cited by: §2.
  • [41] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics (TACL) 2, pp. 67–78. Cited by: §1, §4, Table 4.
  • [42] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi (2019) From recognition to cognition: visual commonsense reasoning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.

7 Supplementary Material

7.1 Data Augmentation

We augment the multilingual datasets MSCOCO [25, 26, 29] and Multi30K  [5, 12, 13] with translations from languages with human-generated annotations to other languages using Google Translate. Tables 3 and 4 show what translations were performed for MSCOCO and Multi30K, respectively. The column X refers to all other languages that consist entirely of translations to create the total set of ten languages; i.e. for MSCOCO, German, French, Czech, Arabic, Afrikaans, Korean, Russian, and for Multi30K, Chinese, Japanese, Arabic, Afrikaans, Korean, Russian. We compare the effect of using human-generated vs. machine translated sentences at test time in Section 7.6.

Annotation Type En Cn Ja X
Human Generated MSCOCO [26] COCO-CN[25] YJ Captions [29]
Translations Cn En En Cn En Ja En X
Ja En
Table 3: Dataset Augmentation for MSCOCO. Arrows signify the use of machine translation, and X refers to all other languages in the total set of ten
Annotation Type En De Fr Cs X
Human Generated Flickr30K [41] Multi30K [13] Multi30K [12] Multi30K [5]
Translations De En En De En Fr En Cs En X
Fr En
Cs En
Table 4: Dataset Augmentation for Multi30K. Arrows signify the use of machine translation, and X refers to all other languages in the total set of ten

7.2 Exploration Parameters

One component of SMALR is the Hybrid Embedding Model (HEM), which makes use of both language-specific and language-agnostic representations. The Language-Agnostic (LA) baseline refers to only using the shared latent vocabulary, which consists of 40K tokens. We found experimentally that using exploration parameters and improves downstream performance when using the latent vocabulary. These exploration parameters are used to force the model to randomly select from a set of similar tokens during training rather than always choosing the best matched token in the language-agnostic vocabulary (described in Section 3.1 of the main paper). Tables 5 and 6 demonstrate the difference in mean Recall for image-sentence retrieval with and without our exploration parameters.

Since we find that using the exploration parameters when learning the mapping to the latent vocabulary improves performance, we use them for both the language-agnostic and HEM results (and thus is included in the final SMALR training paradigm).

Model En De11footnotemark: 1 Fr11footnotemark: 1 Cs11footnotemark: 1 Cn Ja Ar11footnotemark: 1 Af11footnotemark: 1 Ko11footnotemark: 1 Ru11footnotemark: 1 HA A
LA 64.2 58.8 58.3 52.1 59.0 63.2 61.9 65.3 58.6 58.5 58.3 60.0
LA + Explore 65.5 61.3 59.9 54.0 59.4 64.7 63.9 66.5 60.3 60.3 60.2 61.6
11footnotemark: 1

uses translations from English for testing

Table 5: MSCOCO Language-Agnostic (LA) Ablation
Model En De Fr Cs Cn11footnotemark: 1 Ja11footnotemark: 1 Ar11footnotemark: 1 Af11footnotemark: 1 Ko11footnotemark: 1 Ru11footnotemark: 1 HA A
LA 73.9 73.0 71.7 72.9 72.0 70.8 72.8 72.0 69.7 72.0 72.2 72.1
LA + Explore 75.0 74.3 74.1 73.4 72.3 72.1 74.4 74.7 71.6 72.7 73.1 73.5
11footnotemark: 1

uses translations from English for testing

Table 6: Multi30K Language-Agnostic (LA) Ablation

7.3 Qualitative Results

We provide two examples for both MSCOCO and Multi30K which show the effect of the Cross-Lingual Consistency (CLC) module used with SMALR. We report results for the CLC-C variant, which makes use of a simple MLP classifier to aggregate scores across language. For a given text query, if it is human generated, we translate it to all other languages and use the predictions from these translations as input to our CLC-C module.

On the left hand side of Figure 4, the original text query is in English and its matching image is incorrectly retrieved, as shown by the red bounding box. However, when CLC-C is used, SMALR is able to correctly retrieve the matching image, as a subset of the translated sentences do correctly retrieve the ground truth image (e.g. the German translation). On the right hand side of Figure 4, we also see the same benefit for an original text query in German which is aided by English translations. These two examples demonstrate the benefit of CLC-C for R@1, as CLC-C now correctly retrieves the ground truth image. Additionally, these samples show that every language does not have to make the correct prediction; the CLC-C module can learn to combine predictions to improve performance. As we can see in Figure 4, the images incorrectly retrieved for the original English and German queries “People are walking through a vegetable stall filled market” and “Der mann trägt eine orange wollmütze” contain very similar objects and colors to their respective ground truth images, but these errors are remedied when considering all languages.

In Figure 5, there are two examples for MSCOCO, with original text queries in English and Chinese. Both examples have many translated queries which are able to correctly retrieve the ground truth image, such as French and Russian for English, and English, German, and French (among others) for Chinese. We see again that the original incorrectly retrieved image contains very similar visual semantics (e.g. teddy bear for English, baseball field for Chinese) to the ground truth, and the translated sentences help disambiguate subtle details.

Figure 4: Example of the benefits of using CLC module on Multi30K
Figure 5: Example of the benefits of using CLC module on MSCOCO

7.4 Masked Cross-Language Modeling Example

SMALR’s Masked Cross-Language Model (MCLM) uses two language representations to compute its total loss, namely an average representation, and a sentence-level representation. The average masked sentence simply removes masked words and then averages each word embedding over the one-word shorter version of the original sentence before predicting the masked token. The masked sentence-level representation retains the same number of words from the original sentence by replacing the masked word with a special [MASK] token; not only does this retain the total word count for a given query, it also maintains grammatical structure by using an LSTM. This sentence-level representation is passed through an LSTM and fully connected layer before being used to predict the masked token. Figure 6 provides an example of this process. See Section 3.2 of the main paper for a description of how these representations are used.

Figure 6: Variants of masking used in the MCLM module

7.5 Extended Image-Sentence Retrieval Results

We provide all recall values (Recall@K for K) for all ten languages on image-sentence retrieval with MSCOCO and Multi30K. I-to-S signifies the image to sentence retrieval direction, and S-to-I the sentence to image direction. We shorten “Language-Agnostic” to “LA” and CLC-A, CLC-C to A and C, respectively, due to space constraints. Lastly, “Prior” refers to prior work, “Adapted” refers to prior work that has been adapted to our testing scenario using the author’s publicly available code, and “Ours” refers to our SMALR model variants. The number preceding a model refers to the number of languages it was trained on, e.g. (3-4) MULE signifies MULE [21] trained on three languages (English, Chinese, Japanese) on MSCOCO, and four on Multi30K (English, German, French, Czech).

Model MSCOCO Multi30K
I-to-S S-to-I mR I-to-S S-to-I mR
r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10
(a) Prior
Trans. to En 58.6 86.5 94.1 45.5 79.6 89.5 75.6 58.3 82.9 90.4 41.7 72.0 81.2 71.1
EmbN 61.8 87.6 94.1 47.5 79.8 89.8 76.8 57.9 84.5 90.9 44.3 72.7 84.7 72.0
PAR. EmbN 63.1 89.1 94.1 49.2 82.5 91.5 78.3 52.4 80.1 87.7 41.6 71.5 80.7 69.0
(3-4) MULE 63.9 90.2 95.8 50.9 83.5 92.4 79.5 54.2 82.0 89.9 41.9 72.5 81.1 70.3
(b) Adapted
(1) S-LIWE 66.8 91.2 96.6 52.4 85.1 93.5 80.9 65.5 88.9 95.1 46.9 77.2 84.5 76.3
(2) S-LIWE 62.3 87.3 94.6 48.3 80.7 91.0 77.4 64.5 88.1 94.3 46.4 75.8 84.5 75.6
(10) S-LIWE 61.8 88.2 94.8 47.9 80.3 90.5 77.3 64.9 87.3 92.7 45.8 76.3 84.2 75.2
(10) L-LIWE 63.8 90.2 95.6 50.1 82.9 92.2 79.1 63.6 87.6 93.5 45.6 76.0 84.1 75.1
(10) MULE 63.8 88.9 95.5 50.5 83.2 92.0 79.0 55.2 82.1 90.7 42.2 72.2 81.8 70.7
(c) Ours
LA 56.4 84.9 92.3 46.0 80.5 90.2 75.0 48.1 77.2 86.9 36.5 67.1 77.2 65.5
HEM 61.6 89.0 95.4 50.5 83.3 92.4 78.7 51.3 79.9 88.4 41.8 72.1 81.5 69.2
SMALR 62.9 89.2 95.8 51.1 84.0 92.5 79.3 52.0 81.1 88.4 41.8 72.4 82.1 69.6
SMALR-A 66.6 91.1 97.3 52.8 85.7 93.4 81.2 59.4 83.7 90.2 47.5 77.5 86.1 74.1
SMALR-C 66.5 91.3 97.5 53.6 86.2 94.0 81.5 60.2 83.8 91.0 47.9 77.9 86.3 74.5
Table 7: English bidirectional image-sentence retrieval results using human-generated sentences
Model MSCOCO Multi30K
I-to-S S-to-I mR I-to-S S-to-I mR
r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10
(a) Prior
Trans. To En 34.1 60.4 71.1 19.6 47.4 58.5 48.5
EmbN 46.6 73.9 82.2 31.3 59.1 69.0 60.3
PAR. EmbN 46.1 76.3 83.2 34.4 62.5 73.0 62.6
MULE 49.7 77.7 85.7 34.6 63.4 73.5 64.1
(b) Adapted
(1) S-LIWE 61.1 86.6 92.7 42.0 69.9 80.0 72.1
(2) S-LIWE 51.2 80.2 88.4 35.7 65.7 75.2 66.1
(10) S-LIWE 49.8 79.1 87.3 36.6 69.4 82.4 67.4 49.6 79.5 87.4 34.5 64.5 74.8 65.1
(10) L-LIWE 52.1 84.9 92.6 39.3 73.4 85.0 71.2 50.2 78.8 87.1 35.6 64.0 74.3 65.0
(10) MULE 59.1 88.7 94.9 48.5 81.3 90.6 77.2 45.8 75.8 85.2 35.1 64.6 75.3 63.6
(c) Ours
LA 54.4 86.2 93.1 44.5 78.6 88.7 74.3 44.0 75.4 85.1 32.2 59.7 71.0 61.3
HEM 59.2 87.2 95.1 49.1 81.8 91.4 77.3 49.2 75.4 83.2 34.5 62.0 72.4 62.8
SMALR 61.2 89.2 96.2 49.6 82.3 91.8 78.4 49.9 75.8 85.0 36.9 65.4 75.4 64.7
SMALR-A 53.0 77.6 85.8 41.9 72.9 82.3 68.9
SMALR-C 52.9 78.8 87.0 42.6 74.2 83.1 69.8
Table 8: German bidirectional image-sentence retrieval results using sentences translated from English into German for testing on MSCOCO and human-generated sentences on Multi30K
Model MSCOCO Multi30K
I-to-S S-to-I mR I-to-S S-to-I mR
r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10
(a) Prior
Trans. to En 22.5 52.5 63.0 25.1 53.1 63.9 46.7
EmbN 31.0 60.4 71.0 35.2 60.3 70.8 54.8
PAR. EmbN 37.6 66.0 77.4 37.8 66.4 78.2 60.6
MULE 38.0 68.4 80.0 38.2 68.9 80.3 62.3
(b) Adapted
(10) S-LIWE 50.8 79.3 90.4 36.5 70.7 83.2 68.5 25.8 66.8 78.9 25.6 49.0 57.7 50.6
(10) L-LIWE 51.8 81.3 92.2 39.0 73.1 84.7 70.3 24.5 68.4 80.6 26.0 49.2 57.9 51.1
(10) MULE 60.3 86.9 94.3 47.8 81.3 90.4 76.8 39.2 70.9 80.7 38.8 70.5 80.2 63.4
(c) Ours
LA 54.8 83.6 92.6 44.8 79.4 89.7 74.1 35.1 65.8 76.0 39.5 65.6 77.2 59.9
HEM 57.6 87.0 94.0 48.0 80.7 91.1 76.4 38.1 70.5 80.6 40.2 69.5 80.6 63.3
SMALR 59.6 89.7 95.9 48.7 81.9 91.0 77.8 40.6 70.7 81.8 41.1 71.8 80.7 64.5
SMALR-A 40.3 73.4 80.9 42.2 72.8 81.8 65.2
SMALR-C 41.1 73.4 82.5 42.6 73.0 82.9 65.9
Table 9: French bidirectional image-sentence retrieval results using sentences translated from English into French for testing on MSCOCO and human-generated sentences on Multi30K
Model MSCOCO Multi30K
I-to-S S-to-I mR I-to-S S-to-I mR
r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10
(a) Prior
Trans. to En 23.0 50.9 64.7 25.1 53.4 64.2 46.9
EmbN 26.2 51.3 62.5 26.8 50.3 60.8 46.3
PAR. EmbN 31.4 58.2 70.1 33.1 60.4 71.6 54.1
MULE 34.3 63.2 74.2 35.3 63.6 75.5 57.7
(b) Adapted
(10) S-LIWE 46.8 79.8 90.3 34.6 68.2 82.0 66.9 30.3 71.8 82.5 28.0 51.2 59.9 53.9
(10) L-LIWE 50.7 82.3 92.1 37.6 72.8 84.8 70.1 29.7 72.9 83.4 29.4 52.5 60.2 54.7
(10) MULE 61.6 88.7 94.8 48.8 81.5 91.1 77.8 37.0 66.3 76.4 37.5 64.6 74.8 59.4
(c) Ours
LA 55.3 84.6 92.4 43.5 76.9 87.8 73.4 31.0 59.6 71.1 32.5 58.5 71.5 54.0
HEM 59.9 88.4 95.4 49.2 82.5 91.7 77.9 35.0 66.9 77.4 36.1 67.4 77.2 60.0
SMALR 63.2 89.6 95.7 49.2 82.4 91.6 78.6 36.5 69.0 78.0 36.7 68.0 78.2 61.1
SMALR-A 41.1 70.7 80.4 39.9 71.8 83.0 64.5
SMALR-C 41.9 70.7 81.1 40.5 71.7 82.8 64.8
Table 10: Czech bidirectional image-sentence retrieval results using sentences translated from English into Czech for testing on MSCOCO and human-generated sentences on Multi30K
Model MSCOCO Multi30K
I-to-S S-to-I mR I-to-S S-to-I mR
r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10
(a) Prior
Trans. to En 45.9 79.8 89.2 47.8 81.1 89.4 72.2
EmbN 49.6 81.6 90.0 47.8 82.1 90.0 73.5
PAR. EmbN 47.9 81.4 91.1 47.5 81.6 91.2 73.5
MULE 51.1 82.6 91.6 49.1 82.4 91.9 74.8
(b) Adapted
(1) S-LIWE
(2) S-LIWE
(10) S-LIWE 45.1 76.4 88.1 32.7 66.0 79.6 64.5 38.5 69.7 79.2 24.2 50.3 61.7 53.9
(10) L-LIWE 51.4 82.6 91.3 38.1 72.2 84.6 70.0 41.2 71.6 82.0 24.3 51.8 63.9 55.8
(10) MULE 50.8 84.0 92.5 50.3 83.6 92.4 75.6 47.4 77.0 85.8 35.4 64.9 74.4 64.2
(c) Ours
LA 46.0 79.6 90.7 45.9 80.6 91.1 72.3 42.2 72.0 81.6 30.6 59.8 70.0 59.4
HEM 53.2 85.0 93.2 51.3 84.6 93.0 76.7 44.1 74.7 84.4 33.8 63.3 74.4 62.4
SMALR 51.2 86.5 93.8 50.6 84.7 93.3 76.7 45.8 77.0 85.0 35.8 65.1 75.5 64.0
SMALR-A 57.5 87.3 94.9 54.8 87.7 95.2 79.6
SMALR-C 58.0 87.8 95.4 55.3 88.2 95.7 80.1
Table 11: Chinese bidirectional image-sentence retrieval results using sentences translated from English into Chinese for testing on Multi30K and human-generated sentences on MSCOCO
Model MSCOCO Multi30K
I-to-S S-to-I mR I-to-S S-to-I mR
r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10
(a) Prior
Trans. to En 44.8 74.3 85.4 36.9 71.0 84.7 66.1
EmbN 56.0 83.7 90.7 45.5 77.2 87.3 73.2
PAR. EmbN 60.1 86.0 92.8 47.7 79.6 89.7 76.0
MULE 59.6 86.5 92.8 47.8 80.8 90.1 76.3
(b) Adapted
(1) S-LIWE 57.2 85.0 93.2 42.2 76.4 87.6 73.6
(2) S-LIWE 45.3 78.2 89.5 36.4 68.9 81.2 66.6
(10) S-LIWE 45.9 77.9 88.2 34.1 67.5 81.2 65.8 40.9 70.6 81.7 26.1 53.0 63.8 56.0
(10) L-LIWE 51.5 81.4 90.2 39.1 71.4 84.3 69.6 39.9 70.6 80.6 25.7 51.8 63.4 55.3
(10) MULE 59.4 85.2 93.0 47.4 80.1 90.2 75.9 49.9 80.2 87.7 38.1 69.3 78.6 67.3
(c) Ours
LA 51.4 83.3 90.3 42.4 76.8 88.1 72.1 48.4 77.2 85.7 35.5 65.1 76.5 64.7
HEM 56.8 86.3 93.8 47.7 81.7 91.7 76.3 48.9 78.4 86.0 38.4 68.0 78.3 66.3
SMALR 60.4 86.4 94.3 48.5 82.2 91.2 77.2 46.8 79.1 87.6 38.8 69.1 78.8 66.7
SMALR-A 60.0 84.5 92.9 45.9 78.3 88.6 75.0
SMALR-C 61.9 86.4 94.0 49.3 81.9 91.3 77.5
Table 12: Japanese bidirectional image-sentence retrieval results using sentences translated from English into Japanese for testing on Multi30K and human-generated sentences on MSCOCO
Model MSCOCO Multi30K
I-to-S S-to-I mR I-to-S S-to-I mR
r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10
(a) Adapted
(10) S-LIWE 43.4 75.6 86.3 33.1 65.7 78.6 63.8 43.5 75.3 85.0 32.0 60.4 71.4 61.3
(10) L-LIWE 49.1 81.4 90.3 34.4 68.4 81.0 67.5 49.0 78.1 86.9 34.0 63.3 73.6 64.2
(10) MULE 60.3 88.3 94.6 47.9 81.2 90.7 77.2 48.6 78.2 87.4 36.7 66.8 76.9 65.8
(b) Ours
LA 56.1 85.5 93.6 44.0 78.3 88.7 74.4 44.7 78.1 85.6 34.5 65.2 75.3 63.9
HEM 58.4 87.9 94.9 47.6 81.5 91.4 77.0 45.9 76.8 85.6 36.3 66.2 76.2 64.5
SMALR 60.1 89.0 95.7 48.6 81.9 91.9 77.9 46.2 78.6 87.4 38.3 67.7 77.9 66.0
Table 13: Arabic bidirectional image-sentence retrieval results using sentences translated from English into Arabic for testing
Model MSCOCO Multi30K
I-to-S S-to-I mR I-to-S S-to-I mR
r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10
(a) Adapted
(10) S-LIWE 46.7 79.1 88.8 35.0 67.4 80.2 66.2 47.3 77.0 85.2 32.8 60.8 70.7 62.3
(10) L-LIWE 49.9 82.2 91.8 36.9 70.6 82.4 68.9 46.6 76.8 86.7 33.6 61.1 71.3 62.7
(10) MULE 62.4 88.1 94.8 48.7 81.5 91.0 77.8 51.3 80.2 87.7 39.0 67.7 77.7 67.3
(b) Ours
LA 55.2 85.1 92.7 45.7 79.9 89.5 74.7 51.5 78.9 86.5 37.8 67.2 77.2 66.5
HEM 59.8 86.4 93.9 47.5 81.2 91.2 76.7 47.6 79.3 87.4 38.7 69.2 78.9 66.8
SMALR 62.5 88.7 95.9 48.8 82.2 91.4 78.2 48.7 79.7 87.5 40.5 68.8 79.1 67.4
Table 14: Afrikaans bidirectional image-sentence retrieval results using sentences translated from English into Afrikaans for testing
Model MSCOCO Multi30K
I-to-S S-to-I mR I-to-S S-to-I mR
r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10
(a) Adapted
(10) S-LIWE 42.4 76.4 86.9 31.9 63.6 77.4 63.1 39.9 70.1 81.2 25.1 51.2 62.8 55.1
(10) L-LIWE 48.1 79.8 89.5 33.4 66.3 79.9 66.2 37.5 70.3 81.2 24.3 50.8 62.4 54.4
(10) MULE 56.5 85.6 93.5 43.9 78.0 88.5 74.3 47.1 76.1 85.3 35.0 63.7 74.6 63.6
(b) Ours
LA 51.9 85.0 92.2 40.2 73.8 86.4 71.6 43.3 73.4 83.0 31.3 59.4 71.1 60.3
HEM 57.0 85.8 94.6 46.2 79.2 90.0 75.5 44.4 76.6 85.4 32.9 62.0 72.2 62.3
SMALR 55.7 86.9 94.8 45.2 78.8 89.4 75.1 45.7 78.2 85.5 35.2 64.8 75.5 64.2
Table 15: Korean bidirectional image-sentence retrieval results using sentences translated from English into Korean for testing
Model MSCOCO Multi30K
I-to-S S-to-I mR I-to-S S-to-I mR
r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10
(a) Adapted
(10) S-LIWE 44.7 76.1 86.5 31.4 64.5 78.2 63.6 48.0 78.7 87.5 34.3 63.3 73.7 64.2
(10) L-LIWE 50.7 83.6 91.5 36.7 71.7 83.0 69.6 45.7 78.2 86.1 34.4 63.9 74.5 63.8
(10) MULE 60.8 89.0 94.9 48.0 80.4 90.4 77.3 48.3 78.6 86.2 37.1 65.8 76.2 65.4
(b) Ours
LA 53.7 85.0 92.1 42.4 75.8 87.4 72.7 42.2 72.7 82.8 31.6 60.7 71.7 60.3
HEM 58.3 87.5 94.4 48.5 81.8 91.7 77.0 45.6 75.0 83.6 35.3 63.2 73.1 62.6
SMALR 62.7 88.8 95.0 48.2 81.7 91.5 78.0 48.4 77.3 86.0 38.0 67.2 77.5 65.7
Table 16: Russian bidirectional image-sentence retrieval results using sentences translated from English into Russian for testing

7.6 Testing with Machine Translations

In this section we investigate the effect testing with machine translations rather than human-generated sentences has when comparing methods. For all methods we use models trained on all 10 languages, and test on human-generated and translated sentences for Chinese and Japanese on MSCOCO and German on Multi30K.

As seen below, there are only minor differences in the performance of each language we tested. Notably, the performance rankings with each dataset are consistent regardless of whether the method is evaluated on human generated test sentences or test sentences translated from English.

Model MSCOCO Multi30k
mR Avg Rank mR Rank
Cn Ja De
(a) Human generated test sentences
(10) S-LIWE [39] 64.5 65.8 65.2 6 65.1 1
(10) L-LIWE 70.0 69.6 69.8 5 65.0 2
MULE [21] 75.6 75.9 75.8 3 63.6 4
Language-Agnostic 72.3 72.1 72.2 4 61.3 6
HEM 76.7 76.3 76.5 2 62.8 5
SMALR 76.7 77.2 76.9 1 64.7 3
(b) Test sentences translated from En
(10) S-LIWE11footnotemark: 1 [39] 64.7 65.8 65.2 6 65.5 1
(10) L-LIWE11footnotemark: 1 70.0 69.6 69.8 5 64.6 3
MULE [21] 73.2 75.0 74.1 3 64.4 4
Language-Agnostic 69.7 71.4 70.6 4 61.7 6
HEM 73.5 75.3 74.4 2 63.8 5
SMALR 74.4 75.9 75.2 1 65.1 2
Table 17: Evaluation on Human Generated Sentences Vs. Translations