Learning a good language representation is a fundamental component of addressing a vision-language task, such as phrase grounding [20, 32] or visual question answering [3, 15]. Many recent methods have demonstrated that learning text representations aligned to images can boost performance across many vision-language tasks over traditional text-only trained representations [7, 17, 27, 35, 36]. This is often accomplished by using auxiliary vision-language tasks when learning the language representation (such as image-sentence retrieval, as shown in Figure 1(a)). However, these methods often only support a single language. Although some work has addressed a multilingual scenario (e.g., [14, 21, 39]), these methods do not scale well to support many languages in terms of memory or performance (see Figure 1(b)). As the number of languages grows, methods like LIWE  that use character-based recognition systems can save memory but suffer from performance degradation. In contrast, methods that learn to align word embeddings across languages can maintain (or even improve) performance as languages are added (e.g., [14, 21]), but require additional parameters for the word embeddings that represent each new language’s vocabulary. This becomes a challenge when scaling to support many languages, as an increasing majority of trainable parameters are required for representing each language (e.g. 93% of parameters of  with ten languages). While pretrained word embeddings could be used without fine-tuning, e.g. Multilingual BERT  or MUSE , this comes at a significant cost in downstream task performance [7, 21].
To address this trade-off between multilingual capacity and performance, we propose a Scalable Multilingual Aligned Language Representation (SMALR) model, which we demonstrate achieves strong task performance while also being highly compact compared to state-of-the-art word embedding methods [11, 22, 24]. As seen in Figure 1, LIWE drops over 10% in performance going from supporting one to ten languages. MULE slightly increases performance with more languages, but requires 6x more parameters compared to its single language model. Our approach, SMALR, outperforms both with only 1/5th the parameters of MULE. We learn to efficiently represent each language by separating our language embedding into language-specific and language-agnostic token representations. As language follows a long-tailed distribution, only a few words occur often, with large portions of tokens occurring very rarely. For example, in the MSCOCO dataset  there are 25,126 unique tokens, but 61% of them occur less than 4 times. This suggests that having unique representations for every token in the vocabulary in unnecessary, as only a subset would affect downstream task performance significantly. Thus, we use a Hybrid Embedding Model (HEM) that contains language-specific embeddings for the common tokens, thereby providing a good representation for each language, and a compact language-agnostic representation for rare and uncommon words. This results in a model that needs far fewer unique embeddings than prior work without sacrificing performance.
We learn how to assign tokens to the language-agnostic representation in a pretraining step, which uses monolingual FastText embeddings  to map similar words to the same token, e.g. mapping “double-decker” in English and “impériale” in French to the same shared token. Once we obtain our language embeddings, our goal is to align them so that semantically similar words, even those from other languages, are embedded nearby. To accomplish this, we use a multilingual masked language model, where we randomly mask words and then predict them based on context. Unlike similar masking approaches used to train models such as BERT 
, we mask words of sentences in two languages, say German and Chinese, which are semantically similar sentences referring to the same image, and use the context from each to predict both masked tokens. To further encourage cross-language alignment, we also use an adversarial language classifier and neighborhood constraints that have been used in prior work. These universal language embeddings are provided as input a multimodal model that learns to relate them to images. Finally, we use a cross-lingual consistency module that uses machine translations to reason about the image-sentence similarity across multiple languages, which we show significantly boosts performance. Figure 2 contains an overview of our model.
We use bidirectional image-sentence retrieval as the primary evaluation of our multilingual language representation. In this task, the goal is to retrieve a relevant sentence from a database given an image or to retrieve a relevant image from a database given a sentence. We augment current multilingual datasets Multi30K [5, 12, 13, 41] and MSCOCO [26, 25, 29] using machine translations so that every image has at least five sentences across ten diverse languages: English (En), German (De), French (Fr), Czech (Cs), Chinese (Cn), Japanese (Ja), Arabic (Ar), Afrikaans (Af), Korean (Ko), and Russian (Ru). See the supplementary for a breakdown of our data augmentation procedure. This constitutes the highest number of languages used in multilingual learning for vision-language tasks to date, supporting more than double the number of visually-semantically aligned languages compared to prior work [4, 9, 14, 21, 34, 39].
We list the contributions of our work below:
SMALR, a scalable multilingual model for training visually-semantically aligned word embeddings that outperforms the state-of-the-art on multilingual image-sentence retrieval while also requiring few model parameters.
A comparison to four types of vocabulary reduction methods that serve as baselines to complement our evaluation against prior work.
A Masked Cross-Language Modeling (MCLM) procedure that further aligns the multilingual embedding, stabilizing variance in performance over all languages, and serves as an additional data augmentation technique.
A Cross-Lingual Consistency (CLC) module, the first of its kind, that learns how to aggregate an ensemble of predictions across languages made with machine translations, which, combined with our SMALR architecture, results in a total improvement over the state-of-the-art by 3-4%.
2 Related Work
based representation learning models have become prominent in the natural language processing literature since the release of BERT. BERT transfers surprisingly well to other languages, despite having no multilingual training data or explicit multilingual loss . However,  demonstrates that there is an unequal transfer between different language pairs, notably those with typological differences to English. Both BERT and M-BERT, its multilingual extension, have been shown to be dependent on the depth and number of parameters in the model, which reaches 110M parameters for the smaller base model . Thus, as also shown in , a large number of additional parameters are needed to counter the performance degradation introduced when training with many languages. Using the better performing large BERT model is impractical for many vision-language tasks since it contains 340M parameters, leaving little room in many GPUs memory for anything else.
Along with language-only BERT variants, a burst of multimodal BERT-like models have been designed specifically for vision-language tasks [24, 27, 35, 36]. More traditional word embedding models have also been designed for multimodal tasks with the use of either visual-word co-occurrence frequencies , multi-task training , or both , and require significantly less training data to reach similar performance. While these efforts evaluate on many multimodal tasks such as Visual Question Answering , Visual Commonsense Reasoning , Phrase Grounding , and more, they only train and evaluate on a single language.
Recently there have been several multilingual methods that have shown better performance on vision-language tasks than complicated transformer-based methods. LIWE 
is a light-weight character embedding model that can represent many languages with few model parameters. LIWE uses a bidirectional gated recurrent unit (GRU) to aggregate 24-D character embeddings for a text query that is encouraged to embed semantically similar images and sentences in other languages. Although LIWE represents a single language well, it suffers from significant performance loss when co-training on multiple languages as shown in Figure 1(b). Gella et al.  learns how to relate an image to language-specific representations, which also constrains semantically similar sentences across languages to embed nearby each other. MULE  learns a universal language embedding so that it can use a single language branch in the multimodal model, significantly reducing the number of parameters required to represent each language compared to Gella et al. In addition, MULE combined the same cross-lingual constraints used in Gella et al. and LIWE with an adversarial language classifier to further encourage alignment across languages. This results in a model that slightly improves performance as more languages are added as shown Figure 1(b). However, because MULE learns a word-level embedding that still requires significantly more parameters than LIWE (approximately eight times more with ten languages), capacity concerns remain when scaling to many languages.
3 Scalable Multilingual Aligned Language Representation
In this section we describe how we train our Scalable Multilingual Aligned Language Representation (SMALR) to bridge the gap between scalability and downstream vision-language task performance. To accomplish this, we assume we are provided with an image and sentences that describe it in multiple languages. The intuition behind our model is to first learn a universal language embedding which represents all languages, and then learn to relate the universal embedding to corresponding images using a multimodal model. In our experiments our multimodal model uses a modified version  of the Embedding Network architecture 
, although our approach can be easily adapted to use other multimodal models. After obtaining image and sentence features, the Embedding Network uses two branches, one for each modality, and projects them into a joint semantic space where distances are meaningful. The image branch consists of two fully connected layers, while the language branch encodes each word using a GRU, and then passes the final hidden representation through a fully connected layer to obtain a sentence representation.
Our approach is architecturally similar to MULE , but with some notable distinctions. First, MULE learned a unique word embedding for every word in every language (i.e., no shared tokens), whereas we learn an efficient universal embedding with our Hyrbid Embedding Model (HEM) that consists of a mix of language-agnostic and language-specific word representations (Section 3.1). Then, we learn to align our language representations both for the input of the multimodal model (i.e., the universal language embedding) as well as the final language representation of the multimodal model using a novel Masked Cross-Language Model (MCLM) (Section 3.2). This acts to supplement the neighborhood constraints, adversarial language classifier, and image-sentence matching losses used by MULE that we briefly review in Section 3.3. Finally, we also propose a Cross-Lingual Consistency (CLC) module that boosts model performance in downstream vision-language tasks using machine translation (Section 3.4). See Fig. 2 for an overview of our approach.
3.1 Efficient Multilingual Learning with a Hybrid Embedding Model
A significant challenge in multilingual representation learning is scaling to multiple languages, especially when there is a wide disparity in the available training data of different languages. This is more apparent for vision-language tasks where annotations are very expensive to collect, making it even more difficult to learn a good visually-semantically aligned language representation like those from monolingual settings [7, 24]
. Inspired by work in low-resource neural machine translation, we propose a Hybrid Embedding Model (HEM) which projects low-frequency words across languages into a shared latent vocabulary, while allowing the top- most frequent words in each language to maintain their own unique (language-specific) representation. The output of the HEM module is the universal language embedding that is used as input to the multimodal model in Fig. 2 and is also used in the language alignment losses described in Section 3.2 and Section 3.3. The exact value of can be determined experimentally for any targeted downstream vision-language task.
The language-specific word embeddings used for common words roughly follows the implementation used in prior work [16, 21]. We begin by using a monolingual pretrained FastText embedding  that has been PCA-reduced from 300-D to 50-D. These reduced features are used as input to a fully connected layer that projects them into a 512-D universal embedding space that we align across languages; the alignment is applied with the language-agnostic representations as well (see Section 3.2 and Section 3.3 for details on our language alignment procedures).
While our language-agnostic representation is similar to Gu et al. , it does have some key differences. Specifically, Gu et al. projects all words into the universal embedding space with learned language-specific mappings. A soft-attention module is used over the universal embedding features (as it assumes an aligned cross-lingual input) to obtain mixing weights; these weights are then used to combine the language-agnostic features. While this does enable feature sharing across languages, it does not reduce the total number of trainable parameters in the network, as a language-specific representation is still necessary for all words in the vocabulary. Additionally, aggregating all the features in the latent vocabulary using soft-attention weights per-word is costly, especially for large latent vocabularies. Instead, we perform a pretraining step where we learn both the initial representation of the latent vocabulary as well as how to assign the infrequent words to entries in it. We use a hard attention mechanism that is directly predicted from FastText features. This allows us to avoid both computing a language-specific representation for the uncommon words and aggregating the latent vocabulary features on a per-word basis.
To learn our latent shared vocabulary in the pretraining step, we train our model to embed semantically similar sentences in multiple languages near each other using a triplet loss. More formally, given a triplet of items that can be decomposed into a positive pair and a negative pair , a triplet loss is computed as:
where is a distance function, and is a scalar parameter. We use cosine distance for all triplet losses and set . Following the methodology of [21, 38], we construct minibatches by providing semantically similar sentence pairs as input and consider any non-paired sentence as a negative example. Then, we enumerate all triplets in the minibatch and compute the loss over the top- most violated constraints, where in our experiments. Note that these sentences may not come from the same language, so semantically similar sentences in different languages are also used as positive pairs. We obtain representations for each sentence by feeding FastText embeddings into a fully connected layer, which is used to predict which latent embedding we map the source word to. Finally, we average the latent embeddings of each word, which has been shown to be an efficient and high-performing representation for vision-language tasks .
Instead of deterministically mapping to the latent token which achieves the best score, we randomly choose from the top
scoring tokens with probability, which we refer to as exploration parameters. This helps ensure that spurious mappings are not learned, typically resulting in a 2% performance improvement on downstream tasks (see supplementary for a detailed comparison). While we freeze the latent token assignments when training the full model, we allow the features themselves to be fine-tuned. Our experiments use a latent vocabulary size of tokens, with exploration parameters , . In practice not all latent tokens are being used at the end of pretraining, which are dropped when training the full model.
3.2 Masked Cross-Language Modeling (MCLM)
Masked Language Modeling has proven to be useful in training language representations by masking some tokens of an input sentence and then trying predict the missing tokens . We present a generalization of this approach to a multilingual scenario to encourage stronger cross-language alignment. In MCLM, we assume we have paired sentences across different languages. These sentences need not be direct translations of each other, but, as our experiments will show, they simply need to be semantically related to each other. This is important as vision-language datasets do not always have paired text queries that are direct translations of each other in other languages, but are often independently generated instead (e.g. [13, 29, 25]).
Traditional Masked Language Modeling makes predictions about a single masked token using its surrounding words as context. The words immediately surrounding a token referring to the same entity between sentences in different languages may vary significantly due to differences in grammar. Thus, even if you had a dictionary between languages to identify word correspondences, it may not provide useful context. Instead, our approach is based on the intuition that semantically similar sentences should contain comparable information across languages, so a sentence in one language could be used as context to predict missing information from a sentence in another language. More formally, for a pair of languages we obtain their sentence representations , where both sentences describe the same image (i.e., they are semantically similar to each other). Then, we randomly replace some portion of their words with a special MASK token to obtain masked representations which are concatenated together and fed into a fully connected layer that is shared across language pairs to predict the missing information in both sentences . Our MCLM loss then compares this to the unmasked sentences, i.e.,
identifies vectors we forced to have unit norm. We compute the masking loss described by Eq. (2) for all unique pairs of languages in our experiments, and found that masking 20% of the words in the sentences worked best.
3.3 Multilingual Visual-Semantic Alignment
In this section we shall briefly review the visual-semantic alignment constraints used by MULE  that we also employ. First, we use neighborhood constraints  that we shall refer to as to encourage similar sentences to embed nearby each other using a triplet loss (i.e., Eq. (1)). Just as with the MCLM module described in Section 3.2, these neighborhood constraints are applied to both the universal language embedding (i.e., the output of the HEM module) as well as the final language representation in the multimodal model as shown in Fig. 2. The second component of the MULE alignment constraint consists of an adversarial language classifier. We shall refer to this classifier loss as , using the approach of 
, whose goal is to ensure that the representations of the different languages in the universal embedding have similar feature distributions. The last component of the MULE constraint is used to train the multimodal model to embed the images and sentences near each other using a triplet loss. This uses a bidirectional triplet loss function,i.e., for image and paired sentences representing a positive and negative sentence pair, respectively, and sentence and its paired images , this multimodal loss would be,
where is a scalar parameter, which we set to 1.5 in our experiments. In addition to using the unmasked sentence representations for the multimodal loss, we also observe that most sentences tend retain most of their overall semantic meaning if you remove just a few words at random. Using this intuition, we also compute Eq. (3) using the masked sentence representations used in the MCLM module in addition to the unmasked sentences, which we found provides a small, but consistent improvement to performance. As a reminder, all triplet losses use the implementation details (e.g
. hyperparameter settings and hard-negative mining) as described in the first part of Section3. Our total loss function to train SMALR is then,
where are scalar parameters that we set to (1e-4, 1e-6, 5e-2), respectively.
3.4 Cross-Lingual Consistency
Prior work on multilingual vision-language tasks has primarily focused on how to change training procedures or architectures in order to support multiple languages, and do not fully take advantage of this multilingual support at test time. In particular, we argue that there are cases in which the same sentence in different languages may capture complementary information, and that considering the predictions made in other languages may help improve performance. We validate our intuition by obtaining machine translations of a query in the other languages supported by our model. More formally, suppose we have a set of languages . Given a query in language , we translate the query to all other supported languages in and use this as input into our Cross-Lingual Consistency (CLC) module.
We propose two variants of CLC: CLC-A and CLC-C. CLC-A simply averages matching scores over all languages, and does not require any additional parameters. CLC-C, on the other hand, uses a small Multilayer Perceptron (MLP) to aggregate the scores of each language, which enables us to consider the relative information present in each language’s predictions. This MLP has two layers with input sizeand 32 hidden layer units (i.e., it has 352 learnable parameters) and all parameters are initialized with uniform weight. We train the CLC-C module separately to SMALR using the validation set for 30 iterations. No minibatches are employed (i.e., it is trained with all image-sentence pairs at once) and it is trained using the multimodal triplet loss described in Eq. (3).
4 Experimental Setup
Datasets. SMALR is evaluated on bidirectional retrieval with image-sentence pairs from Multi30K [5, 12, 13] and MSCOCO [25, 26, 29]. The Multi30K dataset is built off of Flickr30K , which originally contained 31,783 images and five English descriptions per image. [5, 12, 13] obtained annotations in German, French, and Czech, resulting in a four-language multilingual dataset. Multi30K contains five descriptions per image in English and German, but only one description per image in French and Czech; the latter two were collected as human-generated translations of the English annotations. We use the 29K/1K/1K train, test, val splits as given with the original dataset .
MSCOCO is approximately four times the size of Multi30K, with 123,287 total images. There are five human-generated captions per image in English, but significantly fewer in Chinese and Japanese. YJ Captions  introduced new Japanese annotations for MSCOCO, but only provide five captions per image for a subset of approximately 26K images.  further extended MSCOCO with a total of 22,218 Chinese captions for 20,341 images. We use train/test/validation splits as defined in .
As mentioned in the Introduction, we augment both datasets using machine translations so every image contains at least five sentences for ten languages: Afrikaans, Arabic, English, German, Czech, French, Russian, Chinese, Japanese, and Korean. All models we compare to are trained using this augmented training set. For languages that have no human-generated sentences, we use machine translated sentences at test time as well. While using translations at test time results in a noisy evaluation, we found it did not affect the relative performance of different methods in our experiments. See the supplementary for details.
Visual Features. We use ResNet-152 
features trained on ImageNet as input to the Embedding Network (EmbN) , our image-sentence retrieval model. As done in , we average visual features over ten 448x448 crops of an image. This results in an image embedding of size 2048, which is then passed through a pair of fully connected layers, ultimately resulting in a 512-D image embedding that can be used in the shared image-sentence embedding space. The learning rate was set to for the HEM and LA models, with the remaining hyperparameters being consistent with those used by MULE .
Note that all LIWE  experiments use bottom-up Faster R-CNN  visual features, which are trained on Visual Genome . This represents a very significant increase in the annotation cost compared with our approach, which doesn’t use these annotations. In addition, Visual Genome contains MSCOCO  images, which means that there is train/test contamination as LIWE’s features are extracted using the pretrained, publicly available model from . Thus, some test images were used to train the image representation used by LIWE.
Metrics. We evaluate on image-sentence retrieval, and report Recall@, with for both the image-sentence and sentence-image directions of the task. For our results, we report the mean Recall (mR) across these six values per language. All Recall@ values can be found in the supplementary material. We also provide an additional average, “A,” in Tables 1 and 2, which averages the mR across all languages to serve as a global performance metric. The human average, “HA,” refers to the average mR over the languages which have human-generated annotations (i.e. English, Chinese, and Japanese for MSCOCO, and English, German, French, and Czech for Multi30K).
Comparative Evaluation. We compare the following methods:
Frequency Thresholding: We drop words that occur fewer than times in the training set. Results are reported in Figure 3.
Dictionary Mapping: We map words that occur fewer than times in non-English languages to English tokens using dictionaries . By mapping rare words in other languages to English, some information may be lost, but the token will still indirectly exist in the vocabulary. However, we expect this method to be insufficient for a larger multilingual setting, where languages have greater linguistic differences, like Arabic and Chinese, as mapping to English may not retain enough language-specific information. Results are reported in Figure 3.
We also note that the first line of Tables 1 and 2, Trans To En, refers to using machine translation on non-English sentences to convert them to English, and then using an English-only trained Embedding Network , providing a strong baseline method to compare to.
|Model||En||De11footnotemark: 1||Fr11footnotemark: 1||Cs11footnotemark: 1||Cn||Ja||Ar11footnotemark: 1||Af11footnotemark: 1||Ko11footnotemark: 1||Ru11footnotemark: 1||HA||A|
|(a)||Trans. to En ||75.6||–||–||–||72.2||66.1||–||–||–||–||71.3||–|
|PAR. EmbN ||78.3||–||–||–||73.5||76.0||–||–||–||–||75.9||–|
|(1) S-LIWE 22footnotemark: 2||80.9||–||–||–||–||73.6||–||–||–||–||–||–|
|(2) S-LIWE22footnotemark: 2||77.4||–||–||–||–||66.6||–||–||–||–||–||–|
|(10) S-LIWE22footnotemark: 2||77.3||67.4||68.5||66.9||64.5||65.8||63.8||66.2||63.1||63.6||69.2||66.7|
|(10) L-LIWE22footnotemark: 2||79.1||71.2||70.3||70.1||70.0||69.6||67.5||68.9||66.2||69.6||72.9||70.3|
uses translations from English for testing
22footnotemark: 2visual features trained using outside dataset that includes some test images
|Model||En||De||Fr||Cs||Cn11footnotemark: 1||Ja11footnotemark: 1||Ar11footnotemark: 1||Af11footnotemark: 1||Ko11footnotemark: 1||Ru11footnotemark: 1||HA||A|
|(a)||Trans. to En ||71.1||48.5||46.7||46.9||–||–||–||–||–||–||53.3||–|
|PAR. EmbN ||69.0||62.6||60.6||54.1||–||–||–||–||–||–||61.6||–|
|(1) S-LIWE 22footnotemark: 2||76.3||72.1||–||–||–||–||–||–||–||–||–||–|
|(2) S-LIWE22footnotemark: 2||75.6||66.1||–||–||–||–||–||–||–||–||–||–|
|(10) S-LIWE22footnotemark: 2||75.2||65.1||50.6||53.9||53.9||56.0||61.3||62.3||55.1||64.2||61.2||59.8|
|(10) L-LIWE22footnotemark: 2||75.1||65.0||51.1||54.7||55.8||55.3||64.2||62.7||63.8||54.4||61.5||60.2|
uses translations from English for testing
22footnotemark: 2visual features trained using outside dataset
5 Multilingual Image-Sentence Retrieval Results
We provide results for MSCOCO and Multi30K in Table 1 and Table 2, respectively, which contain comparisons to prior work on fewer languages (a), adaptations of prior work to our setting (b), and our model variants (c). SMALR obtains consistent performance gains when evaluating on ten languages over the state-of-the-art (S-LIWE, line 3(b)) while also being more efficient than high-performing models like MULE (line 4(b)). SMALR outperforms S-LIWE by 11 points on MSCOCO and 5.6 points on Multi30K (line 3(c) versus 3(b)). A parameter comparison is later shown in Figure 3. SMALR’s initial Language-Agnostic (LA) baseline alone is able to boost performance over previous scalable method LIWE by 2-7 points. The HEM, which combines language-agnostic and language-specific embeddings as described in Section 3.1, consistently improves upon the fully language-agnostic vocabulary, even though they share the same vocabulary size of 40K tokens. This points to the utility of our hybrid embedding space, which improves performance upon LA by 3.4 average mR on MSCOCO and 2.4 average mR on Multi30K while adding only a few parameters.
When MCLM losses are added, referred to as SMALR in Tables 1 and 2 (line 3(c)), mR improves for nearly all languages. This is significant, because we find more compact models like LIWE degrade with additional languages when using the same number of parameters (S-LIWE). The LA baseline is still able to outperform L-LIWE on MSCOCO and Multi30K, in which LIWE learns an embedding five fold larger to try to compensate for the increased number and diversity of languages (120-D instead of 24-D embedding). This suggests that the masking process may help regain some semantic information that is lost when tokens are mapped to a language-agnostic space.
We next evaluate two CLC variants that use machine translations at test time (described in Section 3.4) on top of SMALR: an average ensemble over all languages (CLC-A), and a weighted ensemble which makes use of a simple classifier (CLC-C). CLC-A uses no additional test-time parameters, and increases the human average performance by 1-3 points, with a larger gain on Multi30K. This may be because more languages can be leveraged on Multi30K (four versus three, compared to MSCOCO). Surprisingly, English performance improves the most amongst CLC-A metrics on Multi30K, demonstrating that certain image-sentence pairs can be better retrieved from the queries in other languages, which may better capture the visual semantics of the same image. CLC-C further improves the human average over CLC-A by 0.9 points on MSCOCO and 0.5 points on Multi30K, using negligible additional parameters.
Parameter reduction method comparison. We present a side-by-side comparison of baseline vocabulary reduction techniques, described in Section 4, against prior works LIWE, MULE, and SMALR (consisting of only HEM and MCLM components in Figure 3). The frequency thresholding and dictionary mapping labels represent the threshold with which we drop infrequent words or map them to English (e.g. the blue 50 data point represents dropping words that occur fewer the 50 times). PCA point labels represent the dimensionality we reduce our input vectors to (e.g. 300D 50D, 100D, or 200D).
In our comparison of vocabulary reduction methods, we find that frequency thresholding with and vanilla language-agnostic vocabularies (LA) are able to obtain better performance than both LIWE variants on Multi30K, without adding significantly more parameters, as shown on the right side of Figure 3. While more model parameters are needed for MSCOCO, due to the increased vocabulary size, all simple baselines and prior work MULE significantly outperform LIWE. This demonstrates that more-complex character-based models do not necessarily obtain competitive performance with few parameters when addressing a larger multilingual scenario.
SMALR outperforms all baselines for MSCOCO, as seen on the left of Figure 3, outperforming S-LIWE by over 10 points and using fewer parameters than L-LIWE. We also find that average mean recall performance on MSCOCO is more robust to vocabulary reduction, with a maximum range of about 1.5 average mR between the most extreme reduction and the least. We believe this may be due to the size discrepancy between the two datasets, as MSCOCO is approximately four times the size of Multi30K. PCA reduction appears to have a more linear effect as parameters increase on both datasets. Since Multi30K performance is more sensitive to the number of parameters, it is significant that our SMALR model, in green, (which does not yet make use of our cross-lingual consistency module in Figure 3) outperforms all other models while having less than 20M parameters, 1/5th the parameter count of high performing MULE.
In addition to SMALR outperforming MULE on both datasets while using significantly fewer trainable parameters, we find MULE even fails to outperform simple baselines such as dictionary mapping on MSCOCO. This exposes that the large number of parameters used for the word-level embeddings in MULE are unnecessary for performance gains. While SMALR uses more parameters during training than S-LIWE, we have far fewer test-time parameters. We reduce the number of computations needed for evaluation by using precomputed language representations from training. This essentially reduces the entire SMALR model to the image-sentence matching model with our CLC add-on, totaling only 7.1M parameters, now fewer than S-LIWE.
In this paper, we have presented a Scalable Multilingual Aligned Representation (SMALR) which addresses the trade-off between multilingual model size and downstream vision-language task performance. Our approach is modular, and thus can be used as a drop-in language representation for any vision-language method/task. SMALR outperforms all prior work on the task of multilingual image-sentence retrieval on average across ten diverse languages, with the use of a hybrid embedding model, masked cross-language modeling loss, and cross-lingual consistency module. Our hybrid embedding model significantly reduces the input to a language model by mapping most tokens to a fixed size, shared vocabulary. The novel masking procedure aligns our diverse set of languages and makes use of the multimodal model to provide additional alignment by visually grounding our language representations. We find that both cross-lingual consistency modules better aggregate retrieved results, boosting performance with minimal additional parameters. This is all accomplished with less than 20M trainable parameters, significantly reducing oversized prior work by 1/5th, while improving performance over the state-of-the-art by 3-4%.
-  (2019-06) Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Cited by: §2.
Bottom-up and top-down attention for image captioning and visual question answering. In , Cited by: §4.
-  (2015) VQA: Visual Question Answering. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
-  (2016) Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Empirical Methods in Natural Language Processing (EMNLP), pp. 2289–2294. Cited by: §1.
-  (2018) Findings of the third shared task on multimodal machine translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 304–323. Cited by: §1, §4, §7.1, Table 4.
-  (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics (TACL) 5, pp. 135–146. External Links: Cited by: §1.
-  (2019) Language features matter: effective language representations for vision-language tasks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §3.1, §3.1.
-  (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.
-  (2018) Word translation without parallel data. In International Conference on Learning Representations (ICLR), Cited by: §1, §1, §3.1, 3rd item.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. In arXiv:1810.04805v1, Cited by: §1, §1, §1, §2, §3.2.
-  (2017) Findings of the second shared task on multimodal machine translation and multilingual image description. arXiv:1710.07177. Cited by: §1, §4, §7.1, Table 4.
-  (2016) Multi30k: multilingual english-german image descriptions. arXiv:1605.00459. Cited by: §1, §3.2, §4, §7.1, Table 4.
-  (2017) Image pivoting for learning multilingual multimodal representations. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §1, §2, Table 1, Table 2.
-  (2017) Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2018) Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), Cited by: §3.1, §3.1, §3.1.
-  (2019) ViCo: word embeddings from visual co-occurrences. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
-  (2015) Deep residual learning for image recognition. arXiv:1512.03385. Cited by: §4.
-  (2019) Cross-lingual ability of multilingual bert: an empirical study. arXiv:1912.07840. Cited by: §2.
-  (2014) ReferItGame: referring to objects in photographs of natural scenes. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1.
MULE: multimodal universal language embedding.
AAAI Conference on Artificial Intelligence, Cited by: Figure 1, §1, §1, §1, §2, §3.1, §3.1, §3.3, §3, §3, Table 1, Table 2, §4, §4, §7.5, Table 17.
-  (2015) Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV). Cited by: §4.
-  (2019) VisualBERT: a simple and performant baseline for vision and language. arXiv:1908.03557. Cited by: §1, §2, §3.1.
-  (2019) COCO-cn for cross-lingual image tagging, captioning and retrieval. IEEE Transactions on Multimedia. Cited by: §1, §3.2, §4, §4, §7.1, Table 3.
-  (2014) Microsoft coco: common objects in context. In The European Conference on Computer Vision (ECCV), Cited by: §1, §1, §4, §4, §7.1, Table 3.
-  (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv:1908.02265. Cited by: §1, §2.
-  (1993) Principal components analysis (pca). Computers and Geosciences 19 (3), pp. 303 – 342. External Links: Cited by: 2nd item.
-  (2016) Cross-lingual image caption generation. In Conference of the Association for Computational Linguistics (ACL), Cited by: §1, §3.2, §4, §4, §7.1, Table 3.
-  (2019) Multi-task learning of hierarchical vision-language representation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019) How multilingual is multilingual bert?. arXiv:1906.01502. Cited by: §2.
-  (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §4.
-  (2017) Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv:1702.03859. Cited by: §1.
-  (2019) VL-bert: pre-training of generic visual-linguistic representations. arXiv:1908.08530. Cited by: §1, §2.
-  (2019) LXMERT: learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §2.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §2.
Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41 (2), pp. 394–407. Cited by: §3.1, §3.3, §3, Table 1, Table 2, §4, §4.
-  (2019) Language-agnostic visual-semantic embeddings. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Figure 1, §1, §1, §2, Table 1, Table 2, §4, Table 17.
-  (2019) Beto, bentz, becas: the surprising cross-lingual effectiveness of bert. arXiv:1904.09077. Cited by: §2.
-  (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics (TACL) 2, pp. 67–78. Cited by: §1, §4, Table 4.
-  (2019) From recognition to cognition: visual commonsense reasoning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
7 Supplementary Material
7.1 Data Augmentation
We augment the multilingual datasets MSCOCO [25, 26, 29] and Multi30K [5, 12, 13] with translations from languages with human-generated annotations to other languages using Google Translate. Tables 3 and 4 show what translations were performed for MSCOCO and Multi30K, respectively. The column X refers to all other languages that consist entirely of translations to create the total set of ten languages; i.e. for MSCOCO, German, French, Czech, Arabic, Afrikaans, Korean, Russian, and for Multi30K, Chinese, Japanese, Arabic, Afrikaans, Korean, Russian. We compare the effect of using human-generated vs. machine translated sentences at test time in Section 7.6.
|Human Generated||MSCOCO ||COCO-CN||YJ Captions ||–|
|Translations||Cn En||En Cn||En Ja||En X|
|Human Generated||Flickr30K ||Multi30K ||Multi30K ||Multi30K ||–|
|Translations||De En||En De||En Fr||En Cs||En X|
7.2 Exploration Parameters
One component of SMALR is the Hybrid Embedding Model (HEM), which makes use of both language-specific and language-agnostic representations. The Language-Agnostic (LA) baseline refers to only using the shared latent vocabulary, which consists of 40K tokens. We found experimentally that using exploration parameters and improves downstream performance when using the latent vocabulary. These exploration parameters are used to force the model to randomly select from a set of similar tokens during training rather than always choosing the best matched token in the language-agnostic vocabulary (described in Section 3.1 of the main paper). Tables 5 and 6 demonstrate the difference in mean Recall for image-sentence retrieval with and without our exploration parameters.
Since we find that using the exploration parameters when learning the mapping to the latent vocabulary improves performance, we use them for both the language-agnostic and HEM results (and thus is included in the final SMALR training paradigm).
|Model||En||De11footnotemark: 1||Fr11footnotemark: 1||Cs11footnotemark: 1||Cn||Ja||Ar11footnotemark: 1||Af11footnotemark: 1||Ko11footnotemark: 1||Ru11footnotemark: 1||HA||A|
|LA + Explore||65.5||61.3||59.9||54.0||59.4||64.7||63.9||66.5||60.3||60.3||60.2||61.6|
uses translations from English for testing
|Model||En||De||Fr||Cs||Cn11footnotemark: 1||Ja11footnotemark: 1||Ar11footnotemark: 1||Af11footnotemark: 1||Ko11footnotemark: 1||Ru11footnotemark: 1||HA||A|
|LA + Explore||75.0||74.3||74.1||73.4||72.3||72.1||74.4||74.7||71.6||72.7||73.1||73.5|
uses translations from English for testing
7.3 Qualitative Results
We provide two examples for both MSCOCO and Multi30K which show the effect of the Cross-Lingual Consistency (CLC) module used with SMALR. We report results for the CLC-C variant, which makes use of a simple MLP classifier to aggregate scores across language. For a given text query, if it is human generated, we translate it to all other languages and use the predictions from these translations as input to our CLC-C module.
On the left hand side of Figure 4, the original text query is in English and its matching image is incorrectly retrieved, as shown by the red bounding box. However, when CLC-C is used, SMALR is able to correctly retrieve the matching image, as a subset of the translated sentences do correctly retrieve the ground truth image (e.g. the German translation). On the right hand side of Figure 4, we also see the same benefit for an original text query in German which is aided by English translations. These two examples demonstrate the benefit of CLC-C for R@1, as CLC-C now correctly retrieves the ground truth image. Additionally, these samples show that every language does not have to make the correct prediction; the CLC-C module can learn to combine predictions to improve performance. As we can see in Figure 4, the images incorrectly retrieved for the original English and German queries “People are walking through a vegetable stall filled market” and “Der mann trägt eine orange wollmütze” contain very similar objects and colors to their respective ground truth images, but these errors are remedied when considering all languages.
In Figure 5, there are two examples for MSCOCO, with original text queries in English and Chinese. Both examples have many translated queries which are able to correctly retrieve the ground truth image, such as French and Russian for English, and English, German, and French (among others) for Chinese. We see again that the original incorrectly retrieved image contains very similar visual semantics (e.g. teddy bear for English, baseball field for Chinese) to the ground truth, and the translated sentences help disambiguate subtle details.
7.4 Masked Cross-Language Modeling Example
SMALR’s Masked Cross-Language Model (MCLM) uses two language representations to compute its total loss, namely an average representation, and a sentence-level representation. The average masked sentence simply removes masked words and then averages each word embedding over the one-word shorter version of the original sentence before predicting the masked token. The masked sentence-level representation retains the same number of words from the original sentence by replacing the masked word with a special [MASK] token; not only does this retain the total word count for a given query, it also maintains grammatical structure by using an LSTM. This sentence-level representation is passed through an LSTM and fully connected layer before being used to predict the masked token. Figure 6 provides an example of this process. See Section 3.2 of the main paper for a description of how these representations are used.
7.5 Extended Image-Sentence Retrieval Results
We provide all recall values (Recall@K for K) for all ten languages on image-sentence retrieval with MSCOCO and Multi30K. I-to-S signifies the image to sentence retrieval direction, and S-to-I the sentence to image direction. We shorten “Language-Agnostic” to “LA” and CLC-A, CLC-C to A and C, respectively, due to space constraints. Lastly, “Prior” refers to prior work, “Adapted” refers to prior work that has been adapted to our testing scenario using the author’s publicly available code, and “Ours” refers to our SMALR model variants. The number preceding a model refers to the number of languages it was trained on, e.g. (3-4) MULE signifies MULE  trained on three languages (English, Chinese, Japanese) on MSCOCO, and four on Multi30K (English, German, French, Czech).
|Trans. to En||58.6||86.5||94.1||45.5||79.6||89.5||75.6||58.3||82.9||90.4||41.7||72.0||81.2||71.1|
|Trans. To En||–||–||–||–||–||–||–||34.1||60.4||71.1||19.6||47.4||58.5||48.5|
|Trans. to En||–||–||–||–||–||–||–||22.5||52.5||63.0||25.1||53.1||63.9||46.7|
|Trans. to En||–||–||–||–||–||–||–||23.0||50.9||64.7||25.1||53.4||64.2||46.9|
|Trans. to En||45.9||79.8||89.2||47.8||81.1||89.4||72.2||–||–||–||–||–||–||–|
|Trans. to En||44.8||74.3||85.4||36.9||71.0||84.7||66.1||–||–||–||–||–||–||–|
7.6 Testing with Machine Translations
In this section we investigate the effect testing with machine translations rather than human-generated sentences has when comparing methods. For all methods we use models trained on all 10 languages, and test on human-generated and translated sentences for Chinese and Japanese on MSCOCO and German on Multi30K.
As seen below, there are only minor differences in the performance of each language we tested. Notably, the performance rankings with each dataset are consistent regardless of whether the method is evaluated on human generated test sentences or test sentences translated from English.
|(a)||Human generated test sentences|
|(10) S-LIWE ||64.5||65.8||65.2||6||65.1||1|
|(b)||Test sentences translated from En|
|(10) S-LIWE11footnotemark: 1 ||64.7||65.8||65.2||6||65.5||1|
|(10) L-LIWE11footnotemark: 1||70.0||69.6||69.8||5||64.6||3|