Log In Sign Up

ZR-2021VG: Zero-Resource Speech Challenge, Visually-Grounded Language Modelling track, 2021 edition

by   Afra Alishahi, et al.

We present the visually-grounded language modelling track that was introduced in the Zero-Resource Speech challenge, 2021 edition, 2nd round. We motivate the new track and discuss participation rules in detail. We also present the two baseline systems that were developed for this track.


page 1

page 2

page 3

page 4


The Zero Resource Speech Challenge 2017

We describe a new challenge aimed at discovering subword and word units ...

The Interspeech Zero Resource Speech Challenge 2021: Spoken language modelling

We present the Zero Resource Speech Challenge 2021, which asks participa...

A Knowledge-Grounded Dialog System Based on Pre-Trained Language Models

We present a knowledge-grounded dialog system developed for the ninth Di...

Intra-agent speech permits zero-shot task acquisition

Human language learners are exposed to a trickle of informative, context...

Dynamic Behavior of a Railway Track Under a Moving Wheel Load Modelled as a Sinusoidal Pulse

The aim of this paper is to evaluate the train/track induced loads on th...

Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations

A major challenge in visually grounded language generation is to build r...

1 Introduction

This document introduces the visually-grounded spoken language modeling track of the ZeroSpeech 2021 challenge. In this track, participants are asked to use audiovisual materials during training. Evaluation is identical to the speech-only track at ZeroSpeech 2021.

Learning to comprehend and produce spoken languages is one of the hallmarks of human cognition, and the importance of speech communication also makes speech-based capabilities central to AI development. Modern automatic speech recognition (ASR) systems largely rely on supervised training, where input speech is paired with corresponding phonetic annotations or text transcripts. While this approach has produced great results in high-resource languages such as English, deployment of similar systems for low-resource environments such as small language communities or even unwritten languages is difficult. In addition, a mismatch between conversational speech and large-scale text data still exists even in high-resource languages.

In contrast to ASR, human children achieve their language skills without direct supervision or detailed feedback, simply interacting with their physical and linguistic environments. Moreover, these experiences are essentially multimodal: children not only hear speech of their caregivers, but concurrently observe the world through a number of senses. Instead of perceiving speech and the corresponding text, they learn from speech in the context of different everyday multimodal communicative scenarios. Compared with audio data only, audiovisual data contain more statistical regularities at multiple levels: there are regularities within audio and visual data respectively; and there are also cross-modal regularities at both the acoustic and semantic levels. For example, the presence of a dog in a visual scene is likely to co-occur with both barking sounds at the acoustic level and the spoken word “dog” (and other words semantically related to dog) at the semantic level. The presence of a dog can be used as a supervisory signal to train language learning systems to parse and segment words that consistently occur in the same context, and semantically group words that occur in similar visual contexts. In order to develop AI systems or models of human learning with similar multimodal language learning skills, a number of models and learning algorithms have been proposed throughout the years (e.g., [roy2002learning, Yu_Ballard_2004, tenbosch_2009, Driesen_2011, Rasanen_Rasilo_2015, Mangin_2015]). These systems have used various types of speech data with simulated or robot vision-based visual input.

However, only the recent advances in deep learning have scaled up the capabilities of audiovisual systems to a level where they can start to capture the relationships between realistic visual data (e.g., photographs or videos) and language related to the visual scene. These models were first developed for captions descriptive of images

[socher2014grounded, karpathy2015deep] and more recently for spoken descriptions of the images (e.g., [harwath2015deep, harwath2016unsupervised, synnaeve2014learning, chrupala2017representations, kamper2019semantic]). In this context, it is of great interest how the representations emerging from training of such multimodal models relate to the known linguistic structure of the input language (e.g., [chrupala2017representations, alishahi2017encoding, harwath2019learning, havard2019models, havard2019word]), and how such methods can support (or replace) purely audio-based representation learning approaches (e.g., [Chung2019a, oord2018cpc]

). In other words, it would be highly useful if unsupervised learning from multimodal data could be used to acquire language representations such as phone(me)s and words without access to transcribed training data in the given language—units that can be then used as a basis for many other language processing tasks. However, the research in this direction is still young and largely driven by a few research groups


. In addition, there are no standardized evaluation metrics or a common benchmark to compare different methodological approaches and thereby to drive the research in this area forward.

The goal of the ZR-2021VG track is to tackle the issue of multimodal language learning. In contrast to the earlier Zerospeech-challenges [versteegh2015zero, dunbar2017zero, dunbar2019zero, nguyen2020zero]

that have purely focused on audio-based learning of linguistic representations (including the speech-based track of the current challenge, on which we build), ZR-2021VG takes a step towards multimodal language learning by asking participants to train on audiovisual data. The aim of the challenge is to learn phonemic, lexical, syntactic, and semantic representations of speech with the help of supporting visual information, as evaluated by standardized evaluation protocols. As a result, the challenge aims to bring together researchers from speech technology, natural language processing, computer vision, and machine learning to work on multimodal language learning, thereby advancing the state-of-the-art in audiovisual learning algorithms and providing new knowledge on how visual information may support unsupervised learning of linguistic patterns from speech.

2 Registration

Participation is open and free. Registration is done via the ZR2021 website The Challenge will be submitted as such to several conferences, and participants are encouraged to submit a paper to them. The first conference is NeurIPS; more information on participation is available from

For any issues or questions, we recommend to registrants that they join our mailing list:

  1. Make sure you’re logged into google (you may need to get your email account registered with them here:

  2. Ask to join.

A moderator will approve your request within a week.

3 Task

3.1 Task definition

The challenge is to learn spoken language representations from raw audio-visual input, without any annotation or written transcriptions. Systems are allowed to use the raw audio and image/video (or features extracted from them, such as MFCCs for the audio) of the training set(s) as input. The goal is to use the raw multimodal input data to discover discrete linguistic units at phonetic and word levels, which will be evaluated via a set of black-box, zero-shot metrics probing for the quality of the learned models at different linguistic levels, including phonetics, lexicon, syntax and semantics. See Section

3.3 for more information.

Any approach that exploits the correspondence between the two modalities without relying on supervision based on linguistic annotations of the speech or visual signal is allowed (see Section 6 for details). A range of objectives could be used for this purpose, including (but not limited to):

  • generating the visual representations from the audio or vice versa;

  • predicting the next audio frame with the visual input as an additional signal;

  • training language models on pseudo-text that has been generated with the help of the visual input;

  • inducing linguistic and visual representations that are geometrically similar in a shared semantic space.

In order to take into account the computing resources of participants, we distinguish categories of submissions in two tracks based on the type of model and resources employed (see Section 3.2). Additionally, we may highlight submissions that explore innovative architectures, successfully use particularly small amounts of training data, or have other unusual features which contribute to fostering new ideas and gaining collective insight into this interesting problem.

3.2 Model conditions

There are no restrictions or conditions on the models: if they are unsupervised, they are eligible.

In order to take into account the computing resources of participants, we ask participants to report their models’ resource budget. As in ZR2021, you calculate your budget as number of hours

number of GPUs. Additionally, if relevant, report Akaike’s Information Criterion and Bayesian Information Criterion; and/or number of parameters and number of pre-trained parameters. (If you don’t know what that means, it probably is not relevant.)

We also ask participants to report how much data (in number of speech hours, number of images, hours of video) they used for training their submitted models.

One overall restriction on models comes from the evaluation, which we want to align with the audio-only track of the Zerospeech-2021 challenge. The evaluation procedure focuses on linguistic representations (ignoring the visual representation), and will do so using unimodal audio data. This means that the models submitted to the challenge should be able to process audio data independently of visual data.

3.3 Evaluation conditions

We will now briefly describe the metrics and refer the reader to [nguyen2020zero] for more details.

Each metric evaluates models at a different linguistic level:

  • Phonetic (ABX metric). The ABX metric allows to assess the degree of discriminability between two categories and given a representation, where in our case and correspond to minimal pairs of triphones such as /beg/ and /bag/ (differing only in their middle phoneme). Given a triple of stimuli where and belong to the same category and to a different category , and given a distance between pairs of stimuli, discrimination is considered successful if . The ABX score is obtained by aggregating scores across all minimal triphone pairs.

  • Lexical (spot-the-word metric). This metric evaluates the possibility of discriminating between real words and nonwords by associating a probability to each of them (with the expectation that a higher probability is associated with the real word). The score is computed as the average discrimination accuracy across all pairs of words and nonwords.

  • Syntactic (acceptability metric). Similarly to the spot-the-word metric, the acceptability metric computes a discrimination accuracy between grammatical and ungrammatical categories, based on a probability associated to each of them. The pairs of sentences we use are representative of 68 different syntactic paradigms belonging to twelve broad categories.

  • Semantic (similarity metric). This metric compares the similarity between the representations of pairs of words to human similarity judgements. The metric is defined as the Spearman’s rank correlation coefficient between the two similarity scores.

4 Baseline models

To represent a variety of participants, two baseline models have been produced.111Code and detailed instructions to repoduce our results can be found at One uses a low budget (72 GPU hours); the other uses a high budget (165 GPU hours), corresponding to the following submission tracks:

  • Track A: Low budget models. Low budget submissions use smaller models which can be trained with lower GPU memory and training hours (about 100 GPU hours maximum).

  • Track B: High budget models. High budget submissions use more GPU memory and/or more training time.

Our baselines are directly inspired by the baselines used in the audio-only track. The main difference is that we incorporate a visually grounded (VG) model to learn our speech representations. Those representations are then fed to the language model through K-means clustering. The low-budget baseline completely replaces the contrastive predictive model (CPC) with the VG model. The high-budget baseline, on the other hand, adds the VG model on top of the CPC model.

4.1 Data

The baselines are trained with two datasets:

  • SpokenCOCO [hsu_text-free_2020] for the VG models.

  • LibriSpeech [Panayotov2015librispeech] for K-means clustering (100h from the train-clean-100 subset) and the language models (960h from the train-clean-100, train-clean-360 and train-other-500 subsets).

Image features used to train the VG model are extracted from the preclassification layer of a frozen ResNet-152 model [he_deep_2016]

pretrained on ImageNet

[deng_imagenet_2009]. We follow [merkx_language_2019]

and use the mean feature vector over ten crops of each image.

The acoustic feature vectors used by the low-budget baseline are composed of 12 mel-frequency cepstral coefficients (MFCCs) and log energy, with first and second derivatives, resulting in 39-dimensional vectors. They are computed over windows of 25 ms of speech, with 10 ms shift. The high-budget baseline uses features extracted from a pretrained CPC model that works on raw waveform.

4.2 Architecture

4.2.1 Low-budget baseline

The VG model follows the speech-image architecture described in [chrupala_symbolic_2019, higy_textual_2020]. It is composed of visual and speech encoders which each extract fixed length embeddings from the visual and audio input respectively.

The image encoder is composed of a single linear layer projecting the image features into a shared semantic embedding space (dimension 2048), followed by a normalization layer ( norm).

The speech encoder is composed of a 1D convolutional layer (kernel size 6, stride 2 and 64 output channels), followed by bidirectional gated recurrent units (GRUs)

[cho_learning_2014] (4 layers, hidden state of dimension 1024). A vectorial attention layer is then used to convert the variable length input sequence to a fixed-size vector of dimension 2048. Finally, a normalization layer ( norm) is applied.

Once the model is trained, we extract activations of the recurrent layer of the speech encoder and cluster them with K-means (50 clusters). The quantized activations are then used to train a language model corresponding to the BERT-small architecture from the audio-only track.

4.2.2 High-budget baseline

The high-budget baseline is essentally the same as the low-budget baseline with following exceptions:

  • The VG model is trained on features extracted from the pretained CPC-small model from the audio-only track. The activations of the last recurrent layer of the CPC model are used.

  • The language model corresponds to the BERT-big architecture from the audio-only track.

4.3 Evaluation

We evaluated the models on the four metrics introduced in [nguyen2020zero]. The phonetic scores are computed with cosine distance on continuous representation (i.e. before quantization) extracted from the (resp. ) recurrent layer of the VG (resp. CPC) model for the low-budget (resp. high-budget) baseline. Lexical and syntactic metrics rely on pseudo-probabilities obtained from the language model of each pipeline. Finally, the semantic scores are computed using activations extracted from the last (resp.

) recurrent layer of the language model for the low-budget (resp. high-budget) audio-only baseline, while the output of the attention layer was used for both visually-grounded models. All semantic scores are obtained with max pooling

222For the visually-grounded models, the choice of the pooling mechanism doesn’t affect the results as the output of the attention layer is a fixed length vector and not a sequence. and the cosine distance to evaluate the similarity between activations.

Table 1 summarizes the results obtained with the two visually-grounded baselines as well as the two baselines from the audio-only track.

While the audio-only baselines tend to perform better than their visually-grounded counterparts on phonetic, lexical and syntactic metrics, the latter obtain higher semantic scores overall. One exception is the un-weighted semantic similarity scores on the test set where both visually-grounded baselines get very low scores. Upon inspection, this can be attributed to unstability in the scores obtained on the smallest sub-datasets of the test set. This motivated the introduction of a weighted version of the metric (where each sub-score is weighted by the size of the dataset it is obtained from) which shows much more stability.

We can also observe that the high-budget visually-grounded baseline, by using CPC-small as a feature extractor, largely closes the gap with the high-budget audio-only baseline on the phonetic, lexical and syntactic metrics (especially the last two).

Track Budget Set Phonetic Lexical Syntactic Semantic
Within Across Un-weighted Weighted
clean other clean other synth. libri. synth. libri.
Audio-only Low dev 0.03 0.05 0.04 0.08 0.61 0.52 4.42 7.07 4.42 7.07
test 0.03 0.05 0.04 0.08 0.61 0.53 7.35 2.38 7.31 5.82
High dev 0.03 0.05 0.04 0.08 0.68 0.56 6.25 4.35 6.25 4.35
test 0.03 0.05 0.04 0.08 0.68 0.56 5.17 2.48 3.19 1.32
Visually-grounded Low dev 0.09 0.10 0.11 0.15 0.53 0.53 9.65 12.61 9.65 12.61
test 0.08 0.11 0.11 0.15 0.53 0.53 9.71 0.16 12.57 12.55
High dev 0.06 0.07 0.07 0.11 0.67 0.55 9.60 15.09 9.60 15.09
test 0.05 0.07 0.07 0.12 0.67 0.55 9.99 -0.10 13.46 10.96
Table 1: Overall performance of the audio-only and the visually grounded baselines on dev and test sets. Phonetic scores are reported in terms of within and across speakers ABX error rates (lower is better). Lexical and syntactic scores represent accuracies computed on the pseudo-probability of spotting the right stimuli (higher is better). The semantic similarity score is reported as the Spearman’s rank correlation coefficient between the distance scores returned by the model and the true human scores (higher is better) Numbers in bold indicate best scores on the development set for each metric.

5 Data

5.1 Training data

Training data are not provided. Instead, ZR-VG2021 participants may use any publicly available, private, or proprietary data (audio or visual stream or snapshots; instructional videos with voiceovers; see Appendix A for examples) to train their systems provided. However, synthetic speech that is generated from text is not allowed. We strongly encourage the use of public data, as this improves interpretability and cumulativity of results. In addition, the following corpora cannot be used because they are part of the evaluation set (for more details, see ZR2021’s challenge description [nguyen2020zero]):

  • sWUGGY (from ZR2021)

  • sSIMI (from ZR2021)

  • sBLIMP (from ZR2021)

A good starting point for people new to working with audio-visual data is SpokenCOCO dataset, which was also used to train our baselines.

Participants are responsible for documenting precisely the training data they used in any final submissions, as part of their Methods section, including any pre-processing (such as correcting or adding voice activity detection or denoising. We also ask that participants overtly describe their training data in terms of how similar to natural human visual and audio flow they are. For instance, Librispeech is read speech and thus more formal and simpler than natural audio flow, whereas the VanDam corpora [vandam_homebank_2015, vandam_homebank_2015-1] was captured with a wearable, thus representing natural audio conditions from a first-person perspective.

5.2 Development and test data

Scoring will be done by the ZR team, as part of their general yearly challenge. Dev/test data are therefore not defined here. Please see [nguyen2020zero] for information.

6 Evaluation rules

The ZR-VG2021 is an open evaluation challenge. Participants must agree to work respecting the following rules:

  • You should describe training and system data thoroughly, as well as the computational resources used, for your final systems in any submission. Please see the Appendix C for the full list.

  • You should do your best to share training data and code publicly, for example by depositing them in a scientific archive such as the Open Science Framework,, which contains a Github plugin.

  • You should not use the prohibited corpora listed in Section 5.1.

  • You should not perform manual/human investigation of the evaluation data such as listening, manual segmentation, transcription, or other form of human annotation. Examples of such prohibited practice are speech recognition with a language model, phone(me) recognition with a supervised classifier, or using written captions or synthesized-speech captions.

  • You can use any automatically derived information as long as that system was not trained with linguistic labels, for example speaker diarization and denoising, or speaker or language identification. You can also use spoken captions of pictures.

  • Training must be done in an unsupervised fashion within the linguistic domain; i.e., no linguistic labels allowed, including those generated via ASR systems (since these have themselves been trained using linguistic labels).

  • One exception to the general rule of no supervision is for visual features that may be pre-trained with object labels (such as the ImageNet labels), but we categorically forbid the use of full captions, and we caution against using this exception to allow linguistic supervision to seep in. If you are uncertain, please contact the organizers.

Failure to abide by the rules above can lead to disqualification from the challenge, removal of existing submissions, and public remonstrations.

Submissions will be ranked per track (see Section 4) on each of the four metrics described in Section 3.3 separately.

Appendix A Ideas for training data

As in ZR-2021, the following are training options for clean audio:

  • LibriSpeech (the standard subsets: clean-100, clean-360, other-500)

  • Libri-light: small (600h), med(6kh) or large(60kh).

In addition, we suggest the following audio flow corpora:

The following are some ideas for training options for clean visual data:

Here are training options for audio-video flow:

  • YouCook II, 2K videos, 176 hours [zhou2018towards]

  • How2, 80k instructional videos (about 2000h) with associated English subtitles and summaries that participants should NOT use. [sanabria18how2]

  • HowTo100M, 136M video clips with captions sourced from 1.2M Youtube instructional videos (15 years of videos) covering 23k activies: cooking, hand crafting, personal care, gardening [miech19howto100m]

  • AVA Spoken Activity Datasets (densely labeled for NOSPEECH, CLEANSPEECH, SPEECHWITHMUSIC, SPEECHWITHNOISE). Extracted from 192 movies publicly available on YouTube (multiple languages); 45 hours total [AVA]

Audio-image flow:

  • MIT Places 205: 400,000 spoken audio caption each of which describes a different Places image; collected from over 2,500 different speakers and covering a 40,000 word vocabulary. The average caption duration is approximately 10 seconds. Used in [harwath2019learning].

  • Flickr8K: contains 40,000 read audio captions describing 8,000 images from Flickr. Used in [harwath2015deep].

  • SpokenCOCO: contains approximately 600,000 read audio captions describing images from MSCOCO dataset. Used in [hsu_text-free_2020].

Appendix B Q&a

Q: Is it possible to use instructional videos with voiceovers?

A: Yes!

Q: I’m not interested in creating a model for the lower levels (phonetics), I only care about word learning.

A: You are welcome to use other unsupervised models at the beginning of your pipeline, for instance by taking our baseline, and replacing the late components with your own ideas for the ”semantic” level sections.

Q: Can I use other data, beyond audio and visual?

A: Provided it’s not text labels, you probably can. For instance, a model that is trained with 1. an image paired with 2. an audio file and 3. time-course of human gazes on the image would be acceptable

Appendix C Checklist for submission

Have you:

  • played by the rules, as in Section 6?

  • clearly stated in your paper what your budget was?

  • reported on all 4 evaluation conditions, at least in an online appendix to your paper?

  • described the data you used for training (in terms of sources as well as in terms of quantity)?

  • described at least one of your models in terms of complexity and computing resources?