Semantic keyword spotting by learning from images and speech

10/05/2017 ∙ by Herman Kamper, et al. ∙ Stellenbosch University Toyota Technological Institute at Chicago 0

We consider the problem of representing semantic concepts in speech by learning from untranscribed speech paired with images of scenes. This setting is relevant in low-resource speech processing, robotics, and human language acquisition research. We use an external image tagger to generate soft labels, which serve as targets for training a neural model that maps speech to keyword labels. We introduce a newly collected data set of human semantic relevance judgements and an associated task, semantic keyword spotting, where the goal is to search for spoken utterances that are semantically relevant to a given text query. Without seeing any text, the model trained on parallel speech and images achieves a precision of almost 60 to a model trained on transcriptions, our model matches human judgements better by some measures, especially in retrieving non-verbatim semantic matches.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


A data set of semantic keyword spotting labels for the Flickr Audio Captions Corpus.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The last few years have seen great advances in automatic speech recognition (ASR). However, current methods require large amounts of transcribed speech data, which are available only for a small fraction of the world’s languages [Besacier et al.2014]. This has prompted work on speech models that, instead of using exact transcriptions, can learn from weaker forms of supervision, e.g., known word pairs [Synnaeve et al.2014a, Settle et al.2017], translation text [Duong et al.2016, Bansal et al.2017, Weiss et al.2017], or unordered word labels [Palaz et al.2016]. Here we consider the setting where untranscribed speech is paired with images. Such visual context could be used to ground speech when it is not possible to obtain transcriptions, e.g., for endangered or unwritten languages [Chrupała et al.2017]. In robotics, co-occurring audio and visual signals could be combined to learn new commands [Luo et al.2008, Krunic et al.2009, Taniguchi et al.2016].

Our work builds on a line of recent studies [Synnaeve et al.2014b, Harwath et al.2016, Chrupała et al.2017] that use natural images of scenes paired with spoken descriptions. Neither the spoken nor visual input are labelled. Most approaches map the images and speech into some common space (2.3), allowing images to be retrieved using speech and vice versa. Although useful, such models cannot predict (written) labels for the input speech.

Kamper et al. kamper+etal_interspeech17 proposed a model trained on images and speech that maps speech to text labels. They used a visual tagger to obtain soft text labels for each training image, and then trained a neural network to map speech to these soft targets. Without observing any parallel speech and text, the resulting model could be applied as a keyword spotter, predicting which utterances in a search collection contain a given written keyword. The authors observed that the model often confuses semantically related words, which count as errors in keyword spotting, but could still be useful in search applications.

Our work here is inspired by these observations. We define a new task, semantic keyword spotting, where the aim is to retrieve all utterances in a speech collection that are semantically relevant to a given query keyword, irrespective of whether that keyword occurs exactly in an utterance or not. E.g., given the query ‘children’, the goal is to return not only utterances containing the word ‘children’, but also utterances about children, like ‘young boys playing soccer in the park’. To our knowledge, speech data with this type of semantic labelling does not exist, so we collect and release a new data set. Using this data, we present an extensive analysis of an improved version of the model of [Kamper et al.2017], and compare it to several new alternative models for the task of semantic keyword spotting.

Our main finding is that the predictions from the vision-speech model match well with human judgements, leading to competitive scores as a semantic keyword spotter, in particular in retrieving non-exact semantic matches. Specifically, the model even outperforms a “cheating” model (trained on transcriptions) when searching for utterances that are semantic but not verbatim matches to the keyword. We conclude that visual context can play an important role in learning to map speech to semantics.

2 Related work

2.1 Language acquisition

Cognitive scientists have long been interested in how infants use sensory input to learn the mapping of words to real-world entities [Yu and Smith2007, Cunillera et al.2010, Thiessen2010], with computational models providing one way to specify and test theories. Roy and Pentland roy+pentland_cogsci02 and Yu and Ballard yu+ballard_tap04 were some of the first to consider real speech input, followed by more recent work such as [Ten Bosch and Cranen2007, Aimetti2009, Driesen and Van hamme2011]. However, these studies simplify the problem by using discrete labels to represent the visual context [Räsänen and Rasilo2015], the spoken input [Gelderloos and Chrupała2016], or both [Siskind1996, Frank et al.2009]; infants do not have access to such idealised input.

Our model operates on natural images paired with real unlabelled speech. One assumption we make is that a visual tagger is available for processing training images. Our focus is not on cognitive modelling, but this assumption is linked to the question of whether visual category acquisition precedes word learning in infants [Clark2004]. We leave the cognitive implications of our model for future work.

2.2 Joint modelling of vision and text

Joint modelling of images and text has received much recent attention. One approach is to map images and text into a common vector space where related instances are close to each other, e.g., for text-based image retrieval 

[Socher and Fei-Fei2010, Weston et al.2011, Hodosh et al.2013, Karpathy et al.2014]. Image captioning has also been studied extensively, where the goal is to produce a natural language description for a visual scene [Farhadi et al.2010, Yang et al.2011, Kulkarni et al.2013, Bernardi et al.2016]

. Most recent approaches use a convolutional neural network (CNN) to convert an input image to a latent representation, which is then fed to a recurrent neural network to produce a sequence of output words 

[Donahue et al.2015, Fang et al.2015, Karpathy and Fei-Fei2015, Vinyals et al.2015, Xu et al.2015, Chen and Zitnick2015]. Our work uses spoken rather than written language, and we consider a semantic speech retrieval task.

There has also been work on using vision to explicitly capture aspects of text semantics. Semantics are difficult to annotate, so most studies evaluate models using soft human ratings on tasks such as word similarity [Feng and Lapata2010, Silberer and Lapata2012, Bruni et al.2014], word association [Bruni et al.2012, Roller and Schulte Im Walde2013] or concept categorisation [Silberer and Lapata2014]. We also use human responses, but for the task of semantic keyword spotting, which is more closely related to typical speech tasks.

2.3 Joint modelling of vision and speech

Some recent work has shown that ASR can be improved when additional visual features are available from the scene in which speech occurs [Sun et al.2016, Gupta et al.2017]. These studies consider fully supervised ASR, and typically the scene is not described by the speech but is complementary to it. Our aim instead is to use the visual modality to learn from matching but untranscribed speech.

This is the setting considered by Synnaeve et al. synnaeve+etal_nipsworkshop14 and Harwath et al. harwath+etal_nips16, who used natural images of scenes paired with unlabelled spoken descriptions to learn neural mappings of images and speech into a common space. This approach allows images to be retrieved using speech and vice versa. This is useful, e.g., in applications for tagging mobile phone images with spoken descriptions [Hazen et al.2007, Anguera et al.2008]. The joint neural mapping approach has subsequently been used for spoken word and object segmentation [Harwath and Glass2017], and the learned representations have been analysed in a variety of ways [Chrupała et al.2017, Drexler and Glass2017]. Despite these developments, this prior work does not give an explicit mapping of speech to textual labels.

The model of Kamper et al. kamper+etal_interspeech17, which we extend directly here, can make such labelled predictions: a visual tagger is used to obtain soft labels for a given training image, and a neural network is then trained to map input speech to these targets. The resulting speech model attempts to predict which (written) words are present in a spoken utterance, ignoring word order. The model can be used as a keyword spotter, retrieving utterances in a speech collection that contain a given keyword. It was found that the model also retrieves semantic matches for a query, not only exact matches. However, Kamper et al. kamper+etal_interspeech17 did not formalise the semantic search task, compare results to human judgements, or extensively analyse the performance of the model against multiple other systems. Here we present an extensive analysis using several alternative models on a new semantic speech task.

2.4 Semantic text retrieval

In textual information retrieval, the task is to find text documents in a collection that are relevant to a given written keyword, irrespective of whether the keyword occurs exactly in the document. This can be accomplished using query expansion, where additional words or phrases similar in meaning to the query are used to match relevant documents [Xu and Croft1996, Graupmann et al.2005]. Classic approaches use co-occurrences of words, or resources such as WordNet [Miller1995], to expand the query list, while recent methods use word embeddings [Diaz et al.2016, Roy et al.2016].

We also consider a semantic search task, but on speech rather than text. We do use text-based methods to obtain an upper bound on performance using the transcriptions of the speech collection (5.2).

Figure 1: A model for visually grounding untranscribed speech. A trained visual tagger produces soft outputs given an input image , which is then used as target for the speech network fed with the corresponding spoken caption .

3 A keyword prediction model trained on images and untranscribed speech

Given a corpus of parallel images and spoken captions, neither with textual labels, we train a spoken keyword prediction model using a visual tagging system to produce soft labels for the speech network.

The overall model is illustrated in Figure 1. Each training image is paired with a spoken caption , where each frame is an acoustic feature vector, e.g., mel-frequency cepstral coefficients (MFCCs) [Davis and Mermelstein1980]. We use a vision system to tag with soft textual labels, giving targets to train the speech network . The resulting network can then predict which keywords are present for a given utterance , disregarding the order, quantity, or locations of the keywords in the input speech. The possible keywords (i.e. the vocabulary) are implicitly specified by the visual tagger, and no transcriptions are used during training. When applying the trained as a keyword spotter, only speech is used without any visual input.

3.1 Model details

If we knew which words occur in training utterance , we could construct a multi-hot vector , with the vocabulary size and each dimension a binary indicator for whether word occurs in . In [Palaz et al.2016], transcriptions were used to obtain exactly this type of ideal bag-of-words (BoW) labelling where, in contrast to typical ASR supervision, the order and quantity of words are ignored. Instead of a transcription for , we only have access to the paired image . We use a multi-label visual tagging system which, instead of binary indicators, produces soft targets , with

the estimated probability of word

being present given image under vision model parameters . In Figure 1, would ideally be close to for corresponding to words such as ‘jumping’, ‘man’, ‘snow’ and ‘snowboard’, and close to for irrelevant dimensions. This visual tagger is fixed: during training of the speech network , as described below, the vision parameters are never updated.

Given as target, we train the speech model (Figure 1, right). This model (with parameters ) consists of a CNN over the speech with a final sigmoidal layer so that . We interpret each dimension of the output as . Note that

is not a distribution over the output vocabulary, since any number of keywords can be present in an utterance; rather, it is a multi-label classifier where each dimension

can have any value in . We train the speech model using the summed cross-entropy loss, which (for a single training example) is:


If we had , as in , this would be the summed log loss of binary classifiers. The idea of using a pre-trained visual network to provide the supervision for another modality was also used with a similar loss in [Aytar et al.2016], where video was paired with general audio (not speech).

3.2 The visual tagging system

In image classification, the task is to choose one (object) class from a closed set [Deng et al.2009], while in image captioning, the goal is to produce a natural language description (2.2). In contrast to both these tasks, we require a visual tagging system [Barnard et al.2003, Guillaumin et al.2009, Chen et al.2013] that predicts an unordered set of words (nouns, adjectives, verbs) that accurately describe aspects of the scene (Figure 1, left). This is a multi-label binary classification task.

We train our visual tagger on data from the Flickr30k [Young et al.2014] and MSCOCO [Lin et al.2014] data sets, which consist of images each with five written captions. We combine the entire Flickr30k with the training portion of MSCOCO, and remove any images that occur in the parallel image-speech data used in our experiments (5). The result is a training set of around 106k images, significantly more than the 25k used in [Kamper et al.2017]. For each image, a single target BoW vector is constructed by combining all five captions after removing stop words. Element is an indicator for whether word occurs in any of the five captions for that image, where is one of the most common content words in the combined set of image captions.

We follow the common practice of using pre-trained vision representation for processing the images. Specifically, we use VGG-16 [Simonyan and Zisserman2014], trained on around 1.3M images [Deng et al.2009]

; we replace the final classification layer with four 2048-unit ReLU layers, followed by a final sigmoidal layer for predicting word occurrence (Figure 

1, left). The visual tagger, with parameters , is then trained on the combined Flickr30k and MSCOCO data using the summed log loss (1) with arguments . The VGG-16 parameters are fixed; only the additional fully connected layers are updated.

Previous joint image and speech models [Synnaeve et al.2014b, Harwath et al.2016, Chrupała et al.2017] also employ pre-trained visual representations. Here we take this approach even further by using the textual classification output of a trained vision system. Although we train (and then fix) the visual tagger ourselves, we ensure that none of the training data overlaps with the parallel image-speech data used in our speech model, so the model does not obtain even indirect access to text labels.

4 Semantic keyword spotting

We are interested in whether our model can be used to determine the semantic concepts present in a speech utterance. I.e., can we use the model to search a speech collection for mentions of a particular semantic concept? We formalise this task and collect a new data set for evaluation.

4.1 Task description

In the speech technology community, keyword spotting is the task of retrieving utterances in a search collection that contain spoken instances of a given written keyword [Wilpon et al.1990, Szöke et al.2005, Garcia and Gish2006]. In this work, we define a new task called semantic keyword spotting. Instead of matching keywords exactly, the task is to retrieve all utterances that would be relevant if you searched for that keyword, irrespective of whether that keyword occurs in the utterance or not. E.g., given they query ‘sidewalk’, a model should return not only utterances containing the word exactly, but also speech like ‘an old couple crossing a street’.

4.2 Data set

For the sentence below, select all words that could be used to search for the described scene. Sentence: a skateboarder in a light green shirt.  dogs   beach   children   white   swimming   wearing   skateboard   None of the above
Figure 2: An example Amazon Mechanical Turk (AMT) job for semantically annotating a given transcription of a spoken sentence with a set of keywords. In this case, ‘skateboard’ was selected by all five annotators and ‘wearing’ by four (the other keywords were not selected).

As far as we know, there are no speech data sets labelled in this way, so we collect our own. Specifically, we extend the corpus of [Harwath and Glass2015], which consists of parallel images and spoken captions (we use the same data in 5). The data comes with transcriptions, but it has not been labelled semantically. For a subset of the speech in the corpus, we use Amazon Mechanical Turk (AMT) to collect semantic labels from human annotators.

As our keywords, we select a set of 70 random words from the transcriptions of the training portion of the corpus, ignoring stop words. The test portion of the spoken caption data consists of 1000 images, each with five spoken captions; we collect semantic labels for one randomly selected spoken caption from each of these 1000 images. A single AMT job consists of the transcription of a single utterance (describing a scene) with a list of seven potential keywords from which an annotator could select any number, as illustrated in Figure 2. To cover all 70 keywords, a given sentence is repeated over ten jobs. Since the question of semantic relatedness (between a given sentence and keyword) is inherently ambiguous, we have five workers annotate each utterance.

Count 0 1 2 3 4 5
Proportion (%) 83.3 6.9 3.0 2.6 2.5 1.6
Table 1: The proportion of (sentence, keyword) pairings selected by a given number of annotators.

To analyse the agreement between annotators, we construct a count matrix where element gives the number of annotators that selected keyword for sentence . Each entry in the count matrix is therefore between zero and five. Taking together all the matrix entries, Table 1 gives the proportion of entries for each possible count. Annotators agree most often about the absence of a keyword for a given sentence (83.3% of all entries in the count matrix are zero), while there are very few keywords that all annotators agree are semantically related (1.6%).

In order to evaluate a semantic keyword spotting model against the human annotations, one option is to combine the human judgements into a single hard label. On the other hand, the fact that there is a wide range of opinions among the human annotators indicates that semantic relevance may be inherently “soft”, motivating evaluation by comparing against the proportion of annotators that agree with a given label. We consider both options in our experiments.

To obtain a hard label of whether a keyword is semantically relevant or irrelevant for a sentence, we take the majority decision: if three or more annotators selected that keyword, we label that keyword as relevant for that sentence; otherwise we label it as irrelevant. Given this assignment, we can use our count matrix to determine what proportion of annotators agree with the majority decision (whether relevant or irrelevant): we find that 95.8% agree with the hard decision (bearing in mind the skew towards negative assignments).

We also calculated the proportion of annotators that agree with the majority individually for each of the keywords. Most keywords had similar agreement scores, but three keywords (‘one’, ‘person’ and ‘plays’) had substantially worse agreement than the others. We therefore excluded these three, leaving 67 keywords in the data set.

The result of our data collection effort is 1000 spoken utterances, each annotated with a set of keywords that could be used to search for that utterance.

5 Experimental setup and evaluation

We train our model on the corpus of parallel images and spoken captions of [Harwath and Glass2015], containing 8000 images with five spoken captions each. The audio comprises around 37 hours of active speech (non-silence). The data comes with train, development and test splits containing 30 000, 5000 and 5000 utterances, respectively. As described in 4.2, we obtained semantic labels for 1000 of the test utterances. We parameterise the speech audio as 13 MFCCs concatenated with first and second order derivatives, giving 39-dimensional input vectors. 99.5% of the utterances are 8 s or shorter, and utterances longer than this are truncated.

Training images are passed through the visual tagger (3.2), producing soft targets for training the keyword prediction model on the unlabelled speech (3.1), as shown in Figure 1. We refer to the resulting model as VisionSpeechCNN, which during testing is presented only with spoken input (no image). We are interested in semantic keyword spotting (4), but labels for this task are only available for test utterances. We therefore optimise the hyper-parameters of the model using exact keyword spotting on development data. VisionSpeechCNN has the following resulting structure (also in Figure 1

): 1-D ReLU convolution with 64 filters over 9 frames; max pooling over 3 units; 1-D ReLU convolution with 256 filters over 10 units; max pooling over 3 units; 1-D ReLU convolution with 1024 filters over 11 units; max pooling over all units; 3000-unit fully-connected ReLU; and the 1000-unit sigmoid output. Based on experiments on development data, we train for a maximum of 25 epochs with early stopping using Adam optimization 

[Kingma and Ba2014] with a learning rate of and a batch size of eight.

5.1 Evaluation

To apply VisionSpeechCNN as a semantic keyword spotter, we use its output as a score for how relevant an utterance is given the keyword .222Note that, although the model produces scores for all words in its output vocabulary (implicitly specified by the visual tagger), we are only interested in those dimensions corresponding to the test keywords. The baseline and cheating models (described below) similarly predicts a relevance score for each utterance given a specific keyword.

We compare a model’s predictions to semantic labels obtained from human annotators (4.2) using several metrics. To obtain a hard labelling from a model, we set a threshold , and label all keywords for which

as relevant. By comparing this to the ground truth semantic labels (according to majority annotator agreement), precision and recall can be calculated; to measure performance independent of

, we report average precision (AP), the area under the precision-recall curve as is varied. Instead of using the hard ground truth semantic labels, the soft scores can also be compared directly to the number of annotators that selected the keyword for utterance : we use Spearman’s to measure the correlation between the rankings of these two variables, as is common in work on word similarity [Agirre et al.2009, Hill et al.2015]. The remaining metrics are standard in (exact) keyword spotting, based on how a model ranks utterances in the test data from most to least relevant for each keyword [Hazen et al.2009, Zhang and Glass2009]: precision at ten () is the average precision of the ten highest-scoring proposals; precision at () is the average precision of the top proposals, with the number of true occurrences of the keyword; and equal error rate (EER) is the average error rate at which false acceptance and rejection rates are equal.

Apart from Spearman’s , all of these metrics can also be used to evaluate exact keyword spotting.

5.2 Baselines and cheating models

We consider a number of baselines as well as “cheating” models, which use idealised information not available to VisionSpeechCNN.

Prior-based baselines.

TextPrior uses the unigram probability of each keyword estimated from the transcriptions of the training portion of the spoken captions corpus. This will indicate how much better our model does than simply hypothesising common words. Similarly, the baseline VisionTagPrior is obtained by passing all training images from the spoken captions corpus through the trained visual tagger (3.2), and then taking the average over all images. This indicates how our model compares to simply predicting common visual labels.


VisionSpeechCNN is trained to predict soft visual tags. One question is whether it therefore learns to ignore any aspect of the acoustics that does not contribute to predicting the visual target. The model VisionCNN is an attempt to test this: as the representation for each test utterance, it passes through the visual tagger the true image paired with that utterance. Since it uses ideal information, it is our first cheating model. If VisionSpeechCNN were to perfectly predict image tags (ignoring acoustics that do not contribute to visual prediction), then VisionCNN would be an upper bound on performance. But in reality our model could do better or worse than VisionCNN, since the speech contains some information not in the images and training does not generalise perfectly.


Instead of soft targets from an image tagger, the SupervisedBoWCNN cheating model uses transcriptions to obtain ideal BoW supervision (3.1): targets are constructed for the 1000 most common words in the transcriptions of the 30 000 speech training utterances (ignoring stop words) and the loss (1) is used for training, specifically . Other than ideal supervision, the model has the same structure and training procedure as VisionSpeechCNN.

Text-based cheating models.

Suppose we had a perfect ASR system, converting input speech to text without errors. How well could we do at semantic keyword spotting using this error-free text? To answer this, we consider two text-based semantic retrieval methods (2.4). The first is based on WuP similarity, named after Wu and Palmer wu+palmer_acl94, which scores the semantic relatedness between two words according to the path length between them in the WordNet lexical hierarchy [Miller1995]. For our task, the TextWuP model is based on the closest match (in WuP) between a keyword and each of the words in a transcribed utterance. Our second text-based cheating method is based on word embeddings. Specifically, we use the Paragram XXL embedding method of [Wieting et al.2015, Wieting et al.2016], which was developed for semantic sentence similarity prediction. For our task, the TextParaphEmbed

cheating model calculates the cosine similarity between a keyword embedding and the

Paragram sentence embedding of an utterance.

Keyword Top retrieved utterance Count
bike a dirt biker rides through some trees 4 / 5
carrying small dog running in the grass with a toy in its mouth 2 / 5
children a group of young boys playing soccer 4 / 5
face a man in a gray shirt climbs a large rock wall 2 / 5
field two white dogs running in the grass together 3 / 5
hair two women and a man smile for the camera 0 / 5
jumps biker jumps off of ramp 5 / 5
large a group of people on a zig path through the mountains 1 / 5
ocean man falling off a blue surfboard in the ocean 5 / 5
race a red and white race car racing on a dirt racetrack 5 / 5
riding a surfer rides the waves 4 / 5
sitting a baby eats and has food on his face 1 / 5
snowy a skier catches air over the snow 5 / 5
swimming a woman holding a young boy slide down a water slide into a pool 3 / 5
young a little girl in a swing laughs 4 / 5
Table 2: For a selection of keywords, the retrieved utterance rated highest by VisionSpeechCNN. The number of annotators (out of five) that selected the keyword for that utterance is also shown, with indicating an incorrect semantic retrieval according to the majority labelling.

6 Experimental results and analysis

For a first qualitative view, Table 2 shows the top retrievals when we use VisionSpeechCNN to do semantic keyword spotting for a selection of keywords. The number of annotators that marked the utterance as relevant to the keyword is also shown, and indicates incorrect retrievals according to the ground truth (i.e. majority) semantic labelling. Out of the 15 results shown, ten retrievals are correct.

The quantitative metrics for exact and semantic keyword spotting for VisionSpeechCNN and all the baseline and cheating models are shown in Table 3. In both exact and semantic keyword spotting, VisionSpeechCNN outperforms the baseline models across all metrics. The baseline models, VisionSpeechCNN, and VisionCNN all perform better at semantic than at exact keyword spotting. In contrast, the transcription-based cheating models (rows 5 to 7) perform better on , but worse on all other semantic search metrics. only measures precision of the highest ranked utterances, while the other metrics combine precision and recall; thus, the transcription-based cheating models struggle to retrieve semantic matches compared to exact matches, while VisionSpeechCNN and VisionCNN recall more semantic matches. In terms of absolute performance, the transcription-based models still perform better at semantic keyword spotting on the metrics based on hard ground truth labels. However, for Spearman’s , which gives credit even if a prediction does not match the majority of annotations, VisionCNN outperforms all other models, followed closely by VisionSpeechCNN. Visual context is clearly beneficial in matching soft human ratings. Next, we further analyse the models by addressing the following questions.

Exact keyword spotting (%) Semantic keyword spotting (%)
Model EER AP EER AP Spear. 
Baseline models:
1. TextPrior 2.8 3.4 50.0 8.7 6.1 7.0 50.0 11.4 10.8
2. VisionTagPrior 2.8 3.4 50.0 7.0 6.1 7.0 50.0 13.6 12.5
3. VisionSpeechCNN 38.5 30.8 19.6 26.9 58.8 39.7 23.9 39.4 32.4
Cheating models:
4. VisionCNN 31.0 26.2 22.1 22.2 54.2 38.9 22.8 37.4 33.8
5. SupervisedBoWCNN 84.9 74.7 5.6 87.3 88.1 50.3 23.8 51.3 21.9
6. TextWuP 65.4 67.3 2.6 75.2 80.3 63.0 19.4 60.9 25.2
7. TextParaphEmbed 80.0 72.1 3.5 67.7 88.8 64.0 14.3 60.1 31.6
Table 3: Exact and semantic keyword spotting performance for VisionSpeechCNN (row 3), compared against the baseline (rows 1 and 2) and cheating (rows 4 to 7) models. Boldface indicates both the top-scoring non-cheating model (rows 1 to 3) as well as the best cheating model (rows 4 to 7) for each of the metrics.

Does VisionSpeechCNN only output common words?

The baseline models (rows 1 and 2) respectively assign scores to keywords according to their occurrence in the training transcriptions and the average visual tagger output. VisionSpeechCNN outperforms both these models across all metrics.

Does the model do more than map acoustics to images?

One possibility is that VisionSpeechCNN might (conceptually) map speech to the closest image and then output the corresponding visual tagger prediction, in effect ignoring signals in the acoustics that are irrelevant to producing the visual output. To see if this is the case, we compare with VisionCNN (row 4), which represents a test utterance by passing the paired test image through the visual tagger. VisionSpeechCNN outperforms VisionCNN across all metrics, except for Spearman’s . This indicates that VisionSpeechCNN does not simply reproduce the output from the visual tagger (used to supervise it); it actually achieves superior performance for all exact keyword spotting metrics and most of the semantic metrics. Again, Spearman’s takes the actual annotator counts into account, and here VisionCNN performs best.

Is there a benefit to using visual context over transcriptions?

Figure 3: t-SNE visualisations of acoustic embeddings of isolated words for ten keyword types. A penultimate 256-dimensional bottleneck layer was added to each of the two models; segmented word tokens from the development data were then passed through the models, with the bottleneck layers giving the acoustic word embeddings shown here.

This work is motivated by settings in which text transcriptions are not available while images may be. However, it is interesting also to consider whether there is some benefit to the training images beyond serving as weak labels. Might the visual grounding actually provide better supervision than text for some purposes?

To answer this question in the context of semantic keyword spotting, we compare rows 3 and 5 of Table 3. In exact keyword spotting, VisionSpeechCNN lags behind SupervisedBoWCNN, which is trained on ground truth transcriptions. However, when moving from exact to semantic keyword spotting, VisionSpeechCNN achieves absolute improvements of between 10% and 20% for , and AP. In contrast, SupervisedBoWCNN performs around 20% worse on all metrics apart from . VisionSpeechCNN also outperforms SupervisedBoWCNN in terms of Spearman’s , so it matches better the soft human judgements. This indicates that there is actually some benefit to the visual context in training beyond serving as a weak form of transcription.

VisionSpeechCNN 47.5 22.3 25.3
VisionCNN 44.7 19.2 25.5
SupervisedBoWCNN 50.0 38.4 11.7
Table 4: Unweighted semantic can be broken down as , thereby separating out the contributions from exact and exclusively semantic matches. Scores are given as percentages (%).

Despite the benefit of visual training, SupervisedBoWCNN still performs better on semantic keyword spotting measured against hard labels, and next we analyse this effect. Any exact keyword match is also a correct semantic match. The improvement of 20.3% in from VisionSpeechCNN when moving from exact to semantic search is therefore due to additional semantic matches, compared to the 3.2% improvement of SupervisedBoWCNN. , however, only considers top predictions, so it is a poor measure of recall. captures recall, since is the number of true occurrences of a keyword. However, cannot be broken into separate components for semantic and exact matches, since it is averaged over keywords. We therefore define an unweighted , where is the number of occurrences of keyword in the ground truth labels, and is the number of correct utterances in the top predictions of a model. Count can then be broken into exact and semantic matches, such that (here considers only non-exact semantic matches). This metric is used in the analysis of Table 4. For reference, in the 1000 test utterances, there are 4253 ground truth labelled keywords of which 1798 are exact and 2455 are non-exact semantic matches.

Table 4 shows that the of SupervisedBoWCNN is dominated by correct exact predictions, with a contribution of 11.7% from semantic matches. In contrast, VisionSpeechCNN makes more correct semantic (25.3%) than exact (22.3%) predictions. The visual model VisionCNN has the highest proportion of semantic matches by a small margin. The non-exact match results track well the Spearman’s results in Table 3 (last column).

For a qualitative comparison of VisionSpeechCNN and SupervisedBoWCNN, we passed a set of isolated segmented spoken words through both models. Figure 3 shows t-SNE embeddings [Van der Maaten and Hinton2008] of representations obtained from a 256-dimensional bottleneck layer (used for computational reasons) added between the 3000-dimensional ReLU layer and the final output of both models. Although words are well-separated in the case of the transcription-supervised model (a), the visually supervised representations are more successful in mapping semantically related spoken words to similar embeddings (b): related spoken words like ‘bike’, ‘rides’ and ‘riding’ have similar embeddings, as do ‘air’ and ‘jumps’, and ‘football’ and ‘ball’ (which is also closer to ‘soccer’).

What is the best we could do?

The cheating models TextWuP and TextParaphEmbed (rows 6 and 7, Table 3) represent the setting where we have access to a perfect ASR system. On the metrics that use hard ground truth labels, these two models perform best. Despite this, these models are outperformed by both VisionCNN and VisionSpeechCNN in terms of Spearman’s . As noted above, the visually trained models are particularly strong in matching non-exact semantic keywords.

Are the text-based models an alternative to human judgements?

Given that the image-speech corpus has transcriptions, we may ask whether it is necessary to collect human annotations at all, or whether they can be generated automatically by a semantic model like our Text cheating models. Although the Text methods match the hard human labels much better than other approaches, they are far from perfect. Furthermore, when evaluated against the soft annotator counts, VisionCNN and VisionSpeechCNN actually do better, indicating that the Text models cannot replace human judgements.

7 Conclusion

We investigated how a model that learns from parallel images and unlabelled speech captures aspects of semantics in speech. We collected a new data set for a new task, semantic keyword spotting, where the aim is to retrieve utterances that are semantically relevant given a query keyword. Without seeing any parallel speech and text, the vision-speech model achieves a semantic of almost 60%. Although a model trained on transcriptions is superior on some metrics, the vision-speech model retrieves more than double the number of non-verbatim semantic matches and is a better predictor of the actual soft human ratings.

Since visual context seems to provide different information to transcriptions for semantic keyword spotting, future work could consider how these two supervision signals could be combined in cases where both are available. We would also like to explore whether our model could be used to localise the parts of the input signal giving rise to a particular (semantic) label, similarly to analyses in earlier studies that inspired this work [Palaz et al.2016, Harwath and Glass2017], Finally, future work could consider whether text-based retrieval models could be improved enough to obtain semantic labels automatically when transcriptions are available.

We encourage interested readers to make use of our data set and associated task, which we make available at:


We would like to thank Shane Settle, John Wieting, and Michael Roth for helpful discussions and input.


  • [Agirre et al.2009] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Paşca, and A. Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proc. HLT-NAACL.
  • [Aimetti2009] G. Aimetti. 2009. Modelling early language acquisition skills: Towards a general statistical learning mechanism. In Proc. EACL.
  • [Anguera et al.2008] X. Anguera, J. Xu, and N. Oliver. 2008. Multimodal photo annotation and retrieval on a mobile phone. In Proc. ICMIR.
  • [Aytar et al.2016] Y. Aytar, C. Vondrick, and A. Torralba. 2016. SoundNet: Learning sound representations from unlabeled video. In Proc. NIPS.
  • [Bansal et al.2017] S. Bansal, H. Kamper, A. Lopez, and S. J. Goldwater. 2017. Towards speech-to-text translation without speech recognition. In Proc. EACL.
  • [Barnard et al.2003] K. Barnard, P. Duygulu, D. Forsyth, N. d. Freitas, D. M. Blei, and M. I. Jordan. 2003. Matching words and pictures. J. Mach. Learn. Res., 3:1107–1135.
  • [Bernardi et al.2016] R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Muscat, and B. Plank. 2016. Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Intell. Res., 55:409–442.
  • [Besacier et al.2014] L. Besacier, E. Barnard, A. Karpov, and T. Schultz. 2014. Automatic speech recognition for under-resourced languages: A survey. Speech Commun., 56:85–100.
  • [Bruni et al.2012] E. Bruni, G. Boleda, M. Baroni, and N.-K. Tran. 2012. Distributional semantics in technicolor. In Proc. ACL.
  • [Bruni et al.2014] E. Bruni, N.-K. Tran, and M. Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res., 49(2014):1–47.
  • [Chen and Zitnick2015] X. Chen and C. L. Zitnick. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In Proc. CVPR.
  • [Chen et al.2013] M. Chen, A. X. Zheng, and K. Q. Weinberger. 2013. Fast image tagging. In Proc. ICML.
  • [Chrupała et al.2017] G. Chrupała, L. Gelderloos, and A. Alishahi. 2017. Representations of language in a model of visually grounded speech signal. arXiv preprint arXiv:1702.01991.
  • [Clark2004] E. V. Clark. 2004. How language acquisition builds on cognitive development. Trends Cogn. Sci., 8(10):472–478.
  • [Cunillera et al.2010] T. Cunillera, E. Camara, M. Laine, and A. Rodriguez-Fornells. 2010. Speech segmentation is facilitated by visual cues. Q. J. Exp. Psychol., 63(2):260–274.
  • [Davis and Mermelstein1980] S. Davis and P. Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech, Signal Process., 28(4):357–366.
  • [Deng et al.2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proc. CVPR.
  • [Diaz et al.2016] F. Diaz, B. Mitra, and N. Craswell. 2016. Query expansion with locally-trained word embeddings. arXiv preprint arXiv:1605.07891.
  • [Donahue et al.2015] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proc. CVPR.
  • [Drexler and Glass2017] J. Drexler and J. Glass. 2017. Analysis of audio-visual features for unsupervised speech recognition. Proc. GLU.
  • [Driesen and Van hamme2011] J. Driesen and H. Van hamme. 2011. Modelling vocabulary acquisition, adaptation and generalization in infants using adaptive Bayesian PLSA. Neurocomputing, 74(11):1874–1882.
  • [Duong et al.2016] L. Duong, A. Anastasopoulos, D. Chiang, S. Bird, and T. Cohn. 2016.

    An attentional model for speech translation without transcription.

    In Proc. HLT-NAACL.
  • [Fang et al.2015] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. 2015. From captions to visual concepts and back. In Proc. CVPR.
  • [Farhadi et al.2010] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proc. ECCV.
  • [Feng and Lapata2010] Y. Feng and M. Lapata. 2010. Visual information in semantic representation. In Proc. HLT-NAACL.
  • [Frank et al.2009] M. C. Frank, N. D. Goodman, and J. B. Tenenbaum. 2009. Using speakers’ referential intentions to model early cross-situational word learning. Psychol. Sci., 20(5):578–585.
  • [Garcia and Gish2006] A. Garcia and H. Gish. 2006. Keyword spotting of arbitrary words using minimal speech resources. In Proc. ICASSP.
  • [Gelderloos and Chrupała2016] L. Gelderloos and G. Chrupała. 2016. From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning. Proc. COLING.
  • [Graupmann et al.2005] J. Graupmann, R. Schenkel, and G. Weikum. 2005. The SphereSearch engine for unified ranked retrieval of heterogeneous XML and web documents. In Proc. VLDB.
  • [Guillaumin et al.2009] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. 2009. Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In Proc. ICCV.
  • [Gupta et al.2017] A. Gupta, Y. Miao, L. Neves, and F. Metze. 2017. Visual features for context-aware speech recognition. In Proc. ICASSP.
  • [Harwath and Glass2015] D. Harwath and J. Glass. 2015. Deep multimodal semantic embeddings for speech and images. In Proc. ASRU.
  • [Harwath and Glass2017] D. Harwath and J. R. Glass. 2017. Learning word-like units from joint audio-visual analysis. In Proc. ACL.
  • [Harwath et al.2016] D. Harwath, A. Torralba, and J. R. Glass. 2016. Unsupervised learning of spoken language with visual context. In Proc. NIPS.
  • [Hazen et al.2007] T. J. Hazen, B. Sherry, and M. Adler. 2007. Speech-based annotation and retrieval of digital photographs. In Proc. Interspeech.
  • [Hazen et al.2009] T. J. Hazen, W. Shen, and C. White. 2009. Query-by-example spoken term detection using phonetic posteriorgram templates. In Proc. ASRU.
  • [Hill et al.2015] F. Hill, R. Reichart, and A. Korhonen. 2015. SimLex-999: Evaluating semantic models with (genuine) similarity estimation. Comput. Linguist., 41(4).
  • [Hodosh et al.2013] M. Hodosh, P. Young, and J. Hockenmaier. 2013.

    Framing image description as a ranking task: Data, models and evaluation metrics.

    J. Artif. Intell. Res., 47:853–899.
  • [Kamper et al.2017] H. Kamper, S. Settle, G. Shakhnarovich, and K. Livescu. 2017. Visually grounded learning of keyword prediction from untranscribed speech. Proc. Interspeech.
  • [Karpathy and Fei-Fei2015] A. Karpathy and L. Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proc. CVPR.
  • [Karpathy et al.2014] A. Karpathy, A. Joulin, and L. Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Proc. NIPS.
  • [Kingma and Ba2014] D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Krunic et al.2009] V. Krunic, G. Salvi, A. Bernardino, L. Montesano, and J. Santos-Victor. 2009. Affordance based word-to-meaning association. In Proc. ICRA.
  • [Kulkarni et al.2013] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell., 35(12):2891–2903.
  • [Lin et al.2014] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. 2014. Microsoft COCO: Common objects in context. In Proc. ECCV.
  • [Luo et al.2008] J. Luo, B. Caputo, A. Zweig, J.-H. Bach, and J. Anemüller. 2008. Object category detection using audio-visual cues. In Proc. ICVS.
  • [Miller1995] G. A. Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
  • [Palaz et al.2016] D. Palaz, G. Synnaeve, and R. Collobert. 2016. Jointly learning to locate and classify words using convolutional networks. In Proc. Interspeech.
  • [Räsänen and Rasilo2015] O. Räsänen and H. Rasilo. 2015. A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychol. Rev., 122(4):792–829.
  • [Roller and Schulte Im Walde2013] S. Roller and S. Schulte Im Walde. 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. In Proc. EMNLP.
  • [Roy and Pentland2002] D. K. Roy and A. P. Pentland. 2002. Learning words from sights and sounds: A computational model. Cognitive Sci., 26(1):113–146.
  • [Roy et al.2016] D. Roy, D. Paul, M. Mitra, and U. Garain. 2016. Using word embeddings for automatic query expansion. arXiv preprint arXiv:1606.07608.
  • [Settle et al.2017] S. Settle, K. Levin, H. Kamper, and K. Livescu. 2017. Query-by-example search with discriminative neural acoustic word embeddings. In Proc. Interspeech.
  • [Silberer and Lapata2012] C. Silberer and M. Lapata. 2012. Grounded models of semantic representation. In Proc. EMNLP.
  • [Silberer and Lapata2014] C. Silberer and M. Lapata. 2014.

    Learning grounded meaning representations with autoencoders.

    In Proc. ACL.
  • [Simonyan and Zisserman2014] K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  • [Siskind1996] J. M. Siskind. 1996. A computational study of cross-situational techniques for learning word-to-meaning mappings. Cognition, 61(1):39–91.
  • [Socher and Fei-Fei2010] R. Socher and L. Fei-Fei. 2010. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In Proc. CVPR.
  • [Sun et al.2016] F. Sun, D. Harwath, and J. R. Glass. 2016. Look, listen, and decode: Multimodal speech recognition with images. In Proc. SLT.
  • [Synnaeve et al.2014a] G. Synnaeve, T. Schatz, and E. Dupoux. 2014a. Phonetics embedding learning with side information. In Proc. SLT.
  • [Synnaeve et al.2014b] G. Synnaeve, M. Versteegh, and E. Dupoux. 2014b. Learning words from images and speech. In NIPS Workshop Learn. Semantics.
  • [Szöke et al.2005] I. Szöke, P. Schwarz, P. Matejka, L. Burget, M. Karafiát, M. Fapso, and J. Cernockỳ. 2005. Comparison of keyword spotting approaches for informal continuous speech. In Proc. Interspeech.
  • [Taniguchi et al.2016] T. Taniguchi, T. Nagai, T. Nakamura, N. Iwahashi, T. Ogata, and H. Asoh. 2016. Symbol emergence in robotics: A survey. Adv. Robotics, 30(11-12):706–728.
  • [Ten Bosch and Cranen2007] L. Ten Bosch and B. Cranen. 2007. A computational model for unsupervised word discovery. In Proc. Interspeech.
  • [Thiessen2010] E. D. Thiessen. 2010. Effects of visual information on adults’ and infants’ auditory statistical learning. Cognitive Sci., 34(6):1093–1106.
  • [Van der Maaten and Hinton2008] L. Van der Maaten and G. Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res., 9(Nov):2579–2605.
  • [Vinyals et al.2015] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In Proc. CVPR.
  • [Weiss et al.2017] R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen. 2017. Sequence-to-sequence models can directly translate foreign speech. In Proc. Interspeech.
  • [Weston et al.2011] J. Weston, S. Bengio, and N. Usunier. 2011. WSABIE: Scaling up to large vocabulary image annotation. In Proc. IJCAI.
  • [Wieting et al.2015] J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. 2015. From paraphrase database to compositional paraphrase model and back. Trans. ACL, 3:345–358.
  • [Wieting et al.2016] J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. 2016. Towards universal paraphrastic sentence embeddings. Proc. ICLR.
  • [Wilpon et al.1990] J. G. Wilpon, L. R. Rabiner, C.-H. Lee, and E. Goldman. 1990.

    Automatic recognition of keywords in unconstrained speech using hidden markov models.

    IEEE Trans. Acoust. Speech Signal Process., 38(11):1870–1878.
  • [Wu and Palmer1994] Z. Wu and M. Palmer. 1994. Verbs semantics and lexical selection. In Proc. ACL.
  • [Xu and Croft1996] J. Xu and W. B. Croft. 1996. Quary expansion using local and global document analysis. In Proc. SIGIR.
  • [Xu et al.2015] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proc. ICML.
  • [Yang et al.2011] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proc. EMNLP.
  • [Young et al.2014] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. ACL, 2:67–78.
  • [Yu and Ballard2004] C. Yu and D. H. Ballard. 2004. A multimodal learning interface for grounding spoken language in sensory perceptions. ACM T. Appl. Perception, 1(1):57–80.
  • [Yu and Smith2007] C. Yu and L. B. Smith. 2007. Rapid word learning under uncertainty via cross-situational statistics. Psychol. Sci., 18(5):414–420.
  • [Zhang and Glass2009] Y. Zhang and J. R. Glass. 2009. Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In Proc. ASRU.