Log In Sign Up

Object Referring in Visual Scene with Spoken Language

Object referring has important applications, especially for human-machine interaction. While having received great attention, the task is mainly attacked with written language (text) as input rather than spoken language (speech), which is more natural. This paper investigates Object Referring with Spoken Language (ORSpoken) by presenting two datasets and one novel approach. Objects are annotated with their locations in images, text descriptions and speech descriptions. This makes the datasets ideal for multi-modality learning. The approach is developed by carefully taking down ORSpoken problem into three sub-problems and introducing task-specific vision-language interactions at the corresponding levels. Experiments show that our method outperforms competing methods consistently and significantly. The approach is also evaluated in the presence of audio noise, showing the efficacy of the proposed vision-language interaction methods in counteracting background noise.


page 2

page 4

page 5

page 6

page 7


Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and Videos

In this work, we propose an effective approach for training unique embed...

Show and Speak: Directly Synthesize Spoken Description of Images

This paper proposes a new model, referred to as the show and speak (SAS)...

Towards Modelling Coherence in Spoken Discourse

While there has been significant progress towards modelling coherence in...

Symbolic inductive bias for visually grounded learning of spoken language

A widespread approach to processing spoken language is to first automati...

What Vision-Language Models `See' when they See Scenes

Images can be described in terms of the objects they contain, or in term...

Talking with Robots: Opportunities and Challenges

Notwithstanding the tremendous progress that is taking place in spoken l...

Learning English with Peppa Pig

Attempts to computationally simulate the acquisition of spoken language ...

1 Introduction

The recent years have witnessed a great advancement in the subfields of Artificial Intelligence 


, such as Computer Vision, Natural Language Processing and Speech Recognition. The success obtained in these individual fields necessitates formulating and addressing more AI-complete research problems. This is evidenced by the recent trend of a joint understanding of vision and language in tasks such as image/video captioning 

[42], visual question answering [1], and object referring [24]. This work addresses the task of object referring (OR).

OR is heavily used in human communication; the speaker issues a referring expression; the co-observers then identify the referred object and continue the dialog. Future AI machines, such as cognitive robots and autonomous cars, are expected to have the same capacities for effective human-machine interaction. OR has received increasing attention in the last years with large-scale datasets compiled [17, 24, 45] and sophisticated learning approaches developed [24, 14, 28, 33, 43]. Despite the progress, OR has been mainly tackled with clean, carefully-written texts as inputs rather than with speech. This requirement largely hinders the deployment of OR to real applications such as assistive robots and automated cars, as we human express ourselves more naturally in speech than in writing. This work addresses object referring with spoken languages (ORSpoken) in constrained condition, in the hope to inspire research under more realistic setting.

This work makes two main contributions: (i) we compile two datasets for ORSpoken, one for assistive robots and the other for automated cars; and (ii) we develop a novel approach for ORSpoken, by carefully taking the problem down into three sub-problems and introducing goal-directed interactions between vision and language therein. All data are manually annotated with location of objects in images, and text and speech descriptions of the objects. This makes the datasets ideal for research on learning with multiple modalities. Coming to the approach, ORSpoken is decomposed into three sub-tasks: Speech Recognition (SR) to transcribe speech to texts, Object Proposal to propose candidates of the referred object, and Instance Detection to identify the referred target out of all proposal candidates. The choice of this architecture is mainly to: 1) better use the existing resources in these ‘subfields’; and 2) better explore multi-modality information with the simplified learning goals. The pipeline of our approach is sketched in Figure 1.

trim=.000pt .00pt 0.000pt .00pt,clip

Figure 1: The illustration of our approach for Object Referring with Spoken Language. Given an image and the speech expression, our approach first transcribes the speech via its Visually Grounded Speech Recognition module. It then proposes class-specific candidates by its Language-Grounded Object Proposal module. Finally, the method identifies the referred object via its Language-Grounded Instance Detection module. The inputs of our method are marked in blue and the outputs in red.

More specifically, we ground SR to the visual context for accurate speech transcription as speech content and image content are often correlated. For instance, transcription ‘window …’ is more sensible than ‘widow …’ in the visual context of a living room. For Object Proposal, we ground it to the transcribed text to mainly propose candidates from the referred object class. For instance, to propose windows, instead of objects of other classes such as sofa, for the expression ‘the big window in the middle’ in Figure 1. Lastly, for Instance Detection, an elaborate model is learned to identify the referred object from the proposed object candidates. For example, the middle window out of the four proposed windows in Figure 1.

We evaluate the method on the two compiled datasets. Experiments show that the proposed method outperforms competing methods consistently for both expression transcription and for object localization. This work also studies the performance of the method in the presence of different levels of audio noise. The introduced modules for language-vision interaction are found especially useful for large noise. These interaction components are general and can be applied to other relevant tasks as well, such as speech-based visual question answering, and speech-based image captioning.

2 Related Work

Language-based Object Detection. Language-based object detection (LOD) has been tackled under different names in Computer Vision. Notable ones are referring expressions [45, 24], phrase localization [33, 43], grounding of textual phrases [35], language-based object retrieval [14] and segmentation [13]. Recent research focus on LOD can be put into 2 groups: 1) learning embedding functions [9, 16] for effective interaction between vision and language; 2) modeling contextual information to better understand a speaker’s intent, be it global context [24, 14], or local among ‘similar’ objects [28, 45, 24]. Our work use natural speech rather than clean texts as the input.

Similar work has been conducted in Robotics Community. For instance, a joint model is learned from gestures, languages, and visual scenes for human-robot interaction in [25] ; different network architectures have been investigated for interpreting contextually grounded natural language commands [3]; and the appropriateness of natural language dialogue with assistive robots is examined in [19]. While sharing similarity, our work are very different. Our method works with wild, natural scenes, instead of rather controlled lab environments.

Joint Speech-Visual Analysis. Our work also shares similarity with visually grounded speech understanding. Notable examples include visually-grounded instruction understanding [18, 27], audio-visual speech recognition [30], and a ‘unsupervised’ speech understanding via a joint embedding with images [11, 5]. The first vein of research investigates heavily how to gain the benefit of visual cues to ground high-level commands to low-level actions. The second stream is to use image processing capabilities in lip reading to aid speech recognition systems. The last school aims to reduce the amount of supervision needed for speech understanding by projecting spoken utterances and images to a joint semantic space. Our work exploits the visual context/environment for better instruction recognition.

Speech-based User Interface. Speech is a primary means for human communication. Speech-based User Interface has received great attention in the last years. This is evidenced by the surge of commercial products such as Apple’s Siri, Microsoft’s Cortana, Amazon’s Echo, and Google Now, and a large body of academic publications [6, 22, 29, 26, 15, 44]. Speaking has been proven faster than typing on mobile devices for exchanging information [36]. There are academic works [39, 20, 4, 7] which use speech for image description and annotation. The main merit is that speech is very natural and capable of conveying rich content, lively emotion, and human intelligence – all in a hands-free and eyes-free manner. Our work enhances Speech-based User Interface with better object localization capability.

3 Approach

Given an image and a speech expression for the target object, the goal of our method is 1) to transcribe the speech expression into text and 2) to localize the referred object in the image with an axis-aligned bounding box. As described, our method consists of three components: a) Visually Grounded Speech Recognition(VGSR) for transcribing speech to text, b) Language grounded Object Proposal (LOP) to propose object candidates, and c) Language-Grounded Instance Detection (LID) (Section 3.3) for localizing the specific object instance. See Figure 1 for an illustration .

Figure 2: The pipeline of Visually Grounded Speech Recognition (VGSR). Given an image and a speech, we output transcribed text for the same speech using the image as context. We use Google API to yield a diverse set of transcribed texts. Later, a ranking model to score each of the text given the contextual features from the image and choose the highly ranked one.

3.1 Visually grounded Speech Recognition(VGSR)

Given a human generated speech and the context of speech in the form of an image, the main objective of this part is to output the transcribed text for the given speech. In order to better show insights, we keep the method simple in this work by adopting a holistic ranking approach. In particular, given an image and a speech data referring an object, a list of alternatives for transcribed text are generated by a speech recognition(SR) engine. We then learn a joint vision-language model to choose the most sensible one based on the visual context.

For a given speech input, the SR engine yields a diverse set of transcribed texts by Diverse Beam Search as shown in Figure 2. Here, speech recognition results are proposed agnostic to the visual context. Nevertheless, we have a prior information that speech refers to an object in the image. Hence, we learn a similarity model to rank the speech recognition results conditioned on the image. The model is learned to score the set of transcribed alternatives using a regression loss. The inputs of the model are the embedded representations of image and text.

We use Google Cloud Speech API service 111 for transcribing speech to text. We have tried Kaidi framework as well, but it generally gives worse results. The image features are extracted using CNN of VGG [38] network while we use a two-layer LSTM with 300 hidden layer nodes each for textual features. Subsequently, image and textual features are subjected to element-wise multiplication and a fully connected layer before the final layer which is a single node regression layer. This layer outputs the score for the speech recognition alternatives.

For training, we use the alternative transcriptions of all the synthetic speech files of GoogleRef train set. For objective scores, we use the following metrics: BLEU [31], ROUGE [23], METEOR [2] and CIDEr [41]

, which are the well-known evaluation metrics in NLP domain. In Figure 

3, we plot the histogram of the scores of the metrics having mapped to the range of . The google speech alternatives have little differences with the ground truth expressions. We experiment to determine which metric captures these subtle differences. In Figure 3

, the lesser the skewness in the score distribution, the better the metric is sensitive to the subtle differences. CIDEr seems to have a low skew. Consequently, we use CIDEr scores for training the model because they have equi-distribution over the entire range of scores compared to the other three as shown in Figure 

3. In the inference stage, for a given image and the set of alternatives, we use the output regression scores to rank the alternatives. This way, we have the VGSR.

Figure 3: Frequency distributions of scores evaluated by comparing the text alternatives by ASR for the recorded speech (with 10% added noise) to ground-truth expressions.

3.2 Language-Grounded Object Proposal (LOP)

Given an image and a text expression, the aim of Language Grounded Object Proposal (LOP) is to propose a set of object candidates that belong to the referred class.

There are numerous object proposal techniques in the literature. For instance, [14] uses EdgeBox [47] for the object proposals; [24] and [16] use the faster RCNN (FRCNN) object detector [34] and recently Mask-RCNN [12] to propose the candidates. However, a direct use of the object proposal methods or object detectors is far from being optimal. General Object Proposal and Detection methods such as EdgeBox and FRCNN are expression agnostic. Because of this, candidates can be proposed from any object class, including those which are totally irrelevant. This leads to an inefficient use of the candidate budget.

We develop a method for expression-based candidate proposals, with the aim to filter out candidates from irrelevant classes at this stage. We train an LSTM-based approach to associate language expressions to the relevant object class. In particular, we train a two-layer LSTM with hidden layer nodes each. The input of the model is the embedded representation of the expressions by the pre-trained word2vec model [32]

. The output layer is a classification layer to classify expressions to the corresponding classes of the referred objects. The model lets us rank proposals by not only detection scores, but also the relevance scores of their classes to the expressions.

We first evaluate the accuracy of the class association by our LSTM model. On the GoogleRef dataset, it achieves an accuracy of 90.5% on the validation set. The high accuracy of the class association implies that LOP is promising. Due to this high accuracy, it is reasonable to take the most confident one as the relevant class and the rest as irrelevant ones. Given all the detection candidates from the FRCNN detector, we use our LSTM association model to filter out all the detections from the irrelevant classes, and only pass those from the relevant class to the subsequent component for instance identification.

3.3 Language-Grounded Instance Detection (LID)

Given an image, referring expression and a set of bounding box object proposals in the image, our aim of Language Grounded Instance Detection (LID) is to identify the exact instance referred by the given expression.

We employ a generative approach developed by Hu et al [14]

for this task. Here, the model learns a scoring function that takes features from object candidate regions, their spatial configurations, whole image as global context along with the given referring expression as input and then outputs scores for all candidate regions. The model chooses that candidate which maximizes the generative probability of the expression for the target object. The model architecture is same as that of

[14]. We extract fc7 features of VGG-16 [38] net from bounding box location for object features and from whole image for global contextual feature. Textual features are extracted using a embedding layer which is also learned in an end-to-end fashion. For Object Referring with Spoken language (ORSpoken), we use the transcribed text results from VGSR as the input referring expressions for LID (c.f. Figure 1).

4 Dataset

Creating large datasets for Object Referring with Spoken Language (ORSpoken) is very labor intensive, given the fact that one need to annotate object location, issue language description and record speech description. A few choices are made in this work to scale the data collection: 1) Following previous work for language-based OR [24, 14], we use a simple bounding box to indicate the location of object; and 2) Following the success of synthetic data in many other vision tasks[8, 10], we construct hybrid speech recordings which contain sounding-realistic synthetic speeches for training and real human speeches for evaluation. It is to be noted that real speeches are still recorded in constrained environments as explained below. Two datasets are used in this work, one focusing on scenarios for assistive robots and the other for automated cars.

4.1 GoogleRef

Object Annotation. As the first testbed, we choose to enrich the standard object referring dataset GoogleRef [24] with speeches. GoogleRef contains images with referring expressions in the training set and images with referring expressions in the testing set.

Synthetic Speech. Amazon Polly 222The voice of Joanna is used: is used to generate the corresponding speech files for all referring descriptions. This TTS system produces high-quality realistic-sounding speech. It is nevertheless still simpler than real human speech due to the lack of tempo variation and ambient noise. We have thus recorded the real speeches for the testing set of GoogleRef. All speech files are in wav format, at 16 kHz.

Dataset #Images #Objects Synthetic Speech Real Speech On-site Noise
GoogleRef Train
DrivingRef Train
Table 1: Statistics of the two datasets.

Real Speech. We recorded real speech for all the referring expressions of GoogleRef by crowd-sourcing via Amazon Mechanical Turk(AMT). In AMT web interface, we show five text expressions for each Human Intelligence Task(HIT) and workers are instructed to speak out the expression and record each of the expression separately before submitting the HIT. Workers are asked to speak out as they are talking to a robot. We perform automatic validation by checking the length and volume of the recorded speeches to reject erroneous recordings due to inappropriate use of microphones. After each recording, we display the waveform of the recorded speech and an audio playback tool to hear back the recordings. We also keep a threshold for the amplitude of the audio signal to ensure the recording to attain certain volume level. Our automatic filter only accept the HIT only when all the five recordings are sufficiently long and the amplitude reaches the threshold. To further improve the quality of recordings, we also ran an initial round of recordings to qualify workers; only the qualified workers are allowed to our final recordings.

Noisy Speech. To further study the performance of the method, we add different levels of noise to the speech files. The noise we mixed with the original speech files is selected randomly from the Urban8K dataset [37]. This dataset contains 10 categories of ambient noise: air conditioner, car horn, children playing, dog bark, drilling, enging idling, gun shot, jackhammer, siren, and street music. For each original audio file, a random noise file is selected, and combined to produce a noisy audio expression:


where is the noise level. The noise audio files are sub-sampled to 16 kHz in order to match that of the speech audio file. Both files are normalized before being combined so that contributions are strictly proportional to the noise level chosen. We study 3 noise levels: 0%, 10%, and 30%.

valign=t valign=t
(a) driving route (b) highway (c) tunnel
Figure 4: Driving route of our DrivingRef, along with two exemplar visual scenes and the corresponding in-vehicle noise characteristics.

4.2 DrivingRef Dataset

Object Annotation. In order to have more realistic noise and to evaluate our method in human-vehicle communication, we have created a new dataset DrivingRef for driving scenario. We drove over a 90-minutes route covering highway, residential area, busy city and tunnels. Driving scenes were recorded by a camera mounted on the windshield of the car and the in-vehicle noise was recorded by a microphone mounting on the dashboard of the car. Figure 4 shows the driving route, along with samples of the visual scenes and the corresponding in-vehicle noises. We sampled the frames uniformly from the entire length of the recorded video. For annotation, we crowd-sourced to AMT to annotate bounding boxes for the objects on the sampled frames. In the AMT interface, we show the frame and workers are asked to draw tight bounding boxes around the object they would like to refer to. For each bounding box annotated, workers are asked to type a truthful, informative, relevant, and brief expression so that co-observers can find the referred objects easily and unambiguously. For each image, we ask them to annotate four objects along with their referring expressions. In total, we have annotated 1000 objects in 250 diverse images chosen out of the 10,000 recorded images.

Synthetic & Real Speech. We recorded the text descriptions the same way as described in Section 4.1.

Noisy Speech In-vehicle noise characteristics are ascribed to many factors such as driving environment (i.e. highway, tunnel), driving states (i.e. speed, acceleration), and in-vehicle noise (i.e. background music). During the course of data collection for DrivingRef, we also recorded the in-vehicle noise from inside the vehicle. We added this recorded real in-vehicle noise to the speeches according to Eq. 1. Timestamps of images and the recorded noise audio are used to obtain the in-place noise clips for the corresponding images (visual scenes). In total, we have qualified 23 workers for the recording task. The statistics of the two datasets can be found in Table 1.



0% Google API 0.570 0.880 7.416 0.890
VGSR 0.591 0.906 7.806 0.910
2-7 10% Google API 0.536 0.838 6.901 0.846
VGSR 0.544 0.854 7.048 0.861
2-7 30% Google API 0.437 0.683 5.410 0.679
VGSR 0.455 0.714 5.720 0.709


0% Google API 0.391 0.661 4.700 0.662
VGSR 0.409 0.693 5.058 0.689
2-7 10% Google API 0.364 0.619 4.255 0.618
VGSR 0.386 0.654 4.689 0.651
2-7 30% Google API 0.286 0.486 3.162 0.464
VGSR 0.305 0.521 3.526 0.495
Table 2: Comparison of speech recognition results with ground truth text using different standard evaluation metrics, evaluated on GoogleRef dataset.


0% Google API 0.548 0.869 6.922 0.868
VGSR 0.577 0.899 7.735 0.899
2-7 10% Google API 0.534 0.855 6.63 0.857
VGSR 0.542 0.868 7.114 0.871
2-7 30% Google API 0.499 0.816 6.149 0.820
VGSR 0.517 0.837 6.703 0.837


0% Google API 0.372 0.688 4.331 0.674
VGSR 0.392 0.716 4.746 0.697
2-7 10% Google API 0.360 0.670 4.175 0.655
VGSR 0.380 0.705 4.648 0.682
2-7 30% Google API 0.290 0.549 3.159 0.516
VGSR 0.303 0.574 3.449 0.536
Table 3: Comparison of speech recognition results with ground truth text using different standard evaluation metrics, evaluated on DrivingRef dataset.
Figure 5: Comparison plot of Ours(VGSR) and Ours(VGSR+LOP) with the baseline method on a) GoogleRef b) DrivingRef. We show the accuracy of object recognition for LID task from (i) text, (ii) transcribed text results from Synthetic and Real Speech at different noise levels. Over the baseline method, we introduce the contribution of VGSR for Ours(VGSR) and later include LOP for Ours(VGSR+LOP).

5 Experiments

We first evaluate the overall performance of our approach, and then conduct detailed evaluation for the two components Visually Grounded Speech Recognition (VGSR) and Language Grounded Object Proposal (LOP).

5.1 Evaluation Metric

For the final performance of our method for Object Referring with Spoken Language (ORSpoken), we use by following [14]. refers to the percentage of top scoring candidate(rank-1) being a true detection. A candidate is considered as a true detection if IoU computed between the predicted bounding box and the ground truth box is more than 0.5. We evaluate the performance of VGSR based on standard criterions such as METEOR [2], ROUGE [23], BLEU [31] and CIDEr [41]. We evaluate the results of VGSR in comparison to the ground truth referring expressions generated by humans using which we record/generate the speech files. For LOP, we use Intersection over Union (IoU) and recall to evaluate the performance.

Speech:laptop the bed man is using Speech:green t-shirt talking to two other men Speech:the silver vehicle next to a green truck Speech:keyboard on the laptop
GT:laptop the bald man is using GT:a man in green t shirt talking to two other men GT:a silver vehicle next to a green truck GT:keyboard on the laptop

Speech:red car moves ahead on Speech:silver car parked close to us Speech:blue signboard on the top left most Speech:lane
GT:red car moves ahead on left GT:silver car parked close to us on our right GT:blue signboard on the top left most side GT:gray car just moves past the traffic signal on the left lane
Figure 6: Qualitative results on GoogleRef (top) and DrivingRef (bottom)) with real speech of the text descriptions and the corresponding prediction box shown in Red and ground truth text and the corresponding bounding box are shown in Green.

5.2 Object Referring with Spoken Language

Since there is no previous work to compare to, we compare with a baseline method. For the baseline model, we use GoogleAPI to transcribe the speech data to text. We then use the object referring model of [14] to select the object candidates proposed by FRCNN detector. Coming to Our approach as described in Section 3, we use VGSR model to transcribe the speech data to text. Following that, we use the same object referring model of [14] but we propose the object candidates using our method LOP (as in Figure 9). We also use text-based object referring as our references for both the baseline method and our method. That is, clean text is given as input to the two approaches without employing speech recognition. The evaluation is conducted for both synthetic speech and real speech, and at different noise levels.

We illustrate the results of ORSpoken in Figure 5a for GoogleRef dataset and in Figure 5b for DrivingRef dataset. We can observe that detection results from the text remains as an upper bound for the object localization accuracy since all other cases use the transcribed texts from either synthetic or real speech data of the same text. From Figure. 5, we observe that Ours(VGSR) and Ours(VGSR+LOP) clearly outperforms the baseline model in all cases in both GoogleRef and DrivingRef dataset. Taking the case of Ours(VGSR+LOP), it performs better than the baseline model in in all the cases of synthetic and real speech with 0%, 10% and 30% noise. For instance, Case 1: improves by 1.526% for 10 proposals from baseline to Ours for the synthetic speech without 30% noise in Figure 5a, Case 2: improves by 3.925% for 10 proposals from baseline to Ours for real speech data with 30% noise. Similar observation can be seen for DrivingRef dataset in Figure 5b also. From Figure. 5, we can infer that each component of Our approach(VGSR&LOP) contributes to the improvement of overall performance of ORSpoken task. This shows that Our approach in realistic scenarios helps accommodate for a better object localization than in the synthetic cases which shows the suitability of our method to the real world.

We have added some qualitative results of our approach in GoogleRef and DrivingRef dataset in Figure 6. We have the object detection results represented as bounding boxes along with the corresponding transcribed text and ground truth texts for GoogleRef and DrivingRef dataset. We have also added some failure cases in the last column. We see that object detection model(LID) fails for GoogleRef as the predicted box is on a different ‘keyboard’ while in the second row, we observe that speech recognition itself fails for an appropriate transcription.

Apart from our pipelined approach as described in Section 3, we also made an attempt towards an end-to-end solution to the task of ORSpoken. We tried to learn directly from from raw speech and image, and avoiding the intermediate stage of speech recognition, similar to the ‘end-to-end’ approach proposed in  [46]

for speech-based visual question answering. We extract the same set of visual features as in the case of our pipelined approach. For the speech part, we pass the raw speech signal through a series of 1D Convolutional Neural networks and a LSTM layer that outputs the vector representation for speech part. We observe that this straightforward solution does not performs better than our pipelined method. This is due to the lack of a large dataset to learn an end-to-end model for the complex task ORSpoken. Our pipelined approach has less complex sub-tasks (VGSR&LOP) and is able to leverage the available resources for speech recognition and object recognition.

5.3 Visually-grounded Speech Recognition(VGSR)

We conducted experiments of VGSR on GoogleRef and DrivingRef dataset. GoogleRef dataset comprises of referring expressions from images of MSCOCO dataset for training and expressions from images for testing. DrivingRef dataset comprises of referring expressions from images, which we split 50-50 for training and testing. As described in Section 4, we generate synthetic and real speech data from referring expressions of GoogleRef dataset and add 10% and 30% noise content to speech data for various experiments. These speech data is transcribed back to text expressions using Google API. For each speech data, we receive 5 different alternatives from the API 333In some cases, the number of alternatives are less than 5. We find that this happens when the speech recognition engine is very confident in the answers provided.. We evaluate these transcribed text results of Speech Recognition (SR) using standard metrics including METEOR, ROUGE, CIDEr, BLEU1, as shown in Table 2. Google API in Table 2 represents the results of the rank-1 transcribed text among the alternatives obtained. Comparing the rank-1 SR results from Google API under multiple noise levels, we notice that the performance of SR decreases with the increase of noise levels, as in Table 2. We also see that better performance is obtained for SR on synthetic speech data compared to real speech data. This is due to the complexity of real speech data contributed by ambient noise, tempo variations, emotions, and accent.

Coming to VGSR, we re-rank the set of text alternatives obtained from SR API, conditioned on the visual context. In Figure 7, we have compared the SR performances of rank-1, rank-2 and rank-3 transcribed text results from Google API along with performance of VGSR. This experiment is conducted on GoogleRef real speech data with 30% noise level. We observe that Ours:VGSR outperforms all the rank based selections from API. For detailed analysis, we conducted experiments of VGSR over synthetic and real speech data and over multiple noise levels as shown in Table 2 and Table 3. We observe that VGSR performs significantly better in all the cases. For the training of VGSR regression network, we use ground truth labels calculated from CIDEr metric as described in Section 3.1. We have conducted experiments on GoogleRef dataset in Table 2 and DrivingRef dataset in Table 3 which too provide similar observations of superior performance of VGSR. Thus, our experiments show that VGSR performs better than the performance of SR without the context of visual input. We believe that in future complex models can be learned to jointly optimize speech and image context to get an even better performance.

Figure 7: Comparison of Ours(VGSR) to other baselines on GoogleRef dataset; random selection (Random), rank-1, rank-2 and rank-3 results provided by Google SR API are considered. CIDEr is used for the evaluation, for which larger means better.

5.4 Language-grounded Object Proposal (LOP)

The task of LOP is to propose a set of bounding box locations of the object candidates in the image, given the image and a referring text expression of an object.

As described in Section 3.2, given the text description and the image, LOP uses FRCNN detections to create the first set of proposals. FRCNN is trained for object detection task on MSCOCO dataset for the GoogleRef dataset while we use Cityscapes dataset for the DrivingRef dataset. These choices are due to the fact that object classes are roughly shared between MSCOCO and GoogleRef, and between CityScapes and DrivingRef. Later, LOP filters them to yield candidates of the relevant class predicted by the specifically trained LSTM model producing class specific object proposals.

Figure 8: Comparison of object proposal techniques: a) Edgebox, b) FRCNN and c) LOP (ours). Top: Recall vs IoU overlap ratio; Bottow: Recall vs Number of Proposals. The evaluation is performed on GoogleRef Validation set, which consists of 9536 objects.
Figure 9: LID results(Acc@1) on GoogleRef dataset over the number of object proposals. The comparison is made when different object proposals technique are used: a) Edgebox, b) FRCNN and c) LOP (ours).

Following object proposals literature[47, 40], we evaluate our LOP by the average recall of target objects under a fixed IoU criterion. It is evaluated under multiple candidate budgets. Figure 8 shows the results of LOP on GoogleRef dataset, with a comparison to EdgeBox and FRCNN which are agnostic to referring expressions. From the left column of Figure 8, we observe that recall of LOP over all IoUs deteriorates less when we compare the plots of number of proposals being and , in comparison with others. In the right column, we see that recall of LOP is high at both IoUs of and over the number of candidate proposals. The figure shows that LOP outperforms Edgebox and FRCNN consistently.

It is also interesting to see how LOP improves the performance of existing Language-grounded Instance Detection (LID) approaches. Here, we compare our LOP with Edgebox and FRCNN using the system of Hu et al [14] which uses Edgebox proposal technique followed by a image captioning model. In Figure 9, we keep the captioning model fixed and make suitable comparisons between different proposal techniques. We see from Figure 9 that LOP outperforms both EdgeBox and FRCNN detections for the task of LID. We experimented with varying numbers of object proposals = 1, 5, 10, 30 and 100. Using the metric for Object detection (given object proposals), we see that LOP improves by % over FRCNN detection proposals and by % over EdgeBox Proposals as in Figure 9. The corresponding improvements of LOP over EdgeBox and FRCNN can also be observed for 5, 30 and 100 candidate proposals. Thus, LOP keeps a check on passing poor candidate proposals to the following complex instance detection model (LID in our case). We observe the significance of LOP when we pass higher number of object proposals. In Figure 9, we see clearly that difference in improvement with FRCNN increase as number of candidate proposals increases.

6 Discussion and Conclusion

In this work, we have proposed a solution to the problem of Object Referring in Visual Scenes with Spoken Language (ORSpoken). We have tackled the problem by presenting two datasets and a novel method. The annotated data with three modalities are ideal for multi-modality learning. This work has shown how the complicated task ORSpoken can be taken down into three subproblems, and how to develop goal-oriented vision-language interaction models for the corresponding sub-tasks. Extensive experiments show that our proposed approaches and models are superior to competing methods at all different levels: from expression transcription, to object candidate proposal, and to object referring in visual scenes with speech expressions.

We infer the following: a) when speech is relevant to the visual context, our Visually Grounded Speech Recognition outperforms standalone speech recognition significantly; b) by using language information, our Language based Object Proposal generates in-class object proposals instead of proposals of general classes as done by previous methods like EdgeBox and FRCNN; c) the contribution of a) and b) are complementary to each other, and their combination yields a promising solution to Object Referring in Visual Scene with Spoken Language. The aim of the work is to provide insights in this new research direction. Code and data will be made publicly available.

7 Acknowledgement

The work has been supported by Toyota via the research project TRACE-Zürich.


  • [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015.
  • [2] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. 2005.
  • [3] Y. Bisk, D. Yuret, and D. Marcu. Natural language communication with robots. In NAACL HLT, 2016.
  • [4] M.-M. Cheng, S. Zheng, W.-Y. Lin, V. Vineet, P. Sturgess, N. Crook, N. J. Mitra, and P. Torr. Imagespirit: Verbal guided image parsing. ACM Trans. Graph., 34(1):3:1–3:11, 2014.
  • [5] G. Chrupała, L. Gelderloos, and A. Alishahi. Representations of language in a model of visually grounded speech signal. arXiv preprint arXiv:1702.01991, 2017.
  • [6] P. R. Cohen and S. L. Oviatt. The role of voice input for human-machine communication. proceedings of the National Academy of Sciences, 92(22):9921–9927, 1995.
  • [7] D. Dai. Towards Cost-Effective and Performance-Aware Vision Algorithms. PhD thesis, 2016.
  • [8] A. Dosovitskiy, P. Fischery, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, 2015.
  • [9] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
  • [10] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2016.
  • [11] D. Harwath, A. Torralba, and J. Glass. Unsupervised learning of spoken language with visual context. In Advances in Neural Information Processing Systems, pages 1858–1866, 2016.
  • [12] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. arXiv preprint arXiv:1703.06870, 2017.
  • [13] R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. In European Conference on Computer Vision, pages 108–124. Springer, 2016.
  • [14] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. In CVPR, pages 4555–4564, 2016.
  • [15] J. A. Jacko. Human computer interaction handbook: Fundamentals, evolving technologies, and emerging applications. CRC press, 2012.
  • [16] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, pages 3128–3137, 2015.
  • [17] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes.
  • [18] T. Kollar, S. Tellex, D. Roy, and N. Roy. Grounding Verbs of Motion in Natural Language Commands to Robots, pages 31–47. 2014.
  • [19] V. A. Kulyukin. On natural language dialogue with assistive robots. In Proceedings of the 1st ACM SIGCHI/SIGART Conference on Human-robot Interaction, HRI ’06, pages 164–171, New York, NY, USA, 2006. ACM.
  • [20] G. P. Laput, M. Dontcheva, G. Wilensky, W. Chang, A. Agarwala, J. Linder, and E. Adar. Pixeltone: a multimodal interface for image editing. In CHI, 2013.
  • [21] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • [22] B. Lee, M. Hasegawa-johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu, and T. Huang. Avicar: Audio-visual speech corpus in a car environment. In Interspeech, 2004.
  • [23] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries.
  • [24] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20, 2016.
  • [25] C. Matuszek, L. Bo, L. Zettlemoyer, and D. Fox. Learning from unscripted deictic gesture and language for human-robot interactions. In AAAI, 2014.
  • [26] T. Mishra and S. Bangalore. Qme!: A speech-based question-answering system on mobile devices. In NAACL, 2010.
  • [27] D. K. Misra, J. Sung, K. Lee, and A. Saxena. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions. IJRR, 35(1-3):281–300, 2016.
  • [28] V. K. Nagaraja, V. I. Morariu, and L. S. Davis. Modeling context between objects for referring expression understanding. In ECCV, 2016.
  • [29] C. Nass and S. Brave. Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship. The MIT Press, 2005.
  • [30] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Audio-visual speech recognition using deep learning. Applied Intelligence, 42(4):722–737, 2015.
  • [31] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318, 2002.
  • [32] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
  • [33] B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, and S. Lazebnik. Phrase localization and visual relationship detection with comprehensive linguistic cues. CoRR, 2016.
  • [34] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
  • [35] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In ECCV, pages 817–834. Springer, 2016.
  • [36] S. Ruan, J. O. Wobbrock, K. Liou, A. Ng, and J. Landay. Speech is 3x faster than typing for english and mandarin text entry on mobile devices. arXiv preprint arXiv:1608.07323, 2016.
  • [37] J. Salamon, C. Jacoby, and J. P. Bello. A dataset and taxonomy for urban sound research. In ACM MM, pages 1041–1044, 2014.
  • [38] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [39] R. Srihari and Z. Zhang. Show&tell: a semi-automated image annotation system. MultiMedia, IEEE, 7(3):61–71, 2000.
  • [40] K. E. Van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders. Segmentation as selective search for object recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1879–1886. IEEE, 2011.
  • [41] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575, 2015.
  • [42] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
  • [43] M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng. Structured matching for phrase localization. In European Conference on Computer Vision, pages 696–711. Springer, 2016.
  • [44] F. Weng, P. Angkititrakul, E. E. Shriberg, L. Heck, S. Peters, and J. H. L. Hansen. Conversational in-vehicle dialog systems: The past, present and future. IEEE Signal Processing Magazine, 2016.
  • [45] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85. Springer, 2016.
  • [46] T. Zhang, D. Dai, T. Tuytelaars, M.-F. Moens, and L. Van Gool. Speech-based visual question answering. arXiv preprint arXiv:1705.00464, 2017.
  • [47] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, 2014.