Grounding Object Detections With Transcriptions

by   Yasufumi Moriya, et al.

A vast amount of audio-visual data is available on the Internet thanks to video streaming services, to which users upload their content. However, there are difficulties in exploiting available data for supervised statistical models due to the lack of labels. Unfortunately, generating labels for such amount of data through human annotation can be expensive, time-consuming and prone to annotation errors. In this paper, we propose a method to automatically extract entity-video frame pairs from a collection of instruction videos by using speech transcriptions and videos. We conduct experiments on image recognition and visual grounding tasks on the automatically constructed entity-video frame dataset of How2. The models will be evaluated on new manually annotated portion of How2 dev5 and val set and on the Flickr30k dataset. This work constitutes a first step towards meta-algorithms capable of automatically construct task-specific training sets.


page 1

page 2

page 3

page 4


AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Current methods for learning visually grounded language from videos ofte...

Self-Contained Entity Discovery from Captioned Videos

This paper introduces the task of visual named entity discovery in video...

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Learning text-video embeddings usually requires a dataset of video clips...

SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation

Recent advances in deep learning have brought significant progress in vi...

VideoModerator: A Risk-aware Framework for Multimodal Video Moderation in E-Commerce

Video moderation, which refers to remove deviant or explicit content fro...

Is this Harmful? Learning to Predict Harmfulness Ratings from Video

Automatically identifying harmful content in video is an important task ...

Localizing Adverts in Outdoor Scenes

Online videos have witnessed an unprecedented growth over the last decad...

1 Introduction

Figure 1:

Our proposed approaches to automatically labelling video frames with objects based on time-aligned speech transcription. The labelled video frames can be used to fine-tune a VGG-16 object classification convolutional neural network model.

Audio-visual “in the wild” data is much available on the web in multiple distribution channels (e.g.

YouTube). Despite this accessibility, the lack of manual annotations for this data makes it unusable for supervised learning-based machine learning models. More specifically, neural networks require large amounts of data to estimate their parameters reliably. Crowd-sourcing platforms such as Amazon Mechanical Turks and Figure8 offer services that assign an annotation task to remote workers. However, such services are expensive, time-consuming, and quality control over annotators output is difficult. There have been previous attempts to overcome this limitation by exploiting audio-visual data for event detection systems and cooking procedures

(Yu et al., 2014; Malmaud et al., 2015). However, exploiting “in the wild” data for object detection tasks has been rarely explored, and most systems rely on a corpus that goes through partial to full manual annotation. In this work, we present object recognition and visual grounding systems that are developed from automatically generated labels (see Figure 1 and 2).

Specifically, nouns of speech transcription and video frames corresponding to the time-stamp of the nouns are paired up. Those video frames aligned with a noun (entity) will be used to train object recognition and visual grounding models (Section 2

). Both object recognition and visual grounding are established tasks in computer vision, but the systems are rarely developed from audio-visual data. Our approach to generating labels for the tasks and summary of the dataset are described in (Section 

3). Experiments (Section 5) in the How2 dataset (Sanabria et al., 2018) show that it is possible to bridge speech and vision by object recognition and visual grounding.


Our contributons can be summarized as follows:

  • We propose a novel task: grounding speech transcriptions to entities in video.

  • We collected object-grounding annotations for the test and development set of the How2 dataset, an “in the wild” audio-visual corpus with speech transcriptions111The data is publically available at

  • We present a set of object recognition experiments that show how our method is able to automatically create datasets from “in the wild” data without the need of human intervention.

  • Our results on the How2 dataset show that speech transcriptions can be actually grounded to visual entities.

2 Task overview

We demonstrate our automatically constructed visual-entity dataset for object recognition and weakly-supervised visual grounding tasks. Object recognition is a task to categorise a given image or a video frame into 445 classes existing in the How2 dataset. Weakly-supervised visual grounding is to draw a bounding box of an object given a image or a video frame and a target entity.

As notation, each entity is denoted as , where is th entity. For each video frame , where is th video frame of entity , region proposals of the entity are derived. Each of the region proposals is denoted as , where is th region proposal of video frame .

2.1 Object recognition

The use of convolutional neural network has become a dominant approach to object recognition. For domain adaptation of a new dataset such as How2, one of the approaches is transfer learning, where an existing model is fine-tuned on a new dataset. Specifically, the off-the-shell convolutional neural network (CNN) model is used as a feature extractor and a one layer neural network is trained to map a feature to

pre-defined classes of object. Formally:



is a visual feature extracted from a video frame

using a convolutional neural network . Although each video frame is paired with one entity, it is possible that one video frame shows multiple objects. To capture this effect, we experimented with both one-to-one and one-to-many training. For one-to-one training, the model is encouraged to find an entity for a video frame using a softmax function: and cross entropy loss is computed given prediction. For one-to-many training, the model is expected to find objects that appear in a video to which the current video frame

belongs. Therefore, instead of softmax and cross entropy loss, sigmoid function is applied after the linear layer, and binary cross entropy loss is computed:


2.2 Weakly-supervised visual grounding

Figure 2: Our weakly-supervised visual-grounding model use the word timestamps provided by the ASR forced-alignment to select a positive frame. We then sample a negative frame from a random video. A region proposal network suggests region proposals. We then extract a feature representation with a ResNet-152 model. This feature representations are later used for a ranking-loss. ASR stands for Automatic Speech Recognition and RPN for Region Proposal Network.

We base our visual grounding on the multiple instance learning (MIL) approach from Huang et al. (Huang et al., 2018) and on the reconstruction appraoch from Rohrbach et al. (Rohrback et al., 2016). This section describes those two approaches, but designed for our dataset.

2.2.1 multiple instance learning

For each video frame and given target entity, the MIL model finds an index of the region proposal most likely corresponding to the object as follows:


where is a visual encoder that transforms a cropped region of an image into visual features and

is a text encoder that embeds a name of an entity into a dense vector.


A visual encoder consists of a convolutional neural network and a linear layer . Similarly, a text encoder first encodes entity using a word embedding model and a linear layer further transforms word embedding into a textual feature. When optimising the MIL visual grounding model, only linear layers and are updated and weights of and are frozen.

The MIL visual grounding model is optimised through automatically induced entity-video frame pairs. The intuition is that any of the region proposals should be more strongly associated with the entity aligned than a randomly selected entity. Similarly, any of the entities should be more strongly connected with the video frame aligned than a randomly random video frame. To express this formally:


and loss function

is defined as follows:


2.2.2 Attention and reconstruction

Another approach to detecting a bounding box of a target entity is computing attention weights given an entity and region proposals, and taking the region proposal with the highest attention weight. This is formulated as follows:


where is an attention function that computes attention weights given concatenation of a visual feature of region proposal and embedded entity. Attention weight for th resion proposal is computed as follows:


where a linear layer transforms concatenation of visual and textual features, and its output is passed to the softmax function.

Optimisation of this model can be done by reconstructing an original embedded entity from visual features of region proposals to which attention weights are applied.


where is a linear layer to reconstruct an embedded entity from the aggregated visual features of region proposals, and is the dimensionality of embedded entities. is essentially the mean squared error function that compares each dimensionality of the embedded entity and the embedded entity reconstructed from the visual feature .

The original paper transforms textual phrases using a recurrent neural network

(Rohrback et al., 2016). However, the entities extracted from the How2 dataset consist of maximum two words, and there is no need in applying a sequential model to have textual features. For that reason, a word embedding model is used instead of the recurrent neural network.

3 Dataset

In this work, we use the How2 dataset to train our weakly-supervised visual grounding model (Sanabria et al., 2018). The dataset consists of 300 hours of speech and its corresponding transcriptions and crowd-sourced Portuguese translations. In this work, we crowd sourced and release grounding ground-truth labels from the dev5 and val set.

3.1 Comparison to other visual datasets

We reiterate that our visual grounding model will be trained on videos and speech transcriptions of the How2 dataset. Most of the existing computer vision datsets such as Microsoft COCO (Lin et al., 2014), Flickr30k (Plummer et al., 2017) and ReferItGame (Kazemzadeh et al., 2014) consist of images and image description, or ActivityNet (Fabian Caba Heilbron & Niebles, 2015) and YouCookII (Zhou et al., 2018b) consist of videos and video description. On the other hand, How2 videos and speech transcriptions did not go through any annotation process except its evaluation set. Collecting videos and speech transcriptions is relatively easier than having images or videos annotated with textual phrases. Furthermore, when no speech transcriptions are available for videos, transcriptions can be obtained using automatic speech recognition. This facilitates expansion of a set of entities that a visual grounding model learns, while it is not straightforward when any annotation process is involved in development of datasets.

3.2 Extraction of entities

This section describes our approach to extracting entity-image pairs from How2 videos and speech transcriptions.

  • Force-align speech transcriptions with videos to obtain time stamp of each word uttered in videos

  • Run Stanford Core NLP tool on speech transcription to assign part-of-speech tags to words (Manning et al., 2014)

  • Filter out nouns or noun phrases that are not part of ImageNet labels to retain only visible nouns

    (Russakovsky et al., 2015)

  • Extract video frames at the end timestamp of nouns retained in the previous step

The intuition is that when entities are uttered in speech, they are likely to appera in a visual stream. We retained entities that appear at least 5 times on the training set of How2.

3.3 Annotation of evaluation set

In order to evaluate accuracy of our visual grounding model, we annotated video frames extracted from the dev5 and val set of How2 using the procedures in Section 3.2 with object bounding boxes. Approximately 5,000 video frames were reviewed by Amazon Mechanical Turk workers, and they were asked to draw bounding boxes of a target entity. As entities from speech transcriptions are not guaranteed to appear in videos, the workers could choose the “Nothing to label” button. Quality of annotation was manually verified by the authors.

3.4 Summary of automatically constructed dataset

This section gives a summary of the automatically constructed dataset. Overall, there are 139,867 video frames extracted from the training set, and 2,301 video frames from dev5 and val set. The number of entities to be a label is 533, when there is no distinction between singular and plural, and 445, when there is the distinction. The numbers reported from now on assume that singular and plural variants are merged into the same class. The 10 most frequent entities in train and dev5+val are summarised in Table 1. As can be seen in the table, frequent entities refer to a human body part both in train and dev5+val set. Apart from body parts, the dataset includes “ball” (3,306 times in train and 30 times in dev5+val), “horse” (429 times in train and 22 times in dev5+val), and “knife” (327 times in 7 times in dev5+val), since the instruction videos demonstrate a variety of activities such as sports, animal cares, and cooking.

Initially, 5,267 video frames from dev5 and val set were prepared to go through human annotation with a target entity label. However, quality of annotation was not satisfactory on some video frames by the following reasons: (1) a video frame contains a target entity, but bonding boxes are not correctly drawn or “Nothing to label” was selected and (2) a video frame does not contain a target entity, but “Nothing to label” was not selected and bounding boxes were drawn. By excluding the rejected works, 4,775 video frames remained, where 2,331 were “Nothing to label” and 2,444 were annotated. Since dev5 and test set retained some entities that do not appear in the train set, the total number of video frames for evaluation is 2,301 video frames. It is reported that target entities in YouCook validation set appear 51.32% of the total frames (Shi et al., 2019). Considering that, our automatic data construction approach collected 51.12% of positive labels, and this is comparable to quality of YouCook videos whose video description was manually created for the visual grounding task.

train frequency dev5+val frequency
top1 hand 5,576 hand 131
top2 foot 4,224 foot 69
top3 people 3,420 hair 68
top4 back 3,334 body 64
top5 ball 3,306 arm 58
top6 leg 3,026 leg 58
top7 water 2,595 people 56
top8 body 2,579 shoulder 38
top9 top 2,455 skin 38
top10 line 2,430 face 36
Table 1: Top 10 most frequent entities in train and dev5+val

4 Related work

Semi-supervised Classification

Supervised object detection and classification models have been broadly studied in many previous works (Ren et al., 2015; He et al., 2017; Redmon et al., 2016; Redmon & Farhadi, 2017). These approaches, however, rely on big annotated datasets that are usually expensive to create and prone to error. For this reason, to overcome this difficulty, there has been numerous works on weakly-supervised models for object and event classification (Zhuang et al., 2017; Guo et al., 2018; Yu et al., 2014; Chang et al., 2015). More concretely, Zhuang et al. (2017); Guo et al. (2018) exploit crawled images on the web to construct large-scale image classification datasets without the need of any human intervention. Their methods are based on reducing the negative impact that web labels could have. More related to this work, Yu et al. (2014); Chang et al. (2015) use the alignments between “in the wild” collected videos and its transcriptions to localize events and use them for training a multimodal even detection algorithm.


Multimodal retrieval, also, arguably, called grounding, has been studied in many previous works on the language processing community (Harwath et al., 2018a; Kamper et al., 2017; Arandjelovic & Zisserman, 2018; Aytar et al., 2017). More concretely, Harwath et al. (2018a); Kamper et al. (2017) use images as anchor modality to retrieve a textual representation from a speech query. On the other hand, Arandjelovic & Zisserman (2018); Aytar et al. (2017) propose a more general model retrieval approach where queries can be done across multiple modalities.

From a more specific grounding perspective, Malmaud et al. (2015)

propose an approach to align cooking recipes with ASR transcripts using a hidden Markov model and a keyword spotting system. In the image domain,

Karpathy & Fei-Fei (2015) generates natural language descriptions of regions by aligning image bounding boxes to its description with a neural approach. More recently, Rohrback et al. (2016); Harwath et al. (2018b) ground images to its textual and acoustic (i.e. speech) description respectively. Our grounding approach (Section 2.2) is inspired by (Rohrback et al., 2016). Shi et al. (2019); Zhou et al. (2019, 2018a); Huang et al. (2018); Zhao et al. (2018); Ephrat et al. (2018) propose approaches on grounding speech and textual representations to videos. More concretely,  Zhao et al. (2018); Ephrat et al. (2018) ground sound to image regions and use it a posteriori to do source sound localitzation and isolation. On the other hand, Shi et al. (2019); Zhou et al. (2019, 2018a); Huang et al. (2018) ground textual video descriptions to video. All previously cited work use cleaned controlled, human labeled datasets such as YouCookII (Huang et al., 2018), RoboWatch (Sener et al., 2015) or activtiyNet (Zhou et al., 2019), as opposed to thie this work where we use How2 an “in the wild” collected dataset.

In parallel with this work, Amrani et al. (2019) propose a similar approach on automatically creating object recognition datasets with How2. There are four fundamental differences between (Amrani et al., 2019) and our work. Firstly, Amrani et al. (2019) evaluate their approach on a newly defined test set that contains videos from the train set, and this might affect generalizability of the approach. Our experiments use the official data partitions of How2 (Sanabria et al., 2018). Secondly, their approach is limited to 11 categories as opposed to our work, where we use up to 533 classes (445 when singular and plural entities are merged). Third, Amrani et al. (2019) only test their approach on a grounding scenario. In our work, apart from grounding, we perform experiments on object recognition. Finally, Amrani et al. (2019) limit their experimentation results to How2 dataset as opposed to our work where we analyze the generalizability of our approach by also reporting results on Flickr30k.

5 Experiments

5.1 Flickr30k

To evaluate generalisability of object recognition and visual grounding models developed from our visual-entity How2 corpus, we adopted the Flickr30k dataset for another evaluation corpus (Plummer et al., 2017). The dataset consists of more than 30,000 images, from which 1,000 images each are assigned to validation and test set. For each image, there are 5 image descriptions created by Amazon Mechanical Turk workers. Furthermore, the dataset is annotated with bounding boxes that capture phrases of image descriptions (e.g., “A stop sign”). For evaluation, we only use validation and test set of the dataset, except that we fine-tuned a VGG16 for object recognition on training data to compare to models trained on How2.

For object recognition, each phrase of image descriptions is compared against the How2 visual entities. When there is a common entity between a phrase and the set of How2 entities, an Flickr30k image is labelled with the entity. When there are multiple entities that exist in the How2 entities, we take the last word of the phrase (e.g., “corner” from “A street corner”). In Flickr30k validation set, 6,927 phrases contain a single entity common with the How2 entities and 321 phrases multiple entities with How2. In Flickr30k test set, 6,868 phrases contain a single entity common with the How2 entities and 370 phrases multiple entities with How2. None of the phrases in 10 images in validation and test set share common entities with How2. For visual grounding, bounding box labels are extracted using the same approach, but some phrases are not annotated with bounding boxes, because not all of the phrases can be captured by bounding box. There are 3,188 bounding boxes for test set and 3,232 bounding boxes for validation set.

5.2 Experimental setup

For object recognition, we use VGG16 model as a feature extractor. While this is not state-of-the-art model, it has shown reasonable accuracy on the task, and fine-tuning can be finished quicker than other models because of the number of parameters that the model has. The linear classification layer is set to transform input image features of 4,096 dimensions to 256 dimensions. Then, the feature goes through the ReLU layer and the dropout layer with 0.5 dropout rate. Finally, the last linear layer transforms input to 445 that is the number of unique entities in the How2 visual-entity corpus. The mini-batch size was set to 64, and learning rate was 0.0001. Adam optimizer was used to update weights of the classification layers. The model was trained for 20 epochs.

For visual grounding, 20 candidate object bounding boxes were extracted from both How2 and Flickr30k using a region proposal network belonging to Mask-RCNN provided by Facebook Research (Massa & Girshick, 2018). We chose the regions proposal network of the Mask-RCNN that has ResNeXt101 as backbone. This model was pre-trained on the COCO 2017 dataset. For each candidate bounding box, the ResNet 152 model pre-trained on the ImageNet dataset extracts 2048 dimensional visual features. The word embedding model was trained on speech transcription of training set of How2 dataset using fastText model. The size of word embedding was set to 100.

The MIL model consists of one visual feature encoder and one word embedding encoder. The visual feature encoder has one fully connected layer that transforms 2048 dimensional features into 512, ReLU and dropout with rate 0.2. An additional fully connected layer is then applied to the visual features and the final dimensionality of the features is 100. The word embedding encoder is one fully connected layer that transforms 100 dimensioal vectors to 100 dimensional vectors. The model was trained for 20 epochs with learning rate 0.00001 using the Adam optimizer, and delta was set to 0.01. We observed that higher learning rate easily lead the model to being stuck in local minima.

The reconstruction model consists of one visual feature encoder and one word embedding encoder. The visual feature encoder transforms 2,048 dimensional vectors into 100 dimensional vectors, and the word embedding encoder encodes 100 dimensional input to 100 dimensional output. Those features are then concatenated and go through the attention layer. The model was trained for 20 epochs and learning rate was set to 0.001 using the Adam optimizer.

5.3 Object recognition results

top1 top5 top10 top20
single 9.95 25.27 36.47 49.44
multi 8.81 23.69 34.81 49.11
flickr30k 0.76 7.00 11.73 18.64
Table 2: Results of image classification models on dev5+val set of How2. Results are reported in top accuracy. A model fine-tuned with softmax targeting at one label for each example is “single” and a model with sigmoid targeting at multiple labels for each example is “multi”. “flickr30k” is a model fine-tuned on image-label pairs for comparison to the other models.

Table 2 summarises results of image classification on the How2 dev5+val set. As described in Section 2.1, the VGG16 model was fine-tuned using two different training criterion. An additional model was fine-tuned on Flickr30k data using multi label training, as each Flickr image contains multiple entities in descriptions. The results are reported in accuracy of top predictions from the model, as each video frame is annotated with one object on the How2 dev+val set.

As can be seen in Table 2, the model trained on softmax with a single label at one example showed better accuracy than the model on sigmoid with multiple labels. It is understandable that accuracy at top 1 prediction is just less than 10%, as any instruction videos capture multiple objects (e.g., a speaker figure and various body parts). In fact, when top 5 predictions of the “single” model are considered at one time, accuracy increases to 25%, and further to just less than 50%, when taking top 20. It also turned out that the model fine-tuned on Flickr data had difficulties recognising objects on How2 dataset, even though labels of Flickr data are derived from image descriptions created by human annotators.

mAP@5 mAP@10 mAP@20
single 6.47 7.48 8.34
multi 7.86 8.74 9.54
flickr30k 37.11 41.74 44.34
single 6.45 7.53 8.37
multi 7.22 8.25 8.98
flickr30k 35.69 39.91 42.5
Table 3: Results of object recognition models on validation and test set of Flickr30k. Results are reported in mean average precision (mAP) of top k recognised entities.

Table 3 shows the object recognition results on the Flickr30k dataset. Since each image on Flickr30k can have multiple labels, the models are evaluated through mean average precision (mAP) of top predictions. It turned out that using “multi” label training led to better generalisation on another dataset. The “multi” system shows better mAP consistently than the other system. The model fine-tuned on Flickr30k labels performed much better than models on How2 as expected.

Figure 3

shows a few output of the “multi” system fine-tuned on How2. As can be seen in the figure, objects that are not part of popular classes in How2 train were predicted with good probabilities. For example, “knife” occurred 327 times and “fret” 399 times. This shows that automatic construction of a visual dataset using speech transcription and videos offers reliable labels for object recognition.

Figure 3: Output examples of object recognition that VGG-16 model trained on our automatically generated training set produced.

5.4 Visual grounding results

IoU 0.5 IoU 0.3 IoU 0.1
upperbound 43.6 58.8 82.4
random 7.6 18.0 40.6
reconstruction 7.9 19.2 40.2
MIL 9.3 22.6 49.3
Table 4: Results of weakly-supervised models on dev5 and test set of How2. Upper-bound is the best score models can achieve given candidate object bounding boxes.

Table 4 summarises results of visual grounding systems trained on How2 dataset. The results are reported in Intersection over Union (IoU) between a predicted candidate region and a gold standard region. When a predicted candidate region had an overlap with any of the gold standard regions over the threshold, the prediction was regarded as positive. Due to the constraints on region proposals, the best accuracy achievable for a certain threshold was written in the table.

As can be seen in Table 4, the random baseline turned out to be the worst, and reconstruction and MIL approaches are slightly better than the random baseline. MIL consistently produced the highest accuracy on any threshold settings.

Table 5 shows results of visual grounding systems applied to Flickr30k data. In general, higher upper-bound was obtained for Flickr30k data. This is possibly because the region proposal network was trained on the COCO 2017 data, and the domain of this data is similar to that of Flickr30k. The results show that both reconstruction and MIL approaches produced better accuracy than the random baseline. Furthermore, the MIL system was consistently the best on any threshold, as well as evaluation on How2 dataset. Overall, even though those approaches used were not state-of-the-art, the models could learn visual-textual relationships from labels automatically generated from How2. Further improvement can be expected by adopting a more sophisticated visual grounding approach or introducing removal of false alarm video frames.

IoU 0.5 IoU 0.3 IoU 0.1
upperbound 65.6 76.1 89.8
random 15.6 27.2 47.0
reconstruction 18.4 31.8 54.1
MIL 18.9 35.6 58.5
upperbound 65.7 75.9 90.0
random 15.4 28.6 45.6
reconstruction 18.9 30.7 51.6
MIL 19.6 36.4 58.1
Table 5: Results of weakly-supervised models on validation and test set of Flickr30k. Upper-bound is the best score models can achieve given candidate object bounding boxes.

6 Conclusion

In this paper, we have investigated the use of audio-visual data “in the wild” to induce image labels for object recognition and visual grounding tasks. Overall, we have collected over 500 types of labels from 300 hours of videos. It was demonstrated that simple approaches could learn relationships between vision and entities derived from speech transcription. This can be an advantage of removing label annotation from visual data construction. Future work includes refinement of visual grounding approaches to improve accuracy of the task on How2 dataset.


  • Amrani et al. (2019) Amrani, E., Ben-Ari, R., Hakim, T., and Bronstein, A. Toward self-supervised object detection in unlabeled videos. arXiv preprint arXiv:1905.11137, 2019.
  • Arandjelovic & Zisserman (2018) Arandjelovic, R. and Zisserman, A. Objects that sound. In European Conference on Computer Vision (ECCV), pp. 435–451, 2018.
  • Aytar et al. (2017) Aytar, Y., Vondrick, C., and Torralba, A. See, hear, and read: Deep aligned representations. arXiv preprint arXiv:1706.00932, 2017.
  • Chang et al. (2015) Chang, X., Yu, Y.-L., Yang, Y., and Hauptmann, A. G. Searching persuasively: Joint event detection and evidence recounting with limited supervision. In ACM international conference on Multimedia, pp. 581–590. ACM, 2015.
  • Ephrat et al. (2018) Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., and Rubinstein, M. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619, 2018.
  • Fabian Caba Heilbron & Niebles (2015) Fabian Caba Heilbron, Victor Escorcia, B. G. and Niebles, J. C. Activitynet: A large-scale video benchmark for human activity understanding. In

    Computer Vision and Pattern Recognition (CVPR)

    , pp. 961–970, 2015.
  • Guo et al. (2018) Guo, S., Huang, W., Zhang, H., Zhuang, C., Dong, D., Scott, M. R., and Huang, D. Curriculumnet: Weakly supervised learning from large-scale web images. In European Conference on Computer Vision (ECCV), pp. 135–150, 2018.
  • Harwath et al. (2018a) Harwath, D., Chuang, G., and Glass, J. Vision as an interlingua: Learning multilingual semantic embeddings of untranscribed speech. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4969–4973. IEEE, 2018a.
  • Harwath et al. (2018b) Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., and Glass, J. Jointly discovering visual objects and spoken words from raw sensory input. In European Conference on ComputerVision (ECCV), pp. 659–677, 2018b.
  • He et al. (2017) He, K., Gkioxari, G., Dollar, P., and Girshick, R. Mask r-cnn. In International Conference on Computer Vision (ICCV), pp. 2961–2969, 2017.
  • Huang et al. (2018) Huang, D. A., Buch, S., Dery, L., Garg, A., Fei-Fei, L., and Niebles, J. C. Finding “it”: Weakly-supervised, reference-aware visual grounding in instructional videos. In International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5948–5957, 2018.
  • Kamper et al. (2017) Kamper, H., Settle, S., Shakhnarovich, G., and Livescu, K. Visually grounded learning of keyword prediction from untranscribed speech. In Interspeech, pp. 3677–3681, 2017.
  • Karpathy & Fei-Fei (2015) Karpathy, A. and Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Computer Vision and Pattern Recognition (CVPR), pp. 3128–3137, 2015.
  • Kazemzadeh et al. (2014) Kazemzadeh, S., Ordonez, V., Matten, M., and Berg, T. L. Referit game: Referring to objects in photographs of natural scenes. In

    Empirical Methods in Natural Language Processing (EMNLP)

    , 2014.
  • Lin et al. (2014) Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), pp. 740–755, 2014.
  • Malmaud et al. (2015) Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., and Murphy, K. What’s cookin’? interpreting cooking videos using text, speech and vision. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 143–152, 2015.
  • Manning et al. (2014) Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60, 2014.
  • Massa & Girshick (2018) Massa, F. and Girshick, R.

    maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch., 2018. Accessed: 07 June 2019.
  • Plummer et al. (2017) Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision, 123(1):74–93, 2017.
  • Redmon & Farhadi (2017) Redmon, J. and Farhadi, A. Yolo9000: Better, faster, stronger. In Computer Vision and Pattern Recognition (CVPR), July 2017.
  • Redmon et al. (2016) Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. You only look once: Unified, real-time object detection. In Computer Vision and Pattern Recognition (CVPR), June 2016.
  • Ren et al. (2015) Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), pp. 91–99, 2015.
  • Rohrback et al. (2016) Rohrback, A., Rohrbach, M., Hu, R., Darrell, T., and Schiele, B. Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision (ECCV), pp. 817–834, 2016.
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • Sanabria et al. (2018) Sanabria, R., Caglayan, O., Palaskar, S., Elliott, D., Barrault, L., Specia, L., and Metze, F. How2: a large-scale dataset for multimodal language understanding. In Workshop on Visually Grounded Interaction and Language (ViGIL). NeurIPS, 2018. URL
  • Sener et al. (2015) Sener, O., Zamir, A. R., Savarese, S., and Saxena, A. Unsupervised semantic parsing of video collections. In International Conference on Computer Vision (ICCV), pp. 4480–4488, 2015.
  • Shi et al. (2019) Shi, J., Xu, J., Gong, B., and Xu, C. Not all frames are equal: Weakly-supervised video grounding with contextual similarity and visual clustering losses. In Computer Vision and Pattern Recognition (CVPR) (to appear), 2019.
  • Yu et al. (2014) Yu, S.-I., Jiang, L., and Hauptmann, A. Instructional videos for unsupervised harvesting and learning of action examples. In 22nd ACM International Conference on Multimedia, pp. 825–828, 2014.
  • Zhao et al. (2018) Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., and Torralba, A. The sound of pixels. In European Conference on Computer Vision (ECCV), pp. 570–586, 2018.
  • Zhou et al. (2018a) Zhou, L., Louis, N., and Corso, J. J. Weakly-supervised video object grounding from text by loss weighting and object interaction. In British Machine Vision Conference, 2018a.
  • Zhou et al. (2018b) Zhou, L., Xu, C., and Corso, J. J. Towards automatic learning of procedures from web instructional videos. In

    AAAI Conference on Artificial Intelligence

    , pp. 7590–7598, 2018b.
  • Zhou et al. (2019) Zhou, L., Kalantidis, Y., Chen, X., Corso, J. J., and Rohrbach, M. Grounded video description. In Computer Vision and Pattern Recognition (CVPR) (to appear), 2019.
  • Zhuang et al. (2017) Zhuang, B., Liu, L., Li, Y., Shen, C., and Reid, I.

    Attend in groups: a weakly-supervised deep learning framework for learning from web data.

    In Computer Vision and Pattern Recognition (CVPR), pp. 1878–1887, 2017.