Video Description: A Survey of Methods, Datasets and Evaluation Metrics

Automatic video description is useful for assisting the visually impaired, human computer interaction, robotics and video indexing. The past few years have seen a surge of research interest in this area due to the unprecedented success of deep learning in computer vision and natural language processing. Numerous methods, datasets and evaluation measures have been proposed in the literature calling the need for a comprehensive survey to better focus research efforts in this flourishing direction. This paper answers exactly to this need by surveying state of the art approaches including deep learning models; comparing benchmark datasets in terms of their domain, number of classes, and repository size; and identifying the pros and cons of various evaluation metrics such as BLEU, ROUGE, METEOR, CIDEr, SPICE and WMD. Our survey shows that video description research has a long way to go before it can match human performance and that the main reasons for this shortfall are twofold. Firstly, existing datasets do not adequately represent the diversity in open domain videos and complex linguistic structures. Secondly, current measures of evaluation are not aligned with human judgement. For example, the same video can have very different, yet correct descriptions. We conclude that there is a need for improvement in evaluation measures as well as datasets in terms of size, diversity and annotation accuracy because they directly influence the development of better video description models. From an algorithmic point of view, diagnosis of the description quality is challenging because of the difficultly to assess the level of contribution from visual features compared to the bias that comes naturally from the language model adopted.



There are no comments yet.


page 2

page 6

page 12

page 14

page 15

page 24


A Comprehensive Review on Recent Methods and Challenges of Video Description

Video description involves the generation of the natural language descri...

Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures

Automatic description generation from natural images is a challenging pr...

A Survey on Text Classification: From Shallow to Deep Learning

Text classification is the most fundamental and essential task in natura...

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Integration of vision and language tasks has seen a significant growth i...

TRECVID 2020: A comprehensive campaign for evaluating video retrieval tasks across multiple application domains

The TREC Video Retrieval Evaluation (TRECVID) is a TREC-style video anal...

Object Recognition with Imperfect Perception and Redundant Description

This paper deals with a scene recognition system in a robotics contex. T...

What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

While there have been significant gains in the field of automated video ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Describing a short video in natural language is a trivial task for most people, but a very challenging one for machines. Automatic video description involves understanding of many entities and the detection of their occurrences in a video employing computer vision techniques. These entities include background scene, humans, objects, human actions, human-object interactions, human-human interactions, other events, and the order in which events occur. All this information must then be articulated using a comprehensible and grammatically correct text employing Natural Language Processing (NLP) techniques. Over the past few years, these two traditionally independent fields, Computer Vision (CV) and Natural Language Processing (NLP) have joined forces to address the upsurge of research interests in understanding and describing images and videos. Special issues of journals are published focusing on language in vision [10] and workshops uniting the two areas have also been held regularly at both NLP and CV conferences [16, 17, 18, 106].

Automatic video description has many applications in human-robot interaction, automatic video subtitling and video surveillance. It can be used to help the visually impaired by generating verbal descriptions of surroundings through speech synthesis, or automatically generating and reading out film descriptions. Currently, these are achieved through very costly and time-consuming manual processes. Another application is the description of sign language videos in natural language. Video description can also generate written procedures for human or service robots by automatically converting actions in a demonstration video into simple instructions, for example, assembling furniture, installing CD-ROM, making coffee or changing a flat tyre [14, 32].

Fig. 1:

A basic framework for deep learning based video captioning. A visual model encodes the video frames into a vector space. The language model takes input of visual vector and word embeddings to generate the sentence that describes the input visual content.

The advancement of video description opens up enormous opportunities in many application domains. It is envisaged that in the near future, we would be able to interact with robots in the same manner as with humans [136]. If video description is advanced to the stage of being able to comprehend events unfolding in the real world and render them in spoken words, Service Robots or Smart phone Apps will be able to understand human actions and other events to converse with humans in a much more meaningful and coherent manner. For example, they could answer a user’s question as to where they left their wallet or discuss what they should cook for dinner. In industry settings, they could potentially remind a worker of any actions/procedures that are missing from a routine operation. The recent release of a dialogue dataset, Talk the Walk [166], has introduced yet another interesting application where a natural language dialogue between a guide and a tourist helps the tourist to reach a previously unseen location on a map using perception, action and interaction modeling.

Leveraging the recent developments in deep neural networks for NLP and CV, and the increased availability of large multi-modal datasets, automatically generating stories from pixels is no longer a science fiction. This growing body of work has mainly originated from the robotics community and can be labeled broadly as

language grounded meaning from vision to robotic perception [138]. Related research areas include, connecting words to pictures [26, 27, 44], narrating images in natural language sentences [51, 91, 97] and understanding natural language instructions for robotic applications [66, 107, 153]. Another closely related field is Visual Information Retrieval (VIR), which takes visual (image, drawing or sketch), text (tags, keywords or complete sentence) or mixed visual and text query to perform content based search. Thanks to the release of benchmark datasets MS COCO [100] and Flicker30k [181], research in image captioning and retrieval [46, 83, 50, 105], and image question answering [104, 19, 128, 185] has also become very active.

Automatically generating natural language sentences describing the video content has two components; understanding the visual content and describing it in grammatically correct natural language sentences. Figure 1 shows a simple deep learning based video captioning framework. The task of video description is relatively more challenging, compared to image captioning, because not all objects in the video are relevant to the description such as the detected objects that do not play any role in the observed activity [25]. Moreover, video description methods must additionally capture the speed, direction of relevant objects as well as causality among events, actions, and objects. Finally, events in videos can be of varying lengths and may even result in a possible overlap of events [87]. See Figure 2

for example. The event of piano recitals is spanned over almost the entire duration of the video, however, the applause is a very short event that only takes place at the end. The example illustrates differences between three related areas of research, namely, image captioning, video captioning and dense video captioning. In this example, image captioning techniques recognize the event as mere

clapping whereas it is actually an applause that resulted from a previous event - piano playing.

Figure 3 summarizes related research under the umbrella of Visual Description. The classification is based on whether the input is still images (Image Captioning) or multi-frame short videos (Video Captioning). Note, however, that short video captioning is very different from video auto-transcription where audio and speeches are the main focus. Video captioning concerns mainly the visual content as opposed to the audio signals. In particular, Video Description extends video captioning with the aim to provide a more detailed account of the visual contents in the video.

Below we define some terminologies used in this paper.

  • Visual Description: The unifying concept encompassing (see Fig. 3) the automatic generation of single or multiple natural language sentences that convey the information in still images or video clips.

  • Video Captioning: Conveying the information of a video clip as a whole through a single automatically generated natural language sentence based on the premise that short video clips usually contain one main event [56, 179, 161, 22, 118, 46].

  • Video Description: Automatically generating multiple natural language sentences that provide a narrative of a relatively longer video clip. The descriptions are more detailed and may be in the form of paragraphs. Video description is sometimes also referred to as story telling or paragraph generation [184, 131].

  • Dense Video Captioning: Detection and conveying information of all, possibly overlapping, events of different lengths in a video using a natural language sentence per event. As illustrated in Fig. 2, dense video captioning localizes events in time [87, 180, 124, 175] and generates sentences that are not necessarily coherent. On the other hand video description gives a more detailed account of one or more events in a video clip using multiple coherent sentences without having to localize individual events.

Fig. 2: Illustration of differences between image captioning, video captioning and dense video captioning. Image (video frame) captioning describes each frame with a single sentence. Video captioning describes the complete video with one sentence. In dense video captioning, each event in video is temporally detected and described by a single sentence eventually resulting in multiple sentences localized in time but not necessarily coherent.

Video captioning research started with the classical template based approaches in which Subject (S), Verb (V), and Object (O) are detected separately and then joined using a sentence template. These approaches are referred to as SVO-Triplets [85, 25]. However, the advent of deep learning and the tremendous advancements in CV and NLP have equally affected the area of video captioning. Hence, latest approaches follow deep learning based architectures [161, 134] that encode the visual features with 2D/3D-CNN and use LSTM/GRU to learn the sequence. The output of both approaches is either a single sentence [177, 117], or multiple sentences [131, 25, 146, 184, 42, 79] per video clip. Early research on video description mostly focused on domain specific short video clips with limited vocabularies of objects and activities [25, 42, 85, 78, 136, 182]. Description of open domain and relatively longer videos remains a challenge, as it needs large vocabularies and training data. Methods that follow CNN-LSTM/GRU framework mainly differ from each other in the different types of CNNs and language models (vanilla RNN, LSTM, and GRUs) they employ and as well as how they pass the extracted visual features to the language model (at the first time step only or all time steps). Later methods progressed by introducing additional transformations on top of the standard encoder-decoder framework. These transformations include attention mechanism [179] where the model learns which part of the video to focus on, sequence learning [161] that models a sequence of video frames with the sequence of words in the corresponding sentence, semantic attributes [56, 118] that exploits the visual semantics in addition to CNN features, and joint modeling of visual content with compositional text [117]. More recently, video based visual description problem has evolved towards dense video captioning and video story telling. New datasets have also been introduced to progress along these lines.

Fig. 3: Classification of visual content description. This survey focuses on video only and not images.

When it comes to performance comparison, quantitative evaluation of video description systems is not straightforward. Currently, automatic evaluations are typically performed using machine translation and image captioning metrics, including Bilingual Evaluation Understudy (BLEU) [119], Recall Oriented Understudy for Gisting Evaluation (ROUGE) [99], Metric for Evaluation of Translation with Explicit Ordering (METEOR) [23], Consensus based Image Description Evaluation (CIDEr) [159], and the recently proposed Semantic Propositional Image Captioning Evaluation (SPICE) [15] and Word Mover’s Distance (WMD) [93] metrics. Section 5.1 presents these measures. Here, we give a brief overview to establish motivation for our survey. BLEU is a precision-based metric, which accounts for precise matching of n-grams in the generated and ground truth references. METEOR, on the other hand, first creates an alignment between the two sentences by comparing exact tokens, stemmed tokens and paraphrases. It also takes into consideration the semantically similar matches using WordNet synonyms. ROUGE, similar to BLEU, has different n-grams based versions and computes recall for the generated sentences and the reference sentences. CIDEr is a human-consensus-based evaluation metric, which was developed specifically for evaluating image captioning methods but has also been used in video description tasks. WMD makes use of word embeddings (semantically meaningful vector representations of words) and compares two texts using the Earth Mover’s Distance (EMD). This metric is relatively less sensitive to word order and synonym changes in a sentence and, like CIDEr and METEOR, it provides high correlation with human judgments. Lastly, SPICE is a more recent metric that correlates more with human judgment of semantic quality as compared to previously reported metrics. It compares the semantic information of two sentences by matching their content in dependency parse trees. These metrics capture very different performance measures for the same method and are not perfectly aligned with human judgments. Also, due to the hand engineered nature of these metrics, their scores are unstable when the candidate sentence is perturbed with synonyms, word order, length and redundancy. Hence, there is a need for an evaluation metric that is learned from training data to score in harmony with human judgments in describing videos with diverse content.

The current literature lacks a comprehensive and systematic survey that covers different aspects of video description research including methods, dataset characteristics, evaluation measures, benchmark results and related competitions and video Q&A challenges. We fill this gap and present a comprehensive survey of the literature. We first highlight the important applications and major trends of video description in Section 1

and then classify automatic video description methods into three groups, giving an overview of the models from each group in Section 

2. In Section 3, we elaborate on the available video description datasets used for benchmarking. In Section 4, we present the details of video competitions and challenges. Furthermore, we review the evaluation metrics that are used for quantitative analysis of the generated descriptions in Section 5. In Section 6, benchmark results achieved through the aforementioned methods are compared and discussed. In Section 7, we discuss the possible future directions and finally Section 8 concludes our survey and discusses some insights into the findings.

2 Video Description Methods

Video description literature can be divided into three main phases. The classical methods phase, where pioneering visual description research employed classical CV and NLP methods to first detect entities (objects, actions, scenes) in videos and then fit them to standard sentence templates. The statistical methods phase, which employed statistical methods to deal with relatively larger datasets. This phase lasted for a relatively short time. Finally, the deep learning phase, which is the current state of the art and is believed to have the potential to solve the open domain automatic video description problem. Below, we give a detailed survey of the methods in each category.

2.1 Classical Methods

The SVO (Subject, Object, Verb) tuples based methods are among the first successful methods used specifically for video description. However, research efforts were made long before to describe visual content into natural language, albeit not explicitly for captioning or description. The first ever attempt goes back to Koller et al. [86] in 1991, who developed a system that was able to characterize motion of vehicles in real traffic scenes using natural language verbs. Later in 1997, Brand et al. [32]

dubbed this as ”Inverse Hollywood Problem” (since in Hollywood script (description) is converted into video, here the problem is opposite), and described a series of actions into semantic tag summaries in order to develop a storyboard from instructional videos. They also developed a system, “video gister”, that was able to heuristically parse the videos into a series of key actions and generate a script that describes actions detected in the video. They also generated key frames depicting the detected causal events and defined the series of events into semantics representation e.g.

Add by enter, motion, detach and remove by attach, move, leave. Video gister was limited to only one human arm (actor) interacting with non liquid objects and was able to understand only five actions (touch, put, get, add, remove).

Getting back to SVO tuple based methods, which tackle the video description generation task in two stages. The first stage known as content identification focuses on visual recognition and classification of the main objects in the video clip. These typically include the performer or actor, the action and the object of that action. The second stage involves sentence generation

which maps the objects identified in the first stage to Subject, Verb and Object (and hence the name SVO), and filling in handcrafted templates for grammatically sound sentences. These templates are created using grammar or rule-based systems, which are only effective in very constrained environments, i.e. short clips or videos with limited number of objects and actions.

Numerous method have been proposed for detecting objects, humans, actions, and events in videos. Below we summarize the recognition techniques used in the Stage I of the SVO tuples based approaches.

  • Object Recognition: Object recognition in SVO approaches was performed typically using conventional methods, including model-based shape matching through edge detection or color matching [85], HAAR features matching [165], context-based object recognition [157], Scale Invariant Feature Transform (SIFT) [102], discriminatively trained part-based models [55] and Deformable Parts Model (DPM) [53, 54].

  • Human and Activity Detection: Human detection methods employed features such as Histograms of Oriented Gradient (HOG) [40] followed by SVM. For activity detection, features like Spatiotemporal Interest Points such as Histogram of Oriented Optical Flow (HOOF) [33]

    , Bayesian Networks (BN) 

    [73], Dynamic Bayesian Networks (DBNs) [60]

    , Hidden Markov Models (HMM) 

    [28], state machines [86], and PNF Networks [122] have been used by SVO approaches.

  • Integrated Approaches: Instead of detecting the description-relevant entities separately, Stochastic Attribute Image Grammar (SAIG) [193] and Stochastic Context Free Grammars (SCFG) [111], allow for compositional representation of visual entities present in a video, an image or a scene based on their spatial and functional relations. Using the visual grammar, the content of an image is first extracted as a parse graph. A parsing algorithm is then used to find the best scoring entities that describe the video. In other words, not all entities present in a video are of equal relevance, which is a distinct feature of this class of methods compared to the aforementioned approaches.

For Stage II, sentence generation, a variety of methods have been proposed including HALogen representation [94], Head-driven Phrase Structure Grammar (HPSG) [123], planner and surface realizer [127]

. The primary common task of these methods is to define templates. A template is a user-defined language structure containing placeholders. In order to function properly, a template comprises of three parts named lexicons, grammar and template rules.

Lexicon represents vocabulary that describes high level video features. Template rules are user-defined rules guiding the selection of appropriate lexicons for sentence generation. Grammar defines linguistic rules to describe the structure of expressions in a language, ensuring that a generated sentence is syntactically correct. Using production rules, Grammar can generate a large number of various configurations from a relatively small vocabulary.

In template based approaches, a sentence is generated by fitting the most important entities to each of the categories required by the template, e.g. subject, verb, object, and place. Entities and actions recognized in the content identification stage are used as lexicons. Correctness of the generated sentence is ensured by Grammar. Figure 4 presents examples of some popular templates used for sentence generation in template based approaches. Figure 5 gives a timeline of how the classical methods evolved over time whereas below we provide a survey of SVO methods by grouping them into three categories namely, subject (human) focused, action and object focused and methods that use the SVO approach on open domain videos. Note that the division boundaries are frequently blurred between these categories.

Fig. 4: An example of various templates used for sentence generation from videos. Subject, verb, and object are used to fill in these template. Verb is obtained from action/activity detection methods using spatio-temporal features whereas subject and object are obtained from object detection methods using spatial features.

(1) Subject (Human) Focused: In 2002, Kojima et al. [85]

proposed one of the earliest methods designed specifically for video captioning. This method focuses primarily on describing videos of one person performing one action only. To detect humans in a scene, they calculated the probability of a pixel coming from the background or the skin region using the values and distributions of pixel chromaticity. Once a human’s head and hands are detected, the human posture is estimated by considering three kinds of geometric information i.e. position of the head and hands and direction of the head. For example, to obtain the head direction, the detected head image is compared against a list of pre-collected head models and a threshold is used to decide on the matching head direction. For object detection, they applied two-way matching, i.e. shape-based matching and pixel based color matching to a list of predefined known objects. Actions detected are all related to object handling and the difference image is used to detect actions such as putting an object down or lifting an object up. To generate the description in sentences, pre-defined

case frames and verb patterns as proposed by Nishida et al. [114, 113] are used. Case frame is a type of frame expression used for representing the relationship between cases, which are classified into 8 categories. The frequently used ones are agent, object, and locus. For example, “a person walks from the table to the door”, is represented as:

[PRED:walk, AG:person, GO-LOC:by(door), SO-LOC:front(table)],

where PRED is the predicate for action, AG is the agent or actor, GO-LOC is the goal location and SO-LOC is the source location. A list of semantic primitives are defined about movements, which are organized using body action state transitions. For example, if moving is detected and the speed is fast, then the activity state is transitioned from moving to running. They also distinguish durative actions (e.g. walk) from instantaneous actions (e.g. stand up). The major drawback of their approach is that it cannot be easily extended to more complex scenarios such as multiple actors, incorporating temporal information, and capturing causal relationship between events. The heavy reliance on the correctness of manually created activity concept hierarchy and state transition model also prevents it from being used in practical situations.

Fig. 5: Evolution of classical methods over time. In general the focus of these methods moved from subjects (humans) to actions and objects and then to open domain videos containing all three SVO categories.

Hakeem et. al. [67] addressed the shortcomings of Kojima et. al’s [85] work and proposed an extended case framework (CASE) using hierarchical CASE representations. They incorporated multiple agent events, temporal information, and causal relationship between the events to describe the events in natural language. They introduced case-list to incorporate multiple agents in AG, [PRED:move, AG:{person1, person2},...]. Moreover, they incorporated temporal information into CASE using temporal logic to encode the relationship between sub-events. As some events are conditional on other events, they also captured causal relationship between events. For example, in the sentence ”a man played piano and the crowd applauded”, the applaud occurred because the piano was played. [CAUSE: [PRED:play, D:crowed, FAC:applaud]].

Khan et al. [79]

introduced a framework to describe human related contents such as actions (limited to five only) and emotions in videos using natural language sentences. They implemented a suite of conventional image processing techniques, including face detection 

[90], emotion detection [103], action detection [28], non-human object detection [165]

and scene classification 

[82], to extract the high level entities of interest from video frames. These include humans, objects, actions, gender, position and emotion. Since their approach encapsulates human related actions, human is rendered as Subject and the objects upon which action is performed are rendered as Object. A template based approach is adopted to generate natural language sentences based on the detected entities. They evaluated the method on a dataset of 50 snippets, each spanning 5 to 20 seconds duration. Out of 50, 20 snippets were human close-ups and 30 showed human activities such as stand, walk, sit, run and wave. The primary focus of their research was on activities involving a human interacting with some objects. Hence, their method does not generate any description until a human is detected in the video. The method cannot identify actions with subtle movements (such as smoking and drinking) and interactions among humans.

(2) Action and Object Focused: Lee et al. [95] proposed a method for semantically annotating visual content in three sequential stages namely, image parsing, event inference and language generation. An “image parsing engine” using stochastic attribute image grammar (SAIG) [193] is employed to produce a visual vocabulary i.e. a list of visual entities present in the frame along with their relationships. This output is then fed into an “event inference engine”, which extracts semantic and contextual information of visual events, along with their relationships. Video Event Markup Language (VEML) [112] is used to represent semantic information. In the final stage, head-driven phrase structure grammar (HPSG) [123] is used to generate text description from the semantic representation. Compared to Kojima et al. [85], grammar-based methods can infer and annotate a wider range of scenes and events. Ten streams of urban traffic and maritime scenes over a period of 120 minutes, containing more than 400 moving objects are used for evaluation. Some detected events include “entering the scene, moving, stopping, turning, approaching traffic intersection, watercraft approaching maritime markers and land areas and scenarios where one object follows the other” [95]. Recall and Precision rates are employed to evaluate the accuracy of the events that are detected with respect to manually labeled ground truth. Due to poor estimation of the motion direction from low number of perspective views, their method does not perform well on “turning” events.

Hanckmann et al. [69] proposed a method to automatically describe events involving multiple actions (7 on average), performed by one or more individuals. Unlike Khan et al. [79], human-human interactions are taken into account in addition to human-object interactions. Bag-of-features (48 in total) are collected as action detectors [30] for detecting and classifying actions in a video. The description generator subsequently describes the verbs relating the actions to the scene entities. It finds the appropriate actors among objects or persons and connects them to the appropriate verbs. In contrast to Khan et al. [79] who assume that the subject is always a person, Hanckmann et al. [69] generalizes subjects to include vehicles as well. Furthermore, the number of human actions is much richer. Compared to the five verbs in Khan et al. [79]), they have 48 verbs capturing a diverse range of actions such as approach, arrive, bounce, carry, catch and etc.

Barbu et al. [25] generated sentence descriptions for short videos of highly constrained domains consisting of 70 object classes, 48 action classes and a vocabulary of 118 words. They rendered a detected object and action as noun and verb respectively. Adjectives are used for the object properties and prepositions are used for their spatial relationships. Their approach comprises of three steps. In the first step, object detection [54] is carried out on each frame by limiting 12 detections per frame to avoid over detections. Second, object tracking [155, 145] is performed to increase the precision. Third, using dynamic programming the optimal set of detections is chosen. Verb labels corresponding to actions in the videos are then produced using Hidden Markov Models (HMMs). After getting the verb, all tracks are merged to generate template based sentences that comply to grammar rules.

Fig. 6: Example of the Subject-Verb-Object-Place (SVOP) [154] approach where confidences are obtained by integrating probabilities from visual recognition system, with statistics from out of domain English text corpora to determine the most likely SVOP tuple. The red block shows low probability given to a correct object by the visual system that is rectified by the high probability from the linguistic model.

Despite the reasonably accurate lingual descriptions generated for videos in constrained environments, the aforementioned methods have trouble scaling to accommodate increased number of objects and actions in open domain and large video corpora. To incorporate all the relevant concepts, these methods require customized detectors for each entity. Furthermore, the texts generated by existing methods of the time have mostly been in the form of putting together lists of keywords using grammars and templates without any semantic verification. To address the issue of lacking semantic verification, Das et. al [42] proposed a hybrid method that produces content of high relevance compared to simple keyword annotation methods. They borrowed ideas from image captioning techniques. This hybrid model comprises of three steps in a hierarchical manner. First, in a bottom up approach, keywords are predicted using low level video features. In this approach they first find a proposal distribution over the training set of vocabulary using multimodal latent topic models. Then by using grammar rules and parts of speech (POS) tagging, most probable subjects, objects and verbs are selected. Second, in a top down approach, a set of concepts is detected and stitched together. A tripartite graph template is then used for converting the stitched concepts to a natural language description. Finally, for semantic verification, they produced a ranked set of natural language sentences by comparing the predicted keywords with the detected concepts. Quantitative evaluation of this hybrid method shows that it was able to generate more relevant content compared to its predecessors [25, 78].

(3) SVO Methods for Open Domain Videos: While most of the prior mentioned works are restricted to constrained domains, Krishnamoorthy et al. [88] lead the early works of describing open domain videos. They used selected open domain YouTube videos, however, the subjects and objects were limited to the 20 entities that were available in the classifier training set. Their main contribution is the introduction of text-mining using web-scale text corpora to aid the selection of the best SVO tuple to improve sentence coherence.

In addition to focusing on open domain videos and utilizing web scaled text corpora, Guadarrama et al. [65] and Thomason et al. [154] started dealing with relatively larger vocabularies. Compared to Krishnamoorthy et al. [88], instead of using only 20 objects in the PASCAL dataset [49], all videos of the YouTube corpora are used for the detection of 241 objects, 45 subjects, and 218 verbs. To describe short YouTube videos, Guadarrama et al. [65] proposed a novel language driven approach. They introduced “zero-shot” verb recognition for selecting unseen verbs in the training set. For example, if subject is “person”, object refers to “car” and the model-predicted verb is “move”, then the most suitable verb would be “drive”. Thomason et al. [154] used visual recognition techniques on YouTube videos for probabilistic estimations of subjects, verbs, and objects. Their approach is illustrated in Figure 6

. The object and action classifiers were trained on ImageNet 

[141]. In addition to detecting subjects, verbs and objects, places (12 scenes) where actions are performed, e.g. kitchen or play ground are also identified. To further improve the accuracy of assigning visually detected entities to the right category, probabilities using language statistics obtained from four “out of domain” English text corpora: English Gigaword, British National Corpus (BNC), ukWac and WaCkypedia EN are used to enhance the confidence of word-category alignment for sentence generation. A small “in domain” corpus comprising human-annotated sentences for the video description dataset is also constructed and incorporated into the sentence generation stage. Co-occurring bi-gram (SV, VO, and OP) statistics from the candidate SVOP tuples are calculated using both the “out of domain” and the “in domain” corpus, which are used in a Factor Graph Model (FGM) to predict the most probable SVO and place combination. Finally, the detected SVOP tuple is used to generate an English sentence through a template based approach.

Classical methods focused mainly on the detection of pre-defined entities and events separately. These methods then tried to describe the detected entities and events using template based sentences. However, to describe open domain videos or those with more events and entities, classical methods must employ object and action detection techniques for each entity which is unrealistic due to the computational complexity. Moreover, template based descriptions are insufficient to describe all possible events in videos given the linguistic complexity and diversity. Consequently, these methods failed to describe semantically rich videos.

Fig. 7:

Deep learning based video description techniques in the literature comprise two main stages. The first stage involves visual content extraction and is represented either by a fixed length vector or by dynamic vectors. The second stage takes input of visual representation vectors from the first stage for text generation and generates single/multiple sentence(s).

2.2 Statistical Methods

Naïve SVO tuple rule-based engineering approaches are indeed inadequate to describe open domain videos and large datasets, such as YouTubeClips [35], TACoS-MultiLevel [131], MPII-MD [133], and M-VAD [156]. These datasets contain very large vocabularies as well as tens of hours of videos. There are three important differences between these open domain and previous datasets. Firstly, open domain videos contain unforeseeable diverse set of subjects, objects, activities and places. Secondly, due to the sophisticated nature of human languages, such datasets are often annotated with multiple viable meaningful descriptions. Thirdly, the videos to be described are often long, potentially stretching through many hours. Descriptions of such videos with multiple sentences or even paragraphs become more desirable.

To avoid the tedious efforts required in rule-based engineering methods, Rohrbach et. al. [136]

proposed a machine learning method to convert visual content into natural language. They used parallel corpora of videos and associated annotations. Their method follows a two step approach. First, it learns to represent the video as intermediate semantic labels using maximum posterior estimate (MAP). Then, it translates the semantic labels into natural language sentences by using techniques borrowed from Statistical Machine Translation (SMT) 

[84]. In this machine translation approach, the intermediate semantic label representation is the source while the expected annotations are regarded as the target language.

For the object and activity recognition stages, the research moved from earlier threshold-based detection [85] to manual feature engineering and traditional classifiers [88, 42, 65, 154]. For the sentence generation stage, an uptake of machine learning methods can be observed in recent years to address the issue of large vocabulary. This is also evidenced by the trend in recent methods that use models for lexical entries that are learned in a weakly supervised [131, 136, 178, 183] or fully supervised [39, 65, 88, 150] fashion. However, the separation of the two stages makes this camp of methods incapable of capturing the interplay of visual features and linguistic patterns, let alone learning a transferable state space between visual artifacts and linguistic representations. In the next section, we review the deep learning methods and discuss how they address the scalability, language complexity and domain transferability issues faced by open domain video description.

Fig. 8: Summary of deep learning based video description methods. Most methods employ mean pooling of frame representations to represent a video. More advanced methods use attention mechanisms, semantic attribute learning, and/or employ a sequence-to-sequence approach. These methods differ in whether the visual features are fed only at first time step or all time steps of the language model.

2.3 Deep Learning Models

The whirlwind success of deep learning in almost all sub-fields of computer vision, has also revolutionized video description approaches. In particular, Convolutional Neural Networks (CNNs) 

[89] are the state of the art for modeling visual data and excel at tasks such as object recognition [89, 148, 152]

. Long Short-Term Memory (LSTMs) 


and the more general deep Recurrent Neural Networks (RNNs), on the other hand, are now dominating the area of sequence modeling, setting new benchmarks in machine translation

[151, 38], speech recognition [63] and the closely related task of image captioning [46, 164]. While conventional methods struggle to cope with large-scale, more complex and diverse datasets for video description, researchers have combined these deep nets in various configurations with promising performances.

As shown in Figure 7, the deep learning approaches to video description can also be divided into two sequential stages, namely, visual content extraction and text generation. However, in contrast to the SVO Tuple Methods in Section 2.1, where lexical word tokens are generated as a result of the first stage through visual content extraction, visual features represented by fixed or dynamic real-valued vectors are produced instead. This is often referred to as the video encoding stage. CNN, RNN or Long Short-Term Memory (LSTM) are used in this encoding stage to learn these visual features, that are then used in the second stage for text generation, also known as the decoding stage

. For decoding, different flavours of RNNs are used, such as deep RNN, Bi-directional RNN, LSTM or Gated Recurrent Units (GRU). The resulting description can be a single sentence or multiple sentences. Figure 

8 illustrates a typical end-to-end video description system with encoder-decoder stages. The encoding part is followed by transformations such as mean pooling, temporal encoding or attention mechanisms to represent the visual content. Some methods apply sequence-to-sequence learning and/or semantic attributes learning in their frameworks. The aforementioned mechanisms have been used in different combinations by contemporary methods. We group the literature based on the different combinations of deep learning architectures for encoding and decoding stages, namely:

  • CNN - RNN Video Description, where convolution architectures are used for visual encoding and recurrent structures are used for decoding. This is the most common architecture employed in deep learning based video description methods;

  • RNN - RNN Video Description, where recurrent networks are used for both stages; and

  • Deep reinforcement networks, the relatively new research area for video description.

2.3.1 CNN-RNN Video Description

Given its success in computer vision and simplicity, CNN is still by far the most popular network structure used for visual encoding. The encoding process can be broadly categorized into fixed-size and variable-size video encoding.

Donahue et al. [46] were the first to use a deep neural networks to solve the video captioning problem. They proposed three architectures for video description. Their model is based on the assumption to have CRF based predictions of subjects, objects, and verbs after full pass of complete video. This allows the architecture to observe the complete video at each time step. The first architecture, LSTM encoder-decoder with CRF max, is motivated by the statistical machine translation (SMT) based video description approach by Rohrbach et al. [136] mentioned earlier in Section 2.2. Recognizing the state of the art machine translation performance of LSTMs, the SMT module in  [136] is replaced with a stacked LSTM comprising two layers for encoding and decoding. Similar to [151]

, the first LSTM layer encodes the one-hot vector of the input sentence allowing for variable-length inputs. The final hidden representation from the first encoder stage is then fed into the decoder stage to generate a sentence by producing one word per time step. Another variant of the architecture, LSTM decoder with CRF max, incorporates max predictions. This architecture encodes the semantic representation into a fixed length vector. Similar to image description, LSTM is able to see the whole visual content at every time step. An advantage of LSTM is that it is able to incorporate probability vectors during training as well as testing. This virtue of LSTM is exploited in the third variant of the architecture, LSTM decoder with CRF probabilities. Instead of using max predication like in second variant (LSTM decoder with CRF max), this architecture incorporates probability distributions. Although the LSTM outperformed the SMT based approach of 

[136], it was still not trainable in an end-to-end fashion.

In contrast to the work by Donahue et al. [46], where an intermediate role representation was adopted, Venugopalan et al. [162] presented the first end-to-end trainable network architecture for generating natural language description of videos. Their model is able to simultaneously learn the semantic as well as grammatical structure of the associated language. Moreover, Donahue et al. [46] presented results on domain specific cooking videos comprising pre-defined objects and actors. On the other hand, Venugopalan et al. [162] reported results on open domain YouTube Clips [34]. To avoid supervised intermediate representations, they connected an LSTM directly to the output of the CNN. The CNN extracts visual features whereas the LSTM models the sequence dynamics. They transformed a short video into a fixed length visual input using a CNN model [75] that is slightly different from AlexNet [89]. The CNN model [75] was learned using the ILSVRC-2012 object classification dataset (comprising 1.2M images), which is a subset of ImageNet [141]

. It provides a robust and efficient way without manual feature selection for initialization object recognition in the videos. They sampled every tenth frame in the video and extracted features for all sample frames from the

fc7 layer of the CNN. Furthermore, they represented a complete video by averaging all the extracted frame-wise feature vectors into a single vector. These feature vectors are then fed into a two-layered LSTM [64]. The feature vectors from CNN form the input to the first layer of the LSTM. A second LSTM layer is stacked on top of first LSTM layer, where the hidden state of the first LSTM layer becomes the input to the second LSTM unit for caption generation. In essence, the transforming of multiple frame-based feature vectors into a single aggregated video-based vector, reduces the video description problem into an image captioning one. This end-to-end model performed better than the previous video description systems at the time and was able to effectively generate the sequence without any templates. However, as a result of simple averaging, valuable temporal information of the video, such as the order of appearances of any two objects, are lost. Therefore, this approach is only suitable of generating captions for short clips with a single major action in the clip.

Open domain videos are rich in complex interactions among actors and objects. Representation of such videos using a temporally averaged single feature vector is, therefore, prone to produce clutter. Consequently, the descriptions produced are bound to be inadequate because valuable temporal ordering information of events are not captured in the representation. With the success of C3D [158] in capturing spatio-temporal action dynamics in videos, Li et al. [179] proposed a novel 3D-CNN to model the spatio-temporal information in videos. Their 3D-CNN is based on GoogLeNet [152] and pre-trained on an activity recognition dataset. It captures local fine motion information between consecutive frames. This local motion information is then subsequently summarized and preserved through higher-level representations by modeling a video as a 3D spatio-temporal cuboid. It is further represented by concatenation of HoG, HoF, MbH [41, 168]. These transformations not only help capture local motion features but also reduce the computation of the subsequent 3D CNN. For global temporal structure, a temporal attention mechanism is proposed and adapted from soft attention [21]. Using 3D CNN and attention mechanisms in RNN, they were able to improve results. Recently, GRU-EVE [13]

was proposed as an effective and computationally efficient technique for video captioning. GRU-EVE uses a standard GRU for language modeling but with Enriched Visual Encoding as follows. It applies the Short Fourier Transform on 2D/3D-CNN features in a hierarchical manner to encapsulate the spatio-temporal video dynamics. The visual features are further enriched with high level semantics of the detected objects and actions in the video. Interestingly, the enriched features obtained by applying Short Fourier Transform on 2D-CNN features alone 

[13], outperform C3D [158] features.

Unlike the fixed video representation models discussed above, variable visual representation models are able to directly map input videos comprising different number of frames to variable length words or sentences (outputs), and are successful in modeling various complex temporal dynamics. Venugopalan et al. [161] proposed an architecture to address the variable representation problem for both the input (video frames) and the output (sentence) stage. For that purpose they used a two-layered LSTM framework, where the sequence of video frames is input to the first layer of the LSTM. The hidden state of the first LSTM layer forms the input to the second layer of the LSTM. The output of the second LSTM layer is the associated caption. The LSTM parameters are shared in both stages. Although sequence-to-sequence learning had previously been used in machine translation [151], this is the first method [161] to use a sequence-to-sequence approach in video captioning. Later methods have adopted a similar framework, with minor variations including attention mechanisms [179], making a common visual-semantic-embedding [117] or using out of domain knowledge either with language models [160] or visual classifiers [132].

While deep learning has achieved much better results compared to previously used classifier based approaches, most methods aimed at producing one sentence from a video clip containing only one major event. In real-world applications, videos generally contain more than a single event. Description of such multi-events and semantically rich videos by only one sentence ends up to be overly simplified, and hence, uninformative. For example, instead of saying “someone sliced the potatoes with a knife, chopped the onions into pieces and put the onions and potatoes into the pot”, a single sentence generation method would probably say “someone is cooking”. Yu et al. [184] proposed a hierarchical recurrent neural network (h-RNN) that applies the attention mechanisms on both the temporal and spatial aspects. They focused on the sentence decoder and introduced a hierarchical framework that comprises of a sentence generator and on top of that a paragraph generator. First, a Gated Recurrent Unit (GRU) layer takes video features as input and generates a single short sentence. The other recurrent layer generates paragraphs using context and the sentence vectors obtained from the sentence generator. The paragraph generator thus captures the dependencies between sentences and generates a paragraph of sentences that are related. Recently, Krishna et al. [87] introduced the concept of dense-captioning of events in a video and employed action detection techniques to predict the temporal intervals. They proposed a model to extract multiple events with one single pass of a video, attempting to describe the detected events simultaneously. This is the first work of its kind detecting and describing multiple and overlapping events in a video. However, the model did not achieve significant improvement on the captioning benchmark.

2.3.2 RNN - RNN Video Description

Although not as popular as the CNN-RNN framework, another approach is to also encode the visual information using RNNs. Srivastava et al. [149] use one LSTM to extract features from video frames (i.e. encoding) and then pass the feature vector through another LSTM for decoding. They also introduced some variants of their models and predicted the future sequences from the previous frames. The authors adopted a machine translation model [151] for visual recognition but could not achieve significant improvement in classification accuracy.

Yu et al. [184] proposed a similar approach and used two RNN structures for the video description task. Their configuration is a hierarchical decoder with multiple Gated Recurrent Units (GRU) for sentence generation. The output of this decoder is then fed to a paragraph generator which models the time dependencies between the sentences while focusing on linguistic aspects. The authors improved the state-of-the-art results for video description, however, their method is inefficient for videos involving fine-grained activities and small interactive objects.

2.3.3 Deep Reinforcement Learning Models

Deep Reinforcement Learning (DRL) has out-performed humans in many real-word games. In DRL, artificial intelligent agents learn from the environment through trial and error and adjust learning policies purely from environmental rewards or punishments. DRL approaches are popularized by Google Deep Mind 

[110, 109] since 2013. Due to the absence of a straight forward cost function, learning mechanisms in this approach are considerably harder to devise as compared to traditional supervised techniques. Two distinct challenges are evident in reinforcement learning when compared with conventional supervised approaches: (1) The model does not have full access to the function being optimized. It has to query the function through interaction. (2) The interaction with the environment is state based where the present input depends on previous actions. The choice of reinforcement learning algorithms then depends on the scope of the problem at hand. For example, variants of Hierarchical Reinforcement Learning (HRL) framework have been applied to Atari games [92, 163]. Similarly, different variants of DRL have been used to meet the challenging requirements of image captioning [129] as well as video description [120, 172, 96, 121, 37].

Xwang et al. [172] proposed a fully-differentiable neural network architecture using reinforcement learning for video description. Their method follows a general encoder-decoder framework. The encoding stage captures the video frame features using ResNet-152 [71]. The frame level features are processed through two stage encoder i.e. low level LSTM [142] followed by a high level LSTM [72]. For decoding, they employed HRL to generate the word by word natural language descriptions. The HRL agent comprises of three components, a low level worker that accomplishes tasks as set by manager, a high level manager that sets goals and internal critic to ascertain whether the task has been accomplished or not and informs the manager accordingly to help manager update the goals. The process iterates till reaching the end of sentence token. This method is demonstrated to be capable of capturing more details of the video content thus generating more fine-grained descriptions. However, this method has shown very little improvement over existing baseline methods.

In 2018, Chen et al. [37] proposed a RL based model selecting key informative frames to represent a complete video, in an attempt to minimize noise and unnecessary computations. Key frames are selected such that they maximize visual diversity and minimize the textual discrepancy. Hence, a compact subset of 6-8 frames on average can represented a full video. Evaluated against several popular benchmarks, it was demonstrated that video captions can be produced without performance degradation but at a significantly reduced computational cost. The method did not use motion features for encoding, a design trade-off between speed and accuracy.
DRL based methods are gaining popularity and have shown comparable results in video description. Due to their unconventional learning methodology, DRL methods are unlikely to suffer from paucity of labelled training data, hardware constraints and overfitting problems. Therefore, these methods are expected to flourish.

3 Datasets

The availability of labeled datasets for video description have been the main driving forces behind the fast advancement of this research area. In this survey, we summarize the characteristics of these datasets and give an overview in Table I. The datasets are categorized into four main classes namely Cooking, Movies, Videos in the Wild and Social Media. In most of the datasets, a single caption per video is assigned except for a few datasets which contain multiple sentences or even paragraphs per video snippet.


Dataset Domain #
vocab len
MSVD [34] open 218 1970 10 sec 1,970 70,028 607,339 13,010 5.3
MPII Cooking [135] cooking 65 44 600 sec - 5,609 - - 8.0
YouCook [42] cooking 6 88 - Nil 2,688 42,457 2,711 2.3
TACoS [126] cooking 26 127 360 sec 7,206 18,227 146,771 28,292 15.9
TACos-MLevel [131] cooking 1 185 360 sec 14,105 52,593 2,000 - 27.1
MPII-MD [133] movie - 94 3.9 sec 68,337 68,375 653,467 24,549 73.6
M-VAD [156] movie - 92 6.2 sec 48,986 55,904 519,933 17,609 84.6
MSR-VTT [177] open 20 7,180 20 sec 10,000 200,000 1,856,523 29,316 41.2
Charades [147] human 157 9,848 30 sec - 27,847 - - 82.01
VTW [188] open - 18,100 90 sec - 44,613 - - 213.2
YouCook II [191] cooking 89 2,000 316 sec 15.4k 15.4k - 2,600 176.0
ActyNet Cap [87] open - 20,000 180 sec - 100,000 1,348,000 - 849.0
ANet-Entities [190] social media - 14,281 180 sec 52k - - - -
VideoStory [59] social media - 20k - 123k 123k - - 396.0
TABLE I: Standard datasets for benchmarking video description methods.

3.1 Cooking

3.1.1 MP-II Cooking

Max Plank Institute for Informatics (MP-II) Cooking dataset [135] comprises 65 fine grained cooking activities, performed by 12 participants preparing 14 dishes such as fruit salad and cake etc. The data are recorded in the same kitchen with camera installed on the ceiling. The 65 cooking activities include “wash hands”, “put in bowl”, “cut apart”, “take out from drawer” etc. When the person is not in the scene for 30 frames (one second) or is performing an activity that is not annotated, a “background activity” is generated. These fine grained activities, for example “cut slices”, “pour”, or “spice” are differentiated by movements with low inter-class and high intra-class variability. In total, the dataset comprises 44 videos (888,775 frames), with an average length per clip of approximately 600 seconds. The dataset spans a total of 8 hours play length for all videos, and 5,609 annotations.

3.1.2 YouCook

The YouCook dataset [42] consists of 88 YouTube cooking videos of different people cooking various recipes. The background (kitchen/scene) is different in most of the videos. This dataset represents a more challenging visual problem than the MP-II Cooking [135] dataset that is recorded with a fixed camera view point in the same kitchen and with the same background. The dataset is divided into six different cooking styles, for example grilling, baking etc. For machine learning, the training set contains 49 videos and the test set contains 39 videos. Frame wise annotations of objects and actions are also provided for the training videos. The object categories for the dataset include “utensils”, “bowls” and “food” etc. Amazon Mechanical Turk (AMT) was employed for human generated multiple natural language descriptions of each video. Each AMT worker provided at least three sentences per video as a description, and on average 8 descriptions were collected per video. See Figure 9(b) for example clips and descriptions.

3.1.3 TACoS

Textually Annotated Cooking Scenes (TACoS) is a subset of MP-II Composites [137]. TACoS was further processed to provide coherent textual descriptions for high quality videos. Note that MP-II Composites contain more videos but less activities than the MP-II Cooking [135]. It contains 212 high resolution videos with 41 cooking activities. Videos in the MP-II Composites dataset span over different lengths ranging from 1-23 minutes with an average length of 4.5 minutes. The TACoS dataset was constructed by filtering through MP-II Composites, while restricting to only those activities that involve manipulation of cooking ingredients, and have at least 4 videos for the same activity. As a result, TACoS contains 26 fine grained cooking activities in 127 videos. AMT workers were employed to align the sentences and associated videos for example: “preparing carrots”, “cutting a cucumber” or “separating eggs” etc. For each video, 20 different textual descriptions were collected. The dataset comprises of 11,796 sentences containing 17,334 actions descriptions. A total of 146,771 words are used in the dataset. Almost 50% of the words i.e. 75,210 describe the content for example nouns, verbs and, adjectives etc. These words includes a vocabulary size of 28,292 verb tokens. The dataset also provides the alignment of sentences describing activities by obtaining approximate time stamps where each activity starts and ends. Figure 9(d) shows some example clips and descriptions.

3.1.4 TACoS-MultiLevel

TACoS Multilevel [131] corpus annotations were also collected via AMT workers on the TACoS corpus [126]. For each video in the TACoS corpus, three levels of descriptions were collected that include: (1) detailed description of video with no more than 15 sentences per video; (2) a short description that comprises 3-5 sentences per video; and finally (3) a single sentence description of the video. Annotation of the data is provided in the form of tuples such as object, activity, tool, source and target with a person always being the subject. See Figure 9(e) for example clips and descriptions.

3.1.5 YouCook II

YouCook-II Dataset [191]

consists of 2000 videos uniformly distributed over 89 recipes. The cooking videos are sourced from YouTube and offer all challenges of open domain videos such as variations in camera position, camera motion and changing backgrounds. The complete dataset spans a total play time of 175.6 hrs and has a vocabulary of 2600 words. The videos are further divided into 3-16 segments per video with an average of 7.7 segments per video elaborating procedural steps. Individual segment length varies from 1 to 264 seconds. All segments are temporally localized and annotated. The average length of each video is 316 seconds reaching up to a maximum of 600 seconds. The dataset is randomly split into train, validation and test sets with the ratio of 66%:23%:10% respectively.

Fig. 9: Example video frames (3 non-consecutive frames per clip) and captions from the various benchmark video description datasets. C1-C5 represent the associated (exemplary) captions from the dataset.

3.2 Movies

3.2.1 Mpii-Md

MPII-Movie Description Corpus [133] contains transcribed audio descriptions extracted from 94 Hollywood movies. These movies are subdivided into 68,337 clips with an average length of 3.9 seconds paired with 68,375 sentences amounting to almost one sentence per clip. Every clip is paired with one sentence that is extracted from the script of the movie and the audio description data. The Audio Descriptions (ADs) were collected first by retrieving the audio streams from the movie using online services MakeMkV 111 and Subtitle Edit 222 These audio streams are further transcribed using crowd sourced transcription service [3]. Then the transcribed texts were aligned with associated spoken sentences using their time stamps. In order to remove the misalignments of audio content with the visual content itself, each sentence was also manually aligned with the corresponding video clip. During the manual alignment process, sentences describing the content not present in the video clip were also filtered out. The audio descriptions track is an added feature in the dataset tying to describe the visual content to help visually impaired persons. The total time span of the dataset videos is almost 73.6 hours and the vocabulary size is 653,467. Example clips and descriptions are shown in Figure 9(f).

3.2.2 M-Vad

Montreal Video Annotation Dataset (M-VAD) [156] is based on the Descriptive Video Service (DVS) and contains 48,986 video clips from 92 different movies. Each clip is spanned over 6.2 seconds on average and the entire time for the complete dataset is 84.6 hours. The total number of sentences is 55,904, with few clips associated with more than one sentence. The vocabulary of the dataset spans about 17,609 words (Nouns-9,512: Verbs-2,571: Adjectives-3,560: Adverbs-857). The dataset split consists of 38,949, 4,888 and 5,149 video clips for training, validation and testing respectively. See Figure 9(g) for example clips and descriptions.

3.3 Social Media

3.3.1 VideoStory

VideoStory [59] is a multi sentence description dataset comprising 20k social media videos. This dataset is aimed to address the story narration or description generation of long videos that may not sufficiently be illustrated with single sentence. Each video is paired with at least one paragraph. The average number of temporally localized sentences per paragraph are 4.67. There are a total of 26245 paragraphs in the dataset comprising 123k sentences with an average of 13.32 words per sentence. On average, each paragraph covers 96.7% of video content. The dataset contains about 22% temporal overlap between co-occurring events. The dataset has training, validation and test split of 17908, 999, and 1011 videos respectively and also proposes a blind test set comprising 1039 videos. Each training video is accompanied with one paragraph, however, videos in the validation and test sets have three paragraphs each for evaluation. Annotations for the blind test are not released and are only available on server for benchmarking different methods.

3.3.2 ActivityNet Entities

ActivityNet Entities dataset (or ANet-Entities) [190] is the first video dataset with entities grounding and annotations. This dataset is build on the training and validation splits of the ActivityNet Captions dataset [87], but with different captions. In this dataset, noun phrases (NPs) of video descriptions have been grounded to bounding boxes in the video frames. The dataset comprises 14281 annotated videos, 52k video segments with at least one noun phrase annotated per segment and 158k bounding boxes with annotations. The dataset employs training set (10k) similar to ActivityNet Captions. However, validation set of ActivityNet Captions is randomly and evenly split into ANet-Entities validation (2.5k) and testing (2.5k) sets.

3.4 Videos in the Wild

3.4.1 Msvd

Microsoft Video Description (MSVD) dataset [34] comprises of 1,970 YouTube clips with human annotated sentences. This dataset was also annotated by AMT workers. The audio is muted in all clips to avoid bias from lexical choices in the descriptions. Furthermore, videos containing subtitles or overlaid text were removed during the quality control process of the dataset formulation. Finally, manual filtering was carried out over the submitted videos to ensure that each video met the prescribed criteria and was free of inappropriate and ambiguous content. The duration of each video in this dataset is typically between 10 to 25 seconds mainly showing one activity. The dataset comprises multilingual (such as Chinese, English, German etc) human generated descriptions. On average, there are 41 single sentence descriptions per clip. This dataset has been frequently used by the research community as detailed in the Results Section 6. Almost all research groups have split this dataset into training, validation and testing partitions of 1200, 100 and 670 videos respectively. Figure 9(a) shows example clips and descriptions from MSVD dataset.

3.4.2 Msr-Vtt

MSR-Video to Text (MSR-VTT) [177] contains a wide variety of open domain videos for video captioning task. It comprises of 7180 videos subdivided into 10,000 clips. The clips are grouped into 20 different categories. An example is shown in Figure 9(c). The dataset is divided into 6513 training, 497 validation and 2990 test videos. Each video comprises 20 reference captions annotated by AMT workers. In terms of the number of clips with multiple associated sentences, this is one of the largest video captioning datasets. In addition to video content, this dataset also contains audio information that can potentially be used for multimodal research.

3.4.3 Charades

This dataset [147] contains 9848 videos of daily indoor household activities. These videos are recorded by 267 AMT workers from three different continents. They were given scripts describing actions and objects and were required to follow the scripts to perform actions with the specified objects. The objects and actions used in the scripts are from a fixed vocabulary. Videos are recorded in 15 different indoor scenes and restricted to use 46 objects and 157 action classes only. The dataset comprises of 66500 annotations describing 157 actions. It also provides 41104 labels to its 46 object classes. Moreover, it contains 27847 descriptions covering all the videos. The videos in the dataset depict daily life activities with an average duration of 30 seconds. The dataset is split into 7985 and 1863 videos for training and test purposes respectively.

3.4.4 Vtw

Video Titles in the Wild (VTW) [188] contains 18100 video clips with an average of 1.5 minutes duration per clip. Each clip is described with one sentence only. However, it incorporates a diverse vocabulary, where on average one word appears in not more than two sentences across the whole dataset. Besides the single sentence per video, the dataset also provides accompanying descriptions (known as augmented sentences) that describe information not present in the visual content of the clip. The dataset is proposed for video title generation as opposed to video content description but can also be used for language-level understanding tasks including video question answering.

3.4.5 ActivityNet Captions

ActivityNet Captions dataset [87] contains 100k dense natural language descriptions of about 20k videos from ActivityNet [193] that correspond to approximately 849 hours. On average, each description is composed of 13.48 words and covers about 36 seconds of video. There are multiple descriptions for every video and when combined together, these descriptions cover 94.6% content present in the entire video. In addition, 10% temporal overlap makes the dataset especially interesting and challenging for studying multiple events occurring at the same time. An example of this dataset is given in Figure 9(h).

4 Video Description Competitions

Another major driving force of the fast-paced development in video description research comes from the many competitions and challenges organized by companies and conferences in recent years. Some of the major competitions are listed below.

Dataset split # movies # clips # words # sent avg len (sec) tot len (hrs)
LSMDC Training 153 91,908 913,841 91,941 4.9 124.90
LSMDC Validation 12 6,542 63,789 6,542 5.2 9.50
LSMDC Public Test 17 10,053 87,147 10,053 4.2 11.60
LSMDC Blind Test 20 9,578 83,766 9,578 4.5 12.00
LSMDC (Total) 202 118,081 1,148,543 118,081 4.8 158.00
TABLE II: LSMDC Dataset Statistics.

4.1 Lsmdc

The Large Scale Movie Description Challenge (LSMDC) [4] started in 2015 in conjunction with ICCV 2015, and as an ECCV workshop in 2016. The Challenge comprises a test set that is released publicly and a blind test set that is withheld. A server is provided to automatically evaluate [11] results. The challenge consists of three primary tasks i.e. Movie Description, Annotation/Retrieval and Fill-in-the-Blank. Since 2017, the MovieQA challenge has also been included in LSMDC in addition to the previous three tasks.

The dataset for this challenge was first introduced in ICCV 2015 workshop [4]. The LSMDC dataset basically combines two benchmark datasets, M-VAD [156] and MPII-MD [133] which were initially collected independently (see Section 3.2). The two datasets were merged for this Challenge, with overlaps removed to avoid repetition of the same movie in the test and training sets. Further, the manual alignments performed on MPII-MD were also removed from the validation and the test sets. The dataset was then augmented by clips only (without aligned annotations) from 20 additional movies to make up the blind test of the Challenge. These additional clips were added for evaluation only. The final LSMDC dataset has 118,081 video clips extracted from 202 unique movies. It has approximately one sentence per clip. Names of characters in the reference captions are replaced with the token word “SOMEONE”. The dataset is further split into 91908 training clips, 6542 validations clips, 10053 public test clips and a blind (withheld) test set of 9578 clips. The average clip length is approximately 4.8 seconds. The training set captions consists of 22,829 unique words. A summary of the LSMDC dataset can be found in Table II.

A survey of benchmark results on video description (Section-6) shows that LSMDC has emerged as the most challenging dataset, evident by the poor performances of several models. As mentioned in the dataset section (Section 3.2), natural language descriptions of movie clips are typically sourced from movie scripts and audio descriptions, so misalignments between captions and videos often occur when text refer to objects that appeared just before or after the cutting point of a clip. Misalignment is certainly a key contributing factor to the poor performances observed on this dataset. Submission protocol of the challenge is similar to the MSCOCO Image Captioning Challenge [36], and uses the same protocol for automatic evaluation. Human evaluation is used to select the final winner. The latest results of automatic evaluation on LSMDC are publicly available [12].

4.2 Msr-Vtt

In 2016, to further motivate and challenge the academic and the tech industry research community, Microsoft started the Microsoft Research - Video to Text (MSR-VTT) [5] competition aiming at bringing together computer vision and language researchers. The dataset used for this competition is MSR-VTT [177] described in the dataset section (Section 3.4). The participants of the competition are asked to develop a video to text model using MSR-VTT dataset. External datasets, either public or private can be used to help for better object, action, scene, and event detection, as long as the external data used are explicitly cited and explained in the submission file.

Fig. 10: Example video frames from TRECVID-VTT dataset. (a) Frames from the Easy-Video category and (b) frames from the Hard-Video category.

Unlike LSMDC, MSR-VTT challenge focuses only on the video to text task. This challenge requires a competing algorithm to automatically generate at least one natural language sentence that describes the most informative part in the video. Accuracy is benchmarked against human generated captions during the evaluation stage. The evaluation is based on an automatically computed score using multiple common metrics such as BLEU@4, METEOR, ROUGE-L, and CIDEr-D. Details of these metrics are given in Section- 5. Like LSMDC, human evaluations are also used to rank the generated sentences.

4.3 Trecvid

Text Retrieval Conference (TREC) is a series of workshops emphasizing various subareas of Information Retrieval (IR) research. In particular, the TREC Video Retrieval Evaluation (TRECVID) [2] workshops, started in 2001, are dedicated to research efforts on content-based exploitation of digital videos. The primary areas of interests include “semantic indexing, video summarization, video copy detection, multimedia event detection and ad-hoc video search” [2]. Since TREC-2016, Video to Text Description (VTT) [20] using natural language has also been included in the challenge tasks.

TRECVID-2017 VTT task used a dataset of over 50K automatically collected Twitter Vine videos, where each clip spans over approximately 6 seconds. This task is performed on a manually annotated selected subset that consists of 1,880 Twitter Vine videos. The dataset is further divided into four groups, G2, G3, G4 and G5, based on the number of descriptions (2 to 5) per videos. Furthermore, each video is tagged as easy or hard according to the difficulty level in describing it. Example frames from the VTT dataset are show in Figure 10.

TRECVID uses metrics such as METEOR, BLEU and CIDEr (details in Section- 5) for automatic evaluation, in addition to a newly introduced metric, referred to as Semantic Text Similarity (STS) [68]. As the name suggests, STS measures semantic similarity of the generated and reference descriptions. Human evaluations are also employed to gauge the quality of the automatically generated descriptions following the Direct Assessment (DA) [62] method. Due to its high reliability, DA is now employed as the official ranking method for machine translation benchmark evaluations [29]. As per DA based video description evaluation, human assessors are shown video-sentence pairs to rate how well the sentence describes the events in the video on a scale of   [61].

4.4 ActivityNet Challenge

ActivityNet Dense-Captioning Events in Videos  [8] was first introduced in 2017 as a task of the ActivityNet Large Scale Activity Recognition Challenge [9, 58], running as a CVPR Workshop since 2016. This task studies the detection and description of multiple events in a video. In the ActivityNet Captions Dataset, multiple descriptions along with time-stamps are provided for each video clip, where each description covers a unique portion of the clip. Together, multiple events in that clip can be covered and narrated using the set of sentences. The events may be of variable durations (long or short) or even overlap. Details of this dataset are given in Section 3.4.5 and Table I.

Server based evaluations  [6] are performed for this challenge. The precision of captions generated are measured using BLEU, METEOR and CIDEr metrics. The latest results for the challenge are also publicly available and can be found online [7].

5 Evaluation Metrics

Evaluations performed over machine generated captions/descriptions of videos can be divided into Automatic Evaluations and Human Evaluations. Automatic evaluations are performed using six different metrics which were originally designed for machine translation and image captioning. These metrics are BLEU [119], ROUGE [99], METEOR [23], CIDEr [159], WMD [93] and, SPICE [15]. Below, we discuss these metrics in detail as well as their limitations and reliability. Human Evaluations are performed to because of the unsatisfactory performance of automatic metrics given that there are numerous different ways to correctly describe the same video.

5.1 Automatic Sentence Generation Evaluation

Evaluation of video descriptions, automatically or manually generated, is challenging because as there is no specific ground truth or “right answer”, that can be taken as a reference for benchmarking accuracy. A video can be correctly described in a wide variety of sentences, that may differ not only syntactically but also in terms of semantic content. Consider a sample from MSVD dataset as shown in Figure 11 for instance, several ground truth captions are available for the same video clip. Note that each caption describes the clip in an equally valid, but different way with varied attentions and levels of details in the clip, ranging from “jet”, “commercial airplane” to “South African jet” and from “flying”, “soaring” to “banking” and lastly from “air”, “blue sky” to “clear sky”.

Fig. 11: An example from MSVD [34] dataset with the associated ground truth captions. Note how the same video clip has been described very differently. Each caption describes the activity wholly or partially in a different way.

For automatic evaluation, when comparing the generated sentences with ground truth descriptions, three evaluation metrics are borrowed from machine translation, namely, Bilingual Evaluation Understudy (BLEU) [119], Recall Oriented Understudy of Gisting Evaluation (ROUGE) [99] and Metric for Evaluation of Translation with Explicit Ordering (METEOR) [23]. Consensus based Image Description Evaluation (CIDEr) [159] and Semantic Propositional Image Captioning Evaluation (SPICE) [15] are two other recently introduced metrics specifically designed for image captioning tasks, that are also being used for automatic evaluation of video description. Table III gives an overview of the metrics included in this survey. In addition to these automatic evaluation metrics, human evaluations are also employed to determine the performance of an automated video description algorithms.

5.1.1 Bilingual Evaluation Understudy (BLEU, 2002)

BLEU [119] is a popular metric used to quantify the quality of machine generated text. The quality measures the correspondence between a machine and human outputs. BLEU scores take into account the overlap between predicted uni–grams (single word) or higher order n–gram (sequence of adjacent words) and a set of one or more candidate reference sentences. According to BLEU, a high-scoring description should match the ground truth sentence in length i.e. exact match of words as well as their order. BLEU evaluation will score 1 for an exact match. Note that the more the number of reference sentences in the ground truth per video, the more the chances of a higher BLEU score. It is primarily designed to evaluate text at a corpus level and, therefore, its use as an evaluation metric over individual sentences may not be fair. BLEU is calculated as,

In the above equation, is the ratio between the lengths of the corresponding reference corpus and the candidate description, are positive weights, and is the geometric average of the modified n-gram precisions. While the second term computes the actual match score, the first term is a brevity penalty that penalizes descriptions that are shorter than the reference description.

5.1.2 Recall Oriented Understudy for Gisting Evaluation (ROUGE, 2004)

ROUGE [99] metric was proposed in 2004 to evaluate text summaries. It calculates recall score of the generated sentences corresponding to the reference sentences using n–grams. Similar to BLEU, ROUGE is also computed by varying the n–gram count. However, unlike BLEU which is based on precision, ROUGE is based on recall values. Moreover, other than n–gram variants of ROUGE, it has other versions known as , ROUGE (Longest Common Subsequence), ROUGE (Weighted Longest Common Subsequence), ROUGE (Skip-Bigram Co-Occurrences Statistics), and ROUGE (extension of ROUGE). We refer the reader to the original paper for details. The version used in image and video captioning evaluation is ROUGE, which computes recall and precision scores of the longest common subsequences (LCS) between the generated and each reference sentence. The metric compares common subsequences of words in candidate and reference sentences. The intuition behind is that longer LCS of candidate and reference sentences corresponds to higher similarity between the two summaries. The words need not be consecutive but should be in sequence. ROUGE-N is computed as

being the n-gram length, , and represents the highest number of n-grams that are present in candidate as well as ground truth summaries and R stands for reference summaries.

LCS-based F-measure score is computed to find how similar summary of length is to summary of length . Where is a sentence from the ground truth summary and

is a sentence from the candidate generated summary. The recall

, precision

and f-score

are calculated as

where is the length of longest common subsequence between A and B, . The LCS-based F-measure score computed by equation is known as ROUGE score. ROUGE is 1 when , and zero in case when A and B have no commonalities i.e. .

One of the advantages of ROUGE is that it does not consider successive matches of words but employs in-sequence matches within a sentence. Moreover, pre-defining the n-gram length is also not required as this is automatically incorporated by .

Metric Name Designed For Methodology
BLEU [119] Machine translation n-gram precision
ROUGE [99] Document summarization n-gram recall
METEOR [23] Machine translation n-gram with synonym matching
CIDEr [159] Image captioning tf-idf weighted n-gram similarity
SPICE [15] Image captioning Scene-graph synonym matching
WMD [93] Document similarity Earth mover distance on word2vec
TABLE III: Summary of metrics used for video description evaluation.

5.1.3 Metric for Evaluation of Translation with Explicit Ordering (METEOR, 2005)

METEOR [23] was proposed to address the shortcomings of BLEU [119]. Instead of exact lexical match required by BLEU, METEOR introduced semantic matching. METEOR takes WordNet[52], a lexical database of the English language to account for various match levels, including exact words matches, stemmed words matches, synonymy matching and the paraphrase matching.

METEOR score computation is based on how well the generated and reference sentences are aligned. Each sentence is taken as a set of unigrams and alignment is done by mapping unigrams of candidate and reference sentences. During mapping, a unigram in candidate sentence (or reference sentence) should either map to unigram in reference sentence (or candidate sentence) or to zero. In case of multiple options available for alignments between the two sentences, the alignment configuration with less number of crossings is preferred. After finalizing the alignment process, METEOR score is calculated.

Initially, unigram based precision score is calculated using relationship. Here represents the number of unigrams co-occurring in both candidate, as well as reference sentences and corresponds to total number of unigrams in the candidate sentences. Then unigram based recall score is calculated using . Here represents the number of unigrams co-occurring in both candidate as well as reference sentences. However,

is the number of unigrams in the reference sentences. Further, precision and recall scores are used to compute the F-score using following equation:

The precision, recall and F-score measures account for unigram based congruity and do not cater for n–grams. The n–gram based similarities are used to calculate the penalty for alignment between candidate and reference sentences. This penalty takes into account the non-adjacent mappings between the two sentences. The penalty is calculated by grouping the unigrams into minimum number of chunks. The chunk includes unigrams that are adjacent in candidate as well as reference sentences. If a generated sentence is an exact match to the reference sentence then there will be only one chunk. The penalty is computed as

where in represents the number of chunks and corresponds to the number of unigrams grouped together. The METEOR score for the sentence is then computed as:

Corpus level score can be computed using the same equation by using aggregated values of all the arguments i.e. and . In case of multiple reference sentences, the maximum METEOR score of a generated and reference sentence is taken. To date, correlation of METEOR score with human judgments is better than that of BLEU score. Moreover, Elliot et al. [48] also found METEOR to be a better evaluation metric as compared to contemporary metrics. Their conclusion is based on Spearman’s correlation computation of automatic evaluation metrics against human judgments.

5.1.4 Consensus based Image Description Evaluation (CIDEr, 2015)

CIDEr [159] is a recently introduced evaluation metric for image captioning task. It evaluates the consensus between a predicted sentence and reference sentences of the corresponding image. It performs stemming and converts all the words from candidate as well as reference sentences into their root forms e.g. stems, stemmer, stemming, and stemmed to their root word stem. CIDEr treats each sentence as a set of n–grams containing 1 to 4 words. To encode the consensus between predicted sentence and reference sentence, it measures the co-existence frequency of n-grams in both sentences. Finally, n–grams that are very common among the reference sentences of all the images are given lower weight, as they are likely to be less informative about the image content, and more biased towards lexical structure of the sentences. The weight for each n–gram is computed using Term Frequency Inverse Document Frequency (TF-IDF) [130]. The term TF puts higher weightage on frequently occurring n–grams in the reference sentence of the image, whereas IDF puts lower weightage on commonly appearing n–grams across the whole dataset.

Finally, CIDEr score is computed as

where is a vector representing all n–grams with length and depicts magnitude of . Same is true for . Further, CIDEr uses higher order n-grams (higher the order, longer the sequence of words) to capture the grammatical properties and richer semantics of the text. For that matter, it combines the scores of different n-grams using the following equation:

The most popular version of CIDEr in image and video description evaluation is CIDEr-D, that incorporates a few modifications in the originally proposed CIDEr to prevent higher scores for the captions that badly fail in human judgments. Firstly, they proposed removal of stemming to ensure correct form of words are used. Otherwise, multiple forms of verbs (singular, plural etc) are mapped to the same token producing high score for incorrect sentences. Secondly, they ensure that if the words of high confidence are repeated in a sentence a high score is not produced as in the original CIDEr produces even if the sentence does not make sense. This is done by introducing a Gaussian penalty over length differences between the candidate and reference sentences and by clipping to the n–grams count equal to the number of occurrences in the reference sentence. The latter ensures that the desired sentence length is not achieved by repetition of high confidence words to get a high score. The aforementioned changes makes the metric robust and ensures its high correlation score [159].

Fig. 12: Components of the WMD metric between a query D and two sentences and with the same BOW distance. with less distance 1.07 matches with query D than with distance 1.63. The arrows show flow between two words and are labeled with their distance contribution. Figure adapted from [93].
Variation Description B M R C
reference an elderly man is playing piano in front of a crowd in an anteroom 1 1 1 10
candidate an elderly man is showing how to play piano in front of a crowd in a hall room 0.47 0.45 0.70 0.53
synonyms an old man is demonstrating how to play piano in front of a crowd in a hall room 0.37 0.40 0.64 0.43
redundancy an elderly man is showing how to play piano in front of a crowd in a hall room with a woman 0.40 0.44 0.65 0.47
word order an elderly man in front of a crowd is showing how to play piano in a hall room 0.30 0.39 0.57 0.35
short length a man is playing piano 0.12 0.22 0.39 0.49
TABLE IV: Variations in automatic evaluation metric scores with four types of changes made to candidate sentence i.e. words replaced with their synonyms, added redundancy to sentence, changing word order, and shortening the sentence length. The first row shows the upper bound scores of BLEU-4, METEOR, ROUGE, and CIDEr represented by B, M, R, and C respectively.

5.1.5 Word Mover’s Distance (WMD, 2015)

The WMD [93] makes use of word embeddings which are semantically meaningful vector representations of words learnt from text corpora. WMD distance measures the dissimilarity between two text documents. Two captions with different words may still have the same semantic meanings. On the other hand, it is possible for multiple captions to have the same attributes, objects and their relations while still having very different meanings. WMD was proposed to address this problem. This is because word embeddings are good at capturing semantic meanings and are easier to compute than WordNet thanks to the distributed vector representations of words. The distance between two texts is casted as an Earth Mover’s Distance (EMD) [140], typically used in transportation to calculate the travel cost using word2vec embeddings [108].

In this metric, each caption or description is represented by a bag-of-words histogram that includes all but the start and stop words. The magnitude of each bag-of-words histogram is then normalized. To account for semantic similarities that exist between pairs of words, the WMD metric uses the Euclidean distance in the word2vec embedding space. The distance between two documents or captions is then defined as the cost required to move all words between captions. Figure 12 illustrates an example WMD calculation process. The WMD is modelled as a special case of EMD [140] and is then solved by linear optimization. Compared to BLUE, ROUGE and CIDEr, WMD is less sensitive to words order or synonym swapping. Further, similar to CIDEr and METEOR, it gives high correlation against human judgments.

5.1.6 Semantic Propositional Image Captioning Evaluation (SPICE, 2016)

SPICE [15] is the latest proposed evaluation metric for image and video descriptions. SPICE measures the similarity between the scene graph tuples parsed from the machine generated descriptions and the ground truth. The semantic scene graph encodes objects, their attributes and relationships through a dependency parse tree. A scene graph tuple of caption consists of semantic tokens such as object classes , relation types and attribute types ,

SPICE is computed based on F1-score between the tuples of machine generated descriptions and the ground truth. Like METEOR, SPICE also uses WordNet to find and treat synonyms as positive matches. Although, in the current literature, the SPICE score has not been employed much but one obvious limiting factor on its performance could be the quality of the parsing. For instance, in a sentence ‘‘white dog swimming through river”, the failure case could be the word “swimming” being parsed as “object” and the word “dog” parsed as “attribute” resulting in a very bad score.

5.2 Human Evaluations

Given the lack of reference captions and low correlation with human judgments of automated evaluation metrics, human evaluations are also often used to judge the quality of machine generated captions. Human evaluations may either be crowd-sourced, such as AMT workers or specialist judges as in some competitions. Such human evaluations can be further structured using measurements such as Relevance or Grammar Correctness. In relevance based evaluation, video content relevance is given subjective scores, with highest score given to the “Most Relevant” and minimum score to the “Least Relevant”. The score of two sentences cannot be the same unless they are identical. In the approaches where grammar correctness is measured, the sentences are graded based on grammatical correctness without showing the video content to the evaluators in which case, more than one sentence may have the same score.

5.3 Limitations of Evaluation Metrics

Like video description, evaluation of the machine generated sentences is an equally difficult task. There is no metric specifically designed for evaluating video description, instead machine translation and image captioning metrics have been extended for this task. These automatic metrics compute the score given reference and candidate sentences. This paradigm has a serious problem that there can be several different ways to describe the same video, all correct at the same time, depending upon “what has been described” (content selection) and “how it has been described” (realization). These metrics fail to incorporate all these variations and are, therefore, far from being perfect. Various studies [80, 171] have examined how metric scores behave under different conditions. In Table IV, we perform similar experiments [80] but with an additional variation of short length. First, the original caption was evaluated with itself to analyze the maximum possible score achievable by each metric (first row of Table IV). Next, minor modifications were introduced in the candidate sentences to measure how the evaluation metrics behave. It was observed that all metric scores reduced, BLEU and CIDEr being the most affected, when some words were replaced with their synonyms. This is apparently due to the failure to match synonyms. Further experiments revealed that the metrics were generally stable when the sentence was perturbed with a few additional words. However, changing the word order in a sentence was found to alter the scores of n-gram based metrics like BLEU, ROUGE and CIDEr significantly and that of ROUGE to some extent. On the other hand, WMD and SPICE were found to be robust to word order changes [80]. Lastly, reducing the sentence length significantly affected BLEU, METEOR and ROUGE scores but had little effect on CIDEr score i.e. the scores were reduced by 74%, 51%, 44% and 7% respectively.

5.4 Reliability of Evaluation Metrics

A good method to evaluate the video descriptions is to compare the machine generated descriptions with the ground truth descriptions annotated by humans. However, as shown in Figure 11, the reference captions can vary within itself and can only represent few samples out of all valid samples for the same video clip. Having more reference sample captions create a better solution space and hence lead to more reliable evaluation.

Another aspect of the evaluation problem is the syntactic variations in candidate sentences. The same problem also exists in the well studied field of machine translation. In this case, a sentence in a source language can be translated into various sentences in a target language. Syntactically different sentences may still have the same semantic content.

In a nutshell, evaluation metrics assess the suitability of a caption to the visual input by comparing how well the candidate caption matches with that of reference caption(s). The agreement of the metric scores with human judgments (i.e. the gold standard) improves with the increased number of reference captions [159]. Numerous studies [159, 159, 116, 184, 161] also found that CIDEr, WMD, SPICE and METEOR have higher correlations to human judgments and are regarded as superior amongst the contemporary metrics. WMD and SPICE are very recent automatic caption evaluation metrics and have not been studied extensively in the literature at the time of this survey.

Techniques / Models / Methods Yr Dataset Results
RBS+RBS & RF-TP+RBS [69] 2012 MSVD SVO Accuracy
SVO-LM (VE) [88] 2013 MSVD 0.45+_0.05 0.36+_0.27
FGM [154] 2014 MSVD SVOP Accuracy
LSTM-YT [162] 2015 MSVD 33.3 29.1 - -
TA [179] 2015 MSVD 41.9 29.6 51.67 -
S2VT [161] 2015 MSVD - 29.8 - -
h-RNN [184] 2016 MSVD 49.9 32.6 65.8 -
MM-VDN [176] 2016 MSVD 37.6 29.0 - -
Glove + Deep Fusion Ensble [160] 2016 MSVD 42.1 31.4 - -
S2FT [101] 2016 MSVD - 29.9 - -
HRNE [116] 2016 MSVD 43.8 33.1 - -
GRU-RCN [22] 2016 MSVD 43.3 31.6 68.0 -
LSTM-E [117] 2016 MSVD 45.3 31.0 - -
SCN-LSTM [56] 2017 MSVD 51.1 33.5 77.7 -
LSTM-TSA [118] 2017 MSVD 52.8 33.5 74.0 -
TDDF [189] 2017 MSVD 45.8 33.3 73.0 69.7
BAE [24] 2017 MSVD 42.5 32.4 63.5 -
PickNet [37] 2018 MSVD 46.1 33.1 76.0 69.2
M [170] 2018 MSVD 52.8 33.3 - -
RecNet [167] 2018 MSVD 52.3 34.1 80.3 69.8
TSA-ED [173] 2018 MSVD 51.7 34.0 74.9 -
GRU-EVE [13] 2019 MSVD 47.9 35.0 78.1 71.5
TABLE V: Performance of video captioning methods on MSVD dataset. Higher scores are better in all metrics. The best score for each metric is shown in bold.

6 Benchmark Results

We summarize the benchmark results of various techniques on each video description dataset. We group the methods based on the dataset they reported results on and then order them chronologically. Moreover, for multiple variants of the same model, only their best reported results are reported here. For a detailed analysis of each method and its variants, the original paper should be consulted. In addition, where multiple n–gram scores are reported for the BLEU metric, we have chosen only the BLEU@4 results as this is the closest to human evaluations. From Table V, we can see that most methods have reported results on the MSVD dataset, followed by MSR-VTT, M-VAD, MPII-MD, and ActivityNet Captions. The popularity of MSVD can be attributed to the diverse nature of YouTube videos and the large number of reference captioning. MPII-MD, M-VAD, MSR-VTT and ActivityNet Captions are popular because of their size and their inclusion in competitions (see Section 4).

Techniques / Models / Methods Yr Dataset Results
SMT(SR) + Prob I/P [131] 2014 TACoS MLevel 28.5 - - -
CRF + LSTM-Decoder [46] 2015 TACoS MLevel 28.8 - - -
h-RNN [184] 2016 TACoS MLevel 30.5 28.7 160.2 -
JEDDi-Net [175] 2018 TACoS MLevel 18.1 23.85 103.98 50.85
TABLE VI: Performance of video captioning methods on TACoS-MLevel dataset. Higher scores are better in all metrics. The best score for each metric is shown in bold.
Techniques / Models / Methods Yr Dataset Results
Temporal-Attention (TA) [179] 2015 M-VAD 0.7 5.7 6.1 -
S2VT [161] 2015 M-VAD - 6.7 - -
Visual-Labels [132] 2015 M-VAD - 6.4 - -
HRNE [116] 2016 M-VAD 0.7 6.8 - -
Glove + Deep Fusion Ensemble [160] 2016 M-VAD - 6.8 - -
LSTM-E [117] 2016 M-VAD - 6.7 - -
LSTM-TSA [118] 2017 M-VAD - 7.2 - -
BAE [24] 2017 M-VAD - 7.3 - -
TABLE VII: Performance of video captioning methods on M-VAD dataset.
Techniques / Models / Methods Yr Dataset Results
S2VT [161] 2015 MPII-MD - 7.1 - -
Visual-Labels [132] 2015 MPII-MD - 7.0 - -
SMT [133] 2015 MPII-MD - 5.6 - -
Glove + Deep Fusion Ensemble [160] 2016 MPII-MD - 6.8 - -
LSTM-E [117] 2016 MPII-MD - 7.3 - -
LSTM-TSA [118] 2017 MPII-MD - 8.0 - -
BAE [24] 2017 MPII-MD 0.8 7.0 10.8 16.7
TABLE VIII: Performance of video captioning methods on MPII-MD dataset.
Techniques / Models / Methods Yr Dataset Results
Alto [144] 2016 MSR-VTT 39.8 26.9 45.7 59.8
VideoLab [125] 2016 MSR-VTT 39.1 27.7 44.4 60.6
RUC-UVA [47] 2016 MSR-VTT 38.7 26.9 45.9 58.7
v2t-navigator [76] 2016 MSR-VTT 40.8 28.2 44.8 61.1
TDDF [189] 2017 MSR-VTT 37.3 27.8 43.8 59.2
DenseVidCap [143] 2017 MSR-VTT 41.4 28.3 48.9 61.1
CST-GT-None [121] 2017 MSR-VTT 44.1 29.1 49.7 62.4
PickNet [37] 2018 MSR-VTT 38.9 27.2 42.1 59.5
HRL [172] 2018 MSR-VTT 41.3 28.7 48.0 61.7
M [170] 2018 MSR-VTT 38.1 26.6 - -
RecNet [167] 2018 MSR-VTT 39.1 26.6 42.7 59.3
GRU-EVE [13] 2019 MSR-VTT 38.3 28.4 48.1 60.7
TABLE IX: Performance of video captioning methods on MSR-VTT dataset..

Another key observation is that earlier works have mainly reported results in terms of subject, verb, object (SVO) and in some cases place (scene) detection accuracies in the video, whereas more recent works started to report sentence level matches using the automatic evaluation metrics. Considering the diverse nature of the datasets and the limitations of automatic evaluation metrics, we analyze the results of different methods using four popular metrics namely BLEU, METEOR, CIDEr and ROUGE.

Techniques / Models / Methods Yr Dataset Results
Dense-Cap Model [87] 2017 ActivityNet Cap 3.98 9.5 24.6 -
LSTM-A+PG+R [180] 2017 ActivityNet Cap - 12.84 - -
TAC [124] 2017 ActivityNet Cap - 9.61 - -
JEDDi-Net [175] 2018 ActivityNet Cap 1.63 8.58 19.88 19.63
DVC [98] 2018 ActivityNet Cap 1.62 10.33 25.24 -
Bi-SST [169] 2018 ActivityNet Cap 2.30 9.60 12.68 19.10
Masked Transformer [192] 2018 ActivityNet Cap 2.23 9.56 - -
TABLE X: Performance of video captioning methods on ActivityNet Captions dataset.
Techniques / Models / Methods Yr Dataset Results
CT-SAN [187] 2016 LSMDC 0.8 7.1 10.0 15.9
GEAN [186] 2017 LSMDC - 7.2 9.3 15.6
HRL [172] 2018 Charades 18.8 19.5 23.2 41.4
TSA-ED [173] 2018 Charades 13.5 17.8 20.8 -
Masked Transformer [192] 2018 YouCook-II 1.13 5.90 - -
TABLE XI: Performance of video captioning methods on various benchmark datasets.

Table V summarizes results for the MSVD dataset. GRU-EVE [13] achieves the best performance on METEOR and ROUGE metrics and the second best on CIDEr metric whereas LSTM-TSA [118] and M- [170] report the best BLEU scores. RecNet [167] has the best CIDEr score and second best BLEU score. As shown in Table VI, on TACoS Multilevel dataset, h-RNN [184] has the best results on all reported metrics i.e. BLEU, METEOR and CIDEr. This method does not provide ROUGE score.

On the more challenging M-VAD dataset, overall the reported results (Table VII) are very poor, however, within the presented results we see that so far only Temporal-Attention [179], and HRNE [116] reported results using the BLEU metric with a BLEU score of 0.7 each. All the papers using this dataset report METEOR results and so far BAE [24] has produced the best METEOR score followed by LSTM-TSA [118]. HRNE [116] and Glove+Deep Fusion Ensemble [160] share the third place for METEOR score.

MPII-MD is another very challenging dataset and still has very low benchmark results, as shown in Table VIII, similar to the M-VAD dataset. Only BAE [24] has reported BLEU score for this dataset. LSTM-TSA [118] has achieved the best METEOR score followed by LSTM-E [117] and S2VT [161] at second and third place respectively. No other paper using this dataset has reported CIDEr and ROUGE score except BAE [24].

Results on another popular dataset, MSR-VTT, are overall better than the M-VAD and MPII-II datasets. As shown in Table IX, CST-GT-None [121] has reported the highest score on all four metrics i.e. BLEU, METEOR, CIDEr and ROUGE. DenseVidCap [143] and HRL [172] respectively report the second and third best scores on BLEU metric. GRU-EVE [13] reports the third best score in METEOR and CIDEr metrics.

Results of another recent and popular ActivityNet Captions dataset are presented in Table X. This dataset was primarily introduced for dense video captioning and is gaining popularity very quickly. In this dataset, Dense-Cap Model [87] stands at top in terms of BLEU score. Best METEOR score is reported by LSTM-A+PG+R [180]. Highest scores in CIDEr and ROUGE metrics are achieved by methods DVC [98] and JEDDi-Net [175] respectively. Finally, in Table XI, we report two results for LSMDC and Charades each and only one result for YouCook-II datasets. YouCook-II is also a recent dataset and not reported much in the literature.

We summarize the best reporting methods for each dataset along with their published scores. The tables group methods by the used dataset(s). Hence, one can infer the difficulty level of datasets by comparing the intra dataset scores of the same methods and the popularity of a particular dataset from the number of methods that have reported results on it.

7 Future and Emerging Directions

Automatic video description has come very far since the pioneer methods, especially after the adoption of deep learning. Although the performance of existing methods is still far below that of humans, the gap is diminishing at a steady rate and there is still ample room for algorithmic improvements. Here, we list several possible future and emerging directions that have the potential to advance this research area.

Visual Reasoning: Although video VQA is still in nascent stage, beyond VQA is the visual reasoning problem. This is a very promising field to further explore. Here the model is made not to just answer a particular question but to reason why it chose that particular answer. For example in a video where a road side with parking marks is shown, the question is “Can a vehicle be parked here?”, the model answers correctly, “Yes”. The next question is “Why?” to which the model reasons that there is a parking sign on the road which means it is legal to park here. Another example is the explanations generated by self driving cars [81] where the system keeps the passengers in confidence by generating natural language descriptions of the reasons behind its decisions e.g. to slow down, take a turn etc. An example of visual reasoning models is the MAC Network [74] which is able to think and reason giving promising results on CLEVR [77], a visual reasoning dataset.

Visual Dialogue: Similar to audio dialogue (e.g. Siri, Hello Google, Alexa and ECHO), visual dialogue [43] is another promising and flourishing field, especially in an era where we look forward to interact with robots. In visual dialogue, given a video, a model is asked a series of questions sequentially in a dialogue/conversation manner. The model tries to answer (no matter right or wrong) these questions. This is different from visual reasoning where the model argues the reasons that lead the model to choose particular answers.

Audio and Video: While the majority of computer vision research has focused on video description, without the help of audio, audio is naturally present in most of videos. Audio can help in video description by providing background information for instance, the sound of train, ocean, traffic when there is no visual cue of their presence. Audio can additionally provide semantic information for example, who the person is or what they are saying on the other side of the phone. It can also provide clues about the story, context and sometimes explicitly mention the object or action to complement the video information. Therefore, using audio in video description models will certainly improve the performance [70, 115].

External Knowledge: In video description, most of the time we are comparing the performance with humans who have extensive out of domain or prior knowledge. When humans watch a clip and describe it, most of the time they don’t rely solely on the visual (or even the audio) content. Instead, they additionally employ their background knowledge. Similarly, it would be interesting and promising approach to augment the video description techniques with prior external knowledge [174]. This approach has shown significantly better performance in visual question answering methods and is likely to improve video description accuracy.

Addressing the Finite Model Capacity: Existing methods are trying to perform end-to-end training while using as much data as possible for better learning. However, this approach is inherently limited in learning in itself as no matter how big the training dataset becomes, it will never cover the combinatorial complexity of the real world events. Therefore, learning to use data rather than learning the data itself, is more important and may help improve the upcoming system performances.

Video Description for Subtitle Generation: In conjunction with machine translation, video captioning may be used for automatic video subtitling. Currently it is a manual, time consuming, and very costly process. This line of research is not only beneficial for entertainment, one of the largest industries in the world, but it will potentially help improve comprehension of audiovisual material by the visually and hearing impaired, and second language learners.

Automatic Evaluation Measures: So far video description has relied on automatic metrics designed for machine translation and image captioning tasks. To date there is no automatic video description (or even captioning) evaluation metric that is purpose designed. Although metrics designed for image captioning are relevant, they have their limitations. This problem is going to exacerbate in the future with dense video captioning and story telling tasks. There is a need for an evaluation metric that is closer to human judgments and that can encapsulate the diversities of realizations of visual content. A promising research direction is to use machine learning to learn such a metric rather than hand engineer it.

8 Conclusion

We presented the first comprehensive literature survey of video description research, starting from the classical methods that are based on Subject-Verb-Object (SVO) tuples to more sophisticated statistical and deep learning based methods. We reviewed popular benchmark datasets that are commonly used for training and testing these models and discussed international competitions/challenges that are regularly held to promote the video description research. We discussed, in detail, the available automatic evaluation metrics for video description, highlighting their attributes and limitations. We presented a comprehensive summary of results obtained by recent methods on the benchmark datasets using all metrics. These results not only show the relative performance of existing methods but also highlight the varying difficulty levels of the datasets and the robustness and trustworthiness of the evaluation metrics. Finally, we put forward some recommendations for future research directions that are likely to push the boundaries of this research area.

From an algorithm design perspective, although LSTMs have shown competitive caption generation performance, the interpretablity and intelligibility of the underlying models are low. Specifically, it is hard to differentiate how much visual features have contributed to the generation of a specific word compared to the bias that comes naturally from the language model adopted. This problem is exacerbated when the aim is to diagnose the generation of erroneous captions. For example, when we see a caption “red fire hydrant” generated by a video description model from a frame containing a “white fire hydrant”, it is difficult to ascertain whether the color feature is incorrectly encoded by the visual feature extractor or is due to the bias in the used language model towards “red fire hydrants”. Future research must focus on improving diagnostic mechanisms to pin point the problematic part of the architectures so that it can be improved or replaced.

Our survey shows that a major bottleneck hindering progress along this line of research is the lack of effective and purposely designed video description evaluation metrics. Current metrics have been adopted either from machine translation or image captioning and fall short in measuring the quality of machine generated video captions and their agreement with human judgments. One way to improve these metrics is to increase the number of reference sentences. We believe that purpose built metrics that are learned from the data itself is the key to advancing video description research.

Some challenges come from the diverse nature of the videos themselves. For instance, multiple activities in a video, where captions represent only some activities, could lead to low video description performance of a model. Similarly, longer duration videos pose further challenges since most action features can only encode short term actions such as trajectory features and C3D features [158]

that are dependent on video segment lengths. Most feature extractors are suitable only for static or smoothly changing images and hence struggle to handle abrupt scene changes. Current methods rather simplify the visual encoding part by representing holistic videos or frames. Attention models may further need to be explored to focus on spatially and temporally significant parts of the video. Similarly, temporal modeling of the visual features itself is quite rudimentary in existing methods. Most methods either use mean pooling which completely discards the temporal information or use the C3D model which can only model 15 frames. Future research should focus on designing better temporal modeling architectures that preferably learn in an end-to-end fashion rather than disentangling the visual description from the temporal model and the temporal modeling from language description.


The authors acknowledge Marcus Rohrbach (Facebook AI Research) for his valuable input. The research was supported by ARC Discovery Grant DP160101458 and DP150102405.


  • [1]
  • [2] TREC Video Retrieval Evaluation (TRECVID) Challenge.
  • [3] Casting words transcription service, 2014.
  • [4] Describing and Understanding Video and the Large Scale Movie Description Challenge (LSMDC), 2015.
  • [5] Microsoft Research - Video to Text (MSR-VTT) Challenge, 2016.
  • [6] Activity Net Captions Challenge, Evaluations, 2017.
  • [7] Activity Net Captions Challenge, Results, 2017.
  • [8] Activity Net Captions Challenge, Task5: Dense-Captioning Events in Videos, 2017.
  • [9] Activity Net Challenge, 2017.
  • [10] Language in Vision, 2017.
  • [11] The Large Scale Movie Description Challenge (LSMDC) Online Evaluations, 2017.
  • [12] The Large Scale Movie Description Challenge (LSMDC) Online Results, 2017.
  • [13] N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani and A. Mian. 2019. Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. In IEEE CVPR.
  • [14]

    J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and S. Lacoste-Julien. 2016. Unsupervised learning from narrated instruction videos. In IEEE CVPR.

  • [15] P. Anderson, B. Fernando, M. Johnson, and S. Gould. 2016. Spice: Semantic propositional image caption evaluation. In IEEE ECCV
  • [16] B. Andrei, E. Georgios, H. Daniel, M. Krystian, N. Siddharth, X. Caiming, and Z. Yibiao. 2015. A Workshop on Language and Vision at CVPR 2015.
  • [17] B. Andrei, M. Tao, N. Siddharth, Z. Quanshi, S. Nishant, L. Jiebo, and S. Rahul. 2018. A Workshop on Language and Vision at CVPR 2018.
  • [18] R. Anna, T. Atousa, R. Marcus, P. Christopher, L. Hugo, C. Aaron, and S. Bernt. 2015. The Joint Video and Language Understanding Workshop at ICCV 2015.
  • [19] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. 2015. VQA: Visual question answering. In IEEE ICCV.
  • [20] G. Awad, J. Fiscus, M. Michel, D. Joy, W. Kraaij, A. F. Smeaton, G. Quénot, M. Eskevich, R. Aly, G. J. F. Jones, et al. 2016. Evaluating Video Search, Video Event Detection, Localization and Hyperlinking, TRECVID 2016.
  • [21] D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, (2014).
  • [22] N. Ballas, L. Yao, C. Pal, and A. Courville. 2015. Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432, (2015).
  • [23] S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. ACL workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization. 65-72.
  • [24] L. Baraldi, C. Grana, and R. Cucchiara. 2017. Hierarchical Boundary-Aware Neural Encoder for Video Captioning. In IEEE CVPR
  • [25] A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, et al. 2012. Video in sentences out. arXiv preprint arXiv:1204.2742,(2012).
  • [26] K. Barnard, P. Duygulu, D. Forsyth, N. D. Freitas, D. M. Blei, and M. I. Jordan. 2003. Matching words and pictures. Journal of Machine Learning Research 3, Feb (2003), 1107-1135.
  • [27] T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-Miller, and D. A. Forsyth. 2004. Names and faces in the news. In IEEE CVPR.
  • [28] A. F. Bobick and A. D. Wilson. 1997. A state-based approach to the representation and recognition of gesture. IEEE TPAMI 19, 12 (1997), 1325-1337.
  • [29] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, S. Huang, M. Huck, P. Koehn, Q. Liu, V. Logacheva, et al. 2017. 2nd Conference on Machine Translation. 169-214.
  • [30] G. Burghouts, H. Bouma, R. D. Hollander, S V. D. Broek, and K. Schutte. 2012. Recognition of 48 human behaviors from video. In Int. Symp. Optronics in Defense and Security, OPTRO.
  • [31] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, 2015. Activitynet: A large-scale video benchmark for human activity understanding. In IEEE CVPR.
  • [32] M. Brand. 1997. The” Inverse hollywood problem”: from video to scripts and storyboards via causal analysis. In AAAI/IAAI. Citeseer, 132-137.
  • [33] R. Chaudhry, A. Ravichandran, G. Hager, R. Vidal. 2009. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions, CVPR 2009.
  • [34] D. Chen and W. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In ACL: Human Language Technologies-Volume 1. ACL, 190-200.
  • [35] D. Chen, W. Dolan, S. Raghavan, T. Huynh, and R. Mooney. 2010. Collecting highly parallel data for paraphrase evaluation. In JAIR: - Volume 37. ACL, 397-435.
  • [36] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, (2015).
  • [37] Y. Chen, S. Wang, W. Zhang, and Q. Huang. 2018. Less Is More: Picking Informative Frames for Video Captioning. arXiv preprint arXiv:1803.01457, (2018).
  • [38] K. Cho, B. V. Merriënboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, (2014).
  • [39] J. Corso. 2015. GBS: Guidance by Semantics-Using High-Level Visual Inference to Improve Vision-Based Mobile Robot Localization. Technical Report. State Univ of New York at Buffalo Amherst.
  • [40] N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on CVPR.
  • [41] N. Dalal, B. Triggs, and C. Schmid. 2006. Human detection using oriented histograms of flow and appearance. In IEEE ECCV.
  • [42] P. Das, C. Xu, R. F. Doell, and J. J. Corso. 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In IEEE CVPR.
  • [43] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, and D. Batra. 2017. Visual Dialog. In IEEE CVPR
  • [44] J. Deng, K. Li, M. Do, H. Su, and L. Fei-Fei. 2009. Construction and analysis of a large scale image ontology. Vision Sciences Society 186, 2 (2009).
  • [45] D. Ding, F. Metze, S. Rawat, P. F. Schulam, S. Burger, E. Younessian, L. Bao, M. G. Christel, and A. Hauptmann. 2012. Beyond audio and video retrieval: towards multimedia summarization. In 2nd ACM International Conference on Multimedia Retrieval (ICMR).
  • [46] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. 2015. Long-term RCNN for visual recognition and description. In IEEE CVPR.
  • [47] J. Dong, X. Li, W. Lan, Y. Huo, and C. G. M. Snoek. 2016. Early embedding and late reranking for video captioning. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1082-1086.
  • [48] D. Elliott and F. Keller. 2014. Comparing automatic evaluation measures for image description. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Short Papers, Vol. 452. 457.
  • [49] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The pascal visual object classes (voc) challenge. IJCV 88, 2 (2010), 303-338.
  • [50] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. 2015. From captions to visual concepts and back. In IEEE CVPR.
  • [51] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In IEEE ECCV.
  • [52] C. Fellbaum. 1998. WordNet. Wiley Online Library
  • [53] P. Felzenszwalb, D. McAllester, and D. Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In IEEE CVPR.
  • [54] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. 2010. Cascade object detection with deformable part models. In IEEE CVPR.
  • [55] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. 2010. Object detection with discriminatively trained part-based models. IEEE TPAMI 32, 9 (2010), 1627-1645.
  • [56] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. 2017. Semantic Compositional Networks for visual captioning. In IEEE CVPR.
  • [57] A. George, B. Asad, F. Jonathan, J. David, D. Andrew, M. Willie, M. Martial, S. Alan, G. Yvette, and K. Wessel. 2017. TRECVID 2017: Evaluating Ad-hoc and Instance Video Search, Events Detection, Video Captioning, and Hyperlinking. In Proceedings of TRECVID 2017.
  • [58] B. Ghanem, J. Niebles, C. Snoek, F. Heilbron, H. Alwassel, R. Khrisna, V. Escorcia, K. Hata, and S. Buch. 2017. ActivityNet Challenge 2017 Summary. arXiv preprint arXiv:1710.08011, (2017).
  • [59] S. Gella, M. Lewis, and M. Rohrbach. 2018. A Dataset for Telling the Stories of Social Media Videos. In Proc of the 2018 Conference on Empirical Methods in Natural Language Processing. 968–974.
  • [60] S. Gong and T. Xiang. 2003. Recognition of group activities using dynamic probabilistic networks. In IEEE ICCV.
  • [61] Y. Graham, G. Awad, and A. Smeaton. 2017. Evaluation of Automatic Video Captioning Using Direct Assessment. arXiv preprint arXiv:1710.10586, (2017).
  • [62] Y. Graham, T. Baldwin, A. Moffat, and J. Zobel. 2017. Can machine translation systems be evaluated by the crowd alone. Natural Language Engineering 23, 1 (2017), 3-30.
  • [63] A. Graves and N. Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning (ICML-14). 1764-1772.
  • [64] A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In IEEE Int Conference on Acoustics, Speech and Signal Processing (ICASSP). 6645-6649.
  • [65] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. 2013. Recognizing and describing activities using semantic hierarchies and zero-shot recognition. In IEEE ICCV.
  • [66] S. Guadarrama, L. Riano, D. Golland, D. Go, Y. Jia, D. Klein, P. Abbeel, T. Darrell, et al. 2013. Grounding spatial relations for human-robot interaction. In Intelligent Robots and Systems (IROS). 1640-1647.
  • [67] A. Hakeem, Y. Sheikh, and M. Shah. 2004. CASE: A Hierarchical Event Representation for the Analysis of Videos. In AAAI. 263-268.
  • [68] L. Han, A. L. Kashyap, T. Finin, J. Mayfield, and J. Weese. 2013. UMBC-EBIQUITY-CORE: semantic textual similarity systems. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, Vol. 1. 44-52.
  • [69] P. Hanckmann, K. Schutte, and G. J. Burghouts. 2012. Automated textual descriptions for a wide range of video events with 48 human actions. In IEEE ECCV.
  • [70] D. Harwath, A. Recasens, D. Suris, G. Chuang, A. Torralba, and J. Glass. 2018. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input. In IEEE ECCV.
  • [71] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In IEEE CVPR.
  • [72] S. Hochreiter and J. Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735-1780.
  • [73]

    S. Hongeng, F. Brémond, and R. Nevatia. 2000. Bayesian framework for video surveillance application. In Pattern Recognition, 2000. Proceedings. 15th International Conference on, Vol. 1. IEEE, 164-170.

  • [74] Drew A. Hudson, Christopher D. Manning. 2018. Compositional Attention Networks for Machine Reasoning. In ICLR.
  • [75]

    Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 675-678.

  • [76] Q. Jin, J. Chen, S. Chen, Y. Xiong, and A. Hauptmann. 2016. Describing videos using multi-modal fusion. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1087-1091.
  • [77] J. Johnson, B. Hariharan, L. V. D. Maaten, L. Fei-Fei, C. L. Zitnick, R. Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In IEEE CVPR.
  • [78] M. U. G. Khan and Y. Gotoh. 2012. Describing video contents in natural language. In Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. ACL, 27-35.
  • [79] M. U. G. Khan, L. Zhang, and Y. Gotoh. 2011. Human focused video description. In IEEE International Conference on Computer Vision Workshops (ICCV Workshops).
  • [80] M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem. 2016. Re-evaluating automatic metrics for image captioning. arXiv preprint arXiv:1612.07600, (2016).
  • [81] J. Kim, A. Rohrbach, T. Darrell, J. Canny, Z. Akata. 2018. Textual Explanations for Self-Driving Vehicles, ECCV 2018.
  • [82] W. Kim, J. Park, and C. Kim. 2010. A novel method for efficient indoor–outdoor image classification. Journal of Signal Processing Systems 61, 3 (2010), 251-258.
  • [83] R. Kiros, R. Salakhutdinov, and R. S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, (2014).
  • [84] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, WadeShen, C. Moran, R. Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions.ACL, 177-180.
  • [85] A. Kojima, T. Tamura, and K. Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. IJCV 50, 2 (2002), 171-184.
  • [86] D. Koller, N. Heinze, and H. Nagel. 1991. Algorithmic characterization of vehicle trajectories from image sequences by motion verbs. In IEEE Computer Society Conference on CVPR. 90-95.
  • [87] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles. 2017. Dense-Captioning Events in Videos. arXiv:1705.00754, (2017).
  • [88] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. 2013. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge. In AAAI, Vol. 1. 2.
  • [89] A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097-1105.
  • [90] P. Kuchi, P. Gabbur, P. S. Bhat, and S. S. David. 2002. Human face detection and tracking using skin color modeling and connected component operators. IETE Journal of Research 48, 3-4 (2002), 289–293.
  • [91] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. 2011. Baby talk: Understanding and generating image descriptions. In IEEE CVPR.
  • [92] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems. 3675-3683.
  • [93] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. 2015. From word embeddings to document distances. In International Conference on Machine Learning (ICML).
  • [94] I. Langkilde-Geary and K. Knight. Halogen input representation.
  • [95] M. W. Lee, A. Hakeem, N. Haering, and S. Zhu. 2008. Save: A framework for semantic annotation of visual events. In IEEE Computer Society Conference on CVPR Workshops. 1-8.
  • [96] L. Li and B. Gong. 2018. End-to-End Video Captioning with Multitask Reinforcement Learning. arXiv preprint arXiv:1803.07950, (2018).
  • [97] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Conference on Computational Natural Language Learning (CNLL).
  • [98] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei. 2018. Jointly Localizing and Describing Events for Dense Video Captioning. In IEEE CVPR.
  • [99] C. Lin. 2004. Rouge: A package for automatic evaluation of summaries. in: Text Summarization Branches Out.
  • [100] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. 2014. Microsoft coco: Common Objects in Context. In IEEE ECCV.
  • [101] Y. Liu and Z. Shi. 2016. Boosting video description generation by explicitly translating from frame-level captions. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 631-634.
  • [102] D. G. Lowe. 1999. Object recognition from local scale-invariant features. In IEEE ICCV.
  • [103] I. Maglogiannis, D. Vouyioukas, and C. Aggelopoulos. 2009. Face detection and recognition of natural human emotion using Markov random fields. Personal and Ubiquitous Computing, Vol. 13, 1, 95-101.
  • [104] M. Malinowski and M. Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems. 1682-1690.
  • [105] J. Mao, X. Wei, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille. 2015. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In IEEE ICCV.
  • [106] M. Margaret, M. Ishan, H. Ting-Hao, and F. Frank. 2018. Story Telling Workshop and Visual Story Telling Challenge at NAACL 2018.
  • [107] C. Matuszek, D. Fox, and K. Koscher. 2010. Following directions using statistical machine translation. In 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).
  • [108]

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111-3119.

  • [109] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, (2013).
  • [110] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassbis. 2015. Human-level control through deep reinforcement learning. Nature, Vol. 518, 7540, 529.
  • [111] D. Moore and I. Essa. 2002. Recognizing multitasked activities from video using stochastic context-free grammar. In AAAI/IAAI. 770-776.
  • [112] R. Nevatia, J. Hobbs, and B. Bolles. 2004. An ontology for video event representation. In CVPR Workshop. 119-119.
  • [113] F. Nishida and S. Takamatsu. 1982. Japanese-English translation through internal expressions. In Proceedings of the 9th conference on Computational linguistics-Volume 1. Academia Praha, 271-276.
  • [114] F. Nishida, S. Takamatsu, T. Tani, and T. Doi. 1988. Feedback of correcting information in post editing to a machine translation system. In Proceedings of the 12th conference on Computational linguistics-Volume 2. ACL, 476-481.
  • [115] A. Owens, A. A. Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In IEEE ECCV.
  • [116] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In IEEE CVPR.
  • [117] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In IEEE CVPR.
  • [118] Y. Pan, T. Yao, H. Li, and T. Mei. 2017. Video Captioning With Transferred Semantic Attributes. In IEEE CVPR.
  • [119] K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on ACL. 311-318.
  • [120] R. Pasunuru and M. Bansal. 2017. Reinforced video captioning with entailment rewards. arXiv preprint arXiv:1708.02300, (2017).
  • [121] S. Phan, G. E. Henter, Y. Miyao, and S. Satoh. 2017. Consensus-based Sequence Training for Video Captioning. arXiv preprint arXiv:1712.09532, (2017).
  • [122] C. S. Pinhanez and A. F. Bobick. 1998. Human action detection using pnf propagation of temporal constraints. In IEEE Computer Society Conference on CVPR.
  • [123] C. Pollard and I. A. Sag. 1994. Head-driven phrase structure grammar. University of Chicago Press.
  • [124] J. Qin, C. Shizhe, C. Jia, Chen, and H. Alexander. 2017. RUC-CMU: System Descriptions for the Dense Video Captioning Task. arXiv preprint arXiv:1710.08011, (2017).
  • [125] V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. Hendricks, M. Rohrbach, and K. Saenko. 2016. Multimodal video description. In Proceedings of ACM on Multimedia Conference. ACM, 1092-1096.
  • [126] M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, Vol. 1, 25-36.
  • [127] E. Reiter and R. Dale. 2000. Building natural language generation systems. Cambridge university press.
  • [128] M. Ren, R. Kiros, and R. Zemel. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems. 2953-2961.
  • [129] Z. Ren, X. Wang, N. Zhang, X. Lv, and L. Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. arXiv preprint arXiv:1704.03899, (2017).
  • [130] S. Robertson. 2004. Understanding inverse document frequency: on theoretical arguments for IDF. Journal of documentation, Vol. 60, 5, 503–520.
  • [131] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. 2014. Coherent multi-sentence video description with variable level of detail. In German Conference on Pattern Recognition.
  • [132] A. Rohrbach, M. Rohrbach, and B. Schiele. 2015. The long-short story of movie description. In German Conference on Pattern Recognition. 209-221.
  • [133] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. 2015. A dataset for movie description. In IEEE CVPR.
  • [134] A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele. 2017. Movie description. IJCV, Vol. 123, 1, 94-120.
  • [135] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. 2012. A database for fine grained activity detection of cooking activities. In IEEE CVPR.
  • [136] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. 2013. Translating video content to natural language descriptions. In IEEE ICCV.
  • [137] M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele. 2012. Script data for attribute-based recognition of composite activities. In IEEE ECCV.
  • [138] D. Roy. 2005. Semiotic schemas: A framework for grounding language in action and perception. Artificial Intelligence, Vol. 167(1-2), 170-205.
  • [139] D. Roy and E. Reiter. 2005. Connecting Language to the World. Artificial Intelligence, Vol. 167(1-2), 1-12.
  • [140]

    Y. Rubner, C. Tomasi, and L. J. Guibas. 2000. The earth mover’s distance as a metric for image retrieval. IJCV, Vol. 40, 2, 99-121.

  • [141] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. 2015. ImageNet large scale visual recognition challenge. IJCV, Vol. 115, 3, 211-252.
  • [142]

    M. Schuster and K. K. Paliwal. 1997. Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing, Vol. 45, 11, 2673-2681.

  • [143] Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y. Jiang, and X. Xue. 2017. Weakly Supervised Dense Video Captioning. In IEEE CVPR.
  • [144] R. Shetty and J. Laaksonen. 2016. Frame-and segment-level features and candidate pool evaluation for video caption generation. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1073-1076.
  • [145] J. Shi and C. Tomasi. 1994. Good features to track. In IEEE CVPR.
  • [146] A. Shin, K. Ohnishi, and T. Harada. 2016. Beyond caption to narrative: Video captioning with multiple sentences. In IEEE International Conference on Image Processing (ICIP).
  • [147] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In IEEE ECCV.
  • [148] K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, (2014).
  • [149] N. Srivastava, E. Mansimov, and R. Salakhudinov. 2015. Unsupervised learning of video representations using LSTMs. In International Conference on Machine Learning (ICML). 843-852.
  • [150]

    C. Sun and R. Nevatia. 2014. Semantic aware video transcription using random forest classifiers. In IEEE ECCV.

  • [151] I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104-3112.
  • [152] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In IEEE CVPR.
  • [153] S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, Ashis Gopal Banerjee, Seth J Teller, and Nicholas Roy. 2011. Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation. In AAAI.
  • [154] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. 2014. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild. In Coling, Vol. 2, 5, 9.
  • [155] C. Tomasi and T. Kanade. 1991. Detection and tracking of point features.
  • [156] A. Torabi, C. Pal, H. Larochelle, and A. Courville. 2015. Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070, (2015).
  • [157] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. 2003. Context-based vision system for place and object recognition. In IEEE ICCV.
  • [158] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2014. C3D: Generic Features for Video Analysis. CoRR abs/1412.0767, (2014).
  • [159] R. Vedantam, C. L. Zitnick, and D. Parikh. 2015. Cider: Consensus-based image description evaluation. In IEEE CVPR.
  • [160] S. Venugopalan, L. A. Hendricks, R. Mooney, and K. Saenko. 2016. Improving LSTM-based video description with linguistic knowledge mined from text. arXiv preprint arXiv:1604.01729, (2016).
  • [161] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015. Sequence to sequence-video to text. In IEEE ICCV.
  • [162] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729, (2014).
  • [163] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. 2017. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, (2017).
  • [164] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In IEEE CVPR.
  • [165] P. Viola and M. Jones. 2001. Rapid object detection using a boosted cascade of simple features. In IEEE CVPR.
  • [166] Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, and Douwe Kiela. 2018. Talk the Walk: Navigating New York City through Grounded Dialogue. CoRRabs/1807.03367 (2018).
  • [167] B. Wang, L. Ma, W. Zhang, and W. Liu. 2018. Reconstruction Network for Video Captioning. In IEEE CVPR.
  • [168] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In BMVC 2009-British Machine Vision Conference. BMVA Press, 124-1.
  • [169] J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu. 2018. Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning. In IEEE CVPR.
  • [170] J. Wang, W. Wang, Y. Huang, L. Wang, T. Tan. 2018. M3: Multimodal Memory Modelling for Video Captioning. CVPR.
  • [171] J. K. Wang and R. Gaizauskas. 2016. Cross-validating Image Description Datasets and Evaluation Metrics. In Proceedings of 10th Language Resources and Evaluation Conference. European Language Resources Association, 3059-3066.
  • [172] X. Wang, W. Chen, J. Wu, Y. Wang, and W. Y. Wang. 2017. Video Captioning via Hierarchical Reinforcement Learning. arXiv preprint arXiv:1711.11135, (2017).
  • [173] X. Wu, G. Li, Q. Cao, Q. Ji, and L. Lin. 2018. Interpretable Video Captioning via Trajectory Structured Localization. In IEEE CVPR.
  • [174] Q. Wu, P. Wang, C. Shen, A. Dick, A. Hengel. 2016. Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources. In IEEE CVPR.
  • [175] H. Xu, B. Li, V. Ramanishka, L. Sigal, and K. Saenko. 2018. Joint Event Detection and Description in Continuous Video Streams. arXiv preprint arXiv:1802.10250, (2018).
  • [176] H. Xu, S. Venugopalan, V. Ramanishka, M. Rohrbach, and K. Saenko. 2015. A multi-scale multiple instance video description network. arXiv preprint arXiv:1505.05914, (2015).
  • [177] J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In IEEE CVPR.
  • [178] R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In AAAI, Vol. 5, 6.
  • [179] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting temporal structure. In IEEE ICCV.
  • [180] T. Yao, Y. Li, Z. Qiu, F. Long, Y. Pan, D. Li, and T. Mei. 2017. MSR Asia MSM at ActivityNet Challenge 2017: Trimmed Action Recognition, Temporal Action Proposals and Dense-Captioning Events in Videos.
  • [181] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. ACL, Vol. 2, 67-78.
  • [182] H. Yu and J. M. Siskind. 2013. Grounded Language Learning from Video Sentences. In ACL(1). 53-63.
  • [183] H. Yu and J. M. Siskind. 2015. Learning to Describe Video with Weak Supervision by Exploiting Negative Sentential Information. In AAAI. 3855-3863.
  • [184] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In IEEE CVPR.
  • [185] L. Yu, E. Park, A. C. Berg, and T. L. Berg. 2015. Visual madlibs: Fill in the blank description generation and question answering. In IEEE ICCV.
  • [186] Y. Yu, J. Choi, Y. Kim, K. Yoo, S. Lee, and G. Kim. 2017. Supervising Neural Attention Models for Video Captioning by Human Gaze Data. In IEEE CVPR.
  • [187] Y. Yu, H. Ko, J. Choi, and G. Kim. 2016. End-to-end concept word detection for video captioning, retrieval, and question answering. arXiv preprint arXiv:1610.02947, (2016).
  • [188] K. Zeng, T. Chen, J. C. Niebles, and M. Sun. 2016. Title Generation for User Generated Videos. In IEEE ECCV.
  • [189] X. Zhang, K. Gao, Y. Zhang, D. Zhang, J. Li, and Q. Tian. 2017. Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description. In IEEE CVPR.
  • [190] L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach. 2018. Grounded Video Description.arXiv preprint arXiv:1812.06587(2018).
  • [191] L. Zhou, C. Xu, and J. J. Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • [192] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong. 2018. End-to-End Dense Video Captioning with Masked Transformer. In IEEE CVPR.
  • [193] S. Zhu and D. Mumford. 2007. A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, Vol. 2, 4, 259-362.