DeepAI
Log In Sign Up

Unified Embedding and Metric Learning for Zero-Exemplar Event Detection

Event detection in unconstrained videos is conceived as a content-based video retrieval with two modalities: textual and visual. Given a text describing a novel event, the goal is to rank related videos accordingly. This task is zero-exemplar, no video examples are given to the novel event. Related works train a bank of concept detectors on external data sources. These detectors predict confidence scores for test videos, which are ranked and retrieved accordingly. In contrast, we learn a joint space in which the visual and textual representations are embedded. The space casts a novel event as a probability of pre-defined events. Also, it learns to measure the distance between an event and its related videos. Our model is trained end-to-end on publicly available EventNet. When applied to TRECVID Multimedia Event Detection dataset, it outperforms the state-of-the-art by a considerable margin.

READ FULL TEXT VIEW PDF

page 5

page 7

01/14/2016

Dynamic Concept Composition for Zero-Example Event Detection

In this paper, we focus on automatically detecting events in unconstrain...
10/10/2015

TagBook: A Semantic Video Representation without Supervision for Event Detection

We consider the problem of event detection in video for scenarios where ...
12/02/2015

Zero-Shot Event Detection by Multimodal Distributional Semantic Embedding of Videos

We propose a new zero-shot Event Detection method by Multi-modal Distrib...
03/13/2015

The YLI-MED Corpus: Characteristics, Procedures, and Plans

The YLI Multimedia Event Detection corpus is a public-domain index of vi...
09/17/2018

Dual Dense Encoding for Zero-Example Video Retrieval

This paper attacks the challenging problem of zero-example video retriev...
09/25/2015

Selecting Relevant Web Trained Concepts for Automated Event Retrieval

Complex event retrieval is a challenging research problem, especially wh...
12/27/2021

MedShift: identifying shift data for medical dataset curation

To curate a high-quality dataset, identifying data variance between the ...

1 Introduction

TRECVID Multimedia Event Detection (MED) [1, 2] is a retrieval task for event videos, with the reputation of being realistic. It comes in two flavors: few-exemplar and zero-exemplar, where the latter means that no video example is known to the model. Although expecting a few examples seems reasonable, in practice this implies that the user must already have an index of any possible query, making it very limited. In this paper, we focus on event video search with zero exemplars.

Retrieving videos of never-seen events, such as “renovating home”, without any video exemplar poses several challenges. One challenge is how to bridge the gap between the visual and the textual semantics [3, 4, 5]. One approach [3, 6, 7, 8, 9, 10] is to learn a dictionary of concept detectors on external data source. Then, scores for test videos are predicted using these detectors. Test videos are then ranked and retrieved accordingly. The inherent weakness of this approach is that the presentation of a test video is reduced to a limited vocabulary from the concept dictionary. Another challenge is how to overcome the domain difference between training and test events. While Semantic Query Generation (SQG) [3, 8, 9, 11] mitigates this challenge by extracting keywords from the event query, it does not address how relevant these keywords to the event itself. For example, keyword “person” is not relevant to event “car repair” as it is to “flash mob gathering”.

Figure 1:

We pose the problem of zero-exemplar event detection as learning from a repository of pre-defined events. Given video exemplars of events “removing drywall” or “fit wall times”, one may detect a novel event “renovate home” as a probability distribution over the predefined events.

Our entry to zero-exemplar events is that they generally have strong semantic correlations [12, 13] with other possibly seen events. For instance, the novel event “renovating home” is related to “fit wall tiles”, “remove drywall”, or even to “paint door”. Novel events can, therefore, be casted on a repository of prior events, for which knowledge sources in various forms are available beforehand, such as the videos, as in EventNet [14], or articles, as in WikiHow [15]. Not only do these sources provide video examples of a large –but still limited– set of events, but also they provide an association of text description of events with their corresponding videos. A text article can describe the event in words: what is it about, what are the details and what are the semantics. We note that such a visual-textual repository of events may serve as a knowledge source, by which we can interpret novel event queries.

For Zero-exemplar Event Detection (ZED), we propose a neural model with the following novelties:

  1. [leftmargin=1em,itemindent=0em,nolistsep]

  2. We formulate a unified embedding for multiple modalities (e.g. visual and textual) that enables a contrastive metric for maximum discrimination between events.

  3. A textual embedding poses the representation of a novel event as a probability of predefined events, such that it spans a much larger space of admissible expressions.

  4. We exploit a single data source, comprising pairs of event articles and related videos. A single source rather enables end-to-end learning from multi-modal individual pairs.

We empirically shows that our novelties result in performance improvement. We evaluate the model on TRECVID Multimedia Event Detection (MED) 2013 [1] and 2014 [2]. Our results show significant improvement over the state-of-the-art.

2 Related Work

Figure 2: Three families of methods for zero-exemplar event detection: (a), (b) and (c). They build on top of feature representations learned a priori (i.e. initial representations), such as CNN features for a video or word2vec features for event text query . In a post-processing step, the distance is measured between the embedded features. In contrast, our model rather falls in a new family, depicted in (d), for it learns unified embedding with metric loss using single data source.

We identify three families of methods for ZED, as in figure 2 (a), (b) and (c).

Visual Embedding and Textual Retrieval. As in figure 2(a), given a video represented as and a related text represented as . Then, a visual model is trained to project as such that the distance is minimized between . In test time, video ranking and retrieval is done using distance metric between the projected test video and test query representation .

[16, 17] project the visual feature of a web video

into term-vector representation

of the video’s textual title . However, during training, the model makes use of the text query of the test events to learn better term-vector representation. Consequently, this limits the generalization for novel event queries.

Textual Embedding and Visual Retrieval. As in figure 2(b), a given text query is projected into using pre-trained or learned language model .

[18] makes use of freely-available weekly-tagged web videos. Then it propagates tags to test videos from its nearest neighbors. Methods [7, 8, 9, 10, 3] have similar approach. Given a text query , Semantic Query Generation (SQG) extracts most related concepts to the test query. Then, pre-trained concept detectors predict probability scores for a test video . Aggregating these probabilities results in the final video score , upon which videos are ranked and retrieved. [9] learns weighted averaging.

The shortcoming of this family is that expressing a video as probability scores of few concepts is under-representation. Any concept that exists in the video but is missing in the concept dictionary is thus unrepresented.

Visual-Textual Embedding and Semantic Retrieval. As in figure 2(c), visual and textual models are trained to project both of the visual and textual features into a semantic space . During test, ranking score is the distance between the projections in the semantic space .

[19]

projects video concepts into a high-dimensional lexicon space. Separately, it projects concept-based features to the space, which overcomes the lexicon mismatch between the query and the video concepts.  

[20] embeds a fusion of low and mid-level visual features into distributional semantic manifold [21, 22]. In a separate step, it embeds text-based concepts into the manifold.

Figure 3: Model overview. Using dataset of event categories and videos. Each event has a text article and a few videos. Given a video with text title , belonging to an event with article , we extract features respectively. At the top, network learns

to classify the title feature

into one of event categories. In the middle, we borrow the network to embed the event article’s feature as . Then, at the bottom, the network learns to embed the video feature as such that the distance between is minimized, in the learned metric space .

The third family, see figure 2(c), is superior to the others, see figure 2(a), (b). However, one drawback of [19, 20] is separately embedding both the visual and textual features . This leads to another drawback, having to measure the distance between in a post-processing step (e.g

. cosine similarity).

Unified Embedding and Metric Learning Retrieval Our method rather falls into a new family, see figure 2(d), and it overcomes the shortcomings of [19, 20] by the following. It is trained on a single data source, enabling a unified embedding for features of multiple modalities into a metric space. Consequently, the distance between the embedded features is measured by the model using the learned metric space.

Auxiliary Methods Independent to the previous works, the following techniques have been used to improve the results: self-paced reranking [23], pseudo-relevance feedback [24], event query manual intervention [25], early fusion of features (action [26, 27, 28, 29, 30] or acoustic [31, 32, 33]) or late fusion of concept scores [17]. All these contributions may be applied to our method.

Visual Representation. ConvNets [34, 35, 36, 37] provide frame-level representation. To tame them into video-level counterpart, literature use: i- frame-level filtering [38] ii- vector encoding [39, 40] iii- learned pooling and recounting  [10, 41] iv- average pooling [16, 17]. Also, low-level action [28, 29], mid-level action [26, 27] or acoustic [31, 32, 33] features can be used. Textual Representation. To represent text, literature use: i- sequential models [42] ii- continuous word-space representations [22, 43] iii- topic models [44, 45] iv- dictionary-space representation [17].

3 Method

3.1 Overview

Our goal is zero-exemplar retrieval of event videos with respect to their relevance to a novel textual description of an event. More specifically, for the zero-exemplar video dataset and given any future, textual event description , we want to learn a model that ranks the videos according to the relevance to , namely:

(1)

3.2 Model

Since we focus on zero-exemplar setting, we cannot expect any training data directly relevant to the test queries. As such, we cannot directly optimize our model for the parameters in eq. (3). In the absence of any direct data, we resort to external knowledge databases. More specifically, we propose to cast future novel query descriptions as a convex combination of known query descriptions in external databases, where we can measure their relevance the database videos.

We start from a dataset organized by an event taxonomy, where we do not neither expect nor require the events to overlap with any future event queries. The dataset is composed of events. Each event is associated with a textual, article description of the event, analyzing different aspects of it, such as: (i) the typical appearance of subjects and objects (ii) it’s procedures (iii) the steps towards completing task associated with it. The dataset contains in total videos, with denoting the -th video in the dataset with metadata , e.g. the title of the video. A video is associated with an event label and the article description of the event it belongs to. Since multiple videos belong to the same event, they share the article description of such event.

The ultimate goal of our model is zero-exemplar search for event videos. Namely, provided unknown text queries by the user, we want to retrieve those videos that are relevant. We illustrate our proposed model during training in figure 3. The model is composed of two components, a textual embedding , a visual embedding . Our ultimate goal is the ranking of videos, with respect to their relevance to a query description, or in pairwise terms , and .

Let us assume a pair of videos , and query description , where video is more relevant to the query than . Our goal is a model that learns to put videos in the correct relative order, namely . This is equivalent to a model that learns visual-textual embeddings such that , where is the distance between visual-textual embeddings of , is the same for . Since we want to compare distances between pairs , we pose the learning of our model as the minimization of a contrastive loss [46]:

(2)
(3)

where is the projection of the query description into the unified metric space parameterized by , is the projection of a video onto the same space parameterized by and a target variable that equals to when the -th video is relevant to the query description and otherwise. Naturally, to optimize eq. (2), we first need to define the projections and in eq. (3).

Textual Embedding. The textual embedding component of our model, , is illustrated in figure 3 (top). This component is dedicated to learn a projection of a textual input –including any future event queries – on to the unified space . Before detailing our model , however, we note that that the textual embedding can be employed not only with event article descriptions, but also with any other textual information that might be associated to the dataset videos, such as textual metadata. Although we expect the video title not to be as descriptive as the associated article, they may still be able to offer some discriminative information as previously shown [16, 17] which can be associated to the event category.

We model the textual embedding as a shallow (two layers) multi-layer perceptron (MLP). For the first layer we employ a

ReLU nonlinearity. The second layer serves a dual purpose. First, it projects the article description of an event on the unified space . This projection is category-specific, namely different videos that belong to the same event will share the projection. Second, it can project any video-specific textual metadata into the unified space. We, therefore, propose to embed the title metadata , which is uniquely associated with a video, not an event category. To this end, we opt for softmax nonlinearity for the second layer, followed by an additional logistic loss term to penalized misprediction of titles with respect to the video’s event label , namely

(4)

Overall, the textual embedding is trained with a dual loss in mind. The first loss term, see eq. (2) (3) takes care that the final network learns event-relevant textual projections. The second loss term, see eq. (4), takes care that the final network does not overfit to the particular event article descriptions. The latter is crucial because the event article descriptions in will not overlap with the future event queries, since we are in a zero-exemplar retrieval setting. As such, training the textual embedding to be optimal only for these event descriptions will likely result in severe overfitting. Our goal and hope is that the final textual embedding model will capture both event-aware and video-discriminative textual features.

Visual Embedding. The visual embedding component of our model, , is illustrated in figure 3 (bottom). This component is dedicated to learn a projection from the visual input, namely the videos in our zero-exemplar dataset , into the unified metric space . The goal is to project the videos belonging to semantically similar events; project them into a similar region in the space. We model the visual embedding using a shallow (two layers) multi-layer perceptron with tanh nonlinearities, applied to any visual feature for video .

End-to-End Training. At each training forward-pass, the model is given a triplet of data inputs, an event description , a related video and video title . From eq. (3) we observe that the visual embedding is encouraged to minimize its distance with the output of the textual embedding . In the end, all the modules of the proposed model are differentiable. Therefore, we train our model in an end-to-end manner by minimizing the following objective

(5)

For the triplet input , we rely on external representations, since our ultimate goal is zero-exemplar search. Strictly speaking, a visual input is represented as CNN [35] feature vector, while textual inputs are represented as LSI [45] or Doc2Vec [43]

feature vectors. However, given that these external representations rely on neural network architectures, if needed, they could also be further fine-tuned. We choose to freeze CNN and Doc2Vec modules to speed up training. Finally, in this paper we refer to our main model with unified embedding, as

model.

Inference. After training, we fix the parameters (. At test time, we set our function from eq. (1) to be equivalent to the distance function from eq. (LABEL:eq:distance). Hence, at test time, we compute the Euclidean distance in the learned metric space between the embeddings of test video and novel event description , respectively.

4 Experiments

4.1 Datasets

Before delving into the details of our experiments, first we describe the external knowledge sources we use.

Training dataset. We leverage videos and articles from publicly available datasets. EvenNet [14] is a dataset of 90k event videos, harvested from YouTube and categorized into 500 events in hierarchical form according to the events’ ontology. Each event category contains around videos. Each video is coupled with a text title, few tags and related event’s ontology.

We exploit the fact that all events in EventNet are harvested from WikiHow [15] – a website for How-To articles covering a wide spectrum of human activities. For instance: “How to Feed a Dog” or “How to Arrange Flowers”. Thus, we crawl WikiHow to get the articles related to all the events in EventNet.

Test dataset. As the task is zero-exemplar, the test sets are different from the training. While EventNet serves as the training, the following serve as the test: TRECVID MED-13 [1] and MED-14 [1]. In details, they are datasets of videos for events. They comprise 27k videos. There are two versions, MED-13 and MED-14 with 20 events for each. Since 10 events overlap, the result is 30 different events in total. Each event is coupled with short textual description (title and definition).

4.2 Implementation Details

Video Features. To represent a video , we uniformly sample a frame every one second. Then, using ResNet [35], we extract pool5 CNN features for the sampled frames. Then, we average pool the frame-level features to get the video-level feature . We experiment different features from different CNN models: ResNet (prob, fc1000), VGG [37] (fc6, fc7), GoogLeNet [47] (pool5, fc1024), and Places365 [48] (fc6, fc7,fc8) except we find ResNet pool5 to be the best. We only use ResNet pool5 and we don’t fuse multiple CNN features.

Text Features. We choose topic modeling [44, 45], as it is well-suited for long (and sometimes noisy) text articles. We train LSI topic model [45] on Wikipedia corpus [49]. We experiment different latent topics ranging from to , expect we found 2500 to be the best. Also, we experiment other textual representations as LDA [44], SkipThoughts [50] and Doc2Vec [43]. To extract a feature from an event article or video title , first we preprocess the text using standard MLP steps: tokenization, lemmatization and stemming. Then, for we extract 2500-D LSI features , respectively. The same steps apply to MED text queries.

Model Details. Our visual and textual embeddings are learned on top of the aforementioned visual and textual features (). is a 1-hidden layer MLP classifier with ReLU for hidden, softmax for output, logistic loss and 2500-2500-500neurons for the input, hidden, and output layers, respectively. Similarly, is a 1-hidden layer MLP regressor with ReLU for hidden, contrastive loss and 2048-2048-500 neurons for the input, hidden, and output layers, respectively. Our code is made public111github.com/noureldien/unified_embedding to support further research.

4.3 Textual Embedding

(a) LSI Features
(b) Embedded Features
Figure 4: Our textual embedding (b) maps MED to EventNet events better than LSI features. Each dot in the matrix shows the similarity between MED and EventNet events.

Here, we qualitatively demonstrate the benefit of the textual embedding . Figure 4 shows the similarity matrix between MED and EventNet events. Each dot represents how a MED event is similar to EventNet events. It shows that our embedding (right) is better than LSI (left) in mapping MED to EventNet events. For example, LSI wrongly maps “9: getting a vehicle unstuck” to “256: launch a boat” while our embedding correctly maps it to “170: drive a car”. Also, our embedding maps with higher confidence than LSI, as in “16: doing homework or study”.

(a) LSI Features
(b) Embedded Features
Figure 5: For 20 events of MED-14, our textual embedding (right) is more discriminant than the LSI feature representation (left). Each dot in the matrix shows how similar an event to all the others.

Figure 5 shows the similarity matrix for MED events, where each dot represents how related any MED event to all the others. Our textual embedding (right) is more discriminant than on the LSI feature representation (left). For example, LSI representation shows high semantic correlation between events “34: fixing musical instrument” and “40: tuning musical instrument”, while our embedding discriminate them.

(a) MED-13, visual embedding (model).
(b) MED-13, separate embedding (model).
(c) MED-13, unified embedding (model).
(d) MED-14, visual embedding (model).
(e) MED-14 separate embedding (model).
(f) MED-14, unified embedding (model).
Figure 6: We visualize the results of video embedding using the unified embedding model and baselines model, model. Each sub-figure shows how discriminant the representation of the embedded videos. Each dot represents a projected video, while each pentagon-shape represents a projected event description. We use t-SNE to visualize the result.

Next, we quantitatively demonstrate the benefit of the textual embedding . In contrast to the main model, see section 3, we investigate baseline model, where we discard the textual embedding and consider only the visual embedding . We project a video on the LSI representation of the related event . Thus, this baseline falls in the first family of methods, see figure 2(a). It is optimized using mean-squared error (MSE) loss , see eq. 6. The result of this baseline is reported in section 5, table 1.

(6)

Also, we train another baseline model, which is similar to the aforementioned except instead of using MSE loss , see eq. (6), it uses contrastive loss , as follows:

(7)

4.4 Unified Embedding and Metric Learning

In this experiment, we demonstrate the benefit of the unified embedding. In contrast to our model presented in section 3, we investigate baseline model, where this baseline does not learn joint embedding. Instead, it separately learns visual and textual projections. We model these projections as a shallow (2-layer) MLP trained to classify the data input into 500 event categories, using logistic loss, same as eq. (4).

We conduct another experiment to demonstrate the benefit of learning metric space. In contrast to our model presented in section 3, we investigate baseline model, where we discard the metric learning layer. Consequently, this baseline learns the visual embedding is a shallow (2 layers) multi-layer perceptron with tanh non linearities. Also, we replace the contrastive loss , see eq. (2) with mean-squared error loss , namely

(8)

During retrieval, this baseline embeds a test video and novel text query as features onto the common space using textual and visual embeddings , respectively. However, in a post-processing step, retrieval score for the video is the cosine distance between . Similarly, all test videos are scored, ranked and retrieved. The results of the aforementioned baselines model and model are reported in table 1.

Comparing Different Embeddings. In the previous experiments, we investigated several baselines of the unified embedding (model), namely visual-only embedding (model), separate visual-textual embedding (model) and non-metric visual-textual embedding (model). In a qualitative manner, we compare the results of such embeddings. As shown in figure 6, we use these baselines to embed event videos of MED-13 and MED-14 datasets into the corresponding spaces. At the same time, we project the textual description of the events on the same space. Then, we use t-SNE [51] to visualize the result on 2D manifold. As seen, the unified embedding, see sub-figures 6(e)6(f) learns more discriminant representations than the other baselines, see sub-figures 6(a)6(b)6(c) and  6(d). The same observation holds for both for MED-13 and MED-14 datasets.

4.5 Mitigating Noise in EventNet

Based of quantitative and qualitative analysis, we conclude that EventNet is noisy. Not only videos are unconstrained, but also some of the video samples are irrelevant to their event categories. EvenNet dataset [14] is accompanied by 500-category CNN classifier. It achieves top-1 and top-5 accuracies of 30.67% and 53.27%, respectively. Since events in EventNet are structured as an ontological hierarchy, there is a total of 19 high-level categories. The classifier achieves top-1 and top-5 accuracies of 38.91% and 57.67%, respectively, over these high-level categories.

Based on these observations, we prune EventNet to remove noisy videos. To this end, first we represent each video as average pooling of ResNet pool5 features. Then, we follow the conventional 5-fold cross validation with 5 rounds. For each round, we split the dataset into 5 subsets, 4 subsets for training and the last for pruning. Then we train a 2-layer MLP for classification. After training, we forward-pass the videos of and rule-out the mis-classified ones.

The intuition behind pruning is that we rather learn salient event concepts using less video samples than learn noisy concepts with more samples. Pruning reduced the total number of videos by 26%, from 90.2k to 66.7k. This pruned dataset is all what we use in our experiments.

4.6 Latent Topics in LSI

When training LSI topic model on Wikipedia corpus, a crucial parameter is the number of latent topics the model constructs. We observe improvements in the performance directly proportional to increasing . The main reason that the bigger the value of , the more discriminant the LSI feature is. Figure 7 confirms our understanding.

Figure 7: Similarity matrix between LSI features of MED-14 events. The more the latent topics in LSI model, the higher the feature dimension, and the more discriminant the feature.

5 Results

Evaluation metric. Since we are addressing, in essence, an information retrieval task, we rely on the average precision (AP) per event, and mean average precision (mAP) per dataset. We follow the standard evaluation method as in the relevant literature  [1, 2, 52].

Comparing against model baselines. In table 1, we report the mAP score of our model baselines, previously discussed in the experiments, see section 4. The table clearly shows the marginal contribution of each of novelty for the proposed method.

Baseline Loss Metric MED13 MED14
model  (6) 11.90 10.76
model  (7) 13.29 12.31
model  (4) 15.60 13.49
model  (8) 15.92 14.36
model  (5) 17.86 16.67
Table 1: Comparison between the unified embedding and other baselines. The unified embedding model achieves the best results on MED-13 and MED-14 datasets.

Comparing against related work. We report the performance of our method, the unified embedding model on TRECVID MED-13 and MED-14 datasets. When compared with the related works, our method improves over the state-of-the-art by a considerable margin, as shown in table 2 and figure 8.

Method MED13 MED14
TagBook [18] ToM '15 12.90 05.90
Discovary [7] ICAI '15 09.60
Composition [8] AAAI '16 12.64 13.37
Classifiers [9] CVPR '16 13.46 14.32
VideoStory [17] PAMI '16 15.90 05.20
VideoStory [17] PAMI '16 20.00 08.00
This Paper (model) 17.86 16.67
Table 2: Performance comparison between our model and related works. We report the mean average precision (mAP%) for MED-13 and MED-14 datasets.
(a) MED-13 Dataset
(b) MED-14 Dataset
Figure 8: Event detection accuracies: per-event average precision (AP%) and per-dataset mean average precision (mAP%) for MED-13 and MED-14 datasets. We compare our results against TagBook [18], Discovary [7], Composition [8], Classifiers [9] and VideoStory [17].

It is important to point out that VideoStory uses only object feature representation, so its comparable to our method. However, VideoStory uses motion feature representation and expert text query (i.e. using term-importance matrix in  [17]). To rule out the marginal effect of using different datasets and features, we train VideoStory and report results in table 3. Clearly, CNN features and video exemplars in the training set can improve the model accuracy, but our method improves against VideoStory when trained on the same dataset and using the same features. Other works (Classifiers [9], Composition [8]) use both image and action concept classifiers. Nonetheless, our method improves over them using only object-centric CNN feature representations.

Method Training Set CNN Feat. MED14
VideoStory VideoStory46k [17] GoogleNet 08.00
VideoStory FCVID [53] GoogleNet 11.84
VideoStory EventNet [14] GoogleNet 14.52
VideoStory EventNet [14] ResNet 15.80
This Paper EventNet [14] ResNet 16.67
Table 3: Our method improves over VideoStory when trained on the same dataset and using the same feature representation.

6 Conclusion

In this paper, we presented a novel approach for detecting events in unconstrained web videos, in a zero-exemplar fashion. Rather than learning separate embeddings form cross-modal datasets, we proposed a unified embedding where several cross-modalities are jointly projected. This enables end-to-end learning. On top of this, we exploited the fact that zero-exemplar is posed as retrieval task and proposed to learn metric space. This enables measuring the similarities between the embedded modalities using this very space.

We experimented the novelties and demonstrated how they contribute to improving the performance. We complemented this by improvements over the state-of-the-art by considerable margin on MED-13 and MED-14 datasets.

However, the question still remains, how can we discriminate between these two MED events “34: fixing musical instrument” and “40: tuning musical instrument”. We would like to argue that temporal modeling for human actions in videos is of absolute necessity to achieve such fine-grained event recognition. In future research, we would like to focus on human-object interaction in videos and how to model it temporally.

Acknowledgment

We thank Dennis Koelma, Masoud Mazloom and Cees Snoek222{kolema,m.mazloom,cgmsnoek}@uva.nl for lending their insights and technical support for this work.

References

  • [1] Paul Over, George Awad, Jon Fiscus, Greg Sanders, and Barbara Shaw. Trecvid 2013–an introduction to the goals, tasks, data, evaluation mechanisms, and metrics. In TRECVID Workshop, 2013.
  • [2] Paul Over, Jon Fiscus, Greg Sanders, David Joy, Martial Michel, George Awad, Alan Smeaton, Wessel Kraaij, and Georges Quénot. Trecvid 2014–an overview of the goals, tasks, data, evaluation mechanisms and metrics. In TRECVID Workshop, 2014.
  • [3] Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura, and Alexander G Hauptmann. Bridging the ultimate semantic gap: A semantic search engine for internet videos. In ICMR, 2015.
  • [4] Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek. Composite concept discovery for zero-shot video event detection. In ICMR, 2014.
  • [5] Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek. Discovering semantic vocabularies for cross-media retrieval. In ICMR, 2015.
  • [6] Masoud Mazloom, Efstrastios Gavves, and Cees G. M. Snoek. Conceptlets: Selective semantics for classifying video events. In IEEE TMM, 2014.
  • [7] Xiaojun Chang, Yi Yang, Alexander G Hauptmann, Eric P Xing, and Yao-Liang Yu. Semantic concept discovery for large-scale zero-shot event detection. In IJCAI, 2015.
  • [8] Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, and Alexander G Hauptmann. Dynamic concept composition for zero-example event detection. In arXiv, 2016.
  • [9] Xiaojun Chang, Yao-Liang Yu, Yi Yang, and Eric P Xing. They are not equally reliable: Semantic event search using differentiated concept classifiers. In IEEE CVPR, 2016.
  • [10] Yi-Jie Lu. Zero-example multimedia event detection and recounting with unsupervised evidence localization. In ACM MM, 2016.
  • [11] Lu Jiang, Shoou-I Yu, Deyu Meng, Yi Yang, Teruko Mitamura, and Alexander G Hauptmann. Fast and accurate content-based semantic search in 100m internet videos. In ACM MM, 2015.
  • [12] Thomas Mensink, Efstratios Gavves, and Cees Snoek. Costa: Co-occurrence statistics for zero-shot classification. In IEEE CVPR, 2014.
  • [13] E. Gavves, T. E. J. Mensink, T. Tommasi, C. G. M. Snoek, and T Tuytelaars.

    Active transfer learning with zero-shot priors: Reusing past datasets for future tasks.

    In IEEE ICCV, 2015.
  • [14] Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, and Shih-Fu Chang. Eventnet: A large scale structured concept library for complex event detection in video. In ACM MM, 2015.
  • [15] Wikihow. http://wikihow.com.
  • [16] Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek. Videostory: A new multimedia embedding for few-example recognition and translation of events. In ACM MM, 2014.
  • [17] Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek. Videostory embeddings recognize events when examples are scarce. In IEEE TPAMI, 2016.
  • [18] Masoud Mazloom, Xirong Li, and Cees Snoek. Tagbook: A semantic video representation without supervision for event detection. In IEEE TMM, 2015.
  • [19] Shuang Wu, Sravanthi Bondugula, Florian Luisier, Xiaodan Zhuang, and Pradeep Natarajan. Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In IEEE CVPR, 2014.
  • [20] Mohamed Elhoseiny, Jingen Liu, Hui Cheng, Harpreet Sawhney, and Ahmed Elgammal. Zero-shot event detection by multimodal distributional semantic embedding of videos. In arXiv, 2015.
  • [21] Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages for machine translation. In arXiv, 2013.
  • [22] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
  • [23] Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexander G Hauptmann. Easy samples first: Self-paced reranking for zero-example multimedia search. In ACM MM, 2014.
  • [24] Lu Jiang, Teruko Mitamura, Shoou-I Yu, and Alexander G Hauptmann. Zero-example event search using multimodal pseudo relevance feedback. In ICMR, 2014.
  • [25] Arnav Agharwal, Rama Kovvuri, Ram Nevatia, and Cees GM Snoek. Tag-based video retrieval by embedding semantic content in a continuous word space. In IEEE WACV, 2016.
  • [26] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. C3d: generic features for video analysis. In arXiv, 2014.
  • [27] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [28] Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In IEEE CVPR, 2011.
  • [29] J Uijlings, IC Duta, Enver Sangineto, and Nicu Sebe. Video classification with densely extracted hog/hof/mbh features: an evaluation of the accuracy/computational efficiency trade-off. In IJMIR, 2015.
  • [30] B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars. Rank pooling for action recognition. In IEEE TPAMI, 2016.
  • [31] Lindasalwa Muda, Mumtaj Begam, and I Elamvazuthi. Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques. In arXiv, 2010.
  • [32] Anurag Kumar and Bhiksha Raj. Audio event detection using weakly labeled data. In arXiv, 2016.
  • [33] Liping Jing, Bo Liu, Jaeyoung Choi, Adam Janin, Julia Bernd, Michael W Mahoney, and Gerald Friedland. A discriminative and compact audio representation for event detection. In ACM MM, 2016.
  • [34] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In IEEE.
  • [35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In arXiv, 2015.
  • [36] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [37] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In arXiv, 2014.
  • [38] Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In IEEE CVPR, 2016.
  • [39] Karen Simonyan, Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. Fisher vector faces in the wild. In BMVC, 2013.
  • [40] Relja Arandjelovic and Andrew Zisserman. All about vlad. In IEEE CVPR, 2013.
  • [41] Pascal Mettes, Jan C van Gemert, Spencer Cappallo, Thomas Mensink, and Cees GM Snoek. Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting. In ICMR, 2015.
  • [42] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
  • [43] Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, 2014.
  • [44] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. In JMLR, 2003.
  • [45] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. In JACS, 1990.
  • [46] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In IEEE CVPR, 2005.
  • [47] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE CVPR, 2015.
  • [48] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, and Aude Oliva.

    Places: An image database for deep scene understanding.

    In arXiv, 2016.
  • [49] Wikipedia, 2016. http://wikipedia.com.
  • [50] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In NIPS, 2015.
  • [51] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. In JMLR, 2008.
  • [52] Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, and Alexander C Loui. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ICMR, 2011.
  • [53] Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. Exploiting feature and class relationships in video categorization with regularized deep neural networks. In IEEE TPAMI, 2017.