Probabilistic Semantic Retrieval for Surveillance Videos with Activity Graphs

12/17/2017 ∙ by Yuting Chen, et al. ∙ Boston University Amazon Systems & Technology Research 0

We present a novel framework for finding complex activities matching user-described queries in cluttered surveillance videos. The wide diversity of queries coupled with unavailability of annotated activity data limits our ability to train activity models. To bridge the semantic gap we propose to let users describe an activity as a semantic graph with object attributes and inter-object relationships associated with nodes and edges, respectively. We learn node/edge-level visual predictors during training and, at test-time, propose to retrieve activity by identifying likely locations that match the semantic graph. We formulate a novel CRF based probabilistic activity localization objective that accounts for mis-detections, mis-classifications and track-losses, and outputs a likelihood score for a candidate grounded location of the query in the video. We seek groundings that maximize overall precision and recall. To handle the combinatorial search over all high-probability groundings, we propose a highest precision subtree algorithm. Our method outperforms existing retrieval methods on benchmarked datasets.



There are no comments yet.


page 1

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many routine tasks such as activity detection, anomaly detection and user-defined activity recognition & retrieval in surveillance videos currently require significant human attention. The goal of this paper is to develop exploratory search tools for rapid analysis by human operators.

We focus on retrieval of activity that matches analyst or user described semantic activity (ADSA) from surveillance videos. Surveillance videos pose two unique issues: (a) wide query diversity; (b) the presence of many unrelated, co-occurring activities that share common components.

(a) U-Turn
(b) False Alarm
Fig. 1: An example of the need for relationships between components of an action, in this case retrieving a u-turn in a wide-area-motion-imagery (WAMI) data. Even for this simple, single object activity, relationships between detections are important to define the activity. Ignoring relationships between detections in (b), notably that the perceived components of the “u-turn” are due to two different vehicles, yields a false alarm.

The wide diversity of ADSAs limits our ability to collect sufficient training data for different activities and learn activity models for a complete list of ADSAs. Methods that can transfer knowledge from detailed activity descriptions to the visual domain are required. As noted in [1], while it would be desirable to learn to map textual descriptions to a semantic graph, this by itself is an active area of research. To handle query diversity, we focus on a novel intermediate approach, wherein a user represents an activity as a semantic graph (see Fig. 2

) with object attributes and inter-object semantic relationships associated with nodes and edges respectively. We propose to bridge the relationship semantic gap by learning relationship concepts with annotated data. At the object/node-level, we utilize existing state-of-art methods to train detectors, classifiers and trackers to obtain detected outputs, class labels, track data and other low-level outputs. This approach is practical because, in surveillance, the vocabulary of low-level components of a query is typically limited and can be assumed to be known in advance.

Our next challenge is to identify candidate groundings. By a grounding [1], we mean a mapping from archive video spatio-temporal locations to query nodes (see also Sec. II). Find groundings of a query in a video is a combinatorial problem that requires searching over different candidate patterns that matches the query. The difficulty arises from many unrelated co-occurring activities that share node and edge attributes. Additionally, the outputs of low-level detectors, classifiers and trackers are inevitably error-prone leading to mis-detections, mis-classifications, and loss of tracks. Uncertainties can also arise due to the semantic gap. Consequently, efficient methods that match the activity graph with high-confidence in the face of uncertainty are required.

This paper extends our preliminary work on activity retrieval [2] with a novel probabilistic framework to score the likelihood of groundings and explicitly account for visual-domain errors and uncertainties. [2] proposes to identify likely candidate groundings as a ranked subgraph matching problem. By leveraging the fact that the attributes and relationships in the query have different level of discriminability, a novel maximally discriminative spanning tree (MDST) is generated as a relaxation of the actual activity graph to quickly minimize the number of possible matches to the query while guaranteeing the desired recall rate. In [2], the activity graph that describes semantic activity requires a fixed and manual description of node attributes and edge relationships, which relies heavily on domain knowledge and is prone to noise from lower-level pre-processing algorithms. In this paper, we propose a probabilistic framework based on a CRF model of semantic activity that combines the activity graph with the confidence/margin outputs of our learned component-level classifiers, and outputs a likelihood score for each candidate grounding. Similarly, we pose the combinatorial problem of identifying likely candidate groundings as a constrained optimization problem of maximizing precision at a desired recall rate. To solve this problem we propose a successive refinement scheme that recursively attempts to find candidate matches at different levels of confidence. For a given level of confidence, we show that a two-step approach based on first finding subtrees of the activity graph that are guaranteed to have high precision, followed by a tree-based dynamic programming recursion to find the matches, leads to efficient solutions. Our method outperforms bag of objects/attributes approaches [3], demonstrating that objects/attributes are weak signatures for activity in surveillance videos unlike other cases [4, 5, 6, 7]. We compare against approaches [2] based on manually encoding node/edge level relationships to bridge the visual domain gap and demonstrate that our semantic learning combined with probabilistic matching outperforms such methods.

Fig. 2: Overview of Proposed Probabilistic Semantic Retrieval Approach (see Sec. I and Sec. V)

I-a Related Work

Many methods have been proposed for video retrieval.

Classification Methods: Many video retrieval methods [8, 9, 10, 3, 11, 12, 13] at run-time take a video snippet (temporal video segment) as input and output a score based on how well it matches the desired activity. During training, activity classifiers for video snippets are learnt [8, 9, 10, 3]

using fully labeled training data. In this context several recent works have proposed deep neural network approaches for learning representations for actions and events 

[4, 5, 6, 7]

. These works leverage the fact that in some applications object/attributes provide good visual signatures for characterizing activity. In contrast to these methods we do not utilize activity-level training data. Furthermore, while these methods are suited for situations where an activity manifests as a dominant signature in the video snippet, they are ill-suited for situations where the activity signature is weak, namely, the activity occurs among many other unrelated co-occurring activities, which is the typical scenario in surveillance problems.

Zero-Shot Methods: More recently, zero-shot methods have been applied to several visual tasks such as event detection [14, 15, 16], action recognition [17], image tagging [18], action localization [19], and image classification [20]. These methods share the same advantage with our work in that activity level training data associated with the desired activity is not required. Nevertheless, zero-shot methods are trained based on source domain descriptions for a subset of activities that allow for forging links between activity components, which can then be leveraged for classification of unseen activity at test-time. Furthermore, the current set of approaches are only suitable in scenarios where the activity has strong visual signatures in low-clutter environments.

Activity Graphs: It is worth pointing out that several works [3, 11, 12, 13] have developed structured activity representations but they use fully annotated data as mentioned earlier. Lin et al. [3] describes a bipartite object/attribute matching method. Shu et al. [11] describes AND-OR-Graphs based on aggregating sub-events for test-time activity recognition. Similar to classification based methods, these approaches only work well when the desired activity is dominant over a video snippet.

The proposed method is closely related to our preliminary work [2]. Activity are manually represented as graph queries. Ground-truth data is utilized to reduce video to a large annotated graph. A ranked subgraph matching algorithms is used to find matches in the video archive graph. In this way object-level semantic gap was avoided. Relationship semantic gap is handled manually (for instance, nearness, proximity etc are entered manually in terms of pixel distances). This is somewhat cumbersome because relationships are often context dependent (see Footnote 3). It is primarily a deterministic subgraph matching solution that does not handle visual distortion like mis-detections and tracker failure well. In contrast we formulate a probabilistic activity graph that explicitly accounts for visual distortions, bridges the semantic gap through learning low-level concepts, and proposes an efficient probabilistic scoring scheme based on CRFs.

CRF Models for Retrieval:

Our proposed CRF framework closely resembles CRF models that are employed for semantic image retrieval in Johnson et al.

[1, 21]. They propose scene graphs to represent objects and relationships between them, and train a CRF model using fully annotated training data. These CRF models on fully trained data thus can also incorporate knowledge of typical global scenes and context in addition to low-level node/edge predictions. In contrast our premise is that, in the video problem, we do not have adequate training data across all desired activities. In addition, unlike images, miss detections and track losses have substantial impact in video retrieval. Finally, spatio-temporal scale and size of the surveillance videos, and the presence of unrelated co-occurring activities leads to a probabilistic and combinatorial search problem.

Ii Activity Models

The goal of semantic activity retrieval is to spatio-temporally ground semantically described activities in large videos. As no examples of activities are provided, a semantic framework is necessary to represent the search activity. To capture activities involving multiple objects over potentially large temporal scale, we need a flexible framework capable of representing both the objects involved in the activity as well as relationships between these objects. To capture both the components of the activity as well as their relationships, we use an activity graph to define a query.

An activity graph is an abstract model for representing a user-described activity that captures object entities, their attributes, and spatio-temporal relationships between objects. An activity graph provides a detailed description of video activity because it admits diverse sets of objects, attributes, and relationships. Graphs represent a natural approach to representing activities involving interaction between multiple objects. For example, consider the following activity:

Two men are meeting so one can give the other a backpack. They will meet and talk first, then they will go to a vehicle and drive away. One man is wearing a red shirt, the other is wearing a green shirt, and their vehicle is a blue sedan.

The above description can be represented as a composition of atomic elements, element descriptions, relationships between elements, and relationship descriptions. For example, the activity can be described by 4 atomic elements with specific descriptions, a person wearing red (P1), a person wearing green (P2), an object representing a backpack (O), and a blue car (V). Using these elements, the activity can be described by the interactions between these elements: initially, P1 and O are near each other, then P1, P2, and O are near each other. The three objects P1, P2, and O move near V, then O, P1, and P2 enter V, and finally, V moves.

Formally, an activity graph is composed of nodes, each representing a realization of an object at an instance of time, and edges, representing relationships between nodes.

We adapt the notation used in scene recognition

[1] and assume we are given a set of object classes , a set of attributes associated with each object, and a set of relationships between objects.

An activity graph is defined as the tuple . denotes the nodes in the graph, , with each node characterized by its class and attributes, . Similarly, denotes the set of edges between nodes of the graph, with each edge characterized by its associated relationships , where represents the set of relationships between objects and .

Differing from image retrieval, edges in an activity graph represent not only spatial displacement, but additionally temporal displacement as well as identity information, to capture concepts such as “the same person is near the vehicle later.” Similarly, attributes associated with nodes also include time-dependent attributes such as velocity.

In searching for activities, we seek to ground the activity graph to a video segment, that is to associate each node and edge in our activity graph with parts of the video denoted by spatio-temporal bounding boxes . For the nodes of an activity graph, , and a set of bounding boxes , a grounding is a mapping between nodes and bounding boxes. Note that mapping nodes to bounding boxes is sufficient to map the graph to the video segment as the edges are implicitly mapped by . For a grounding , we denote the bounding box that element is mapped to by as .

In this framework, the problem of semantic activity retrieval is equivalent to choosing a grounding for the activity graph. In Section III, we formulate an approach to efficiently grounding an activity graph in a large archive video.

Representing text as an activity graph requires mapping of nouns to objects and understanding of relationships, activities, and interaction between elements. This work is out of the scope of the thesis, and to better demonstrate the efficacy of our approach to retrieval, we focus solely on the problem of spatio-temporally locating activities in videos given a human-generated activity graph. In practice, these activity graphs are composed of components that are semantically interpretable to humans.

Iii Activity Retrieval by Graph Grounding

Our goal is to find an activity in a large archive video. To this end, we seek to find a grounding of an activity graph, representing the query activity, in the archive video. To solve this problem, we must address two main sub-problems: how to evaluate the grounding between an activity graph and a set of object bounding boxes (generated from object proposal approaches like [22]), and how to search over a large archive of bounding boxes in order to find the highest scoring grounding. We first present an approach to evaluate a grounding between activity graph and bounding boxes, then present an approach to efficiently reason over a large archive video to finding the optimal groundings.

Iii-a Evaluating Activity Graph Grounding

To evaluate the grounding between an activity graph and set of bounding boxes, we consider a maximum a posteriori inference scheme. For a graph , set of bounding boxes , and grounding , we consider the maximum a posteriori probability, that is . We consider a conditional random field (CRF) model [23],


Given that we are in a zero-shot setting, we consider uniform distributions over bounding boxes,

, and activity graph nodes, . From Bayes’ rule, the conditional probability can be expressed

Our goal is to find the maximum a posteriori grounding,


Note that due to the uniform distribution assumptions on and , these terms are constant and are ignored in finding the maximum a posteriori grounding.

Iii-A1 Learning Node & Edge Level Probability Models

To evaluate the maximum a posteriori probability of a grounding, the distributions and

need to be estimated. The distribution

, representing the probability that the bounding box specified by has the class, , and attributes , associated with node . We assume that the probabilities of class and attributes are independent, and therefore we can model this as a product of distributions:


Estimating each of these probabilities is accomplished by learning an object detector or attribute classifier, with the output margin mapped to a probability using a logistic model.

where is the output margin of the detector for class , and

are two scalar parameters that can be set heuristically or learned with Platt scaling


Similarly, we learn semantic relationship classifiers on features from detected object pairs, and estimate the distribution as in the case of object probabilities. Our perspective is that the vocabulary typically used to describe complex activity by analysts is a priori known (”carry, together, with, near”). In [2], relationship was manually annotated as it is convenient for the subsequent matching stage since everything was deterministic. However it limits the method to work robustly in different scenarios. For example, consider the relationship near between two objects. The manually way is to set a pixel-distance threshold for identifying “near” property for two objects. However, the semantic meaning of near is strongly dependent on the context of the two objects. Near in the context of moving vehicles is different from stationary vehicles or for two persons.

The details of the node and edge probability models we used are described in Sec. V.

Iii-B Efficient Grounding in Large Graphs

In the previous section, we presented an approach to estimating the conditional probability of a grounding for a given activity graph. Although estimating the probability of a specific grounding can be efficiently achieved, a combinatorially large number of possible groundings exist between an activity graph and collection of bounding boxes. Furthermore, due to long surveillance videos, the collection of extracted bounding boxes is generally large.

In order to efficiently find the maximum a posteriori grounding of an activity graph in a video, we instead consider the following optimization problem:


Note that for the proper setting of thresholds and , the solution of (4) is equivalent to the solution of (2). In the case where the parameters are set below this optimal set of parameters, the solution is non-unique, with a set of possible groundings returned, one of which is the solution to (2). By scoring the groundings that maximize (4) according to the objective of (2), we are able to find the optimal grounding from this subset.

Our goal is to find a set of thresholds that maximize precision subject to a recall constraint. For a grounding , we define as the value of the objective of (4), that is

Let denote whether or not a grounding corresponds to the desired activity, for a set of thresholds , the precision can be expressed as the probability of a grounding corresponding to the desired activity having an objective value, , equal to divided by the probability of any grounding having an objective value equal to , that is:

We assume the probability of a grounding corresponding to the desired activity is significantly smaller than the probability of a grounding not corresponding to the desired activity, allowing for the approximation:

Similarly, we can express the recall rate as

We therefore seek to minimize the approximate precision subject to the recall rate being greater than some value :

Note that the ratio is an unknown quantity dependent on the archive video, however as this quantity is a constant, the value does not effect the optimization. Assuming independence of the nodes and edges of the activity graph (and their attributes), the remaining conditional properties can be estimated by evaluating detector performance, with thresholds chosen given detector performance.

Solving the optimization problem in (4) has the potential to be significantly more efficient than solving the optimization problem in (2) through the use of branch-and-bound approaches, particularly due to the ability to aggressively bound by eliminating any solution subtrees where one node or edge does not meet the associated threshold. Unfortunately, despite the potential improvement in efficiency, solving this problem is still combinatorially hard and may be computationally infeasible, particularly for a large collection of bounding boxes.

Rather than directly solving this problem, consider a subgraph of that we denote as , with the nodes and edges of the subgraph denoted . Consider the problem of finding a grounding for this subgraph:


For this subgraph matching problem, we make the following observation:

Theorem III.1.

Any grounding of the graph that maximizes (4) is also a subgraph grounding that maximizes (5).

Thm. III.1 implies that the set of groundings that maximize (5) includes all groundings that also maximize (4). Therefore, the set of groundings that maximize (5) has a recall rate of , though the precision rate may be decreased.

Thm. III.1 leads to an efficient approach to solving (4). Rather than searching for the full graph , we instead consider a subgraph that can be efficiently searched for. From Thm. III.1, all subgraphs of will have a recall rate of , however the choice of spanning tree directly impacts the precision rate of the set of groundings that maximize (5). We therefore propose selecting a Highest Precision Subgraph (HPS) defined as the subgraph of with a minimal expected number of groundings that maximize (5).

In particular, we attempt to find a HPS from the set of spanning trees of , as tree search can be efficiently solved using dynamic programming [25]. From our model in (1), we assume that each edge is distributed independently. Therefore, we can find the HPS from the set of spanning trees of by finding a spanning tree over the graph that minimizes the likelihood that an edge probability is above the associated threshold.


where is restricted to be the set of edges that yield a valid spanning tree over and is the distribution over relationships in the video. In practice, can be efficiently approximated by randomly sampling bounding box pairs and estimating their distribution. Solving the optimization in (6) can be done efficiently, as the problem can be mapped to a minimum spanning tree problem. We explain the details of the algorithm in Sec. IV.

Iv Highest Precision Spanning Tree Algorithm

The goal of our algorithm is to efficiently retrieve the optimal grounding for an activity graph . We accomplish it in two steps: First, we calculate the highest precision subgraph for an activity graph . The selected subgraph minimizes (6) among all the spanning trees of , thus filters out as many infeasible groundings as possible. Then we develop a Highest Precision Subtree Matching (HPSM) approach to find the optimal groundings that maximizes (5). In the end, we recover the ranked groundings with respect to the original activity graph .

Iv-a Highest Precision Subtree Selection

Given an activity graph , we first reduce the video data to the set of potentially relevant nodes and edges by building a coarse archive graph out of the spatio-temporal bounding boxes . For every node , we retrieve the set of corresponding locations that satisfy the class and attributes characterized by and . Similarly, we retrieve the corresponding edges .

Despite incredible cost savings, the down-sampled coarse graph is still a large graph with a collection of hundreds of thousands of bounding boxes. We therefore select a HPS of the activity graph so as to minimize the time spent performing an expensive search.

The choice of which spanning tree to select has significant run-time implications. The creation of a spanning tree involves the removal of edges from , and not all edges are created equal. The edges in the HPS should be the set of edges that minimizes the likelihood in (6). To solve the optimization, we first compute a set of weights indicating the discriminative power of the edges, then calculate the spanning tree which minimizes the total edge weight.

Iv-A1 Weight Computation

During the archival process, we assign probabilities , and to each class, attribute and relationship that we store. These functions denote the probability that a randomly-chosen class or attribute or relationship in the archive video is a match to the class , attribute or relationship . Relationships, in particular, have greater power to be discriminative because the set of potential relationships is .

The set of edges that minimizes (6) is associated with the most discriminative relationships so that it yields the least possible mappings in the coarse archive graph . In VIRAT[26] datasets, while the “Person near car” relationship is normally very discriminative, 80% of the dataset is shot in parking lots, where people are frequently near cars. Objects disappear near cars far less less frequently - thus, a tree rooted at the “object disappears” node and connecting through the “near” edge to the “car” node has less potential matches than one starting elsewhere.

We compute empirical values for during the archival process by computing the percentage of relationships that has appeared in the videos. If relationships have not been seen, they are assumed to be nondiscriminative and assigned values of . If it is later determined that these relationships are discriminative, we can revise our estimate of .

Iv-A2 Highest Precision Spanning Tree

From our model in (1), We assume that each edge is distributed independently. Absent additional information indicating the distribution of relationships in the video corpus, we assume that all relationships are generated independently.

Since relationships and edges in an activity graph = (, ) are independent, the total edge weight of the graph is:


As noted in (5), we are going to do a search using HPS instead of the original query graph in order to reduce the computational complexity. That HPS which results in the fewest possible groundings is our novel highest precision spanning tree.

Definition (Highest Precision Spanning Tree (HPST)).

We call a spanning tree an HPST of activity graph with edge weights , if the subtree satisfies


where denotes the set of all possible spanning trees induced from activity graph .

This is exactly the same as the definition of minimum spanning tree. By minimizing the total edge weight, we achieve a Highest Precision Spanning Tree , which solves the optimization in (6). We use Kruskal’s algorithm to calculate the HPST that should produce the fewest possible matches.

Iv-B Highest Precision Subtree Matching (HPSM)

Given a coarse graph and the HPST , we seek to select the maximum a posteriori grounding from all possible grounding . We solve for the optimal grounding between the HPST and archive graph in two steps. In the first step, we construct a matching graph of the possible groundings. Then we find the optimal grounding from the matching graph .

Iv-B1 Matching Graph Creation

we build a matching graph , where each node is a tuple denoting a proposed assignment between a node in and a node in , and each edge denotes the relationship between the two assignments. All the assignments in satisfy the setting of nodes and edges thresholds described in (5) so that we can rule out the impossible mappings.

We create by first adding assignments for the root, then adding in assignments to its successors which satisfy both node and edge relationships described in . We then set the score thresholds and to be the minimum score for nodes and edges and find a set of mappings that maximize (5). The proper setting of thresholds ensures that no feasible groundings to is ruled out in the filtering process. This process is described in Algorithm 1, and the expected number of mappings scales as a product of and the size of the archive data.

1:procedure Create Matching Graph()
3:     Iterate from root to leaves
4:     for all  do
5:         Compute the groundings to this node
6:          where
7:         if  then
9:              for all  do
10:                   where
11:              end for
15:         else
17:         end if
18:     end for
19:end procedure
Algorithm 1 Create Matching Graph

Iv-B2 Retrieval with HPSM

After traversing from root to leaves to create a matching graph, we then traverse it from leaves to root to determine the optimal solution for each root node assignment. To evaluate the matching score of a grounding, we use the maximum a posteriori probability described in (2), where the score is the product of the distributions and . For each leaf node in , we merge nodes in with their parents, keeping the one which has the best score. We repeat this process until only mappings to root nodes are left, and then sort these root nodes by the score of the best tree which uses them. This process is described in Algorithm 2, and has complexity of .

1:procedure Optimize groundings()
4:     Iterate from leaves to root
5:     for all  do
6:         if  then
7:              for all  where  do
9:              end for
10:         end if
11:     end for
12:end procedure
Algorithm 2 Solve for Optimal Groundings

This process yields a set of groundings, , for each potential activity - generally on the order of the number of true groundings in the data. We then iterate through each grounding and filtering groundings that have poor scores for the edges not present in the HPST . In this way, we attempt to recover the grounding results for the original problem in (2) from the approximated problem (5). This allows us to have the speed of the HPSM approach and the effective quality of the full graph grounding results.

V Implementation

In this section, we present implementation details of our approach. Fig. 2 shows an overview of our system. At a high-level, it operates as follows: as an archive video is recorded, detectors are applied to extract bounding boxes of objects of interest. These bounding boxes are then fused through a tracker and classified, yielding tracklets of objects that are stored in a table along with some simple attributes. During query-time, an analyst provides an ADSA query by an activity graph. From this query, an HPST, in particular a tree, is found according to (6). The set of groundings that maximize (5) are then found in the table. These groundings are scored and returned according to (1).

ADSA Query Vocabulary

We construct a vocabulary that corresponds to nodes and edges in the ADSA activity graph to allow for semantic descriptions of queries. We consider three classes of items, person, object, and vehicle. Each item has a set of attributes that can be included in the query such as size, appearing, disappearing, and speed. Between each of these items, we define the following relationship attributes: same enti, near, not near, and later. The set of items and attributes can be expanded to include additional or more specific classes/descriptors. Due to the limited variety of objects in the datasets, we limit ourselves to simple semantic descriptors to prevent dominance of attributes in returns. By limiting the descriptiveness of attributes in our queries, we demonstrate retrieval capability in the presence of possible confusers. For a query such as “two people loading an object into a pink truck” a method that leverages primarily the color is not sufficiently general to handle ADSA’s that do not include strong attribute descriptions.

Detection and Tracking

We demonstrate the proposed method on two datasets. For the high quality VIRAT ground dataset [26]

, we use Piotr’s Computer Vision Matlab Toolbox

[27] to extract detections and then fuse them into tracklets [28]. In the case of the low-resolution WAMI AFRL data [2], we apply algorithms designed specifically for aerial data [29, 30, 31].

Relationship Learning

We learn semantic relationships by training a classifier on annotated positive and negative relationship examples of object pairs. For example, the relationship descriptor “near” between two items is found by training a classifier on features of two objects such as size, aspect ratios, distance between objects, etc. on a set of annotated examples of items that are near and are not near. Linear SVMs [32] are used to learn the relationships and provide the probabilities.


Many of our queries requires maintaining identity over long periods of time, while tracked data inevitably has lost-tracks. We thus leverage re-identification (re-ID) algorithms by utilizing a linear classifier () over the outer product of features from a pair of tracklets and train SVMs to learn . This classifier is universally applied independent of context, pose, illumination etc.

In practice, we have extremely limited training data that are properly annotated for the complex re-ID models [33, 34, 35], so we use elementary target features like bounding box aspect ratios, locations, size, etc.

Vi Experiments

We perform semantic video retrieval experiments on the VIRAT Ground 2.0 dataset [26] and the AFRL Benchmark WAMI Data [2]. Given a set of activity graph queries, each algorithm is asked to return a ranked list of groundings in the archive video based on their likelihood scores. Each grounding is then represented by the minimal bounding spatio-temporal volume of the involved bounding boxes. For VIRAT dataset where ground truth is provided, standard Precision-Recall curves are produced by varying the scoring threshold. For the unannotated AFRL data, a human operator evaluates the precision of top-k returns by watching the corresponding spatio-temporal window of the video. Each return is marked as a true detection if the overlap of the returned spatio-temporal volume and the true spatio-temporal volume is larger than 50% of the union.

As stated in Sec. I-A, most of the related methods [8, 9, 10, 3, 11, 12, 13] are not applicable to our setup as they retrieve relevant videos from a collection of short video snippets. We compare our performance with two approaches, a bag-of-words (BoW) scheme and a Manually Specified Graph Matching (MSGM) scheme. BoW [3, 11] collects objects, object attributes and relationships in to a bag and ignores the structural relationships. To identify groundings, a bipartite matching scheme is utilized to find an assignment between the bag of words and a video snippet. We use our trained models for node-level concepts in this context. For the MSGM method [2] , we quantify relationships by manually annotating data using bounding boxes for objects and then utilize subgraph matching of [2] on test data.

Vi-a Baseline Performance

We first show the baseline performance of three methods on human-annotated track data of the VIRAT Ground 2.0 dataset [26] with a set of seven queries. The VIRAT dataset is composed of 40 gigabytes of surveillance videos, capturing 11 scenes of moving people and vehicles interacting. Resolution varies, with about 50100 pixels representing a pedestrian, and around 200200 pixels for vehicles.

(a) Average ROC curve on VIRAT
with human annotated data.
(b) Average ROC curve on VIRAT
with tracked data.
(c) Average Precision wrt. number
of returns on AFRL dataset.
Fig. 3: Retrieval performance

As shown in Table I and Fig. 2(a), the proposed approach outperforms BoW and MSGM. On human annotated tracked data, where we assume no uncertainty at the object level, we can see that both MSGM and the proposed method significantly outperform BoW. The queries all include some level of structural constraints between objects, for example, there is an underlying distance constraint for the people, car and object involved in object deposit. In a cluttered surveillance video where multiple activities occur at the same time, when an algorithm attempts to solve for a bipartite matching between people, car and objects, while ignoring the global spatial relationships between them, unrelated agents from different activities could be chosen, resulting in low detection accuracy for BoW. This shows that global structural relationships rather than isolated object-level descriptors are important. The performance gap between MGSM [2] and our method, indicates the importance of semantic concept learning and probabilistic reasoning over manually specified relationships and deterministic matching.

Query BoW MSGM Proposed
Person dismount 15.33 78.26 83.93
Person mount 21.37 70.61 83.94
Object deposit 26.39 71.34 85.69
Object take-out 8.00 72.70 80.07
2 person deposit 14.43 65.09 74.16
2 person take-out 19.31 80.00 90.00
Group Meeting 25.20 82.35 88.24
Average 18.58 74.34 83.72
TABLE I: Area-Under-Curve (AUC) of precision-recall curves on VIRAT dataset with human annotated bounding boxes for Bag-of-Words approach (BoW [3]), Manually Specified Graph Matching (MSGM [2]), and our proposed approach.
Query BoW MSGM Proposed
Re-ID RL Full
Person dismount 6.27 22.51 21.69 25.98 30.51
Person mount 1.38 20.98 23.12 29.41 35.98
Object deposit 7.90 46.27 47.79 47.62 49.13
Object take-out 16.80 34.92 35.32 41.98 42.12
2 person deposit 3.38 46.11 49.44 50.83 50.83
2 person take-out 15.27 48.03 48.03 49.28 49.28
Group Meeting 23.53 30.80 39.51 30.80 47.64
Average 10.65 35.66 37.84 39.41 43.64
TABLE II: Area-Under-Curve (AUC) of precision-recall curves on VIRAT dataset with automatically detected and tracked data for BoW, MSGM, and our proposed approach with only re-ID (Re-ID), with only relationship learning (RL), and the full system (Full) with both re-ID and relationship learning.

Vi-B Probabilistic Reasoning with Noisy Input Data

We perform an ablative analysis of our approach with detected and tracked bounding boxes in Table II and Fig. 2(b). To demonstrate the effect of re-ID and relationship learning, we report performance with only re-ID, with only relationship learning, and with both re-ID and relationship learning.

Performance of all three methods degrade on tracked data due to miss detections/classifications and track errors. While ours degrades significantly, we still out-perform existing methods111Significant performance degradation with track data has also been observed in the context of activity classification even when full annotated data is available [11]. for training an a priori known set of activities. For BoW, performance loss is large for the first six queries, due to reasons explained in Sec. VI-A. Note that with BoW the group meeting query does not suffer significant degradation since it is more node-dominant than other queries (i.e, bipartite matching identifies multiple people present at the same time, and is a strong indicator of a meeting, particularly, in the absence of other co-occurring confusers).

Clutter v.s. visual distortion

On human annotated bounding boxes, we achieve an average AUC of 83.72%. It indicates that our method is performing well in cluttered video free of visual distortions. Our performance drop to 43.64% on tracked data is directly due to visual distortions introduced by miss-detections, miss-classifications and loss of tracks. This suggests that while our method compensates for some of the visual distortions, it is still important to improve detection, classification and tracking techniques.

To visualize the importance of re-ID and relationship learning, we show examples of falsely returned MSGM outputs in Fig. 4. In Figs. 3(a) and 3(c), the objects are detected and tracked and both MSGM and our approach yield correct returns. In Fig. 3(b), the suitcase is temporarily occluded by the vehicle. MSGM returns this as an example of object take-out, as the suitcase is falsely described as appearing after being occluded by the vehicle. Our proposed approach incorporates re-ID to classify the suitcase as the same suitcase as prior to the occlusion, and therefore the suitcase is not described as appearing and the example is rejected. Similarly, Fig. 3(d) shows an MSGM false return for person mount where a pedestrian walks by a car before the associated track is broken due to shadows. A manually input deterministic distance for near across all perspectives leads to returning this as an example of person mount. In contrast, our approach that learns an adaptive definition of near identifies this relationship as not near and correctly rejects this as an example of person mount.

(a) Detected obj. take-out
(b) Rejected false obj. take-out
(c) Detected mount return
(d) Rejected false mount return
(e) Clutter (VIRAT)
(f) Clutter (AFRL)
Fig. 4: Retrieval Examples: (b) and (d) show example of activities falsely returned by MSGM as obj. take-out and person mount. These activities are correctly rejected by our proposed approach. In (b), re-ID correctly stitches the suitcase tracks from before and after occlusion by the vehicle, and therefore does not return this as an example of obj. take-out. In (d), MSGM describes the person as near the vehicle, whereas our proposed approach does not, and therefore our approach does not return this as an example of person mount. (e) and (f) demonstrate clutter present in the data, necessitating a retrieval system capable of reasoning over objects, attributes, and relationships.

Vi-C Exploratory Search on WAMI Benchmark

Query BoW MSGM Proposed
P@5 P@10 P@20 P@5 P@10 P@20 P@5 P@10 P@20
Car starts 0.80 0.80 0.75 0.40 0.30 0.45 0.80 0.80 0.75
Person mount 0.60 0.80 0.50 0.80 0.80 0.60 1.00 0.80 0.75
Car stops 0.60 0.70 0.60 0.40 0.40 0.40 0.60 0.70 0.60
Person dismount 0.60 0.60 0.50 0.60 0.60 0.55 0.60 0.60 0.70
Car suspicious
0.60 0.40 0.20 0.60 0.60 0.65 1.00 1.00 0.90
Car following 0.40 0.40 0.30 0.60 0.50 0.55 0.80 0.70 0.70
Car following
0.60 0.40 0.40 0.80 0.50 0.60 1.00 0.70 0.80
Car following
0.60 0.50 0.30 0.80 - - 1.00 1.00 -
TABLE III: Precision @ top-k return results for AFRL aerial benchmark dataset.

The AFRL Benchmark WAMI data is from a wide-area persistent surveillance sensor flying over 4 sq. km in Yuma, AZ. It contains 110 minutes of large (8000 8000), low-contrast, low frame rate (1.5 fps), low resolution (0.25 m/pixel) gray scale imagery. Vehicles and people roughly occupy approximately 50-150 and 10 pixels, respectively, leading to noisy detector/tracker outputs.

We search for queries of varying complexity. Simple queries like car starts and car stops where a stationary car starts moving or a moving vehicle comes to a prolonged stop, can be described by a single node with corresponding attributes. Person mount and person dismount are built on top of the single node car queries by adding a person getting into or out of an vehicle. Complex queries like car suspicious stop searches for a car that comes to a stop for a period of time then continues moving. Finally, we search for composite queries, car following + stop, a car following activity immediately proceeded by a car suspicious stop activity, and car following + dismount, a car following activity immediately proceeded by a person dismount activity.

We compare performance of our proposed approach to BoW and MSGM in Table III and Fig. 2(c). Ground truth labeling is unavailable for this dataset, so we report performance as precision at for .

Dominant vs. Weak Attributes

We can see that BoW outperforms MSGM for simple queries like car starts and car stops where a dominant signature of object attribute is present. It is reasonable since BoW learns attribute classifiers for car starting or stopping based on the speed of the vehicle, while MSGM uses a manually specified speed constraint. In contrast, when multiple agents are involved and thus the structural relationships between agents compose part of the query, MSGM outperforms BoW. It suggests the need for reasoning with relationships between objects to capture the activity. The proposed approach combines the attribute learning from BoW, along with additional ability to learn semantic relationships, and as such, outperforms both BoW and MSGM. Our performance gain is more significant on complex composite queries like car following + stop or car following + dismount, which demonstrates the benefits from different components of our system.

Co-occurring Activities

Figs. 3(e) and 3(f) demonstrate the importance of grounding when many other unrelated co-occurring activities are present in the data, which leads to significant degradation with BoW based approaches. For these scenarios, a retrieval system must be able to reason with objects, attributes and relationships to find the correct grounding that matches the query.

Vii Summary

In this paper, we incorporate similarity learning components to the problem of semantic activity retrieval in large surveillance videos. We represent semantic queries by activity graphs and propose a novel probabilistic approach to efficiently identify potential spatio-temporal locations to ground activity graphs in cluttered video. Our experiments show superior performance over methods that fail to consider structural relationships between objects or ignore input data noise and domain-specific variance. The proposed method is robust to visual distortions and capable of suppressing clutter that is inevitable in surveillance videos.


  • [1] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2015, pp. 3668–3678.
  • [2] G. Castañón, Y. Chen, Z. Zhang, and V. Saligrama, “Efficient activity retrieval through semantic graph queries,” in Proceedings of the 23rd Annual ACM Conference on Multimedia Conference.   ACM, 2015, pp. 391–400.
  • [3] D. Lin, S. Fidler, C. Kong, and R. Urtasun, “Visual semantic search: Retrieving videos via complex textual queries,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp. 2657–2664.
  • [4] Z. Xu, Y. Yang, and A. G. Hauptmann., “A discriminative cnn video representation for event detection.” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1798–1807.
  • [5] M. Jain, J. van Gemert, and C. Snoek., “What do 15,000 object categories tell us about classifying and localizing actions?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 46–55.
  • [6]

    A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei., “Large-scale video classification with convolutional neural networks.” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
  • [7] K. Simonyan and A. Zisserman., “Two-stream convolutional networks for action recognition in videos.” in Advances in Neural Information Processing Systems, 2014.
  • [8] K. Tang, L. Fei-Fei, and D. Koller, “Learning latent temporal structure for complex event detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2012, pp. 1250–1257.
  • [9] Y. Yang, Z. Ma, Z. Xu, S. Yan, and A. G. Hauptmann, “How related exemplars help complex event detection in web videos?” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2104–2111.
  • [10] Z. Ma, Y. Yang, N. Sebe, and A. G. Hauptmann, “Knowledge adaptation with partially shared features for event detection using few exemplars,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 9, pp. 1789–1802, 2014.
  • [11] T. Shu, D. Xie, B. Rothrock, S. Todorovic, and S. Chun Zhu, “Joint inference of groups, events and human roles in aerial videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4576–4584.
  • [12] W. Choi and S. Savarese, “Understanding collective activities of people from videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 6, pp. 1242–1257, 2014.
  • [13] T. E. Choe, H. Deng, F. Guo, M. W. Lee, and N. Haering., “Semantic video-to-video search using sub-graph grouping and matching.” in Proceedings of the IEEE International Conference on Computer Vision, 2013.
  • [14] S. Wu, S. Bondugula, F. Luisier, X. Zhuang, and P. Natarajan, “Zero-shot event detection using multi-modal fusion of weakly supervised concepts,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2665–2672.
  • [15] X. Chang, Y. Yang, A. G. Hauptmann, E. P. Xing, and Y.-L. Yu, “Semantic concept discovery for large-scale zero-shot event detection,” in

    Proceedings of the 24th International Conference on Artificial Intelligence

    , 2015, pp. 2234–2240.
  • [16] M. Elhoseiny, J. Liu, H. Cheng, H. Sawhney, and A. Elgammal, “Zero-shot event detection by multimodal distributional semantic embedding of videos,” in Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016.
  • [17] C. Gan, M. Lin, Y. Yang, Y. Zhuang, and A. G. Hauptmann, “Exploring semantic inter-class relationships (sir) for zero-shot action recognition,” in Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015, pp. 3769–3775.
  • [18] Y. Zhang, B. Gong, and M. Shah, “Fast zero-shot image tagging,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp. 5985–5994.
  • [19] M. Jain, J. C. van Gemert, T. Mensink, and C. G. M. Snoek, “Objects2action: Classifying and localizing actions without any video example,” in Proceedings of the IEEE International Conference on Computer Vision, December 2015.
  • [20] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 951–958.
  • [21] D. F. Fouhey and C. L. Zitnick, “Predicting object dynamics in scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp. 2019–2026.
  • [22]

    M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr, “Bing: Binarized normed gradients for objectness estimation at 300fps,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3286–3293.
  • [23] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in

    Proceedings of the 18th International Conference on Machine Learning

    , 2001.
  • [24] J. Platt et al.

    , “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,”

    Advances in Large Margin Classifiers, vol. 10, no. 3, pp. 61–74, 1999.
  • [25] G. D. Castanon, A. L. Caron, V. Saligrama, and P.-m. Jodoin, “Exploratory search of long surveillance videos,” in Proceedings of the 20th Annual ACM International Conference on Multimedia.   ACM, 2012, pp. 309–318.
  • [26] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee, S. Mukherjee, J. Aggarwal, H. Lee, L. Davis et al., “A large-scale benchmark dataset for event recognition in surveillance video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2011, pp. 3153–3160.
  • [27] P. Dollár, “Piotr’s Computer Vision Matlab Toolbox (PMT),”
  • [28] A. Andriyenko, K. Schindler, and S. Roth, “Discrete-continuous optimization for multi-target tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2012, pp. 1926–1933.
  • [29] J. Xiao, H. Cheng, H. Sawhney, and F. Han, “Vehicle detection and tracking in wide field-of-view aerial video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2010, pp. 679–684.
  • [30] S. Wu, S. Das, Y. Tan, J. Eledath, and A. Z. Chaudhry, “Multiple target tracking by integrating track refinement and data association,” in Proceedings of the 15th International Conference on Information Fusion.   IEEE, 2012, pp. 1254–1260.
  • [31] A. Divakaran, Q. Yu, A. Tamrakar, H. S. Sawhney, J. Zhu, O. Javed, J. Liu, H. Cheng, and J. Eledath, “Real-time object detection, tracking and occlusion reasoning,” May 23 2014, uS Patent App. 14/286,305.
  • [32] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear classification,” Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008.
  • [33] A. Das, A. Chakraborty, and A. K. Roy-Chowdhury, “Consistent re-identification in a camera network,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 330–345.
  • [34] T. Wang, S. Gong, X. Zhu, and S. Wang, “Person re-identification by discriminative selection in video ranking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 12, pp. 2501–2514, Dec 2016.
  • [35] W. S. Zheng, S. Gong, and T. Xiang, “Towards open-world person re-identification by one-shot group-based verification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 591–606, March 2016.