I Introduction
Many routine tasks such as activity detection, anomaly detection and userdefined activity recognition & retrieval in surveillance videos currently require significant human attention. The goal of this paper is to develop exploratory search tools for rapid analysis by human operators.
We focus on retrieval of activity that matches analyst or user described semantic activity (ADSA) from surveillance videos. Surveillance videos pose two unique issues: (a) wide query diversity; (b) the presence of many unrelated, cooccurring activities that share common components.
The wide diversity of ADSAs limits our ability to collect sufficient training data for different activities and learn activity models for a complete list of ADSAs. Methods that can transfer knowledge from detailed activity descriptions to the visual domain are required. As noted in [1], while it would be desirable to learn to map textual descriptions to a semantic graph, this by itself is an active area of research. To handle query diversity, we focus on a novel intermediate approach, wherein a user represents an activity as a semantic graph (see Fig. 2
) with object attributes and interobject semantic relationships associated with nodes and edges respectively. We propose to bridge the relationship semantic gap by learning relationship concepts with annotated data. At the object/nodelevel, we utilize existing stateofart methods to train detectors, classifiers and trackers to obtain detected outputs, class labels, track data and other lowlevel outputs. This approach is practical because, in surveillance, the vocabulary of lowlevel components of a query is typically limited and can be assumed to be known in advance.
Our next challenge is to identify candidate groundings. By a grounding [1], we mean a mapping from archive video spatiotemporal locations to query nodes (see also Sec. II). Find groundings of a query in a video is a combinatorial problem that requires searching over different candidate patterns that matches the query. The difficulty arises from many unrelated cooccurring activities that share node and edge attributes. Additionally, the outputs of lowlevel detectors, classifiers and trackers are inevitably errorprone leading to misdetections, misclassifications, and loss of tracks. Uncertainties can also arise due to the semantic gap. Consequently, efficient methods that match the activity graph with highconfidence in the face of uncertainty are required.
This paper extends our preliminary work on activity retrieval [2] with a novel probabilistic framework to score the likelihood of groundings and explicitly account for visualdomain errors and uncertainties. [2] proposes to identify likely candidate groundings as a ranked subgraph matching problem. By leveraging the fact that the attributes and relationships in the query have different level of discriminability, a novel maximally discriminative spanning tree (MDST) is generated as a relaxation of the actual activity graph to quickly minimize the number of possible matches to the query while guaranteeing the desired recall rate. In [2], the activity graph that describes semantic activity requires a fixed and manual description of node attributes and edge relationships, which relies heavily on domain knowledge and is prone to noise from lowerlevel preprocessing algorithms. In this paper, we propose a probabilistic framework based on a CRF model of semantic activity that combines the activity graph with the confidence/margin outputs of our learned componentlevel classifiers, and outputs a likelihood score for each candidate grounding. Similarly, we pose the combinatorial problem of identifying likely candidate groundings as a constrained optimization problem of maximizing precision at a desired recall rate. To solve this problem we propose a successive refinement scheme that recursively attempts to find candidate matches at different levels of confidence. For a given level of confidence, we show that a twostep approach based on first finding subtrees of the activity graph that are guaranteed to have high precision, followed by a treebased dynamic programming recursion to find the matches, leads to efficient solutions. Our method outperforms bag of objects/attributes approaches [3], demonstrating that objects/attributes are weak signatures for activity in surveillance videos unlike other cases [4, 5, 6, 7]. We compare against approaches [2] based on manually encoding node/edge level relationships to bridge the visual domain gap and demonstrate that our semantic learning combined with probabilistic matching outperforms such methods.
Ia Related Work
Many methods have been proposed for video retrieval.
Classification Methods: Many video retrieval methods [8, 9, 10, 3, 11, 12, 13] at runtime take a video snippet (temporal video segment) as input and output a score based on how well it matches the desired activity. During training, activity classifiers for video snippets are learnt [8, 9, 10, 3]
using fully labeled training data. In this context several recent works have proposed deep neural network approaches for learning representations for actions and events
[4, 5, 6, 7]. These works leverage the fact that in some applications object/attributes provide good visual signatures for characterizing activity. In contrast to these methods we do not utilize activitylevel training data. Furthermore, while these methods are suited for situations where an activity manifests as a dominant signature in the video snippet, they are illsuited for situations where the activity signature is weak, namely, the activity occurs among many other unrelated cooccurring activities, which is the typical scenario in surveillance problems.
ZeroShot Methods: More recently, zeroshot methods have been applied to several visual tasks such as event detection [14, 15, 16], action recognition [17], image tagging [18], action localization [19], and image classification [20]. These methods share the same advantage with our work in that activity level training data associated with the desired activity is not required. Nevertheless, zeroshot methods are trained based on source domain descriptions for a subset of activities that allow for forging links between activity components, which can then be leveraged for classification of unseen activity at testtime. Furthermore, the current set of approaches are only suitable in scenarios where the activity has strong visual signatures in lowclutter environments.
Activity Graphs: It is worth pointing out that several works [3, 11, 12, 13] have developed structured activity representations but they use fully annotated data as mentioned earlier. Lin et al. [3] describes a bipartite object/attribute matching method. Shu et al. [11] describes ANDORGraphs based on aggregating subevents for testtime activity recognition. Similar to classification based methods, these approaches only work well when the desired activity is dominant over a video snippet.
The proposed method is closely related to our preliminary work [2]. Activity are manually represented as graph queries. Groundtruth data is utilized to reduce video to a large annotated graph. A ranked subgraph matching algorithms is used to find matches in the video archive graph. In this way objectlevel semantic gap was avoided. Relationship semantic gap is handled manually (for instance, nearness, proximity etc are entered manually in terms of pixel distances). This is somewhat cumbersome because relationships are often context dependent (see Footnote 3). It is primarily a deterministic subgraph matching solution that does not handle visual distortion like misdetections and tracker failure well. In contrast we formulate a probabilistic activity graph that explicitly accounts for visual distortions, bridges the semantic gap through learning lowlevel concepts, and proposes an efficient probabilistic scoring scheme based on CRFs.
CRF Models for Retrieval:
Our proposed CRF framework closely resembles CRF models that are employed for semantic image retrieval in Johnson et al.
[1, 21]. They propose scene graphs to represent objects and relationships between them, and train a CRF model using fully annotated training data. These CRF models on fully trained data thus can also incorporate knowledge of typical global scenes and context in addition to lowlevel node/edge predictions. In contrast our premise is that, in the video problem, we do not have adequate training data across all desired activities. In addition, unlike images, miss detections and track losses have substantial impact in video retrieval. Finally, spatiotemporal scale and size of the surveillance videos, and the presence of unrelated cooccurring activities leads to a probabilistic and combinatorial search problem.Ii Activity Models
The goal of semantic activity retrieval is to spatiotemporally ground semantically described activities in large videos. As no examples of activities are provided, a semantic framework is necessary to represent the search activity. To capture activities involving multiple objects over potentially large temporal scale, we need a flexible framework capable of representing both the objects involved in the activity as well as relationships between these objects. To capture both the components of the activity as well as their relationships, we use an activity graph to define a query.
An activity graph is an abstract model for representing a userdescribed activity that captures object entities, their attributes, and spatiotemporal relationships between objects. An activity graph provides a detailed description of video activity because it admits diverse sets of objects, attributes, and relationships. Graphs represent a natural approach to representing activities involving interaction between multiple objects. For example, consider the following activity:
Two men are meeting so one can give the other a backpack. They will meet and talk first, then they will go to a vehicle and drive away. One man is wearing a red shirt, the other is wearing a green shirt, and their vehicle is a blue sedan.
The above description can be represented as a composition of atomic elements, element descriptions, relationships between elements, and relationship descriptions. For example, the activity can be described by 4 atomic elements with specific descriptions, a person wearing red (P1), a person wearing green (P2), an object representing a backpack (O), and a blue car (V). Using these elements, the activity can be described by the interactions between these elements: initially, P1 and O are near each other, then P1, P2, and O are near each other. The three objects P1, P2, and O move near V, then O, P1, and P2 enter V, and finally, V moves.
Formally, an activity graph is composed of nodes, each representing a realization of an object at an instance of time, and edges, representing relationships between nodes.
We adapt the notation used in scene recognition
[1] and assume we are given a set of object classes , a set of attributes associated with each object, and a set of relationships between objects.An activity graph is defined as the tuple . denotes the nodes in the graph, , with each node characterized by its class and attributes, . Similarly, denotes the set of edges between nodes of the graph, with each edge characterized by its associated relationships , where represents the set of relationships between objects and .
Differing from image retrieval, edges in an activity graph represent not only spatial displacement, but additionally temporal displacement as well as identity information, to capture concepts such as “the same person is near the vehicle later.” Similarly, attributes associated with nodes also include timedependent attributes such as velocity.
In searching for activities, we seek to ground the activity graph to a video segment, that is to associate each node and edge in our activity graph with parts of the video denoted by spatiotemporal bounding boxes . For the nodes of an activity graph, , and a set of bounding boxes , a grounding is a mapping between nodes and bounding boxes. Note that mapping nodes to bounding boxes is sufficient to map the graph to the video segment as the edges are implicitly mapped by . For a grounding , we denote the bounding box that element is mapped to by as .
In this framework, the problem of semantic activity retrieval is equivalent to choosing a grounding for the activity graph. In Section III, we formulate an approach to efficiently grounding an activity graph in a large archive video.
Representing text as an activity graph requires mapping of nouns to objects and understanding of relationships, activities, and interaction between elements. This work is out of the scope of the thesis, and to better demonstrate the efficacy of our approach to retrieval, we focus solely on the problem of spatiotemporally locating activities in videos given a humangenerated activity graph. In practice, these activity graphs are composed of components that are semantically interpretable to humans.
Iii Activity Retrieval by Graph Grounding
Our goal is to find an activity in a large archive video. To this end, we seek to find a grounding of an activity graph, representing the query activity, in the archive video. To solve this problem, we must address two main subproblems: how to evaluate the grounding between an activity graph and a set of object bounding boxes (generated from object proposal approaches like [22]), and how to search over a large archive of bounding boxes in order to find the highest scoring grounding. We first present an approach to evaluate a grounding between activity graph and bounding boxes, then present an approach to efficiently reason over a large archive video to finding the optimal groundings.
Iiia Evaluating Activity Graph Grounding
To evaluate the grounding between an activity graph and set of bounding boxes, we consider a maximum a posteriori inference scheme. For a graph , set of bounding boxes , and grounding , we consider the maximum a posteriori probability, that is . We consider a conditional random field (CRF) model [23],
(1) 
Given that we are in a zeroshot setting, we consider uniform distributions over bounding boxes,
, and activity graph nodes, . From Bayes’ rule, the conditional probability can be expressedOur goal is to find the maximum a posteriori grounding,
(2) 
Note that due to the uniform distribution assumptions on and , these terms are constant and are ignored in finding the maximum a posteriori grounding.
IiiA1 Learning Node & Edge Level Probability Models
To evaluate the maximum a posteriori probability of a grounding, the distributions and
need to be estimated. The distribution
, representing the probability that the bounding box specified by has the class, , and attributes , associated with node . We assume that the probabilities of class and attributes are independent, and therefore we can model this as a product of distributions:(3) 
Estimating each of these probabilities is accomplished by learning an object detector or attribute classifier, with the output margin mapped to a probability using a logistic model.
where is the output margin of the detector for class , and
are two scalar parameters that can be set heuristically or learned with Platt scaling
[24].Similarly, we learn semantic relationship classifiers on features from detected object pairs, and estimate the distribution as in the case of object probabilities. Our perspective is that the vocabulary typically used to describe complex activity by analysts is a priori known (”carry, together, with, near”). In [2], relationship was manually annotated as it is convenient for the subsequent matching stage since everything was deterministic. However it limits the method to work robustly in different scenarios. For example, consider the relationship near between two objects. The manually way is to set a pixeldistance threshold for identifying “near” property for two objects. However, the semantic meaning of near is strongly dependent on the context of the two objects. Near in the context of moving vehicles is different from stationary vehicles or for two persons.
The details of the node and edge probability models we used are described in Sec. V.
IiiB Efficient Grounding in Large Graphs
In the previous section, we presented an approach to estimating the conditional probability of a grounding for a given activity graph. Although estimating the probability of a specific grounding can be efficiently achieved, a combinatorially large number of possible groundings exist between an activity graph and collection of bounding boxes. Furthermore, due to long surveillance videos, the collection of extracted bounding boxes is generally large.
In order to efficiently find the maximum a posteriori grounding of an activity graph in a video, we instead consider the following optimization problem:
(4) 
Note that for the proper setting of thresholds and , the solution of (4) is equivalent to the solution of (2). In the case where the parameters are set below this optimal set of parameters, the solution is nonunique, with a set of possible groundings returned, one of which is the solution to (2). By scoring the groundings that maximize (4) according to the objective of (2), we are able to find the optimal grounding from this subset.
Our goal is to find a set of thresholds that maximize precision subject to a recall constraint. For a grounding , we define as the value of the objective of (4), that is
Let denote whether or not a grounding corresponds to the desired activity, for a set of thresholds , the precision can be expressed as the probability of a grounding corresponding to the desired activity having an objective value, , equal to divided by the probability of any grounding having an objective value equal to , that is:
We assume the probability of a grounding corresponding to the desired activity is significantly smaller than the probability of a grounding not corresponding to the desired activity, allowing for the approximation:
Similarly, we can express the recall rate as
We therefore seek to minimize the approximate precision subject to the recall rate being greater than some value :
Note that the ratio is an unknown quantity dependent on the archive video, however as this quantity is a constant, the value does not effect the optimization. Assuming independence of the nodes and edges of the activity graph (and their attributes), the remaining conditional properties can be estimated by evaluating detector performance, with thresholds chosen given detector performance.
Solving the optimization problem in (4) has the potential to be significantly more efficient than solving the optimization problem in (2) through the use of branchandbound approaches, particularly due to the ability to aggressively bound by eliminating any solution subtrees where one node or edge does not meet the associated threshold. Unfortunately, despite the potential improvement in efficiency, solving this problem is still combinatorially hard and may be computationally infeasible, particularly for a large collection of bounding boxes.
Rather than directly solving this problem, consider a subgraph of that we denote as , with the nodes and edges of the subgraph denoted . Consider the problem of finding a grounding for this subgraph:
(5) 
For this subgraph matching problem, we make the following observation:
Theorem III.1.
Thm. III.1 implies that the set of groundings that maximize (5) includes all groundings that also maximize (4). Therefore, the set of groundings that maximize (5) has a recall rate of , though the precision rate may be decreased.
Thm. III.1 leads to an efficient approach to solving (4). Rather than searching for the full graph , we instead consider a subgraph that can be efficiently searched for. From Thm. III.1, all subgraphs of will have a recall rate of , however the choice of spanning tree directly impacts the precision rate of the set of groundings that maximize (5). We therefore propose selecting a Highest Precision Subgraph (HPS) defined as the subgraph of with a minimal expected number of groundings that maximize (5).
In particular, we attempt to find a HPS from the set of spanning trees of , as tree search can be efficiently solved using dynamic programming [25]. From our model in (1), we assume that each edge is distributed independently. Therefore, we can find the HPS from the set of spanning trees of by finding a spanning tree over the graph that minimizes the likelihood that an edge probability is above the associated threshold.
(6) 
where is restricted to be the set of edges that yield a valid spanning tree over and is the distribution over relationships in the video. In practice, can be efficiently approximated by randomly sampling bounding box pairs and estimating their distribution. Solving the optimization in (6) can be done efficiently, as the problem can be mapped to a minimum spanning tree problem. We explain the details of the algorithm in Sec. IV.
Iv Highest Precision Spanning Tree Algorithm
The goal of our algorithm is to efficiently retrieve the optimal grounding for an activity graph . We accomplish it in two steps: First, we calculate the highest precision subgraph for an activity graph . The selected subgraph minimizes (6) among all the spanning trees of , thus filters out as many infeasible groundings as possible. Then we develop a Highest Precision Subtree Matching (HPSM) approach to find the optimal groundings that maximizes (5). In the end, we recover the ranked groundings with respect to the original activity graph .
Iva Highest Precision Subtree Selection
Given an activity graph , we first reduce the video data to the set of potentially relevant nodes and edges by building a coarse archive graph out of the spatiotemporal bounding boxes . For every node , we retrieve the set of corresponding locations that satisfy the class and attributes characterized by and . Similarly, we retrieve the corresponding edges .
Despite incredible cost savings, the downsampled coarse graph is still a large graph with a collection of hundreds of thousands of bounding boxes. We therefore select a HPS of the activity graph so as to minimize the time spent performing an expensive search.
The choice of which spanning tree to select has significant runtime implications. The creation of a spanning tree involves the removal of edges from , and not all edges are created equal. The edges in the HPS should be the set of edges that minimizes the likelihood in (6). To solve the optimization, we first compute a set of weights indicating the discriminative power of the edges, then calculate the spanning tree which minimizes the total edge weight.
IvA1 Weight Computation
During the archival process, we assign probabilities , and to each class, attribute and relationship that we store. These functions denote the probability that a randomlychosen class or attribute or relationship in the archive video is a match to the class , attribute or relationship . Relationships, in particular, have greater power to be discriminative because the set of potential relationships is .
The set of edges that minimizes (6) is associated with the most discriminative relationships so that it yields the least possible mappings in the coarse archive graph . In VIRAT[26] datasets, while the “Person near car” relationship is normally very discriminative, 80% of the dataset is shot in parking lots, where people are frequently near cars. Objects disappear near cars far less less frequently  thus, a tree rooted at the “object disappears” node and connecting through the “near” edge to the “car” node has less potential matches than one starting elsewhere.
We compute empirical values for during the archival process by computing the percentage of relationships that has appeared in the videos. If relationships have not been seen, they are assumed to be nondiscriminative and assigned values of . If it is later determined that these relationships are discriminative, we can revise our estimate of .
IvA2 Highest Precision Spanning Tree
From our model in (1), We assume that each edge is distributed independently. Absent additional information indicating the distribution of relationships in the video corpus, we assume that all relationships are generated independently.
Since relationships and edges in an activity graph = (, ) are independent, the total edge weight of the graph is:
(7) 
As noted in (5), we are going to do a search using HPS instead of the original query graph in order to reduce the computational complexity. That HPS which results in the fewest possible groundings is our novel highest precision spanning tree.
Definition (Highest Precision Spanning Tree (HPST)).
We call a spanning tree an HPST of activity graph with edge weights , if the subtree satisfies
(8) 
where denotes the set of all possible spanning trees induced from activity graph .
This is exactly the same as the definition of minimum spanning tree. By minimizing the total edge weight, we achieve a Highest Precision Spanning Tree , which solves the optimization in (6). We use Kruskal’s algorithm to calculate the HPST that should produce the fewest possible matches.
IvB Highest Precision Subtree Matching (HPSM)
Given a coarse graph and the HPST , we seek to select the maximum a posteriori grounding from all possible grounding . We solve for the optimal grounding between the HPST and archive graph in two steps. In the first step, we construct a matching graph of the possible groundings. Then we find the optimal grounding from the matching graph .
IvB1 Matching Graph Creation
we build a matching graph , where each node is a tuple denoting a proposed assignment between a node in and a node in , and each edge denotes the relationship between the two assignments. All the assignments in satisfy the setting of nodes and edges thresholds described in (5) so that we can rule out the impossible mappings.
We create by first adding assignments for the root, then adding in assignments to its successors which satisfy both node and edge relationships described in . We then set the score thresholds and to be the minimum score for nodes and edges and find a set of mappings that maximize (5). The proper setting of thresholds ensures that no feasible groundings to is ruled out in the filtering process. This process is described in Algorithm 1, and the expected number of mappings scales as a product of and the size of the archive data.
IvB2 Retrieval with HPSM
After traversing from root to leaves to create a matching graph, we then traverse it from leaves to root to determine the optimal solution for each root node assignment. To evaluate the matching score of a grounding, we use the maximum a posteriori probability described in (2), where the score is the product of the distributions and . For each leaf node in , we merge nodes in with their parents, keeping the one which has the best score. We repeat this process until only mappings to root nodes are left, and then sort these root nodes by the score of the best tree which uses them. This process is described in Algorithm 2, and has complexity of .
This process yields a set of groundings, , for each potential activity  generally on the order of the number of true groundings in the data. We then iterate through each grounding and filtering groundings that have poor scores for the edges not present in the HPST . In this way, we attempt to recover the grounding results for the original problem in (2) from the approximated problem (5). This allows us to have the speed of the HPSM approach and the effective quality of the full graph grounding results.
V Implementation
In this section, we present implementation details of our approach. Fig. 2 shows an overview of our system. At a highlevel, it operates as follows: as an archive video is recorded, detectors are applied to extract bounding boxes of objects of interest. These bounding boxes are then fused through a tracker and classified, yielding tracklets of objects that are stored in a table along with some simple attributes. During querytime, an analyst provides an ADSA query by an activity graph. From this query, an HPST, in particular a tree, is found according to (6). The set of groundings that maximize (5) are then found in the table. These groundings are scored and returned according to (1).
ADSA Query Vocabulary
We construct a vocabulary that corresponds to nodes and edges in the ADSA activity graph to allow for semantic descriptions of queries. We consider three classes of items, person, object, and vehicle. Each item has a set of attributes that can be included in the query such as size, appearing, disappearing, and speed. Between each of these items, we define the following relationship attributes: same enti, near, not near, and later. The set of items and attributes can be expanded to include additional or more specific classes/descriptors. Due to the limited variety of objects in the datasets, we limit ourselves to simple semantic descriptors to prevent dominance of attributes in returns. By limiting the descriptiveness of attributes in our queries, we demonstrate retrieval capability in the presence of possible confusers. For a query such as “two people loading an object into a pink truck” a method that leverages primarily the color is not sufficiently general to handle ADSA’s that do not include strong attribute descriptions.
Detection and Tracking
We demonstrate the proposed method on two datasets. For the high quality VIRAT ground dataset [26]
, we use Piotr’s Computer Vision Matlab Toolbox
[27] to extract detections and then fuse them into tracklets [28]. In the case of the lowresolution WAMI AFRL data [2], we apply algorithms designed specifically for aerial data [29, 30, 31].Relationship Learning
We learn semantic relationships by training a classifier on annotated positive and negative relationship examples of object pairs. For example, the relationship descriptor “near” between two items is found by training a classifier on features of two objects such as size, aspect ratios, distance between objects, etc. on a set of annotated examples of items that are near and are not near. Linear SVMs [32] are used to learn the relationships and provide the probabilities.
ReID
Many of our queries requires maintaining identity over long periods of time, while tracked data inevitably has losttracks. We thus leverage reidentification (reID) algorithms by utilizing a linear classifier () over the outer product of features from a pair of tracklets and train SVMs to learn . This classifier is universally applied independent of context, pose, illumination etc.
Vi Experiments
We perform semantic video retrieval experiments on the VIRAT Ground 2.0 dataset [26] and the AFRL Benchmark WAMI Data [2]. Given a set of activity graph queries, each algorithm is asked to return a ranked list of groundings in the archive video based on their likelihood scores. Each grounding is then represented by the minimal bounding spatiotemporal volume of the involved bounding boxes. For VIRAT dataset where ground truth is provided, standard PrecisionRecall curves are produced by varying the scoring threshold. For the unannotated AFRL data, a human operator evaluates the precision of topk returns by watching the corresponding spatiotemporal window of the video. Each return is marked as a true detection if the overlap of the returned spatiotemporal volume and the true spatiotemporal volume is larger than 50% of the union.
As stated in Sec. IA, most of the related methods [8, 9, 10, 3, 11, 12, 13] are not applicable to our setup as they retrieve relevant videos from a collection of short video snippets. We compare our performance with two approaches, a bagofwords (BoW) scheme and a Manually Specified Graph Matching (MSGM) scheme. BoW [3, 11] collects objects, object attributes and relationships in to a bag and ignores the structural relationships. To identify groundings, a bipartite matching scheme is utilized to find an assignment between the bag of words and a video snippet. We use our trained models for nodelevel concepts in this context. For the MSGM method [2] , we quantify relationships by manually annotating data using bounding boxes for objects and then utilize subgraph matching of [2] on test data.
Via Baseline Performance
We first show the baseline performance of three methods on humanannotated track data of the VIRAT Ground 2.0 dataset [26] with a set of seven queries. The VIRAT dataset is composed of 40 gigabytes of surveillance videos, capturing 11 scenes of moving people and vehicles interacting. Resolution varies, with about 50100 pixels representing a pedestrian, and around 200200 pixels for vehicles.
As shown in Table I and Fig. 2(a), the proposed approach outperforms BoW and MSGM. On human annotated tracked data, where we assume no uncertainty at the object level, we can see that both MSGM and the proposed method significantly outperform BoW. The queries all include some level of structural constraints between objects, for example, there is an underlying distance constraint for the people, car and object involved in object deposit. In a cluttered surveillance video where multiple activities occur at the same time, when an algorithm attempts to solve for a bipartite matching between people, car and objects, while ignoring the global spatial relationships between them, unrelated agents from different activities could be chosen, resulting in low detection accuracy for BoW. This shows that global structural relationships rather than isolated objectlevel descriptors are important. The performance gap between MGSM [2] and our method, indicates the importance of semantic concept learning and probabilistic reasoning over manually specified relationships and deterministic matching.
Query  BoW  MSGM  Proposed 

Person dismount  15.33  78.26  83.93 
Person mount  21.37  70.61  83.94 
Object deposit  26.39  71.34  85.69 
Object takeout  8.00  72.70  80.07 
2 person deposit  14.43  65.09  74.16 
2 person takeout  19.31  80.00  90.00 
Group Meeting  25.20  82.35  88.24 
Average  18.58  74.34  83.72 
Query  BoW  MSGM  Proposed  

ReID  RL  Full  
Person dismount  6.27  22.51  21.69  25.98  30.51 
Person mount  1.38  20.98  23.12  29.41  35.98 
Object deposit  7.90  46.27  47.79  47.62  49.13 
Object takeout  16.80  34.92  35.32  41.98  42.12 
2 person deposit  3.38  46.11  49.44  50.83  50.83 
2 person takeout  15.27  48.03  48.03  49.28  49.28 
Group Meeting  23.53  30.80  39.51  30.80  47.64 
Average  10.65  35.66  37.84  39.41  43.64 
ViB Probabilistic Reasoning with Noisy Input Data
We perform an ablative analysis of our approach with detected and tracked bounding boxes in Table II and Fig. 2(b). To demonstrate the effect of reID and relationship learning, we report performance with only reID, with only relationship learning, and with both reID and relationship learning.
Performance of all three methods degrade on tracked data due to miss detections/classifications and track errors. While ours degrades significantly, we still outperform existing methods^{1}^{1}1Significant performance degradation with track data has also been observed in the context of activity classification even when full annotated data is available [11]. for training an a priori known set of activities. For BoW, performance loss is large for the first six queries, due to reasons explained in Sec. VIA. Note that with BoW the group meeting query does not suffer significant degradation since it is more nodedominant than other queries (i.e, bipartite matching identifies multiple people present at the same time, and is a strong indicator of a meeting, particularly, in the absence of other cooccurring confusers).
Clutter v.s. visual distortion
On human annotated bounding boxes, we achieve an average AUC of 83.72%. It indicates that our method is performing well in cluttered video free of visual distortions. Our performance drop to 43.64% on tracked data is directly due to visual distortions introduced by missdetections, missclassifications and loss of tracks. This suggests that while our method compensates for some of the visual distortions, it is still important to improve detection, classification and tracking techniques.
To visualize the importance of reID and relationship learning, we show examples of falsely returned MSGM outputs in Fig. 4. In Figs. 3(a) and 3(c), the objects are detected and tracked and both MSGM and our approach yield correct returns. In Fig. 3(b), the suitcase is temporarily occluded by the vehicle. MSGM returns this as an example of object takeout, as the suitcase is falsely described as appearing after being occluded by the vehicle. Our proposed approach incorporates reID to classify the suitcase as the same suitcase as prior to the occlusion, and therefore the suitcase is not described as appearing and the example is rejected. Similarly, Fig. 3(d) shows an MSGM false return for person mount where a pedestrian walks by a car before the associated track is broken due to shadows. A manually input deterministic distance for near across all perspectives leads to returning this as an example of person mount. In contrast, our approach that learns an adaptive definition of near identifies this relationship as not near and correctly rejects this as an example of person mount.
ViC Exploratory Search on WAMI Benchmark
Query  BoW  MSGM  Proposed  

P@5  P@10  P@20  P@5  P@10  P@20  P@5  P@10  P@20  
Car starts  0.80  0.80  0.75  0.40  0.30  0.45  0.80  0.80  0.75  
Person mount  0.60  0.80  0.50  0.80  0.80  0.60  1.00  0.80  0.75  
Car stops  0.60  0.70  0.60  0.40  0.40  0.40  0.60  0.70  0.60  
Person dismount  0.60  0.60  0.50  0.60  0.60  0.55  0.60  0.60  0.70  

0.60  0.40  0.20  0.60  0.60  0.65  1.00  1.00  0.90  
Car following  0.40  0.40  0.30  0.60  0.50  0.55  0.80  0.70  0.70  

0.60  0.40  0.40  0.80  0.50  0.60  1.00  0.70  0.80  

0.60  0.50  0.30  0.80      1.00  1.00   
The AFRL Benchmark WAMI data is from a widearea persistent surveillance sensor flying over 4 sq. km in Yuma, AZ. It contains 110 minutes of large (8000 8000), lowcontrast, low frame rate (1.5 fps), low resolution (0.25 m/pixel) gray scale imagery. Vehicles and people roughly occupy approximately 50150 and 10 pixels, respectively, leading to noisy detector/tracker outputs.
We search for queries of varying complexity. Simple queries like car starts and car stops where a stationary car starts moving or a moving vehicle comes to a prolonged stop, can be described by a single node with corresponding attributes. Person mount and person dismount are built on top of the single node car queries by adding a person getting into or out of an vehicle. Complex queries like car suspicious stop searches for a car that comes to a stop for a period of time then continues moving. Finally, we search for composite queries, car following + stop, a car following activity immediately proceeded by a car suspicious stop activity, and car following + dismount, a car following activity immediately proceeded by a person dismount activity.
We compare performance of our proposed approach to BoW and MSGM in Table III and Fig. 2(c). Ground truth labeling is unavailable for this dataset, so we report performance as precision at for .
Dominant vs. Weak Attributes
We can see that BoW outperforms MSGM for simple queries like car starts and car stops where a dominant signature of object attribute is present. It is reasonable since BoW learns attribute classifiers for car starting or stopping based on the speed of the vehicle, while MSGM uses a manually specified speed constraint. In contrast, when multiple agents are involved and thus the structural relationships between agents compose part of the query, MSGM outperforms BoW. It suggests the need for reasoning with relationships between objects to capture the activity. The proposed approach combines the attribute learning from BoW, along with additional ability to learn semantic relationships, and as such, outperforms both BoW and MSGM. Our performance gain is more significant on complex composite queries like car following + stop or car following + dismount, which demonstrates the benefits from different components of our system.
Cooccurring Activities
Figs. 3(e) and 3(f) demonstrate the importance of grounding when many other unrelated cooccurring activities are present in the data, which leads to significant degradation with BoW based approaches. For these scenarios, a retrieval system must be able to reason with objects, attributes and relationships to find the correct grounding that matches the query.
Vii Summary
In this paper, we incorporate similarity learning components to the problem of semantic activity retrieval in large surveillance videos. We represent semantic queries by activity graphs and propose a novel probabilistic approach to efficiently identify potential spatiotemporal locations to ground activity graphs in cluttered video. Our experiments show superior performance over methods that fail to consider structural relationships between objects or ignore input data noise and domainspecific variance. The proposed method is robust to visual distortions and capable of suppressing clutter that is inevitable in surveillance videos.
References

[1]
J. Johnson, R. Krishna, M. Stark, L.J. Li, D. Shamma, M. Bernstein, and
L. FeiFei, “Image retrieval using scene graphs,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2015, pp. 3668–3678.  [2] G. Castañón, Y. Chen, Z. Zhang, and V. Saligrama, “Efficient activity retrieval through semantic graph queries,” in Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 2015, pp. 391–400.
 [3] D. Lin, S. Fidler, C. Kong, and R. Urtasun, “Visual semantic search: Retrieving videos via complex textual queries,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp. 2657–2664.
 [4] Z. Xu, Y. Yang, and A. G. Hauptmann., “A discriminative cnn video representation for event detection.” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1798–1807.
 [5] M. Jain, J. van Gemert, and C. Snoek., “What do 15,000 object categories tell us about classifying and localizing actions?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 46–55.

[6]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei., “Largescale video classification with convolutional neural networks.” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.  [7] K. Simonyan and A. Zisserman., “Twostream convolutional networks for action recognition in videos.” in Advances in Neural Information Processing Systems, 2014.
 [8] K. Tang, L. FeiFei, and D. Koller, “Learning latent temporal structure for complex event detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 1250–1257.
 [9] Y. Yang, Z. Ma, Z. Xu, S. Yan, and A. G. Hauptmann, “How related exemplars help complex event detection in web videos?” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2104–2111.
 [10] Z. Ma, Y. Yang, N. Sebe, and A. G. Hauptmann, “Knowledge adaptation with partially shared features for event detection using few exemplars,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 9, pp. 1789–1802, 2014.
 [11] T. Shu, D. Xie, B. Rothrock, S. Todorovic, and S. Chun Zhu, “Joint inference of groups, events and human roles in aerial videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4576–4584.
 [12] W. Choi and S. Savarese, “Understanding collective activities of people from videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 6, pp. 1242–1257, 2014.
 [13] T. E. Choe, H. Deng, F. Guo, M. W. Lee, and N. Haering., “Semantic videotovideo search using subgraph grouping and matching.” in Proceedings of the IEEE International Conference on Computer Vision, 2013.
 [14] S. Wu, S. Bondugula, F. Luisier, X. Zhuang, and P. Natarajan, “Zeroshot event detection using multimodal fusion of weakly supervised concepts,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2665–2672.

[15]
X. Chang, Y. Yang, A. G. Hauptmann, E. P. Xing, and Y.L. Yu, “Semantic
concept discovery for largescale zeroshot event detection,” in
Proceedings of the 24th International Conference on Artificial Intelligence
, 2015, pp. 2234–2240.  [16] M. Elhoseiny, J. Liu, H. Cheng, H. Sawhney, and A. Elgammal, “Zeroshot event detection by multimodal distributional semantic embedding of videos,” in Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016.
 [17] C. Gan, M. Lin, Y. Yang, Y. Zhuang, and A. G. Hauptmann, “Exploring semantic interclass relationships (sir) for zeroshot action recognition,” in Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015, pp. 3769–3775.
 [18] Y. Zhang, B. Gong, and M. Shah, “Fast zeroshot image tagging,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp. 5985–5994.
 [19] M. Jain, J. C. van Gemert, T. Mensink, and C. G. M. Snoek, “Objects2action: Classifying and localizing actions without any video example,” in Proceedings of the IEEE International Conference on Computer Vision, December 2015.
 [20] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by betweenclass attribute transfer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 951–958.
 [21] D. F. Fouhey and C. L. Zitnick, “Predicting object dynamics in scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp. 2019–2026.

[22]
M.M. Cheng, Z. Zhang, W.Y. Lin, and P. Torr, “Bing: Binarized normed gradients for objectness estimation at 300fps,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3286–3293. 
[23]
J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields:
Probabilistic models for segmenting and labeling sequence data,” in
Proceedings of the 18th International Conference on Machine Learning
, 2001. 
[24]
J. Platt et al.
, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,”
Advances in Large Margin Classifiers, vol. 10, no. 3, pp. 61–74, 1999.  [25] G. D. Castanon, A. L. Caron, V. Saligrama, and P.m. Jodoin, “Exploratory search of long surveillance videos,” in Proceedings of the 20th Annual ACM International Conference on Multimedia. ACM, 2012, pp. 309–318.
 [26] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.C. Chen, J. T. Lee, S. Mukherjee, J. Aggarwal, H. Lee, L. Davis et al., “A largescale benchmark dataset for event recognition in surveillance video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2011, pp. 3153–3160.
 [27] P. Dollár, “Piotr’s Computer Vision Matlab Toolbox (PMT),” https://github.com/pdollar/toolbox.
 [28] A. Andriyenko, K. Schindler, and S. Roth, “Discretecontinuous optimization for multitarget tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 1926–1933.
 [29] J. Xiao, H. Cheng, H. Sawhney, and F. Han, “Vehicle detection and tracking in wide fieldofview aerial video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2010, pp. 679–684.
 [30] S. Wu, S. Das, Y. Tan, J. Eledath, and A. Z. Chaudhry, “Multiple target tracking by integrating track refinement and data association,” in Proceedings of the 15th International Conference on Information Fusion. IEEE, 2012, pp. 1254–1260.
 [31] A. Divakaran, Q. Yu, A. Tamrakar, H. S. Sawhney, J. Zhu, O. Javed, J. Liu, H. Cheng, and J. Eledath, “Realtime object detection, tracking and occlusion reasoning,” May 23 2014, uS Patent App. 14/286,305.
 [32] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin, “Liblinear: A library for large linear classification,” Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008.
 [33] A. Das, A. Chakraborty, and A. K. RoyChowdhury, “Consistent reidentification in a camera network,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 330–345.
 [34] T. Wang, S. Gong, X. Zhu, and S. Wang, “Person reidentification by discriminative selection in video ranking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 12, pp. 2501–2514, Dec 2016.
 [35] W. S. Zheng, S. Gong, and T. Xiang, “Towards openworld person reidentification by oneshot groupbased verification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 591–606, March 2016.
Comments
There are no comments yet.