1 Introduction
Human activity recognition comprises a range of open challenges and is a very active research area [Aggarwal and Ryoo2011, Vishwakarma and Agrawal2013, Sukthankar et al.2014], spanning topics from visual recognition of individual behavior [Poppe2010], pairwise interactions among individuals participating in different roles in a joint activity [Barbu et al.2012, Kwak, Han, and Han2013], coordinated sequences of actions as expressions of planned activity [Geib and Goldman2009], and multiple groups of individuals interacting across broad time scales. In this paper, we address the last of these, presenting a framework for automatically constructing an interpretation of highlevel human activity structure as observed in surveillance video, across multiple, interleaved instances of activities. We assume that lowerlevel visual processing provides high quality tracks of individuals moving through the scene. Our goal is to construct accurate descriptions of the events in the video at different levels of granularity, based on the tracks alone. We develop a probabilistic generative model that combines multiple features that to our knowledge have not been previously incorporated into a single framework for joint inference. To wit: (1) Activities have composite structure with roles representing semantically distinct aspects of the overall activity structure. (2) Activities are described hierarchically and recursively, entailing multiple levels of granularity both in time and membership. (3) Arbitrarily sized groups of actors participate in activities and fulfill roles. (4) Hierarchical descriptions and temporally changing groupings consist of the best joint explanation of the full set of individual trajectories, as found via posterior probabilistic inference.
The rest of the paper is organized as follows. In the next section, we review prior research, with a focus on work modeling group membership, hierarchically structured activities, and the identification of roles. In Section 3 we present our probabilistic generative model. In Section 4 we present an MCMC sampling framework for performing joint inference using the model. In Section 5 we evaluate the model on synthetic and realworld data from the VIRAT [Oh et al.2011] and UCLA Aerial Event [Shu et al.2015] video data sets, demonstrating the model’s expressive power and effectiveness. We conclude with a discussion of future work.
2 Related Work
A number of researchers have proposed models that distinguish the different roles that individuals play in a coordinated activity [Ryoo and Aggarwal2011, Barbu et al.2012, Lan, Sigal, and Mori2012, Kwak, Han, and Han2013]. These models capture the semantics of activities with component structure. It can be difficult to scale role identification in scenes with an arbitrary number of individuals, in particular while properly handling identification of nonparticipants [Kwak, Han, and Han2013]. A consequence of our joint inference over role assignments and groups is that our model naturally distinguishes and separates participants of different activities playing different roles.
Considerable work has been devoted to developing more expressive models in which activities are decomposed into hierarchical levels of description, across different spatial and temporal granularities. Some prior models account for a specific number of hierarchical levels of description, with up to 3 levels being a popular choice [Choi and Savarese2012, Lan et al.2010, Chang, Krahnstoever, and Ge2011, Cheng et al.2014]. Other models permit a potentially greater number of levels of activity description, but the activity hierarchy is fixed prior to inference [Kwak, Han, and Han2013, Garate et al.2014, Zaidenberg, Boulay, and Bremond2012]. In only a few cases, including our model, are the levels of activity description assessed during inference [Lin et al.2010, Ryoo and Aggarwal2011].
A third branch of work has been devoted to modeling activities not just among individuals, but involving groups of actors. In some models, activities include groups, but interactions are still considered between individuals within groups [Choi and Savarese2012, Lan et al.2010, Zhang et al.2013, Odashima et al.2012, Zhu et al.2011]. Other models allow for activities to be performed between groups themselves [Chang, Krahnstoever, and Ge2011, Chang et al.2010]. Still others, including our model, take group activity modeling a step further, allowing for arbitrary numbers of participants in groups, provided they satisfy group membership criteria while performing the activity [Lin et al.2010, Kwak, Han, and Han2013, Shu et al.2015, Zhang et al.2012].
Two recent papers, one by 2_ryoo2011 2_ryoo2011 and one by shu2015_CVPR shu2015_CVPR, come closest to accommodating the combination of features in our model. Both have developed methods for simultaneous inference of group structure with roles in hierarchically structured activities. One key difference between the two is that the former, like our model, can flexibly accommodate multiple levels of activity description in the course of inference, while the latter is restricted to two fixed levels of description during inference (one for group activities, and the other for their member roles). A crucial difference with our approach is that we consider the possibility of detecting multiple occurrences of the same type of activity description in a scene, and these descriptions are integrated into a single probabilistic joint model in order to influence inference of the overall activity structure. While activities described in 2_ryoo2011 2_ryoo2011 have hierarchical component structure with roles, their model does not accommodate more than one top level activity (i.e. assault, meet, robbery, etc.) occurring during the course of the video, nor changes in group membership by individuals over such highlevel activities.
3 Model
We present a probabilistic generative model describing how coordinated activities by groups of individuals give rise to the observed physical trajectories of the actors involved. In Section 3.1 we introduce some terminology. Then, in Section 3.2 we give precise definitions of the representations employed by our model for activities, groups, activity sequences, and spatial trajectories. Next, in Section 3.3
, we define the factors in the joint probability distribution that comprise the generative model. Finally, in Section
3.4 we define the specific activities that are used to model the scenes that we use for evaluation in Section 5.3.1 Terminology
Activities are optionally composed of other activities. For example, the canonical meet activity in our model consists of multiple participating groups, each of which must moveto the location of the meeting, unless they are already there, and then possibly wait (stand) until one or more other groups arrive. Engaging in a moveto activity requires the group to carry out a sequence of walk and run activities. The semantics of a walk or run is, in turn, characterized by the group’s physical trajectory in space.
A scene is described as an activity tree, which has a recursive structure, much as a syntactic tree describes the recursive phrase structure of a natural language sentence, but with an added temporal component. An example of such a tree is depicted in Figure 1(a). We refer to the more abstract, “nonterminal” activities that are defined in terms of other activities as intentional, and the “leaf” activities that are defined in terms of observable movement as physical.
Participants in an intentional activity are divided into subgroups, each of which plays a particular role with respect to the parent activity. Carrying out a particular role entails engaging in a sequence of subactivities, each of which may be physical or intentional. In intentional activities, the group may be decomposed further into smaller subgroups, each with their own role in the subactivity, and so on. The sequence of subactivities performed by participants in a role is constrained (either deterministically or stochastically) according to the dynamics associated with the role. In cases where a role in a parent activity may be realized as a sequence that includes the same activity type as the parent, the structure is recursive, and allows for arbitrary nesting of activities at different time scales. For example, in Figure 1, a group of individuals participating in a meet activity performs a submeet of their own.
Physical activities are associated with group trajectories, which are coupled either via shared membership or when an intentional activity requires their coordination. An individual, , has an observed trajectory that is generated as a sequence of connected subtrajectories, one for each physical activity in which participates, and which is constrained to be near the respective group trajectory. An example of an observed set of individual trajectories is shown in Figure 1(c).
An important feature of our model is that group trajectories are not explicitly represented during inference. The assertion that there is some group trajectory induces correlations among the individual trajectories of the members, but we average over all possible group trajectories (marginalizing them out) when computing the posterior probability of a description. This allows activity descriptions with different numbers of groups to be compared based on the posterior probabilities of their activity trees alone, without needing to deal with probability densities with different numbers of dimensions.
3.2 Representation
Activities
Formally, an activity is a tuple , where is an activity label (e.g., walk), and are the start and end time of the activity, respectively. The simplest activities are physical activities, e.g., walking, running and standing, which directly constrain the motion of a group of individuals over an interval of time. For example, a run activity is expected to yield trajectories with speeds corresponding to typical human running. We denote the set of physical activity labels by . Similarly, intentional activities include meet and moveto. We denote the set of intentional activity labels by . The complete set of activities is denoted by .
Groups
The set of participants in activity is denoted by . We let , the size of the group. This set is partitioned into subgroups, , where the number of subgroups is bounded only by the number of individuals, , in the group. When specifying the probability model, it is convenient to work instead with the indicator variables, , where indicates which of the subgroups participant is affiliated with. Note that contains exactly the same information as the partition. In the following description, we omit the subscript for readability.
Activity Sequences
Each subgroup within activity performs a sequence of subactivities. For example, a subgroup in a meet might perform a sequence of two subactivities: first, moveto a designated meeting location, and then stand in that location while meeting with other subgroups, who may have also approached that location. Alternatively, one of the subgroups could be involved in a side meeting, with further subgroups that approach and merge with each other before their union merges with another subgroup of the toplevel meet. Figure 1 provides such an example.
The sequences of subactivities performed by subgroups are governed by roles. Each subgroup is assigned a role from a set defined by the parent activity . Roles govern the dynamics of the sequence of subactivities that each group carries out, via a set of parameters associated with that role. The parameters for role specify what the allowable activities are in the activity sequence of a group carrying out that role, as well as hard or soft constraints involving what order the activities occur in. The subgroup of the example meet activity above would be assigned the role of APPROACHER, which prescribes a moveto followed by a stand
. In this case the constraint on the order of subactivities is deterministic, with the only degrees of freedom being the times at which transitions take place. These constraints can be represented by a Markov chain with a degenerate initial distribution and a transition matrix with only one nonzero offdiagonal entry per column. More general initial and transition distributions will give rise to softer constraints on activity sequences.
Trajectories
Ultimately, each physical activity, realized over the interval , is associated with a group trajectory, denoted by , which is a 2 by array specifying a 2dimensional position on the ground plane for each time index between and inclusive. The group trajectory represents the central tendency of the members’ individual trajectory segments during the interval . Since individual ’s path depends on the sequence of activities it participates in, each individual trajectory consists of segments, , consecutive pairs of which must be connected at transition points, , where denotes the start of segment and the end of segment .
3.3 Generative Model
We now describe the generative process for activities. The highlevel process has three steps: (1) recursive expansion of intentional activities, (2) generation of group trajectories for the set of physical activities, and (3) generation of individual trajectories conditioned on the group assignments and group trajectories.
Overview
In the first step of activity generation, each intentional activity gives rise to one or more child activity sequences: one for each subgroup of participants involved in the parent activity. Each child sequence is assigned a role, based on the parent activity type. Subgroups and role assignments occur jointly. The choice of role governs the sequence of activities that the subgroup engages in, by specifying a Markov transition function. Each segment of an activity sequence may be a physical activity or another intentional activity. Each intentional activity in the sequence is recursively expanded until only physical activities are generated.
As a working example for this stage of the process, we consider the meet activity at node 1, at the root of the tree in Figure 1(a). Node 1 has two child sequences corresponding to the two subgroups involved in the meeting, both carrying out the APPROACHER role. One of those child sequences consists of just a single moveto activity, while the other consists of two activities: a meet, followed by a moveto. In general, a special toplevel root activity, a “freeforall” (FFA), comprises all actors and has a duration over the entire video. All other activities are children of the root FFA. (To simplify the example tree in Figure 1(a), we removed the parent FFA.) The details of this tree expansion are given in “Generating the Activity Tree”, below.
We make a conditional independence assumption by supposing that the contents of a parent activity fully specify the distribution of possible child sequences, and that child sequences are conditionally independent of each other given their parent. Since child sequences can contain other intentional activities, activity generation is a recursive process, which bottoms out when no intentional activities are generated.
In the second step of the generative process, group trajectories are generated for each physical activity. This process must satisfy two constraints: (a) physical activity trajectories that share members and that border in time must be spatially connected; and (b) groups that need to physically interact as coparticipants in an intentional activity, such as a meet, must have trajectories that intersect at the appropriate points in time. Due to these constraints it is not feasible to generate group trajectories conditionally independently given the activity tree. Instead they are generated jointly according to a global Gaussian Process with a covariance kernel that depends on the activity tree in such a way as to enforce the key constraints. The details of this process are given in “Generating Group Trajectories”, below.
For the example tree in Figure 1(a), four group trajectory segments are needed, one for each physical activity leaf node. Since walk activity 9 is part of a meeting in which its participants must meet with the participants in walk activity 7, the group trajectories for 7 and 9 must end in the same location. Similarly, walk 4 must end where stand 5 is located.
In the final step of the generative process, the individual trajectories are realized, conditioned on the set of group trajectories. Here, conditional independence is possible, with each individual’s trajectory depending only on the sequence of group trajectories for physical activities in which that individual is a participant. This process is detailed in Section “Generating Individual Trajectories”.
Generating the Activity Tree
Let be a parent intentional activity, where indexes the set of activities. Its participant set (by relabeling, we assume that ) is divided into subgroups, where the th participant of is assigned to group , and the distinct realized groups are numbered through . We let , which defines a partition of into subgroups, , with . Subgroup is assigned role , and we define .
The th subgroup, which has participants and role , produces an activity sequence according to the stochastic process associated with , which has parameters , a Markov transition function, and , an initial activity distribution. Denote the resulting sequence by , where is the number of jumps generated by the process, and the are activity tuples, , where . Figure 2 illustrates the graphical model of this production.
To summarize, the grouping, role assignment, and child activity sequences within activity are generated according to , which factors as
(1) 
where we define . We assume that roles are assigned independently, so that . Additionally, we model the partition of into subgroups using a Chinese Restaurant Process (), whose concentration parameter depends on the activity label. That is, we let be the mass function with parameter . Each segment of that is an intentional activity is expanded and its members are subdivided recursively according to (1), replacing with and with . Once all expansions contain exclusively physical activities, the recursion has bottomed out. The resulting tree consists of all intentional activities , all physical activities, , and, for each intentional activity , its membership partition , role assignments , and subgroup activity sequences . We denote this complete tree by , whose prior distribution is
(2) 
Generating Group Trajectories
The leaves of the activity tree are all physical activities, each of which is associated with a group trajectory. In general, the endpoints of different group trajectories are not independent (given
), since they may be constrained to start or end at the same location. Consequently, we define a joint distribution on all of the group trajectory endpoints, and, conditioned on their endpoints, we treat their interiors as independent.
We model the interiors as realizations of Gaussian processes [Rasmussen and Williams2006] with the squaredexponential kernel function. This results in trajectories that are generally smooth, but flexible enough to allow for different kinds of motion. We use different scale parameters depending on the activity , which determines the rate of change of the trajectory.
We specify dependencies among the set of trajectory endpoints by first defining an undirected weighted graph over the endpoints. We use this graph to construct a constraint matrix over transition points by interpreting the sum of the weights on the shortest path between two nodes as distances. We then apply a positive semidefinite isotropic covariance kernel pointwise to the distance matrix to transform the distances into covariances.
Let represent the sequence of groundplane positions that make up the group trajectory for activity . Abusing notation slightly, we will write and for both the endpoints of a group trajectory and the corresponding node in . We introduce three kinds of edges on : temporal, transitional, and compositional. Two nodes are connected by a temporal edge when they belong to the same physical activity. The start of an activity, , is connected to the end of another activity, , by a transitional
edge when they correspond to the same moment in time and the corresponding activities share at least one participant. Finally, two endpoints are connected by a
compositional edge if they correspond to the same moment in time and have a common ancestor that specifies they must coincide, e.g., in a meet activity, all participants must end in the same location. All transitional and compositional edges have weight (or “distance”) zero, corresponding to the constraint that the connected edges must correspond to the same trajectory position. The weight assigned to temporal edges is a function of the time elapsed during the intervening physical activity and the associated with the activity label (e.g., “slowermoving” activities having lower weights, corresponding to a stronger dependence between the positions of their endpoints). Figure 3 shows an example .Having defined we can compute a distance matrix for the set of physical activity endpoints, where the entry is the sum of the weights along the shortest path in from node to node . If no path exists between two nodes, the distance is set to . We then transform into a covariance matrix by applying the covariance function . The locations of the set of group trajectory endpoints is distributed as . Conditioned on the endpoints, the interiors are mutually independent, so that
(3) 
where is the vector of interior points of trajectory and . Each is generated according to the Gaussian process for activity ; i.e.,
is normally distributed. Finally, the distribution over the physical trajectories
factorizes as(4) 
Using the fact that factors in (4) are normally distributed, we can easily see that also has a normal distribution.
Generating Individual Trajectories
As described above, individual participates in physical activity sequence , which has the sequence of group trajectories . The individual trajectory consists of segments, , where spans the same temporal interval as . Given these group trajectories, the individual trajectory segments are mutually independent, so that
(5) 
where is the label of the th activity in sequence .
We also use a Gaussian process for individual trajectories, but fix the mean to the (given) group trajectory, i.e. , which implies that , where the th entry in is the covariance function evaluated at the th and th frames of , using the scale corresponding to the activity label associated to . See Figure 4 for an example graphical model of this distribution.
3.4 Specific Activities
In this work, we limit ourselves to six specific activities, three intentional (freeforall, meet, and moveto) and three physical (stand, walk, and run). FFA has a single role which allows all activities to take place. A meet activity assigns nonzero probability to two roles, APPROACHER and WAITER; an APPROACHER performs a meets (recursively) and movetos, and a WAITER only performs stands. Moveto only produces one role, MOVER, which switches uniformly at random between the three physical activities. Finally, the physical activities have scale parameters such that .
4 Inference
Given a set of individual trajectories , such as those depicted in Figure 1(c), we wish to find an activity tree , such as that depicted in Figure 1(a), that best describes them. Specifically, we wish to maximize the posterior probability of an activity tree given the observed data
(6) 
where the prior is given by (2), and the likelihood is
(7) 
The integrands are given by (4) and (5). In general, the integral in (7) cannot be computed analytically. However, since every factor in (7) is a normal pdf, is also normal, which makes the evaluation of (6) straightforward.
We cannot find analytically. Instead, we draw samples from the posterior (6) using the MetropolisHastings (MH) algorithm, and keep the sample with the highest posterior probability. At the th iteration, we draw from a proposal distribution , where is the current sample, and accept the sample with the standard MH acceptance probability [Neal1993].
The ability of MH to efficiently explore the space depends largely on the choice of the proposal distribution . Although there has been work on MCMC sampling on tree models [Chipman, George, and McCulloch2002, Pratola2013] there is no general approach which can be applied to any model. Consequently, we employ a proposal distribution which is specific to our model.
4.1 Proposal distribution
Our proposal mechanism is composed of sampling moves which perform edits to the current hypothesized activity tree to produce a new tree sample. When drawing a sample from , we choose a move uniformly at random to apply. When applying a move, we must make sure that the resulting tree is valid (e.g., start and end times must be consistent; or activity sequences must be possible given the role), which requires bookkeeping that is beyond the scope of this document.
We have also developed a set of bottomup activity detectors to help explore the space efficiently. These detectors provide rough estimates of groupings of individuals at each frame, and activities being performed by each group (see Section
4.1, “Detectors”). We use these detectors in two ways. First, we initialize the sampler to a state obtained by transforming the output of the detectors to an activity tree . Additionally, we bias our proposal distribution toward groups and activities found by the detectors. For example, when proposing a merge move, we might choose participants which are predicted to be in a group by some activity detector.Sampling moves
During inference, we employ the following moves (see Figure 5 for an illustration). (a) Birth/death: a birth move inserts an intentional activity node between an intentional activity and some of its children. We randomly choose a set of sibling activity sequences whose parent is intentional activity , and insert a new intentional activity node (whose label is also chosen at random), such that becomes the parent of and a child of . In a death move, we randomly choose a intentional node , remove it from the tree, and connect its children nodes to its parent. (b) Merge/split: In a merge move, we take two sibling activity nodes with the same label and combine them into a single activity. If and are two activities with label and groups and , we create a new node with label and participants . The split move performs the opposite operation, taking labeled node and splitting it into two nodes and , both with activity , assigning participants in to either or uniformly. (c) Sequence/unsequence: Let and be two temporally nonoverlapping sibling activity sequences. A sequence move concatenates and into a new sequence . An unsequence move randomly selecting a split point at which to separate a sequence. (d) Relabel move: The relabel move randomly changes ’s label. Note that the label must be valid, e.g., we cannot assign a physical activity label to an intentional activity node.
Detectors
The bottomup detectors provide an estimate of how individuals in the video are grouped across time, as well as the physical activity they are performing.
At each frame, we cluster individuals into groups using their trajectories on the ground plane. We apply the densitybased spatial clustering of applications with noise (DBSCAN) [Ester et al.1996] algorithm independently on each frame, where our feature is composed of the position and velocity of an individual at that frame, both of which are obtained from the smoothed trajectories (which are assumed to be given). Importantly, we keep track of individual identities over time by recording the actors involved in each group in the previous frame and assigning the cluster found in the following frame where the majority of individuals in that new set are still involved. Given the groups as computed above, we want to identify the physical activities of their individuals (e.g., walk, run
). For this we use a hidden Markov model, where the observation function is a naive Bayes model with each individual’s speed modeled by a Gamma distribution, and a transition function that prefers staying within the current activity.
5 Experiments and Results
We evaluate the model in two ways. First, we demonstrate the model’s expressive power in inferring two different types of complexly structured scenarios from synthetic data. In the first, groups of individuals engage in activities and disband, forming different groups over time. The second demonstrates a recursively structured activity, in which one meeting is a component of a higherlevel meeting. We then evaluate the model on real data; specifically on two publicly available group activity datasets, VIRAT [Oh et al.2011] and the UCLA aerial event dataset [Shu et al.2015].

5.1 Evaluation
Performance is measured in terms of how well activities are labeled in the scene, and how well individuals are grouped, irrespective of activity label. In the following, and are the ground truth and inferred activity trees, respectively.
Activity Labeling
For each activity and video frame , we compare between and the set of individuals performing at . We first define performance counts in terms of an individual at a frame, then compute overall counts and associated performance measures.
For an individual at frame , let be the set of individuals which have the same label as in . Define similarly for . The set of false positives for is , the set of false negatives is , the true positives are , and the true negatives are , where is the set of all individuals.
Grouping
We follow a similar approach when evaluating grouping performance. For person , we compare the groups to which belongs in and at each frame . Since the two trees could have different depths and topologies, it is not necessarily clear which groups should be compared with which; however, every individual is part of exactly one physical activity group at each frame, as well as one highest level node. Consequently, we only compare groups at these two levels of the tree, without penalty for difference in activities within a level. Thus once we determine the group associated with person at frame
at the physical (resp. highest) activity level of each tree, we compute the score as before.
5.2 Results
The performance of our algorithm is summarized in Tables 1 and 2. Table 1
shows the activity labeling precision and recall on the two synthetic scenes (SYNTH1 and SYNTH2), a video sequence obtained from the VIRAT dataset, and four different video sequences from the UCLA aerial event dataset (UAED). In Table
2we see the performance as measured by our grouping evaluation metric described above.
SYNTH1  STAND  WALK  MOVETO  MEET  FFA 
Precision  0.59  0.84  0.63  0.71  1.00 
Recall  1.00  0.94  0.71  0.75  1.00 
F1  0.74  0.88  0.67  0.73  1.00 
SYNTH2  STAND  WALK  MOVETO  MEET  FFA 
Precision  0.87  1.00  0.94  1.00  
Recall  0.50  0.71  0.73  0.56  
F1  0.63  0.83  0.82  0.72  
VIRAT  STAND  WALK  MOVETO  MEET  FFA 
Precision  0.85  0.81  0.73  0.90  1.0 
Recall  0.51  0.88  0.89  0.80  1.0 
F1  0.64  0.85  0.81  0.85  1.0 
UAED  STAND  WALK  MOVETO  MEET  FFA 
Precision  0.96  0.89  0.71  0.75  0.97 
Recall  0.82  0.99  0.67  0.64  0.73 
F1  0.86  0.94  0.62  0.62  0.77 
Synthetic data
The synthetic dataset comprises two videos, where a video is a set of trajectories on the ground plane. In the first, SYNTH1, five actors participate in a series of meetings, where participants repeatedly change group memberships across 20 frames. The second (SYNTH2) features five actors meeting, with four of them participating in a side meeting before joining the global meeting. As Table 1 shows, our model performs very well on highlevel activities, such as meet, even when presented with nested structure.
Virat
We also evaluate on real data, specifically frames 2520 to 3960 of video 2 of the VIRAT dataset. This video features seven people participating in two meetings, where groups exchange members several times. Fig. 6 shows two frames of the video, along with the inferred activity tree. Our model correctly recognizes the two meetings, as well as all of the groups at the highest level of description. The activity labeling results (Table 1) show perfect FFA performance, and the grouping results (Table 2) show a perfect highestlevel intentional activity score. As before, there is divergence from ground truth at the physical activity level, but this does not affect the grouping score.
SYNTH1  SYNTH2  VIRAT  UAED  
PHYS  INT  PHYS  INT  PHYS  INT  PHYS  INT  
Precision  0.86  1.0  1.00  1.0  0.98  1.0  0.95  0.93 
Recall  1.00  1.0  0.69  1.0  0.85  1.0  0.99  0.89 
F1  0.92  1.0  0.82  1.0  0.91  1.0  0.97  0.89 
UCLA aerial event dataset
We extracted four video sequences from the UCLA aerial event dataset (UAED). More specifically, we searched for subsequences of videos which featured properties like activity nesting, groups interchanging members, etc. The result is four video sequences of roughly frames each. As we can see in Table 1, which shows the overall precision and recall scores, for all four videos, our algorithm performs reasonably well across all activities. Note the relatively low recall score of the meet activity, which is due in large part to one very long missed meet in the third video sequence. Similarly, Table 2 shows that our algorithm performs well at finding groups of individuals at both the physical and intentional activity levels.
6 Discussion
We have presented a probabilistic generative model of complex multiagent activities over arbitrary time scales. The activities specify component roles between groups of actors and accommodate unboundedly deep recursive, hierarchical structure. The model accommodates arbitrary groups participating in activity roles, describing both betweengroup and betweenindividual interactions. Physical and intentional (higherlevel description) activities explain hierarchical correlations among individual trajectories. To our knowledge, no existing model of trackbased activity recognition provides this expressiveness in a joint model.
The modeling framework is naturally extensible. We are currently undertaking several extensions, including (1) developing additional activities, including following, exchanging items, and interacting with vehicles and building entrances, (2) adding prior knowledge about the spatial layout of the scene that naturally constrains what activities are possible, such as roads, sidewalks, impassible buildings, and other spatial features that influence behavior in order to improve both accuracy and speed by reducing the search space, (3) using our model as a prior for a 3D Bayesian tracker [Brau et al.2013], and (4) connecting natural language to activity descriptions as our model accommodates activity descriptions across multiple events, tracking individual participation throughout, providing opportunities for building natural language narratives about activities at different levels of granularity.
Acknowledgements
This research was supported by grants under the DARPA Mind’s Eye program W911NF10C0081 (subcontract to iRobot, 92003) and the DARPA SSIM program W911NF1020064. We give special thanks to Paul R. Cohen and Christopher Geyer for helpful discussions and advice.
References
 [Aggarwal and Ryoo2011] Aggarwal, J. K., and Ryoo, M. S. 2011. Human activity analysis: A review. ACM Computing Surveys (CSUR) 43(3).
 [Barbu et al.2012] Barbu, A.; Bridge, A.; Burchill, Z.; Coroian, D.; Dickinson, S.; Fidler, S.; Michaux, A.; Mussman, S.; Siddharth, N.; Salvi, D.; Schmidt, L.; Shangguan, J.; Siskind, J. M.; Waggoner, J.; Wang, S.; Wei, J.; Yin, Y.; and Zhang, Z. 2012. Video in sentences out. In UAI 2012.
 [Brau et al.2013] Brau, E.; Guan, J.; Simek, K.; del Pero, L.; Dawson, C. R.; and Barnard, K. 2013. Bayesian 3d tracking from monocular video. In ICCV 2013.
 [Chang et al.2010] Chang, M.C.; Krahnstoever, N.; Lim, S.N.; and Yu, T. 2010. Group level activity recognition in crowded environments across multiple cameras. In AVSS, 56–63.
 [Chang, Krahnstoever, and Ge2011] Chang, M.C.; Krahnstoever, N.; and Ge, W. 2011. Probabilistic grouplevel motion analysis and scenario recognition. In ICCV 2011, 747–754. IEEE.
 [Cheng et al.2014] Cheng, Z.; Qin, L.; Huang, Q.; Yan, S.; and Tian, Q. 2014. Recognizing human group action by layered model with multiple cues. Neurocomputing 136:124–135.
 [Chipman, George, and McCulloch2002] Chipman, H.; George, E.; and McCulloch, R. 2002. Bayesian treed models. Machine Learning 48(13):299–320.
 [Choi and Savarese2012] Choi, W., and Savarese, S. 2012. A unified framework for multitarget tracking and collective activity recognition. In ECCV 2012, 215–230.
 [Ester et al.1996] Ester, M.; peter Kriegel, H.; S, J.; and Xu, X. 1996. A densitybased algorithm for discovering clusters in large spatial databases with noise. 226–231. AAAI Press.
 [Garate et al.2014] Garate, C.; Zaidenberg, S.; Badie, J.; and Bremond, F. 2014. Group tracking and behavior recognition in long video surveillance sequences. In VISIGRAPP 2014.
 [Geib and Goldman2009] Geib, C. W., and Goldman, R. P. 2009. A probabilistic plan recognition algorithm based on plan tree grammars. Artificial Intelligence 173:1101–1132.
 [Kwak, Han, and Han2013] Kwak, S.; Han, B.; and Han, J. H. 2013. Multiagent event detection: localization and role assignment. In CVPR.
 [Lan et al.2010] Lan, T.; Wang, Y.; Yang, W.; and Mori, G. 2010. Beyond actions: Discriminative models for contextual group activities. In Advances in neural information processing systems, 1216–1224.
 [Lan, Sigal, and Mori2012] Lan, T.; Sigal, L.; and Mori, G. 2012. Social roles in hierarchical models for human activity recognition. In CVPR.
 [Lin et al.2010] Lin, W.; Sun, M.T.; Poovendran, R.; and Zhang, Z. 2010. Group event detection with a varying number of group members for video surveillance. Circuits and Systems for Video Technology, IEEE Transactions on 20(8):1057–1067.
 [Neal1993] Neal, R. M. 1993. Probabilistic inference using markov chain monte carlo methods. Technical report.
 [Odashima et al.2012] Odashima, S.; Shimosaka, M.; Kaneko, T.; Fukui, R.; and Sato, T. 2012. Collective activity localization with contextual spatial pyramid. In Computer Vision–ECCV 2012. Workshops and Demonstrations, 243–252. Springer.
 [Oh et al.2011] Oh, S.; Hoogs, A.; Perera, A.; Cuntoor, N.; Chen, C.C.; Lee, J. T.; Mukherjee, S.; Aggarwal, J.; Lee, H.; Davis, L.; et al. 2011. A largescale benchmark dataset for event recognition in surveillance video. In CVPR 2011, 3153–3160. IEEE.
 [Poppe2010] Poppe, R. 2010. A survey on visionbased human action recongition. Journal of Image and Vision Computing 28(6):976–90.
 [Pratola2013] Pratola, M. T. 2013. Efficient metropolishastings proposal mechanisms for bayesian regression tree models.
 [Rasmussen and Williams2006] Rasmussen, C. E., and Williams, C. K. I. 2006. Gaussian Processes for Machine Learning. The MIT Press.
 [Ryoo and Aggarwal2011] Ryoo, M. S., and Aggarwal, J. K. 2011. Stochastic representation and recognition of highlevel group activities. International Journal of Computer Vision 93(2):183–200.
 [Shu et al.2015] Shu, T.; Xie, D.; Rothrock, B.; Todorovic, S.; and Zhu, S.C. 2015. Joint inference of groups, events and human roles in aerial videos. In CVPR.
 [Sukthankar et al.2014] Sukthankar, G.; Goldman, R. P.; Geib, C. W.; Pynadath, D. V.; and Bui, H. H., eds. 2014. Plan, Activity, and Intent Recognition: Theory and Practice. Morgan Kaufmann.
 [Vishwakarma and Agrawal2013] Vishwakarma, S., and Agrawal, A. 2013. A survey on activity recognition and behavior understanding in video surveillance. The Visual Computer 29(10):983–1009.
 [Zaidenberg, Boulay, and Bremond2012] Zaidenberg, S.; Boulay, B.; and Bremond, F. 2012. A generic framework for video understanding applied to group behavior recognition. In AVSS, 136–142. IEEE.
 [Zhang et al.2012] Zhang, C.; Yang, X.; Zhu, J.; and Lin, W. 2012. Parsing collective behaviors by hierarchical model with varying structure. In ACMMM 2012, 1085–1088. ACM.
 [Zhang et al.2013] Zhang, Y.; Qin, L.; Yao, H.; Xu, P.; and Huang, Q. 2013. Beyond particle flow: Bag of trajectory graphs for dense crowd event recognition. In ICIP, 3572–3576.
 [Zhu et al.2011] Zhu, G.; Yan, S.; Han, T. X.; and Xu, C. 2011. Generative group activity analysis with quaternion descriptor. In Advances in Multimedia Modeling. Springer. 1–11.
Comments
There are no comments yet.