The need for high-level analytics of large streams of video data has arisen in recent years in many practical commercial, law enforcement, and military applications [2, 3, 4, 5]. Examples include human activity recognition, video summarization and indexing, human-machine teaming, and human behavior tracking and monitoring. The human activity recognition process requires recognizing both objects in the scene as well as body movements to correctly identify the activity with the help of context . In the case of video summarization, one should be able to semantically segment and summarize the visual content in terms of context. This enables efficient indexing of large amounts of video data, which allows for easy query and retrieval . For effective human-machine (robot) teaming, autonomous systems should be able to understand human teammates’ gestures as well as recognize various human activities taking place in the field to gain situational awareness of the scene. Further, an important step in building visual tracking systems is to design ontologies and vocabularies for human activity and environment representations [8, 9]. All these tasks necessitate automated learning of movements of the human body or human action attributes.
Human activities consist of a sequence of actions that can be represented hierarchically  as shown in Fig. 1. The bottom level of the hierarchy consists of the fine resolution description of an action, i.e., movement of the human body (e.g., right arm moves up, left arm moves up, torso bending, and legs moving apart) and can be called an action attribute . At the middle level, a sequence of these attributes forms a human action. The human actions and their interactions form an activity, while a sequence of activities forms an event. An important advantage of the hierarchical model is that such structures go hand in hand with semantic or syntactic approaches and they provide us with the flexibility to generate summaries of long video sequences at different resolutions based on the needs of the end application.
In this work, we focus on the bottom two layers of the human activity hierarchical model, i.e., human actions and their representations using attributes. One possible way of obtaining such representations is to manually specify the action attributes and assign training video sequences to each attribute . Another way is to manually annotate training video sequences by labeling movements 
. Any human action in a test sequence can then be described using such user-defined attributes. Both of these approaches fall into supervised category. However, a set of user-defined action attributes may not completely describe all the human actions in given data. Also, manual assignment of training data for each of the action attributes is time consuming, if not impossible, for large datasets. Another issue with supervised learning of action attributes is that there might be attributes that were not seen in the training data by the system, but that might be seen in the field. Because of these reasons, unsupervised techniques to learn action attributes have been investigated[12, 13, 14]. These methods learn action attributes by clustering low-level features based on their co-occurrence in training videos. However, video data are not usually well distributed around cluster centers and hence, the cluster statistics may not be sufficient to accurately represent the attributes .
Motivated by the premise that high-dimensional video data usually lie in a union of low-dimensional subspaces, instead of being uniformly distributed in the high-dimensional ambient space, we propose to represent human action attributes based on the union-of-subspaces (UoS) model . The hypothesis of the UoS model is that each action attribute can be represented by a subspace. We conjecture that the action attributes represented by subspaces can encode more variations within an attribute compared to the representations obtained using co-occurrence statistics [12, 13, 14]. The task of unsupervised learning of the UoS underlying data of interest is often termed subspace clustering [17, 15, 16], which involves learning a graph associated with the data and then applying spectral clustering on the graph to infer the clustering of data. Recently, the authors in  have developed a sparse subspace clustering (SSC) technique by solving an -minimization problem. This has been extended into a hierarchical structure to learn subspaces at multiple resolutions in . To capture the global structure of data, low-rank representation (LRR) models with and without sparsity constraints have been proposed in  and , respectively. Liu et al.  extended LRR by incorporating manifold regularization into the LRR framework. It has been proved that LRR can achieve perfect subspace clustering results under the condition that the subspaces underlying the data are independent [21, 16]. However, this condition is hard to satisfy in many real situations. To handle the case of disjoint subspaces, Tang et al.  extended LRR by imposing restrictions on the structure of the solution, called structure-constrained LRR (SC-LRR). The low-rank subspace learning has been extended to multidimensional data for action recognition . However, there are fundamental differences between our work and  because the main objective of our proposed approach is to learn action attributes in an unsupervised manner.
I-a Our Contributions
Existing LRR based subspace clustering techniques use spectral clustering as a post-processing step on the graph generated from a low-rank coefficient matrix, but the relationship between the coefficient matrix and the segmentation of data is seldom considered, which can lead to sub-optimal results . Our first main contribution in this regard is introduction of a novel low-rank representation model, termed clustering-aware structure-constrained LRR (CS-LRR) model, to obtain optimal clustering of human action attributes from a large collection of video sequences. We formulate the CS-LRR learning problem by introducing spectral clustering into the optimization program. The second main contribution of this paper is a hierarchical extension of our CS-LRR model for unsupervised learning of human action attributes from the data at different resolutions without assuming any knowledge of the number of attributes present in the data. Once the graph is learned from CS-LRR model, we segment it by applying hierarchical spectral clustering to obtain action attributes at different resolutions. The proposed approach is called hierarchical clustering-aware structure-constrained LRR (HCS-LRR).
The block diagram of the system that uses HCS-LRR algorithm to learn human action attributes is shown in Fig. 2. A large stream of video data is taken as input and features such as silhouettes, frame-by-frame spatial features like histograms of oriented gradients (HOG) , and spatio-temporal features like motion boundary histogram (MBH)  are extracted from the input. The data samples in this high-dimensional feature space are given as the input to the HCS-LRR algorithm and human action attributes are obtained as the output. One of the main applications of learning the attributes based on the UoS model is semantic summarization
of long video sequences. The attributes at different levels of the hierarchy can be labeled by an expert-in-the-loop by visualizing the first few basis vectors of each attribute (subspace) in the form of images. Once the labeled attributes are available, any long video sequence of human activity can then be semantically summarized at different levels of granularity based on the requirements of an application. Another major application of learning the attributes ishuman action recognition
. A human action or activity can be represented as a sequence of transitions from one attribute to another, and hence, can be represented by a subspace transition vector. Even though multiple actions can share action attributes, each action or activity can be uniquely represented by its subspace transition vector. A classifier can be trained based upon these transition vectors to classify an action in a test video sequence into one of the actions in the training data. Our final contribution involves developing frameworks for both semantic summarization and human action recognition from the HCS-LRR model. Our results confirm the superiority of HCS-LRR in comparison to a number of state-of-the-art subspace clustering approaches.
I-B Notational Convention and Organization
The following notation will be used throughout the rest of this paper. We use non-bold letters to represent scalars, bold lowercase letters to denote vectors/sets, and bold uppercase letters to denote matrices. The -th element of a vector is denoted by and the -th element of a matrix is denoted by . The -th row and -th column of a matrix are denoted by and , respectively. Given two sets and , denotes the submatrix of corresponding to the rows and columns indexed by andand of appropriate dimensions, respectively.
The only used vector norm in this paper is the norm, which is represented by . We use a variety of norms on matrices. The and norms are denoted by and , respectively. The norm is defined as . The spectral norm of a matrix
, i.e., the largest singular value of, is denoted by . The Frobenius norm and the nuclear norm (the sum of singular values) of a matrix are denoted by and , respectively. Finally, the Euclidean inner product between two matrices is , where and denote transpose and trace operations, respectively.
The rest of the paper is organized as follows. Section II
introduces our feature extraction approach for human action attribute learning. In SectionIII, we mathematically formulate the CS-LRR model and present the algorithm based on CS-LRR model. The hierarchical structure of the CS-LRR model is described in Section IV. In Section V and VI, we discuss the approaches for semantic description of long video sequences and action recognition using learned action attributes, respectively. We then present experimental results in Section VII, which is followed by concluding remarks in Section VIII.
Ii Feature Extraction for Attribute Learning
The main focus of our work is to learn meaningful human action attributes from large streams of video data in an unsupervised manner. The first step in our proposed framework is to extract feature descriptors from an action interest region in which the human performs the action. The action interest region of each frame of an action sequence is determined by a bounding box. In our work, we learn action attributes using two local visual descriptors: HOG (histograms of oriented gradients)  and MBH (motion boundary histogram) . To extract HOG descriptors, we divide the action interest region into a grid of blocks, each of size pixels. Then HOG feature is extracted for each block and orientations are quantized into 9 bins. Therefore, the HOG feature of every frame can be stored into a matrix , where denotes the number of blocks in the action interest region, and the HOG feature vector of each block corresponds to a row in . We vectorize HOG features and normalize each vector to unit norm, forming individual data samples in a matrix , where and denotes the total number of frames in the videos, as shown in Fig. 3.
The MBH descriptor represents the oriented gradients computed separately from the vertical and horizontal components of the optical flow, which is robust to camera and background motion. To extract MBH descriptors, we first split the optical flow field into two scalar fields corresponding to the horizontal and vertical components, which can be regarded as “images” of the motion components. Similar to HOG descriptor extraction, we divide the action interest region of each of the optical flow component image into a grid of blocks, each of size pixels. Spatial derivatives are then computed for each block in each optical flow component and orientation information is quantized into 9 bins. Instead of using MBH features of optical flow field between every two video frames separately, we aggregate MBH features of every adjacent optical flow fields (computed between video frames) and the sum is used as the feature of these optical flow fields. Therefore, the MBH feature of every adjacent optical flow fields corresponds to a matrix , where denotes the number of blocks in the action interest region, and the MBH feature vector of each block corresponds to a row in . We again vectorize the MBH features and normalize each vector to unit norm, forming MBH feature matrix , where and again denotes the total number of feature descriptors, see Fig. 3. Given all the features extracted from the video data, we aim to learn action attributes based on the UoS model, which will be described in the following sections.
Iii Clustering-Aware Structure-Constrained Low-Rank Representation
In this section, we propose our clustering-aware structure-constrained LRR (CS-LRR) model for learning the action attributes using the feature descriptors extracted from the video data. We begin with a brief review of LRR and SC-LRR since our CS-LRR model extends these models.
Iii-a Brief Review of LRR and SC-LRR
Consider a collection of feature vectors in , , that are drawn from a union of low-dimensional subspaces of dimensions . The task of subspace clustering is to segment the data points according to their respective subspaces. Low-rank representation (LRR) is a recently proposed subspace clustering method [21, 16] that aims to find the lowest-rank representation for the data using a predefined dictionary to reveal the intrinsic geometric structure of the data. Mathematically, LRR can be formulated as the following optimization problem :
where is a predefined dictionary that linearly spans the data space and is the low-rank representation of the data over . In practice, the observed data are often corrupted by noise. To better handle the noisy data, the LRR problem can be formulated as
where is a matrix representing the approximation errors of the data, the nuclear norm is the convex relaxation of rank operator and indicates a certain regularization strategy involving . In , the
norm is used for regularization because of its robustness against corruptions and outliers in data. Finally,is a positive parameter that sets the tradeoff between low rankness of and the representation fidelity. In , the whole sample set is used as the dictionary for clustering, which takes advantage of the self-expressiveness property of data. Once the matrix is available, a symmetric non-negative similarity matrix can be defined as , where denotes the element-wise absolute value operation. Finally, spectral clustering  can be performed on to get the final clustering results.
It has been proved that LRR can achieve a block-diagonal (up to permutation) solution under the condition that the subspaces underlying the data are independent [21, 16]. However, clustering of disjoint subspaces is more desirable in many real situations .111Heuristically, a collection of subspaces is said to be independent if the bases of all the subspaces are linearly independent, whereas are said to be disjoint if every pair of subspaces are independent; we refer the reader to  for formal definitions. To improve upon LRR for disjoint subspace clustering, Tang et al.  proposed structure-constrained LRR (SC-LRR) model, whose learning can be formulated as follows:
where and are penalty parameters, is a predefined weight matrix associated with the data, and denotes the Hadamard product. It has been shown in  that by designing some predefined weight matrices, the optimal solution of is block-diagonal for disjoint subspaces when the data are noiseless. In general, the matrix imposes restrictions on the solution by penalizing affinities between data samples from different clusters, while rewarding affinities between data samples from the same cluster. The sample set is again selected as the dictionary in  for clustering.
Iii-B CS-LRR Model
Almost all the existing subspace clustering methods follow a two-stage approach: () learning the coefficient matrix from the data and (
) applying spectral clustering on the affinity matrix to segment the data. This two-step approach may lead to sub-optimal clustering results because the final clustering result is independent of the optimization problem that is used to obtain the coefficient matrix. We hypothesize that by making the final clustering result dependent on the generation of the optimal coefficient matrix, we will be able to obtain better clustering results. Specifically, suppose we have the coefficient matrixfor SC-LRR. Then one can define an affinity matrix as . We obtain the clustering of the data by applying spectral clustering  on , which solves the following problem:
where is a binary matrix indicating the cluster membership of the data points, i.e., , if lies in subspace and otherwise. Here, is a diagonal matrix with its diagonal elements defined as . The solution of
consists of the eigenvectors of the Laplacian matrixassociated with its smallest eigenvalues. Note that the objective function of (4) can also be written as
where . In order to capture the relation between and , we imagine that the exact segmentation matrix is known. It can be observed that if and lie in different subspaces, i.e., , then we would like to have for better clustering. Therefore, we can use (5) to quantify the disagreement between and .
The ground truth segmentation matrix is of course unknown in practice. In order to penalize the “disagreement” between and , we propose a clustering-aware structure-constrained LRR (CS-LRR) model obtained by solving the following problem:
where , and are penalty parameters. Similar to , the -th entry of is defined as , where is the mean of all ’s. The CS-LRR model in (III-B) requires knowledge of the number of subspaces . Initially, we assume to have knowledge of an upper bound on , which we denote by , and we use in (III-B) to learn the representation matrix. In practice, however, one cannot assume knowledge of this parameter a priori. Therefore, we also develop a hierarchical clustering technique to automatically determine , which will be discussed in Section IV. The CS-LRR model encourages consistency between the representation coefficients and the subspace segmentation by making the similarity matrix more block-diagonal, which can help spectral clustering achieve the best results.
Iii-C Solving CS-LRR
where and are matrices of Lagrangian multipliers and is a penalty parameter. The optimization of (III-C) can be done iteratively by minimizing with respect to , , and one at a time, with all other variables being fixed. Note that we also update accordingly once we have updated. The constraint in (III-C) is imposed independently in each step of updating .
Update while fixing other variables: When other variables are fixed, the problem of updating in the -th iteration () is equivalent to minimizing the following function:
where . However, this variant of the problem does not have a closed-form solution. Nonetheless, in the spirit of LADM, can also be minimized by solving the following problem:
where is the partial differential of with respect to and is a constant satisfying . For this problem, . Then the closed-form solution for is given as
where denotes singular value thresholding operator .
Update while fixing other variables: When other variables are fixed, the problem of updating is
which has the following closed-form solution:
where the -th entry of is given by with . After this, we update by setting , .
Update while fixing other variables: When other variables are fixed, the problem of updating is
Defining , this problem has a closed-form solution that involves eigendecomposition of . In particular, the columns of are given by the eigenvectors of associated with its smallest eigenvalues.
Update while fixing other variables: When other variables are fixed, the problem of updating can be written as
where . For HOG features, we define to be the approximation error with respect to (the matrix version of ) and set the error term to ensure robustness against “corruptions” in the orientation of each HOG feature descriptor; this is because the background information is included in the feature vector. Then (14) can be decomposed into independent subproblems. In order to update , we first convert the vector to a matrix and then solve the following problem:
where is the reshaped “image” of the vector . This problem can be solved using [21, Lemma 3.2]. For MBH features, since the noise due to background motion is eliminated, we simply set the error term ; then (14) can be written as
Iv Hierarchical Subspace Clustering Based on CS-LRR Model
We now introduce a hierarchical subspace clustering algorithm based on CS-LRR approach for learning action attributes at multiple levels of our UoS model and for automatically determining the final number of attributes present in a high-dimensional dataset without prior knowledge. To begin, we introduce some notation used in this section. We define to be the set containing the indexes of all ’s that are assigned to the -th subspace at the -th level () of the hierarchical structure, and let be the corresponding set of signals, where is the number of signals in . Let denote the number of subspaces at the -th level, then we have and for all ’s. The subspace underlying is denoted by and the dimension of the subspace is denoted by .
We first apply Algorithm 1 to obtain the optimal representation coefficient matrix . Then we set the coefficients below a given threshold to zeros, and we denote the final representation matrix by . By defining the affinity matrix , we proceed with our hierarchical clustering procedure as follows. We begin by applying spectral clustering  based on at the first level (), which divides into two subspaces with clusters such that , and we use ’s () to denote the indexes of signals in each cluster. At the second level, we perform spectral clustering based on and separately and divide each () into 2 clusters, yielding 4 clusters with (). Using the signals in (
), we estimate the four subspaces’s underlying ’s by identifying their orthonormal bases. To be specific, we obtain eigendecomposition of the covariance matrix such that , where is a diagonal matrix () and . Then the dimension of the subspace , denoted by , is estimated based on the energy threshold, i.e., , where is a predefined threshold and is set close to 1 for better representation. The orthonormal basis of can then be written as . After this step, we end up with 4 clusters with their corresponding indexes and associated orthonormal bases .
For every , we decide whether or not to further divide each single cluster (i.e., subspace) at the -th level into two clusters (subspaces) at the (
)-th level based on the following principle. We use a binary variableto indicate whether the cluster is further divisible at the next level or not. If it is, we set , otherwise . We initialize for all ’s (). Consider the cluster at the -th level and assume there already exist clusters at the ()-th level derived from . If , the ()-th cluster at the ()-th level will be the same as ; thus, we simply set , , , and . If , we first split into two sub-clusters by applying spectral clustering on , and we use () to be the set containing the indexes of the signals in . Then we find the subspaces () underlying ’s respectively using the aforementioned strategy, while their dimensions and orthonormal bases are denoted by ’s and ’s, respectively. After this step, we compute the relative representation error of every signal in () using the parent subspace basis and the child subspace basis , which are defined as and , respectively. We use and to denote the mean of ’s and ’s in (), respectively. Finally, we say is divisible if () the relative representation errors of the signals using the child subspace are less than the representation errors calculated using the parent subspace by a certain threshold, i.e., for either or 2, and () the dimensions of the two child subspaces meet a minimum requirement, that is, . In here, and are user-defined parameters and are set to avoid redundant subspaces. When either or decreases, we tend to have more subspaces at every level . Assuming the two conditions are satisfied, the cluster is then divided by setting (, ) and (, ). The bases of the subspaces at the ()-th level are set by and . If the above conditions are not satisfied, we set , , , and to indicate , i.e., , is a leaf cluster and this cluster will not be divided any further. This process is repeated until we reach a predefined maximum level in the hierarchy denoted by . The hierarchical subspace clustering algorithm based on CS-LRR model for any level is described in Algorithm 2, which we term HCS-LRR. It is worth noting that the maximum number of leaf clusters is in this setting, which we set as a key input parameter of Algorithm 1.
V Attribute Visualization and Semantic Summarization
Given the learned subspaces at different levels of the hierarchical structure, our next goal is to develop a method that helps an expert-in-the-loop to visualize the learned human action attributes, give them semantic labels, and use the labeled attributes to summarize long video sequences of human activities in terms of language at different resolutions. As we have shown previously in , if frame-by-frame silhouette features are used for learning the human action attributes, the attributes (subspaces) can be easily visualized by reshaping the first few vectors of the orthonormal bases of the subspaces into an image format and displaying the scaled versions of these images. However, if other spatial or spatio-temporal features like HOG or MBH are used, the attributes or the subspaces learned using HCS-LRR algorithm cannot be visualized directly by just reshaping each dimension of the subspace in the feature domain.
V-a Visualization of Attributes Using HOG Features
In the case of HOG features, inspired by HOGgles , we propose an algorithm to visualize the learned attributes by mapping them back to the pixel domain. In particular, we are interested in building a mapping between the pixel (image) space and the HOG feature space and use this mapping to transform the bases of the HOG feature subspaces into the image space and visualize the attributes. An algorithm based on paired dictionary learning is used to develop this mapping. Concretely, let be the collection of vectorized patches of size pixels from video frames, and be the corresponding HOG feature vectors of the patches in . Here, the dimensionality of the HOG features of each patch depends on the choice of the number of bins in the histogram. For better visualization quality, we extract 18-bin contrast specific HOG features. Hence, in this work. Then, two dictionaries, i.e., overcomplete bases whose columns span the data space, and consisting of atoms are learned to represent image space and HOG feature space, respectively, such that the sparse representation of any in terms of should be the same as that of in terms of . Similar to , paired dictionary learning problem can be considered as solving the following optimization problem:
where denotes the sparse code with respect to /, and and denote the -th column of and , respectively.
Equation (V-A) can be simplified into a standard dictionary learning and sparse coding problem and can be solved