Many of the most successful classification frameworks for videos employ generative models of visual processes, where videos are modeled as distributions of descriptors. Prominent examples are Local Binary Patterns in Three Orthogonal Plains (LBP-TOP)  and Bags of Systems (BoS) [2, 3, 4]. Typically, the descriptors in question are local in the spatiotemporal domain, but the distributions are global
, neglecting their spatial and temporal order. This has proven successful in many classification problems, however, there are scenarios where such a procedure could turn out suboptimal for several reasons. Spatiotemporally local descriptors are supposed to capture the dynamics locally in space and time. This is sensible for the recognition of dynamic textures on a small scale, but for large-scale dynamic textures or real-world dynamic scenes, the temporal dynamics on a global scale is a more distinguishing feature: for instance, a traffic scene is more characterized by the appearance and disappearance of vehicles than by the movement of the trees on the roadside. Furthermore, breaking up the global temporal order of the overall visual process can be problematic in cases where the appearance of the frames changes as a whole over the course of time, e.g. when observing outdoor scenes under changing weather conditions. In such cases, the video can have semantic features that may get lost by destroying the temporal order. On the other hand, breaking up the global spatial order contradicts the everyday observation that looking at single frames of a visual process often suffices to distinguish between dynamic scenes or high-resolution textures. In such cases, employing well established still image feature extraction methods on the isolated frames can be a sensible step in the feature extraction of the overall visual process.
Remarkably, for still image textures and still image scenes, generative, distribution-based models have proven their efficacy at several occasions. Texture images, being often thought of as realizations of stochastic processes , have a long standing tradition of distribution based models [6, 7, 8, 9]. Meanwhile, the concept of BoS is inspired by the Bag of Words (BoW)  paradigm, where images are described by the frequency of previously learned features contained in them. BoW based methods have been successfully employed in the task of distinguishing still-image scenes [11, 12]
. Moreover, outstanding performance on dynamic scene recognition can be achieved when ”bags” of spatiotemporal features are computed on a temporally local scale. For instance, the authors of propose computing several temporally localized bags of oriented filter response features from videos and produce outstanding results. The classification is performed by a majority vote that encompasses all of the computed bags in a video.
We conclude that, employing generative, distribution based models for the individual frames, or alternatively, localized collections of frames of visual processes such as dynamic textures and dynamic scenes is a promising approach. In the classical case, including many BoW based approaches, these models are histograms. Alternatively, they can also be represented by Fisher Vectors
or statistical moments of a parametrized distribution model. We will focus on the classical view in the following, even though the proposed methods can be easily generalized to the other perspectives. This leads to the assumption that individual frames of a visual process can be well modeled by histograms. However, a model treating visual processes as sequences of histograms neglects their temporal dynamics and thus fails to generalize from one sample to the whole process.
A remedy is to derive a dynamic model for the temporal evolution of the histograms.
We refer to such an approach as System of Bags (SoB) in reference to the successful Bags of Systems. Unlike Bags of Systems, where dynamic systems on spatially and temporally local scale are computed and ”bags” thereof on a global scale are created, we compute histograms on a temporally local but spatially global scale and model their evolution over time by means of a dynamic system. Fig. 1 visualizes this distinction. The left side of the picture shows the procedure of BoS and related generative models. A video is first divided into spatiotemporal cubes and for each cube, a word, e.g. a stack of dynamic system parameters
, is computed. Afterwards, a descriptor distribution over the complete video is estimated as the final representation of the video. The right side of the figure shows the procedure of generating a SoB. Distributions of descriptors are computed on temporally local but a spatially global scale. The temporal progression of these distributions is successively modeled as a dynamic system. Typically, the terms bag and word imply the usage of a learned codebook. However, this technique can be applied independently of codebooks and thus will be referred to as SoBs even in cases where the histograms were not computed with respect to a codebook.
I-a Related Work and our Contribution
The temporal evolution of histograms can not be well described by linear dynamic systems. Instead, we employ the concept of Kernel Linear Dynamic Systems
(KLDS) which model the observations in a kernel feature space. The parameters describing the KLDS are the representations of the visual processes employed in his work. Using these descriptors in the context of supervised learning requires the definition of a dissimilarity measure. The available literature on KLDSs offers a manageable number of dissimilarity measures that perform insufficiently on the SoB descriptors used in this work, as will be shown in the experimental SectionIV, which may be due to the fact that they put too much emphasis on the dynamic part of the KLDS parameters, whereas in the setting discussed in this work, static information, i.e. the information not related to the temporal context has a considerable significance. The KLDS itself was introduced in  and motivated by the recognition of dynamic textures. As a dissimilarity measure, a kernelized version of the widely adopted Martin distance  was applied.
Modeling visual processes as SoBs i.e. KLDSs of histograms was employed in  for the classification of human actions. The authors modeled videos of human actions as streams of histograms of optical flow (HOOF). As a similarity measure, a Binet-Cauchy Maximum Singular Value kernel was applied. The work was further enhanced in , where human interactions were targeted. More generally, bags and histograms as representations of samples of multidimensional time signals were used both for human action  and dynamic scene  recognition, but the temporal order was neglected in both cases.
The novelty of this work is to explore the framework of SoB in the context of dynamic scene and large-scale dynamic texture recognition and to develop an appropriate dissimilarity measure. To this end, we adopt the framework of the alignment distance from  and develop a kernelized version suitable for the comparison of SoBs. Unlike the mentioned dissimilarity measures for SoBs, impact of the static and dynamic components can be chosen depending on the employed generative image model. Besides, its property of being the square of a metric and its simple definition based on the Frobenius distance allows for the definition of Fréchet means . This is crucial for the classification via the nearest class center (NCC) . A part of this work is dedicated to the computation of abstract means of sets of KLDSs for avoiding the memory burden of nearest neighbor (1-NN) classification.
The computation of alignment distances involves modeling the appearance of a visual process as an equivalence class of points on a Stiefel manifold. This is closely related to Grassmann based models. The authors of  model visual processes as points on kernelized Grassmann manifolds, while our approach employs kernelized Stiefel manifolds in a similar manner. In particular, the authors propose to model the spaces of video frames, or of temporally localized features extracted from them, as sparse codes via points on Grassmann manifolds. Furthermore, a kernel dictionary learning approach for dynamic texture recognition was employed in .
Boldfaced uppercase letters, e.g. denote matrices and boldfaced lowercase letters e.g denote vectors. Bold italic letters like or
refer to any members of metric spaces. The identity matrix inis written as and a vector of ones as . For submatrices, the colon notation of Matlab is adopted, e.g. for the left-upper square submatrix of
. Singular value and Eigenvalue decompositions are assumed to be sorted in a descending manner.
Ii Systems of Bags
Ii-a Visual Processes as Streams of Histograms
Given an ordered set of vectorized samples of a multidimensional signal in time, let us assume that temporally local dynamics are negligible for the assignment of a class. This assumption can not be kept up when it comes to determining the sense of rotation of a wheel or a windmill from video footage but can be usually assumed to be valid for telling one scene video apart from another. In such cases, it is sensible to convert the signal to a matrix of temporally localized feature vectors that capture the distinguishable characteristics on a spatially global but temporally local scale.
Classification requires the generalization from one set of signals of one class to other signals of the same class. Since the entities in our model are temporally ordered sets of features, our aim is to learn a generative model that describes how the feature vectors develop over time. For many image classification scenarios, histogram-based feature vectors have proven successful. We do not pose any constraints on the histograms except that their entries are nonnegative and their -norm is 1. Let be a sample matrix of histogram vectors that were observed from a visual process over time. A common temporal model - one that is particularly popular in the modeling of dynamic textures  - is a linear dynamic system (LDS) typically modeled as
where is the expected value of the observations and is the observation matrix which, together with maps the internal state space vector to its respective observation at time , unperturbed by noise. The observation noise vector describes the model error. In the state space, the state transition matrix models the expected temporal evolution of the state vector and the term accounts for the process noise. The noise terms are assumed to be i.i.d. Gaussian. The parameters , and describe the predictable part of the dynamics of the system and are thus a natural choice for a feature representation of it.
Ii-B Kernelized Linear Dynamic Systems
Sets of histograms can not be well modeled by linear vector spaces due to the intrinsic structure of histogram manifolds  and thus a non-linear model is preferred. Let the function be a feature space mapping of histograms to a feature space corresponding to an appropriate histogram kernel
In other words, for two histograms , the inner product fo their feature space mapping in the Hilbert space can be written as
We assume that operates on matrices and returns a matrix of kernel values for each pair of columns of the input matrices. A number of kernels are available for probabilistic models and, in particular, histograms. Among the most popular are the Bhattacharrya kernel, the histogram intersection kernel  and the -kernel. We will restrict ourselves to the -kernel which is defined as
for a pair of histograms , where denote the supports of and , respectively.
Since kernel feature spaces are separable , we can think of linear operators that map from real-valued euclidean vectors to as matrices, where by matrix a touple of elements in is meant. Specifically, a matrix represents the operator
Beyond that, for two matrices , we define the product
From this follow the definitions of the respective Frobenius inner product for and the Frobenius norm
A KLDS is defined as
Unlike equation (1), the observation is not modeled directly, but in terms of its feature space mapping . The matrix , denoted feature space observer in the following, is typically not directly available, since is usually not given and described implicitly via the kernel trick. The same holds for the feature space bias . However, given a set of observations such that
holds, let us define
Then there exists a coefficient matrix and a coefficient vector for which the following relations are valid.
Thus, the KLDS (8) can be equivalently described via and , along with and a sample matrix as follows.
Given two feature space observers and feature space biases , described by the sample matrices and the coefficient parameters , the relations
can be concluded. For the kernel (4), the set of feature representations of histograms is bounded. This follows directly from the fact that for any histogram, the canonical norm induced by (4) is . We can thus assume, that the underlying system is stable. This is imposed by constraining the spectral norm of the state transition matrix to
An algorithm based on Kernel PCA to extract and from a set of observations of a system was provided in . Algorithm 1 incorporates the procedure. The state space dimension has to be fixed manually. It should be large enough to capture the variation of the data which can be measured by observing the magnitudes of the eigenvalues of the Gram matrix , and small enough to make the computation of the state transition matrix feasible.
Iii Kernelized Alignment Distances
Iii-a State Space Bases and Invariance
Given two KLDSs described by and
, respectively, one of the most elementary machine learning tasks is comparing the two by means of a dissimilarity measure. A natural choice is a linear combination of the squared Frobenius distances of the dynamic parameters:
Note that (15) differs from the formulation in  by the term and the missing of the process noise covariance. Since we assume that the temporally local histograms are already quite discriminative for each visual process, it is sensible to consider the feature space bias in the distance measure. By contrast, the process noise is of little importance and will be neglected. The parameters and are real and positive. They incorporate the discriminatory power of each aspect of deterministic part of the dynamics. Their choice is always a matter of consideration and depends on the specific problem. When a sufficient number of training samples are available, cross validation can be used for finding the best values. Some general guidelines can be inferred from the roles the KLDS parameters play in the motion equation (8). In particular, large values for should be used for scenarios, where the appearance has little discriminatory power, e.g. when similar objects with diverse movements are to be distinguished. A value well above for should be employed, when isolated frames of the videos provide much discriminatory power and a value close to for the opposite case.
To facilitate computations, we rewrite (15) in terms of the trace product and split it up as
The advantage of (15) is that the Frobenius distance is well studied and easy to interpret. However, this choice has the drawback that it is ambiguous. To see this, let be invertible. For an arbitrary parameter touple , consider the transformation
Substituting it into (12) and neglecting the noise terms indicates that it describes the same dynamics as , but does not vanish in general. This is undesirable, since a distance measure should account for ambiguities of particular representations. It is possible to partially accommodate these ambiguities by imposing the constraint that the columns of the feature space observer must be orthogonal, i.e
For any representation , a change of state space basis can be found such that the transformation (19) satisfies this constraint. Furthermore, this constraint is fulfilled for any representation extracted with Algorithm 1. We formalize it by defining the set of valid, stable KLDSs as
For any , a transformation is a member of if and only if is orthogonal. Furthermore, for any two descriptors and , it can be shown that
holds and equation (17) simplifies to
Iii-B Alignment Distance for KLDSs
In the following we define a non-ambiguous dissimilarity measure on KLDSs. The remaining ambiguity of with respect to orthogonal transformations suggests operating on equivalence classes of , rather than itself. Let denote the quotient of induced by the equivalence relation
A dissimilarity measure on is
The square root of (25) is actually a metric on . To see the positive definiteness, we first observe that its square is real and non-negative by definition. Neither can it be zero unless there is an orthogonal , such that , since otherwise would be smaller than in (25), due to the Cauchy-Schwarz inequality. But this makes and member of the same equivalence class of (24). The symmetry property follows directly form (22):
As for the triangle inequality, consider the three systems , and , with and being the orthogonal matrices that solve (25) for the respective pairs. Since is a metric on , we can conclude the relation
We refer to as the alignment metric, as opposed to its square , alignment distance. The reason for this distinction is that the metric is helpful for a mathematical interpretation, while for the definition and implementation of the algorithms only the distance is of interest.
Iii-C Jacobi-type Method for Computing the Alignment Distance
This subsection aims at solving (25), which boils down to finding an orthogonal maximizer of
The set of is not connected, but consists of the two connected components and . For the sake of simplicity, we treat these two cases separately. The authors of  propose to compute the alignment distance of classical LDSs by writing matrices in as products of Givens rotations, i.e. as
in which describes a matrix that performs a rotation in the plane spanned by the coordinate axes and :
The real numbers denote the sine and cosine of the rotation angle and appear in the rows and columns with the indexes and , respectively. Consequentially, they conform
In each iteration, Algorithm 2 sweeps through all possible combinations of two-dimensional rotations and determines the sine minimizing the cost function (25) for each one of them. It repeats the procedure until a complete sweep does not significantly alter the cost function. The scalar optimization problem
can be solved analytically. With and , it can be reformulated as
It can be observed that (33) is quadratic in and . Since constant offsets are irrelevant for maximization, we can write it as
where the factors can be determined by writing the trace products as sums of matrix elements  and eliminating the constant offset. Substituting the second constraint yields
If the optimum is not at the boundaries of , it is at a critical point, i.e. at a point with a vanishing derivative. Setting the first derivative to 0 produces
The reformulation is given by multiplying with and applying the third binomial formula. The resulting formula is the quartic equation
for which closed-form solution formulas exist  or Newton type methods can be applied. If the global maximum is not unique, the candidate with the smallest value for must be selected. Once is found, so is . The sign of is finally determined by evaluating for and .
While we need the optimizer on , Algorithm 2 searches only one of its connected components. Thus, a necessary condition for the solution of (25) is that the determinant of the determinant of the initialization has the correct sign. Practically, this implies that the algorithm needs to be run at least twice with initializations both in and .
Iii-D Convergence Properties
The purpose of this subsection is to take a closer a look at Algorithm 2 from the point of view of convergence and numerical complexity. For a discussion of convergence properties of Jacobi-type methods in general, the reader is referred to . The optimization problem (25) is non-convex due to the restriction of to be in . Although Jacobi methods perform reasonably well in practice, unfortunately we lack a proof of global convergence for this particular method.
be the cost function value at iteration of Algorithm 2 for any initialization. The sequence is bounded above and increases monotonically. Hence, it converges to a limit . In other words, the algorithm always terminates. The execution time is dictated by the solution of (37), for which fast and robust methods exist but which has to be performed times per iteration.
An essential question is, what the algorithm returns. Since we can not guarantee that it always returns a global maximum, we want at least to make sure that the result is (close to) a critical point of . This is shown in Theorem 1. One claim that is made is that can be further increased at a non-critical point on by multiplying with a Givens rotation. The claim follows from the observation that for any , the set
spans the tangent space of at which consists of products of
with skew-symmetric matrices.
Let be the sequence of orthogonal matrices generated by the outer while loop of Algorithm 2. The sequence is bounded with respect to the Frobenius norm. If the order of each sweep is chosen randomly, each accumulation point is almost certainly a critical point of the cost function .
Since is bounded w.r.t the Frobenius norm, so is .
Let be the limit of the cost function value sequence and the monotonically increasing index mapping belonging to a convergent subsequence of . The sequence converges to a limit point with . Assume that is not a critical point of . This implies, that we can find a Givens rotation , such that
Since the sequence converges to 0, so does the sequence
, because the Frobenius norm is invariant under orthogonal transformations. The mapping is continuuous w.r.t. the Frobenius norm and thus there is an index such that
holds for all . However, at each iteration the algorithm starts the sweep by performing a Givens rotation. If the affected indexes for this first Givens rotation are chosen randomly, the algorithm almost certainly will eventually choose and thus yield the point with , where and are chosen such that is maximized, at the beginning of a sweep. For such a point, the inequality
holds. This is not possible, because and is an upper bound for any element of .
Iii-E Fréchet Means of sets of KLDSs
Due to the metric property, the alignment distance allows for the employment Fréchet means on subsets of . Consider a finite subset of a space equipped with a metric . The Fréchet mean  of is the minimizer
This may look like an abstract concept at first, but it is a natural way to choose a representative point out of a set of candidates. We want this representative point not to be far away of any of the reference points, which is why the squared sum of metrics to all these reference points is minimized. By doing so, we make sure through the triangle inequality, that the metric from a test sample to the representative point differs as little as possible from the metric from a test sample to any reference point. Fig. 2 visualizes this intuition: if we consider a reference point , a testing point and a representative point , the triangle inequality yields
A small value for thus makes sure that is a sensible approximation of for proximity based classification tasks.
Consider the finite set . The Fréchet mean of is the minimizer of the average of alignment distances to the KLDSs:
The minimization problem (45) can be approached iteratively, until the cost function
can not be further reduced.
Assume that at a given iteration , an approximate solution was determined and let be the minimizer of , i.e. the maximizer of , for each . Substituting this into (45) yields
This minimization problem can be solved separately for , and the rest of the parameters. With this in mind, let us split up the problem. For , this yields