kshs
Code for reproducing the experiments from the paper "Dynamic Texture Recognition via Nuclear Distances on Kernelized Scattering Histogram Spaces"
view repo
Distance-based dynamic texture recognition is an important research field in multimedia processing with applications ranging from retrieval to segmentation of video data. Based on the conjecture that the most distinctive characteristic of a dynamic texture is the appearance of its individual frames, this work proposes to describe dynamic textures as kernelized spaces of frame-wise feature vectors computed using the Scattering transform. By combining these spaces with a basis-invariant metric, we get a framework that produces competitive results for nearest neighbor classification and state-of-the-art results for nearest class center classification.
READ FULL TEXT VIEW PDFCode for reproducing the experiments from the paper "Dynamic Texture Recognition via Nuclear Distances on Kernelized Scattering Histogram Spaces"
The term temporal or dynamic texture
refers to a class of visual processes consisting of a temporally evolving image texture. Typical examples of dynamic textures are ocean waves, flags moving in the wind, or flames of a bonfire. In order to classify, retrieve or cluster dynamic textures, but also for purposes such as spatial or temporal video segmentation or identifying patterns in a dynamic scene, an expressive way to represent dynamic textures and to measure similarity between them is of great help. Since a dynamic texture is defined by the appearance of its individual frames, as well as the dynamics of their temporal progression, descriptors usually aim at capturing the behavior along the spatial
and temporal axes.Quite often, the dynamics are of considerably lesser importance for distinguishing videos of temporal textures than the appearance of individual frames. For instance, to tell apart a forest from an ocean, we do not need to consider, whether the video was shot in stormy or calm weather conditions. In fact, dynamic textures can be usually easily recognized by observing isolated video frames. In this work, we therefore neglect the issue of appropriately describing the temporal transitions. Instead, we treat each dynamic texture as a set of image feature vectors describing individual frames. We then proceed to model each set as a finite subspace of a kernel feature space that is represented as a coefficient matrix of an orthogonal basis. Since one vector space is spanned by an infinite number of orthogonal bases, we employ a basis-invariant metric.
This work incorporates two observations from previous publications. The first observation is that when we apply Scattering transform
to a texture image, its coefficient distributions constitute an expressive feature representation that performs well on recognition tasks when combined with probability product kernels
[1]. The second observation is that visual processes can be well described as spaces computed by applying Kernel PCA [2] on distribution-based descriptors of the video frames, when employed together with a distance measure that is invariant to the bases of said spaces [3, 4]. The contributions of this work are as follows. First, we propose a feature extraction method that performs a Scattering transform on an image texture and computes a histogram from each Scattering
subband. These histograms are then concatenated into a Scattering histogram vector and a Mercer kernel on pairs of these vectors is defined. Second, we introduce an algorithm which expects a dynamic texture video and returns a coefficient matrix that describes the kernel subspace containing the Scattering histogram vectors computed from the video. Next, we define a metric on pairs of these coefficient matrices. Finally, we describe an algorithm for computing Fréchet means [3] on finite sets of coefficient matrices with respect to the defined metric.The classical approach to dynamic texture recognition is by first computing a linear autoregressive state-space model, and then defining a distance measure [5] on its parameters. Another common approach is to use multi-scale spatio-temporal filter responses of the dynamic texture video, e.g. the 2D-T Curvelet Transform [6] or the Spatio-temporal receptive fields [7]. Beyond that, a considerable number of recognition methods in the recent years is based on the idea to collect features that are designed for 2D images from a video, by applying them in three orthogonal planes (TOP) of the video cuboid [8, 9]. Methods based on computing sparse representations have also demonstrated remarkable performance [10]. Interestingly, while neural networks play a significant role in visual process recognition nowadays [11]
, classical end-to-end deep learning is more an exception than the rule. One reason for this might be that even today it is not straight-forward to collect enough dynamic textures sequences to train a deep neural net with a superior performance.
Simply put, a Scattering transform [12]
is a convolutional neural net (CNN) with fixed weights and without channel recombination, where the absolute value operator is consistently used as the activation function. Consider an operator
that creates a subband decomposition of the input, and applies the absolute value operator to each non-lowpass output. The first element of the output tuple, written as in the following, is a lowpass representation of the input. The other signals contained in the tuple can be subjected to the operator again, yielding another subband decompositions with the absolute value applied to the bandpass signals. This procedure can be repeated several times, yielding a tree structure like in Figure 1, where the black circles denote the absolute values of bandpass signals and the white circle denote lowpass signals. The -depth Scattering transform of a signal is the collection of (only) the lowpass signals created by constructing a tree with layers. These signals are referred to subbands in the following. Each Scattering subband of has the form(1) |
with and . The Scattering transform of results in a tuple containing all subbands. For the subband at the top layer (), let us fix the notation . As an optional step, the Scattering subbands can be normalized [1, 13] in order to decorrelate them. Let us define the normalized Scattering transform as a collection of subbands with
(2) |
where denotes the signal average, and
(3) |
for .
In [1], a texture retrieval method based on Scattering subband distributions has been proposed, in which texture images are described by the Weibull distribution parameters computed from the coefficients of each subband via maximum likelihood. We can thus conclude that the subband-wise distributions capture distinctive features from image textures. The framework presented in the following is based on the same feature extraction mechanism, but the distributions are modeled as histograms. Given a dynamic texture sequence of vectorized video frames, let us assume that we have applied the Scattering transform to each one of the frames and computed a histogram of from each subband. To describe a video, we are thus given a matrix
(4) |
with Scattering histogram vectors as its columns, where is the number of Scattering subbands.
Assuming that the Scattering histogram vectors are an appropriate way to represent the individual frames, the entire set of the columns in
should capture the essential information of the whole dynamic texture, if we neglect the dynamics and the order of the frames. What we aim for is thus a compact way to represent this set. However, the classical approach of performing a principal component analysis (PCA) and representing
by the principal subspace would likely fail, because low-dimensional linear vector spaces are not a fitting model for sets of histograms, which is why histogram data is often mapped to a feature space using the kernel trick, prior to further processing [4]. Following the insights from [1], we employ the Bhattacharyyakernel which, for two probability distributions
over the sample space , is defined as(5) |
Let us make the (simplifying) assumption that the coefficient distributions of different Scattering subbands are independent. Then we can evaluate the Bhattacharyya kernel on a pair of Scattering histogram vectors by computing
(6) |
For the sake of readability, we fix the notation
(7) |
with , for the matrix-wise application of .
Using and , we could now proceed to apply Kernel PCA to compute a low-dimensional subspace of a feature space corresponding to
. Kernel PCA is typically carried out by performing truncated Eigenvalue or Singular Value Decomposition (SVD) of the
Gram matrix and returns a coefficient matrix that, together with and , parameterizes an orthogonal basis spanning the -dimensional principal subspace of the feature space representation of [2]. The orthogonality of a basis described by a parameter pair can be easily verified using the condition(8) |
However, storing the whole matrix for each video sequence is rarely sustainable. So in addition to performing Kernel PCA, we include a
Nyström interpolation
[14] step to approximate the computed subspace by a small subset of columns of , for which the inequality holds. To do so, we collect representative columns from in a matrix and choose the coefficient matrix , such that an appropriate metric between and , is minimized. We choose the metric(9) |
which measures the squared error between the basis vectors in the feature space. Since we can assume that both involved bases are orthogonal, i.e. Eq. (8) holds for and , we can write it as
(10) |
We thus define
(11) |
Due to Eq. (8), the solution of Eq. (11) must have the form , where denote the SVD factors of and is a matrix with orthogonal columns, i.e. . Eq. (11) thus boils down to
(12) |
which can be solved using the SVD.
Computing the subspace parameters of a dynamic texture is described in Algorithm 1.
Note that the parameters computed by Algorithm 1 are not unique. Consider a parameter pair describing an orthogonal basis in the feature space. Then, all pairs in the set
(13) |
describe different orthogonal bases of the same space. A semantically meaningful distance measure on kernel subspaces should be invariant to orthogonal transformations of . This is not the case for Eq. (10): Given a pair , we observe
(14) |
even though both and pairs describe the same subspace. We overcome this ambiguity by choosing always such that is minimized. We denote the result
(15) |
the kernelized Nuclear distance, as it is obtained by replacing the trace in Eq. (10) by the nuclear norm. On two matrix pairs , the Nuclear distance is computed as
(16) |
Eq. (16) is actually a special case of the kernelized Alignment distance [15, 3]. As such, it inherits its metric property on the the equivalence classes in Eq. (13). Specifically, , i.e. the square root of Eq. (16) is symmetric, positive definite and fulfills the triangle inequality [3].
Certain scenarios, such as clustering or nearest class center (NCC) classification require the notion of an average for finite sets of descriptors. Sadly, the set of kernel subspaces with dimension is a non-trivial manifold, which means we can not compute a mean using the arithmetic average. Thanks to Eq. (16), we are operating on a metric space, which means that we can define averages using the Fréchet mean. Consider a set . Its Fréchet mean is given by
(17) |
To solve the optimization problem, we follow the approach in [3]. We approximate by computing cluster centers using the -means algorithms on the columns of . Next, is approached using an alternating scheme with two steps for each iteration. The first step consists of finding a set of orthogonal matrices , such that
(18) |
is fulfilled, for every . This can be easily achieved using a projection onto the orthogonal group. The second step is approximating by computing
(19) |
These two steps are repeated until convergence of the loss function in Eq. (
17). Due to space constraints, we omit providing pseudo-code for the final algorithm and refer the reader to [3] for in-detail description of a similar algorithm.The DynTex database [16] is a collection of high-resolution RGB texture videos. Three splits have been compiled for classification benchmarking.
DynTex Alpha is composed of 60 videos divided into the 3 classes Sea (20 videos), Grass (20), and Trees (20)
DynTex Beta is composed of 162 videos divided into the 10 classes Sea (20), Vegetation (20), Trees (20), Flags (20), Calm Water (20), Fountains (20), Smoke (16), Escalator (7), Traffic (9) and Rotation (10).
DynTex Gamma is composed of 264 videos divided into the 10 classes Flowers (29), Sea (38), Naked trees (25), Foliage (35), Escalator (7), Calm water (30), Flags (31), Grass (23), Traffic (9) and Fountains (37). Some works [6] use a different version of this split containing 275 videos.
We perform -NN (Nearest Neighbor) and NCC classification experiments on the three splits. To this end, Kernelized Scattering Histogram Spaces (KSHS) are computed from each video using Algorithm 1 with . The Scattering transform is computed using Kymatio [17] on CUDA with the arguments L=4, J=4. Histograms with are calculated, from both regular and normalized (KNSHS) subbands. For NCC classification, a Fréchet mean is computed from each class using the procedure described in Section 5 with and the leave-one-out protocol proposed in [6]. Parameters were chosen using grid search. Code for reproducing the experimental results is available online^{1}^{1}1https://github.com/alexandersagel/kshs.
Alpha | Beta | Gamma | ||||
-NN | NCC | -NN | NCC | -NN | NCC | |
LBP-TOP | 96.7 % | - | 85.8 % | - | 84.9 % | - |
PCANet-TOP | 96.7 % | - | 90.7 % | - | 89.4 % | - |
DFS | - | 83.6 % | - | 65.2 % | - | 60.8 % |
2D+T C. | - | 85.0 % | - | 67.0 % | - | -* |
OTDL | - | 86.6 % | - | 69.0 % | - | 64.2 % |
CLSP-TOP | 95.0 % | - | 92.0 % | - | 91.3 % | - |
STRF N-jet | 100.0 % | - | 93.8% | - | 91.2 % | - |
B3DF | 96.7 % | 90.0 % | 90.1 % | 74.1 % | -* | -* |
SOE-NET | 98.3 % | 96.7 % | 96.9 % | 86.4 % | -* | -* |
SoB+Align | 98.3 % | 88.3 % | 90.1 % | 75.3 % | 79.9 % | 67.1 % |
KSHS+Ncl. | 98.3 % | 96.7 % | 88.9 % | 88.3 % | 88.6 % | 86.7 % |
KNSHS+Ncl. | 98.3 % | 96.7 % | 93.2 % | 90.1 % | 91.3 % | 89.8 % |
*Reported results refer to 275-video version of DynTex Gamma.
Table 1 shows the 1-NN and NCC classification results in comparison with LBP-TOP [8, 11], PCANet-TOP [18], Dynamic Fractal Spectrum (DFS) [19, 10], Spatiotemporal Curvelet Transform (2D+T Curvelet) [6]
, Orthogonal Tensor Dictionary Learning (OTDL)
[10] Completed Local Structure Patterns in Three Orthogonal Planes (CLSP-TOP) [9], Spatio-temporal Receptive Fields (STRF N-jet) [7], Binarized 3D features (B3DF)
[20] Spatiotemporal Oriented Energy Network (SOE-NET) [21], and Systems of Bags with the Alignment Distance (SoB+Align) [3]. Results that have not been reported or are not applicable are indicated by ’-’.In line with the results in [1], normalizing the Scattering coefficients improves the recognition performance: for no experimental setting, KNSHS performs worse than KSHS. One explanation is that normalized coefficients are closer to fulfilling the independence assumption made in Section 3. Overall, KNSHS in combination with the Nuclear distance competes well with current approaches. For the -NN classification, it yields the same success rate as CLSP-TOP on the Gamma split, and is outperformed by STRF N-jet and SOE-NET on the Alpha and Beta split of DynTex. More remarkable are the results for NCC classification. Not only does our method yield higher success rates than many of the state-of-the-art approaches in literature, but it does so with a considerably small performance gap with regards to -NN classification. This could be due to using Fréchet means as class centers: Because of the triangle inequality, the distance from a test point to the Fréchet mean of a set is always a good approximation of the distance from the test point to any point in said set [3]. Hence, NCC should yield similar results as -NN.
In this work, we have proposed a Scattering-based feature extraction method for dynamic textures using Kernel PCA, and described a distance that accounts for non-uniqueness of the extracted features. Additionally, we have briefly outlined a procedure to compute abstract averages from finite sets of such features. We have evaluated the proposed method on -NN and NCC classification and have observed state-of-the-art results for the latter scenario. The capability of computing expressive features and measure the distance between pairs thereof, in addition to computing abstract averages, are also useful in related recognition tasks, such as retrieval, clustering or even segmentation of video data.
2009 IEEE Conference on Computer Vision and Pattern Recognition
. IEEE, 2009, pp. 1932–1939.“Dynamic texture and scene classification by transferring deep image features,”
Neurocomputing, vol. 171, pp. 1230–1241, 2016.journal of machine learning research
, vol. 6, no. Dec, pp. 2153–2175, 2005.