Code for reproducing the experiments from the paper "Dynamic Texture Recognition via Nuclear Distances on Kernelized Scattering Histogram Spaces"
Distance-based dynamic texture recognition is an important research field in multimedia processing with applications ranging from retrieval to segmentation of video data. Based on the conjecture that the most distinctive characteristic of a dynamic texture is the appearance of its individual frames, this work proposes to describe dynamic textures as kernelized spaces of frame-wise feature vectors computed using the Scattering transform. By combining these spaces with a basis-invariant metric, we get a framework that produces competitive results for nearest neighbor classification and state-of-the-art results for nearest class center classification.READ FULL TEXT VIEW PDF
A scattering transform defines a signal representation which is invarian...
A scattering vector is a local descriptor including multiscale and
Texture plays an important role in many image analysis applications. In ...
This work studies the problem of content-based image retrieval, specific...
Superpixels are widely used in computer vision applications. Nevertheles...
Dynamic textures exist in various forms, e.g., fire, smoke, and traffic ...
In this paper PREMONN (PREdictive MOdular Neural Networks) model/archite...
Code for reproducing the experiments from the paper "Dynamic Texture Recognition via Nuclear Distances on Kernelized Scattering Histogram Spaces"
The term temporal or dynamic texture
refers to a class of visual processes consisting of a temporally evolving image texture. Typical examples of dynamic textures are ocean waves, flags moving in the wind, or flames of a bonfire. In order to classify, retrieve or cluster dynamic textures, but also for purposes such as spatial or temporal video segmentation or identifying patterns in a dynamic scene, an expressive way to represent dynamic textures and to measure similarity between them is of great help. Since a dynamic texture is defined by the appearance of its individual frames, as well as the dynamics of their temporal progression, descriptors usually aim at capturing the behavior along the spatialand temporal axes.
Quite often, the dynamics are of considerably lesser importance for distinguishing videos of temporal textures than the appearance of individual frames. For instance, to tell apart a forest from an ocean, we do not need to consider, whether the video was shot in stormy or calm weather conditions. In fact, dynamic textures can be usually easily recognized by observing isolated video frames. In this work, we therefore neglect the issue of appropriately describing the temporal transitions. Instead, we treat each dynamic texture as a set of image feature vectors describing individual frames. We then proceed to model each set as a finite subspace of a kernel feature space that is represented as a coefficient matrix of an orthogonal basis. Since one vector space is spanned by an infinite number of orthogonal bases, we employ a basis-invariant metric.
This work incorporates two observations from previous publications. The first observation is that when we apply Scattering transform
to a texture image, its coefficient distributions constitute an expressive feature representation that performs well on recognition tasks when combined with probability product kernels. The second observation is that visual processes can be well described as spaces computed by applying Kernel PCA  on distribution-based descriptors of the video frames, when employed together with a distance measure that is invariant to the bases of said spaces [3, 4]
. The contributions of this work are as follows. First, we propose a feature extraction method that performs a Scattering transform on an image texture and computes a histogram from each Scatteringsubband. These histograms are then concatenated into a Scattering histogram vector and a Mercer kernel on pairs of these vectors is defined. Second, we introduce an algorithm which expects a dynamic texture video and returns a coefficient matrix that describes the kernel subspace containing the Scattering histogram vectors computed from the video. Next, we define a metric on pairs of these coefficient matrices. Finally, we describe an algorithm for computing Fréchet means  on finite sets of coefficient matrices with respect to the defined metric.
The classical approach to dynamic texture recognition is by first computing a linear autoregressive state-space model, and then defining a distance measure  on its parameters. Another common approach is to use multi-scale spatio-temporal filter responses of the dynamic texture video, e.g. the 2D-T Curvelet Transform  or the Spatio-temporal receptive fields . Beyond that, a considerable number of recognition methods in the recent years is based on the idea to collect features that are designed for 2D images from a video, by applying them in three orthogonal planes (TOP) of the video cuboid [8, 9]. Methods based on computing sparse representations have also demonstrated remarkable performance . Interestingly, while neural networks play a significant role in visual process recognition nowadays 
, classical end-to-end deep learning is more an exception than the rule. One reason for this might be that even today it is not straight-forward to collect enough dynamic textures sequences to train a deep neural net with a superior performance.
Simply put, a Scattering transform 
is a convolutional neural net (CNN) with fixed weights and without channel recombination, where the absolute value operator is consistently used as the activation function. Consider an operatorthat creates a subband decomposition of the input, and applies the absolute value operator to each non-lowpass output. The first element of the output tuple, written as in the following, is a lowpass representation of the input. The other signals contained in the tuple can be subjected to the operator again, yielding another subband decompositions with the absolute value applied to the bandpass signals. This procedure can be repeated several times, yielding a tree structure like in Figure 1, where the black circles denote the absolute values of bandpass signals and the white circle denote lowpass signals. The -depth Scattering transform of a signal is the collection of (only) the lowpass signals created by constructing a tree with layers. These signals are referred to subbands in the following. Each Scattering subband of has the form
with and . The Scattering transform of results in a tuple containing all subbands. For the subband at the top layer (), let us fix the notation . As an optional step, the Scattering subbands can be normalized [1, 13] in order to decorrelate them. Let us define the normalized Scattering transform as a collection of subbands with
where denotes the signal average, and
In , a texture retrieval method based on Scattering subband distributions has been proposed, in which texture images are described by the Weibull distribution parameters computed from the coefficients of each subband via maximum likelihood. We can thus conclude that the subband-wise distributions capture distinctive features from image textures. The framework presented in the following is based on the same feature extraction mechanism, but the distributions are modeled as histograms. Given a dynamic texture sequence of vectorized video frames, let us assume that we have applied the Scattering transform to each one of the frames and computed a histogram of from each subband. To describe a video, we are thus given a matrix
with Scattering histogram vectors as its columns, where is the number of Scattering subbands.
Assuming that the Scattering histogram vectors are an appropriate way to represent the individual frames, the entire set of the columns in
should capture the essential information of the whole dynamic texture, if we neglect the dynamics and the order of the frames. What we aim for is thus a compact way to represent this set. However, the classical approach of performing a principal component analysis (PCA) and representingby the principal subspace would likely fail, because low-dimensional linear vector spaces are not a fitting model for sets of histograms, which is why histogram data is often mapped to a feature space using the kernel trick, prior to further processing . Following the insights from , we employ the Bhattacharyya
kernel which, for two probability distributionsover the sample space , is defined as
Let us make the (simplifying) assumption that the coefficient distributions of different Scattering subbands are independent. Then we can evaluate the Bhattacharyya kernel on a pair of Scattering histogram vectors by computing
For the sake of readability, we fix the notation
with , for the matrix-wise application of .
Using and , we could now proceed to apply Kernel PCA to compute a low-dimensional subspace of a feature space corresponding toand returns a coefficient matrix that, together with and , parameterizes an orthogonal basis spanning the -dimensional principal subspace of the feature space representation of . The orthogonality of a basis described by a parameter pair can be easily verified using the condition
However, storing the whole matrix for each video sequence is rarely sustainable.
So in addition to performing Kernel PCA, we include a Nyström interpolation
Nyström interpolation step to approximate the computed subspace by a small subset of columns of , for which the inequality holds. To do so, we collect representative columns from in a matrix and choose the coefficient matrix , such that an appropriate metric between and , is minimized. We choose the metric
which measures the squared error between the basis vectors in the feature space. Since we can assume that both involved bases are orthogonal, i.e. Eq. (8) holds for and , we can write it as
We thus define
which can be solved using the SVD.
Computing the subspace parameters of a dynamic texture is described in Algorithm 1.
Note that the parameters computed by Algorithm 1 are not unique. Consider a parameter pair describing an orthogonal basis in the feature space. Then, all pairs in the set
describe different orthogonal bases of the same space. A semantically meaningful distance measure on kernel subspaces should be invariant to orthogonal transformations of . This is not the case for Eq. (10): Given a pair , we observe
even though both and pairs describe the same subspace. We overcome this ambiguity by choosing always such that is minimized. We denote the result
the kernelized Nuclear distance, as it is obtained by replacing the trace in Eq. (10) by the nuclear norm. On two matrix pairs , the Nuclear distance is computed as
Eq. (16) is actually a special case of the kernelized Alignment distance [15, 3]. As such, it inherits its metric property on the the equivalence classes in Eq. (13). Specifically, , i.e. the square root of Eq. (16) is symmetric, positive definite and fulfills the triangle inequality .
Certain scenarios, such as clustering or nearest class center (NCC) classification require the notion of an average for finite sets of descriptors. Sadly, the set of kernel subspaces with dimension is a non-trivial manifold, which means we can not compute a mean using the arithmetic average. Thanks to Eq. (16), we are operating on a metric space, which means that we can define averages using the Fréchet mean. Consider a set . Its Fréchet mean is given by
To solve the optimization problem, we follow the approach in . We approximate by computing cluster centers using the -means algorithms on the columns of . Next, is approached using an alternating scheme with two steps for each iteration. The first step consists of finding a set of orthogonal matrices , such that
is fulfilled, for every . This can be easily achieved using a projection onto the orthogonal group. The second step is approximating by computing
These two steps are repeated until convergence of the loss function in Eq. (17). Due to space constraints, we omit providing pseudo-code for the final algorithm and refer the reader to  for in-detail description of a similar algorithm.
The DynTex database  is a collection of high-resolution RGB texture videos. Three splits have been compiled for classification benchmarking.
DynTex Alpha is composed of 60 videos divided into the 3 classes Sea (20 videos), Grass (20), and Trees (20)
DynTex Beta is composed of 162 videos divided into the 10 classes Sea (20), Vegetation (20), Trees (20), Flags (20), Calm Water (20), Fountains (20), Smoke (16), Escalator (7), Traffic (9) and Rotation (10).
DynTex Gamma is composed of 264 videos divided into the 10 classes Flowers (29), Sea (38), Naked trees (25), Foliage (35), Escalator (7), Calm water (30), Flags (31), Grass (23), Traffic (9) and Fountains (37). Some works  use a different version of this split containing 275 videos.
We perform -NN (Nearest Neighbor) and NCC classification experiments on the three splits. To this end, Kernelized Scattering Histogram Spaces (KSHS) are computed from each video using Algorithm 1 with . The Scattering transform is computed using Kymatio  on CUDA with the arguments L=4, J=4. Histograms with are calculated, from both regular and normalized (KNSHS) subbands. For NCC classification, a Fréchet mean is computed from each class using the procedure described in Section 5 with and the leave-one-out protocol proposed in . Parameters were chosen using grid search. Code for reproducing the experimental results is available online111https://github.com/alexandersagel/kshs.
|LBP-TOP||96.7 %||-||85.8 %||-||84.9 %||-|
|PCANet-TOP||96.7 %||-||90.7 %||-||89.4 %||-|
|DFS||-||83.6 %||-||65.2 %||-||60.8 %|
|2D+T C.||-||85.0 %||-||67.0 %||-||-*|
|OTDL||-||86.6 %||-||69.0 %||-||64.2 %|
|CLSP-TOP||95.0 %||-||92.0 %||-||91.3 %||-|
|STRF N-jet||100.0 %||-||93.8%||-||91.2 %||-|
|B3DF||96.7 %||90.0 %||90.1 %||74.1 %||-*||-*|
|SOE-NET||98.3 %||96.7 %||96.9 %||86.4 %||-*||-*|
|SoB+Align||98.3 %||88.3 %||90.1 %||75.3 %||79.9 %||67.1 %|
|KSHS+Ncl.||98.3 %||96.7 %||88.9 %||88.3 %||88.6 %||86.7 %|
|KNSHS+Ncl.||98.3 %||96.7 %||93.2 %||90.1 %||91.3 %||89.8 %|
*Reported results refer to 275-video version of DynTex Gamma.
Table 1 shows the 1-NN and NCC classification results in comparison with LBP-TOP [8, 11], PCANet-TOP , Dynamic Fractal Spectrum (DFS) [19, 10], Spatiotemporal Curvelet Transform (2D+T Curvelet) 
, Orthogonal Tensor Dictionary Learning (OTDL) Completed Local Structure Patterns in Three Orthogonal Planes (CLSP-TOP) , Spatio-temporal Receptive Fields (STRF N-jet) 
, Binarized 3D features (B3DF) Spatiotemporal Oriented Energy Network (SOE-NET) , and Systems of Bags with the Alignment Distance (SoB+Align) . Results that have not been reported or are not applicable are indicated by ’-’.
In line with the results in , normalizing the Scattering coefficients improves the recognition performance: for no experimental setting, KNSHS performs worse than KSHS. One explanation is that normalized coefficients are closer to fulfilling the independence assumption made in Section 3. Overall, KNSHS in combination with the Nuclear distance competes well with current approaches. For the -NN classification, it yields the same success rate as CLSP-TOP on the Gamma split, and is outperformed by STRF N-jet and SOE-NET on the Alpha and Beta split of DynTex. More remarkable are the results for NCC classification. Not only does our method yield higher success rates than many of the state-of-the-art approaches in literature, but it does so with a considerably small performance gap with regards to -NN classification. This could be due to using Fréchet means as class centers: Because of the triangle inequality, the distance from a test point to the Fréchet mean of a set is always a good approximation of the distance from the test point to any point in said set . Hence, NCC should yield similar results as -NN.
In this work, we have proposed a Scattering-based feature extraction method for dynamic textures using Kernel PCA, and described a distance that accounts for non-uniqueness of the extracted features. Additionally, we have briefly outlined a procedure to compute abstract averages from finite sets of such features. We have evaluated the proposed method on -NN and NCC classification and have observed state-of-the-art results for the latter scenario. The capability of computing expressive features and measure the distance between pairs thereof, in addition to computing abstract averages, are also useful in related recognition tasks, such as retrieval, clustering or even segmentation of video data.
“Dynamic texture and scene classification by transferring deep image features,”Neurocomputing, vol. 171, pp. 1230–1241, 2016.
journal of machine learning research, vol. 6, no. Dec, pp. 2153–2175, 2005.