Dynamic Texture Recognition via Nuclear Distances on Kernelized Scattering Histogram Spaces

02/01/2021 ∙ by Alexander Sagel, et al. ∙ fortiss 0

Distance-based dynamic texture recognition is an important research field in multimedia processing with applications ranging from retrieval to segmentation of video data. Based on the conjecture that the most distinctive characteristic of a dynamic texture is the appearance of its individual frames, this work proposes to describe dynamic textures as kernelized spaces of frame-wise feature vectors computed using the Scattering transform. By combining these spaces with a basis-invariant metric, we get a framework that produces competitive results for nearest neighbor classification and state-of-the-art results for nearest class center classification.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Code for reproducing the experiments from the paper "Dynamic Texture Recognition via Nuclear Distances on Kernelized Scattering Histogram Spaces"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The term temporal or dynamic texture

refers to a class of visual processes consisting of a temporally evolving image texture. Typical examples of dynamic textures are ocean waves, flags moving in the wind, or flames of a bonfire. In order to classify, retrieve or cluster dynamic textures, but also for purposes such as spatial or temporal video segmentation or identifying patterns in a dynamic scene, an expressive way to represent dynamic textures and to measure similarity between them is of great help. Since a dynamic texture is defined by the appearance of its individual frames, as well as the dynamics of their temporal progression, descriptors usually aim at capturing the behavior along the spatial

and temporal axes.

Quite often, the dynamics are of considerably lesser importance for distinguishing videos of temporal textures than the appearance of individual frames. For instance, to tell apart a forest from an ocean, we do not need to consider, whether the video was shot in stormy or calm weather conditions. In fact, dynamic textures can be usually easily recognized by observing isolated video frames. In this work, we therefore neglect the issue of appropriately describing the temporal transitions. Instead, we treat each dynamic texture as a set of image feature vectors describing individual frames. We then proceed to model each set as a finite subspace of a kernel feature space that is represented as a coefficient matrix of an orthogonal basis. Since one vector space is spanned by an infinite number of orthogonal bases, we employ a basis-invariant metric.

This work incorporates two observations from previous publications. The first observation is that when we apply Scattering transform

to a texture image, its coefficient distributions constitute an expressive feature representation that performs well on recognition tasks when combined with probability product kernels

[1]. The second observation is that visual processes can be well described as spaces computed by applying Kernel PCA [2] on distribution-based descriptors of the video frames, when employed together with a distance measure that is invariant to the bases of said spaces [3, 4]

. The contributions of this work are as follows. First, we propose a feature extraction method that performs a Scattering transform on an image texture and computes a histogram from each Scattering

subband. These histograms are then concatenated into a Scattering histogram vector and a Mercer kernel on pairs of these vectors is defined. Second, we introduce an algorithm which expects a dynamic texture video and returns a coefficient matrix that describes the kernel subspace containing the Scattering histogram vectors computed from the video. Next, we define a metric on pairs of these coefficient matrices. Finally, we describe an algorithm for computing Fréchet means [3] on finite sets of coefficient matrices with respect to the defined metric.

The classical approach to dynamic texture recognition is by first computing a linear autoregressive state-space model, and then defining a distance measure [5] on its parameters. Another common approach is to use multi-scale spatio-temporal filter responses of the dynamic texture video, e.g. the 2D-T Curvelet Transform [6] or the Spatio-temporal receptive fields [7]. Beyond that, a considerable number of recognition methods in the recent years is based on the idea to collect features that are designed for 2D images from a video, by applying them in three orthogonal planes (TOP) of the video cuboid [8, 9]. Methods based on computing sparse representations have also demonstrated remarkable performance [10]. Interestingly, while neural networks play a significant role in visual process recognition nowadays [11]

, classical end-to-end deep learning is more an exception than the rule. One reason for this might be that even today it is not straight-forward to collect enough dynamic textures sequences to train a deep neural net with a superior performance.

2 Scattering Subband Histograms

Figure 1: Scattering tree produced by successive application of on the input signal . Lowpass signals are depicted as white nodes. Once the tree is computed, only the white nodes are kept as the representation of the input signal.

Simply put, a Scattering transform [12]

is a convolutional neural net (CNN) with fixed weights and without channel recombination, where the absolute value operator is consistently used as the activation function. Consider an operator

that creates a subband decomposition of the input, and applies the absolute value operator to each non-lowpass output. The first element of the output tuple, written as in the following, is a lowpass representation of the input. The other signals contained in the tuple can be subjected to the operator again, yielding another subband decompositions with the absolute value applied to the bandpass signals. This procedure can be repeated several times, yielding a tree structure like in Figure 1, where the black circles denote the absolute values of bandpass signals and the white circle denote lowpass signals. The -depth Scattering transform of a signal is the collection of (only) the lowpass signals created by constructing a tree with layers. These signals are referred to subbands in the following. Each Scattering subband of has the form


with and . The Scattering transform of results in a tuple containing all subbands. For the subband at the top layer (), let us fix the notation . As an optional step, the Scattering subbands can be normalized [1, 13] in order to decorrelate them. Let us define the normalized Scattering transform as a collection of subbands with


where denotes the signal average, and


for .

In [1], a texture retrieval method based on Scattering subband distributions has been proposed, in which texture images are described by the Weibull distribution parameters computed from the coefficients of each subband via maximum likelihood. We can thus conclude that the subband-wise distributions capture distinctive features from image textures. The framework presented in the following is based on the same feature extraction mechanism, but the distributions are modeled as histograms. Given a dynamic texture sequence of vectorized video frames, let us assume that we have applied the Scattering transform to each one of the frames and computed a histogram of from each subband. To describe a video, we are thus given a matrix


with Scattering histogram vectors as its columns, where is the number of Scattering subbands.

3 Representation via Kernel Subspaces

Assuming that the Scattering histogram vectors are an appropriate way to represent the individual frames, the entire set of the columns in

should capture the essential information of the whole dynamic texture, if we neglect the dynamics and the order of the frames. What we aim for is thus a compact way to represent this set. However, the classical approach of performing a principal component analysis (PCA) and representing

by the principal subspace would likely fail, because low-dimensional linear vector spaces are not a fitting model for sets of histograms, which is why histogram data is often mapped to a feature space using the kernel trick, prior to further processing [4]. Following the insights from [1], we employ the Bhattacharyya

kernel which, for two probability distributions

over the sample space , is defined as


Let us make the (simplifying) assumption that the coefficient distributions of different Scattering subbands are independent. Then we can evaluate the Bhattacharyya kernel on a pair of Scattering histogram vectors by computing


For the sake of readability, we fix the notation


with , for the matrix-wise application of .

Using and , we could now proceed to apply Kernel PCA to compute a low-dimensional subspace of a feature space corresponding to

. Kernel PCA is typically carried out by performing truncated Eigenvalue or Singular Value Decomposition (SVD) of the

Gram matrix and returns a coefficient matrix that, together with and , parameterizes an orthogonal basis spanning the -dimensional principal subspace of the feature space representation of [2]. The orthogonality of a basis described by a parameter pair can be easily verified using the condition


However, storing the whole matrix for each video sequence is rarely sustainable. So in addition to performing Kernel PCA, we include a

Nyström interpolation

[14] step to approximate the computed subspace by a small subset of columns of , for which the inequality holds. To do so, we collect representative columns from in a matrix and choose the coefficient matrix , such that an appropriate metric between and , is minimized. We choose the metric


which measures the squared error between the basis vectors in the feature space. Since we can assume that both involved bases are orthogonal, i.e. Eq. (8) holds for and , we can write it as


We thus define


Due to Eq. (8), the solution of Eq. (11) must have the form , where denote the SVD factors of and is a matrix with orthogonal columns, i.e. . Eq. (11) thus boils down to


which can be solved using the SVD.

Input: Video Sequence , Histogram size , Sampling size , Traget dimension
  // Scattering transform
  // Subband histograms
  // Subsampling
1 ;
  // Kernel PCA
2 ;
3 ;
  // Eq. (12)
Output: Kernel Subspace parameters
Algorithm 1 Kernel Subspace Computation

Computing the subspace parameters of a dynamic texture is described in Algorithm 1.

4 Nuclear Distance on Kernel Subspaces

Note that the parameters computed by Algorithm 1 are not unique. Consider a parameter pair describing an orthogonal basis in the feature space. Then, all pairs in the set


describe different orthogonal bases of the same space. A semantically meaningful distance measure on kernel subspaces should be invariant to orthogonal transformations of . This is not the case for Eq. (10): Given a pair , we observe


even though both and pairs describe the same subspace. We overcome this ambiguity by choosing always such that is minimized. We denote the result


the kernelized Nuclear distance, as it is obtained by replacing the trace in Eq. (10) by the nuclear norm. On two matrix pairs , the Nuclear distance is computed as


Eq. (16) is actually a special case of the kernelized Alignment distance [15, 3]. As such, it inherits its metric property on the the equivalence classes in Eq. (13). Specifically, , i.e. the square root of Eq. (16) is symmetric, positive definite and fulfills the triangle inequality [3].

5 Fréchet Means as Abstract Averages

Certain scenarios, such as clustering or nearest class center (NCC) classification require the notion of an average for finite sets of descriptors. Sadly, the set of kernel subspaces with dimension is a non-trivial manifold, which means we can not compute a mean using the arithmetic average. Thanks to Eq. (16), we are operating on a metric space, which means that we can define averages using the Fréchet mean. Consider a set . Its Fréchet mean is given by


To solve the optimization problem, we follow the approach in [3]. We approximate by computing cluster centers using the -means algorithms on the columns of . Next, is approached using an alternating scheme with two steps for each iteration. The first step consists of finding a set of orthogonal matrices , such that


is fulfilled, for every . This can be easily achieved using a projection onto the orthogonal group. The second step is approximating by computing


These two steps are repeated until convergence of the loss function in Eq. (

17). Due to space constraints, we omit providing pseudo-code for the final algorithm and refer the reader to [3] for in-detail description of a similar algorithm.

6 Experiments

The DynTex database [16] is a collection of high-resolution RGB texture videos. Three splits have been compiled for classification benchmarking.

DynTex Alpha is composed of 60 videos divided into the 3 classes Sea (20 videos), Grass (20), and Trees (20)

DynTex Beta is composed of 162 videos divided into the 10 classes Sea (20), Vegetation (20), Trees (20), Flags (20), Calm Water (20), Fountains (20), Smoke (16), Escalator (7), Traffic (9) and Rotation (10).

DynTex Gamma is composed of 264 videos divided into the 10 classes Flowers (29), Sea (38), Naked trees (25), Foliage (35), Escalator (7), Calm water (30), Flags (31), Grass (23), Traffic (9) and Fountains (37). Some works [6] use a different version of this split containing 275 videos.

We perform -NN (Nearest Neighbor) and NCC classification experiments on the three splits. To this end, Kernelized Scattering Histogram Spaces (KSHS) are computed from each video using Algorithm 1 with . The Scattering transform is computed using Kymatio [17] on CUDA with the arguments L=4, J=4. Histograms with are calculated, from both regular and normalized (KNSHS) subbands. For NCC classification, a Fréchet mean is computed from each class using the procedure described in Section 5 with and the leave-one-out protocol proposed in [6]. Parameters were chosen using grid search. Code for reproducing the experimental results is available online111https://github.com/alexandersagel/kshs.

Alpha Beta Gamma
LBP-TOP 96.7 % - 85.8 % - 84.9 % -
PCANet-TOP 96.7 % - 90.7 % - 89.4 % -
DFS - 83.6 % - 65.2 % - 60.8 %
2D+T C. - 85.0 % - 67.0 % - -*
OTDL - 86.6 % - 69.0 % - 64.2 %
CLSP-TOP 95.0 % - 92.0 % - 91.3 % -
STRF N-jet 100.0 % - 93.8% - 91.2 % -
B3DF 96.7 % 90.0 % 90.1 % 74.1 % -* -*
SOE-NET 98.3 % 96.7 % 96.9 % 86.4 % -* -*
SoB+Align 98.3 % 88.3 % 90.1 % 75.3 % 79.9 % 67.1 %
KSHS+Ncl. 98.3 % 96.7 % 88.9 % 88.3 % 88.6 % 86.7 %
KNSHS+Ncl. 98.3 % 96.7 % 93.2 % 90.1 % 91.3 % 89.8 %

*Reported results refer to 275-video version of DynTex Gamma.

Table 1: Classification rate on DynTex subsets

Table 1 shows the 1-NN and NCC classification results in comparison with LBP-TOP [8, 11], PCANet-TOP [18], Dynamic Fractal Spectrum (DFS) [19, 10], Spatiotemporal Curvelet Transform (2D+T Curvelet) [6]

, Orthogonal Tensor Dictionary Learning (OTDL)

[10] Completed Local Structure Patterns in Three Orthogonal Planes (CLSP-TOP) [9], Spatio-temporal Receptive Fields (STRF N-jet) [7]

, Binarized 3D features (B3DF)

[20] Spatiotemporal Oriented Energy Network (SOE-NET) [21], and Systems of Bags with the Alignment Distance (SoB+Align) [3]. Results that have not been reported or are not applicable are indicated by ’-’.

In line with the results in [1], normalizing the Scattering coefficients improves the recognition performance: for no experimental setting, KNSHS performs worse than KSHS. One explanation is that normalized coefficients are closer to fulfilling the independence assumption made in Section 3. Overall, KNSHS in combination with the Nuclear distance competes well with current approaches. For the -NN classification, it yields the same success rate as CLSP-TOP on the Gamma split, and is outperformed by STRF N-jet and SOE-NET on the Alpha and Beta split of DynTex. More remarkable are the results for NCC classification. Not only does our method yield higher success rates than many of the state-of-the-art approaches in literature, but it does so with a considerably small performance gap with regards to -NN classification. This could be due to using Fréchet means as class centers: Because of the triangle inequality, the distance from a test point to the Fréchet mean of a set is always a good approximation of the distance from the test point to any point in said set [3]. Hence, NCC should yield similar results as -NN.

7 Conclusion

In this work, we have proposed a Scattering-based feature extraction method for dynamic textures using Kernel PCA, and described a distance that accounts for non-uniqueness of the extracted features. Additionally, we have briefly outlined a procedure to compute abstract averages from finite sets of such features. We have evaluated the proposed method on -NN and NCC classification and have observed state-of-the-art results for the latter scenario. The capability of computing expressive features and measure the distance between pairs thereof, in addition to computing abstract averages, are also useful in related recognition tasks, such as retrieval, clustering or even segmentation of video data.


  • [1] Alexander Sagel, Dominik Meyer, and Hao Shen, “Texture retrieval using scattering coefficients and probability product kernels,” in International Conference on Latent Variable Analysis and Signal Separation. Springer, 2015, pp. 506–513.
  • [2] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Computation, vol. 10, no. 5, pp. 1299–1319, 1998.
  • [3] Alexander Sagel and Martin Kleinsteuber, “Alignment distances on systems of bags,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2018.
  • [4] Rizwan Chaudhry, Avinash Ravichandran, Gregory Hager, and René Vidal, “Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions,” in

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    . IEEE, 2009, pp. 1932–1939.
  • [5] Payam Saisan, Gianfranco Doretto, Ying Nian Wu, and Stefano Soatto, “Dynamic texture recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2001, vol. 2, pp. II–II.
  • [6] Sloven Dubois, Renaud Péteri, and Michel Ménard, “Characterization and recognition of dynamic textures based on the 2d+ t curvelet transform,” Signal, Image and Video Processing, vol. 9, no. 4, pp. 819–830, 2015.
  • [7] Ylva Jansson and Tony Lindeberg, “Dynamic texture recognition using time-causal and time-recursive spatio-temporal receptive fields,” Journal of Mathematical Imaging and Vision, vol. 60, no. 9, pp. 1369–1398, 2018.
  • [8] Guoying Zhao and Matti Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 6, pp. 915–928, 2007.
  • [9] Thanh Tuan Nguyen, Thanh Phuong Nguyen, and Frédéric Bouchara, “Completed local structure patterns on three orthogonal planes for dynamic texture recognition,” in 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA). IEEE, 2017, pp. 1–6.
  • [10] Yuhui Quan, Yan Huang, and Hui Ji, “Dynamic texture recognition via orthogonal tensor dictionary learning,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 73–81.
  • [11] Xianbiao Qi, Chun-Guang Li, Guoying Zhao, Xiaopeng Hong, and Matti Pietikäinen,

    “Dynamic texture and scene classification by transferring deep image features,”

    Neurocomputing, vol. 171, pp. 1230–1241, 2016.
  • [12] Stéphane Mallat, “Group invariant scattering,” Communications on Pure and Applied Mathematics, vol. 65, no. 10, pp. 1331–1398, 2012.
  • [13] Joakim Andén and Stéphane Mallat, “Deep scattering spectrum,” IEEE Transactions on Signal Processing, vol. 62, no. 16, pp. 4114–4128, 2014.
  • [14] Petros Drineas and Michael W Mahoney, “On the nyström method for approximating a gram matrix for improved kernel-based learning,”

    journal of machine learning research

    , vol. 6, no. Dec, pp. 2153–2175, 2005.
  • [15] Bijan Afsari, Rizwan Chaudhry, Avinash Ravichandran, and René Vidal, “Group action induced distances for averaging and clustering linear dynamical systems with applications to the analysis of dynamic scenes,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2208–2215.
  • [16] Renaud Péteri, Sándor Fazekas, and Mark J. Huiskes, “DynTex : a Comprehensive Database of Dynamic Textures,” Pattern Recognition Letters, 2010, http://projects.cwi.nl/dyntex/.
  • [17] Mathieu Andreux, Tomás Angles, Georgios Exarchakis, Roberto Leonarduzzi, Gaspar Rochette, Louis Thiry, John Zarka, Stéphane Mallat, Joakim Andén, Eugene Belilovsky, et al., “Kymatio: Scattering transforms in python.,” Journal of Machine Learning Research, vol. 21, no. 60, pp. 1–6, 2020.
  • [18] Shervin Rahimzadeh Arashloo, Mehdi Chehel Amirani, and Ardeshir Noroozi, “Dynamic texture representation using a deep multi-scale convolutional network,” Journal of Visual Communication and Image Representation, vol. 43, pp. 89–97, 2017.
  • [19] Yong Xu, Yuhui Quan, Haibin Ling, and Hui Ji, “Dynamic texture classification using dynamic fractal analysis,” in International Conference on Computer Vision (ICCV), 2011, pp. 1219–1226.
  • [20] Xiaochao Zhao, Yaping Lin, Li Liu, Janne Heikkilä, and Wenming Zheng, “Dynamic texture classification using unsupervised 3d filter learning and local binary encoding,” IEEE Transactions on Multimedia, vol. 21, no. 7, pp. 1694–1708, 2019.
  • [21] Isma Hadji and Richard P Wildes, “A spatiotemporal oriented energy network for dynamic texture recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3066–3074.