In recent years, data analysis methods are increasingly being used in order to automate the extraction of relevant information from data collections. However, most data analysis algorithms suffer from overfitting and numerical problems when patterns have more than a few dimensions. In order to appropriately apply these data analysis methods
, a feature extraction pre-processing stage is required, allowing to remove collinearity among variables and to reduce data dimensionality. For this reason, feature extraction techniques, and in particular Multivariate Analysis (MVA) methods[1, 2]
, have been successfully applied in many different applications of machine learning, such as biomedical engineering[3, 4], remote sensing [5, 6], or chemometrics .
MVA aggregates a family of methods that build a new set of features by extracting highly correlated projections of data representations in input and target spaces. Well-known representatives of these methods are Principal Component Analysis (PCA), Partial Least Squares (PLS) approaches , or Canonical Correlation Analysis (CCA) 
. PCA creates a new data representation space by finding the directions of largest input variance, providing an optimal set of features in terms of mean-squared reconstruction error. Unlike other MVA methods, PCA works in an unsupervised manner, i.e., it only considers the input data and does not take into account anyknowledge of the target space. The Partial Least Squares (PLS) approach refers to a family of techniques which, in its general form, project both input and target variables to a new space, generating a set of projected features known as latent variables. The criterion used to extract these latent variables varies within the approach, but the general objective is to maximize the covariance of the two projected expressions. In CCA the aim is to find linear projections of the input and target data to maximize correlation between the projected data.
In this paper, we focus on a fourth MVA method known as Orthonormalized PLS (OPLS) , also known as semipenalized CCA , multilinear regression (MLR) , or reduced-rank regression . OPLS is known to be optimal in the mean square error sense for performing multilinear regression [14, 15]; thus, it is very competitive as a pre-processing step in classification and regression problems [5, 15, 16]. Several works have also tried to establish the connections between OPLS and other MVA and discriminative methods (see, e.g., [13, 17] which proved that OPLS and CCA are the same solution in a balanced classification task if the target matrix is binary coded, or  that proposes a generalized framework for component analysis).
In this paper, we consider extensions of OPLS that favor the interpretability of the extracted features. In particular, when energy or spectral signals are involved, positive constraints should be imposed on the projection vectors, so that the extracted features can be interpreted as the energy contained in an equalized frequency band, and the projection vectors themselves can be seen as a sort of filter bank. This interpretation is useful, e.g., in audio or image data applications, where data processing is usually carried out inthefrequency domain. Note that, in this paper, we pursue interpretable solutions, considering interpretability as the machine capability of revealing to the expert the reasons why the solution has been reached.
We can find in the recent machine learning literature other algorithms preserving the non-negativity of the solution. One of the most popular algorithms is Non-Negative Matrix Factorization (NMF)  that has been applied for source separation , music transcription , or spectral data analysis , among others. Another perhaps less explored approach consists of incorporating a non-negativity constraint in the solution of the MVA methods. For instance, non-negative PCA has been used for blind positive source separation ; non-negative PLS (NPLS) for understanding Nuclear Magnetic Resonance (NMR) spectroscopy data ; non-negative CCA (NCCA) for audiovisual source separation ; and the positive constrained OPLS algorithm for music instrument and genre classification .
In contrast to NMF approaches, an additional advantage of incorporating non-negativity constraints in MVA methods is the capability to obtain sparse solutions and, indirectly, automatic feature selection. This preference for sparsity has motivated that during the last years many methods incorporateand regularizations in their formulations, see among others sparse versions of PCA , CCA , and OPLS . However, unlike the methods we consider in this paper, neither nor regularization alone enforce non-negative solutions.
As we have already mentioned, in this paper we will focus on the design of banks of filters that provide interpretable features for supervised problems. Figure 1 illustrates the whole process when dealing with images, and consists of three well-differentiated blocks: 1) a pre-processing step which converts raw data into a better-fitted data representation in the frequency domain (see Sections III and IV); 2) a feature extraction step where the signal is passed through a bank of filters and, as a result, a feature vector is obtained, with each of its components being the energy of the image in a certain frequency range; and 3) a classification stage where the feature vector is used to discriminate the class associated to the image.
In most previous works that rely on systems similar to the one depicted in Figure 1
, it is the classifier block which is designed in a supervised manner, whereas the filter bank is typically built without any available label information. Instead, a sufficiently rich battery of general purpose filters is used (e.g. Gabor filters[30, 31]) or expert knowledge is exploited. Unlike these schemes, in this paper we will present a non-negative constrained OPLS method that exploits the available labels to design a bank of filters particularly fitted to each supervised task. As we will see, this results in more discriminative filters that can achieve better recognition rates with a smaller number of features. In order to show the versatility of our methods, we review two completely different scenarios: texture classification as a visual application; and music genre classification as an audio application.
Among a large amount of visual tasks, the classification of images by their textures is an interesting application. Surprisingly, the non-negative methods referred above have not been applied here, perhaps due to the wide and successful use of purposely desinged methods feature extraction procedures. One of the most adopted techniques is Gabor Filtering (GF), which was proposed for texture discrimination in [30, 31] and is still used or even improved for efficient feature extraction [32, 33], and for rotation and scale-invariant texture classification . Local Binary Pattern (LBP) is also a successful technique used for texture classification  but it does not provide any sort of interpretation to the solution.
Regarding audio-based music classification applications , and in particular the music information retrieval (MIR) field, music genre classification has been an active research area in the last years. Despite the great variety of different approaches to solve this problem, feature extraction is a usual stage into these solutions  and the use of sparse representations has been recently suggested as a way to boost performance . However, sparse features do not provide any sort of interpretability to the solution, which is a desirable property for understanding music structure. In order to provide this capability, non-negative features can be extracted, as it was proposed by  and .
This paper proposes a general framework to obtain non-negative solutions for the desing of filter banks, providing solutions which not only perform well in classification tasks but also help to interpret the machine working. For this purpose, starting from a non-negative OPLS formulation, we study four different approaches to solve the proposed framework
. The performance of these approaches is compared with classical filter design methods for image and music classification (where their parameters need to be adjusted by an expert user) and, additionally, with some current state of art approaches based on deep learning.
The rest of the paper is organized as follows: In the next section, we propose different ways of designing filter banks in a supervised way as a feature extraction stage for audio or visual applications. In Sections III and IV, we review the most frequently used methods in the visual and audio fields, focusing in particular on the texture and music genre classification problems. In order to prove the general applicability of our methods for applications when energy or spectral data are involved, we analyze the discriminative power of the extracted filter banks (i.e., non-negative features) on several texture and music genre databases in Section V. Finally, Section VI presents the main conclusions of our work.
Ii OPLS-based Methods for Supervised Design of Filter Banks
In this section, we formulate several methods to design filter banks in a supervised learning scenario, where the goal is to learn relevant features from input data using a set oftraining data , for , where and are considered as the input and target data vectors, respectively. Therefore, and denote the dimensions of the input and target spaces. In classification problems, will be used to denote the class membership of the th pattern, e.g., using 1-of- encoding . Furthermore, in this paper, we assume that all entries of are non-negative, which is the case when dealing with applications where the input space consists of spectral features. For notational convenience, we define the input and target data matrices: and , where and are centered versions of and , i.e.,
are the means of the input and target data. Sample estimations of the input and target data covariance matrices, as well as of their cross-covariance matrix, can then be calculated as, and , where we have neglected the scaling factor , and superscript denotes vector or matrix transposition.
We denote the desired filter bank as , where each of its columns can be seen as the response of one of the filters in the bank and represent the extracted features corresponding to pattern .
When the input data are (positive) spectral features, we can consider as a frequency filter bank whose coefficients are going to be forced to be non-negative, and can be interpreted as the non-negative output of each of the filters in the bank. However, when matrix is centered, can also be seen as projections of the centered input data, but its entries are no longer guaranteed to be non-negative. Nevertheless, data centering does not affect the design of the filter bank and is recommended for learning purposes if regression procedures are involved . In fact, it is quite straightforward to illustrate the irrelevance of the centering operation with respect to the extracted features since
where is the mean of the filtered data. Therefore, the interpretation of the filter bank remains valid when working with centered data, and the optimization problems become numerically more stable .
We use OPLS as a starting point in order to design the filter bank. OPLS is optimal in the Mean Square Error (MSE) sense , i.e. to find the projection vectors so that the projected data best approximate the target data minimizing , where subscript refers to the Frobenius norm of a matrix and is the regression matrix. For this reason, OPLS usually outperforms all other traditional Multivariate Analysis (MVA) algorithms in both regression and classification tasks . However, to enforce non-negative filter coefficients and allow more interpretable solutions, we add a non-negativity constraint to the OPLS objective function. Hence, the minimization problem that we propose for the design of the filter bank is:
where indicates that all elements of matrix have to be non-negative.
In this paper we propose four different algorithms to solve (1):
Deflated NOPLS: the sequential implementation of NOPLS using deflation.
NMF-like OPLS (NMF-OPLS): that can be considered as a supervised version of the NMF problem.
Note that although all algorithms try to solve the same optimization problem, they will in general converge to different solutions. In the rest of this section we will introduce these algorithms, and compare their results in the experimental section. For completeness we will also consider a fifth method known as positive constrained OPLS (POPLS). This method was proposed in  and relies on Quadratic Programming (QP) for solving (1).
Ii-a Non-negative OPLS (NOPLS)
The algorithm proposed in this subsection is based on the sparse OPLS (SOPLS) method proposed in . In that paper, an efficient solution of OPLS, also known as reduced rank regression problem in the statistics literature , was used in order to propose a feature extraction framework, which allows to easily include any constraints to the projection111Note that is not a projection operator in a rigorous mathematical sense, since it maps data from to , and therefore does not satisfy the idempotent property of projection operators. However, the columns of span the subspace of where the data are projected, and it is in this sense that we refer to and as projection matrix and vector respectively, and to as projected data. This nomenclature has been widely used in the machine learning field, particularly in works dealing with feature extraction methods. matrix
. The algorithm consists of the iterative application of two coupled steps: a constrained least squares regression problem, and a standard eigenvalue problem with the same complexity as the target dimensionality. Note that in audio or visual applications, the target dimensionality is usually much smaller than the input dimensionality (i.e.,) and, thus, this latter step is normally quite efficient.
In this subsection, we propose to replace the sparsity constraint that led to SOPLS by the desired non-negativity constraint. Therefore, according the same arguments of , we propose the following iterative procedure to solve the objective function (1):
step: For fixed , minimize (1) with respect to , subject to .
The solution of this problem is given by the eigenvalue decomposition
where and is a diagonal matrix with all the eigenvalues sorted in descending order on its diagonal. Note that the dimension of the matrix that needs to be analyzed is , which makes up a computationally efficient step in the common case of .
step: For fixed , minimize (1) with respect to only.
We refer the reader to  and  for good summaries on optimization methods that can be used to solve this non-negative least squares (NNLS) problem. In the experiments section, we will use the MATLAB implementation of block principal pivoting algorithm provided by 222Code is available at http://www.cc.gatech.edu/~hpark/software/nmf_bpas.zip. In case of adding also an -regularization term to (1) (i.e., ), the Monotone Incremental Forward Stagewise Regression (MIFSR) algorithm proposed in  with the modifications introduced in  can be used.
The NOPLS method consists of the iterative application of the and steps until some convergence criterion is reached. Preliminary experiments have showed that initialization of the algorithm is not critical, and we simply initialize for the first iteration as a matrix whose elements are defined by the Kronecker delta . As a stopping mechanism, we use , where the superscripts denote the iteration index and is a small constant. In plain words, the algorithm stops when the difference between the eigenvalues of the step of two consecutive iterations is smaller than a prefixed constant .
Ii-B NOPLS with Procrustes approach (P-NOPLS)
Our second proposed method consists of the modification of the step of NOPLS applying the solution to the orthogonal Procrustes problem. This approach was used by  and  to derive sparse solutions of PCA and OPLS respectively. As we explained above, when fixing the projection matrix the step of the algorithm becomes:
In , this problem is called an “Orthogonal Procrustes problem” and its solution is defined as
given the singular value decomposition, where
is a diagonal matrix containing the singular values, andand contain the left and right singular vectors, respectively.
Since the solution of (2) gives , we can see that P-NOPLS results just in a rotated version of this matrix during the step. However, we note that:
The rotation process affects the relevance and ordering of the extracted features. For NOPLS we can affirm that the features/filter banks are sorted according to their relevance, i.e., the first filter captures the maximum possible information with just one filter with respect to criterion (1), and so on. The rotation process prevents us from stating this same claim for P-NOPLS.
In , we showed that the Procrustes solution is very sensitive to initialization and that for some initializations the algorithm may fail to converge.
The two arguments above justify our preference for NOPLS over the P-NOPLS solution. Nevertheless, P-NOPLS has also been included in the experiments for the sake of comparison and completeness.
Ii-C Sequential implementation of NOPLS using deflation (defNOPLS)
In this section, we describe a sequential algorithm that implements the non-negative OPLS scheme introduced in Subsection II-A. This sequential algorithm consists of the following two steps: 1) Extraction of the projection vector which represents the frequency response of the next filter to be included in the bank, and 2) application of a deflation procedure to eliminate the influence of the eigenvector by annihilating the associated eigenvalue. Note that the deflation process allows to obtain a new data space where the resulting matrix does not depend on the space defined by the eigenvector, so that we can again apply the first step and calculate the next most important eigenvector. These steps are repeated for until the desired number of filters or features is reached.
The design of the next filter consists of the extraction of a pair of vectors which are optimal with respect to (1). This can be done by iterating the and steps we have described for the NOPLS algorithm. Since in this case we are solving a unidimensional problem at each step, the solution to the step simplifies to
Throughout this paper, we will deflate the cross-covariance matrix by means of the Schur Complement deflation, which is one of the deflation methods proposed by , and can be applied to any of a number of constrained MVA methods or eigendecomposition-based problems, including this non-negative constrained OPLS solution. This deflation procedure avoids the reappearance of as a component of future “pseudo-eigenvectors”333In , the “pseudo-eigenvector” term is used to differentiate the true eigenvector obtained from the original objective function (i.e. without any constraint on the eigenvector) and the solution obtained when constraints are applied.. Since, in our case, we have to deal with a supervised problem, the Schur Complement deflation can be written as
where represents an update of the left matrix, and it renders left orthogonal to only, since after deflation it holds that
Table I provides the pseudocode for the sequential algorithm that we have just described. Note that, in the table, subscript is used to index the projection vectors (i.e., ), whereas superscript indexes the iterative application of and steps that are needed to converge to each projection vector. Different convergence criteria can be used at Step 2.2.3 of the algorithm. In the experimental section we will monitor the cosine distance
and use as a stopping criterion , where is a tolerance parameter. Other possibilities would consist in monitoring the cosine distance between the regression coefficient vectors, or the eigenvalue of the step.
|1.- Inputs: centered matrices and , .|
|2.1.- Initialize .|
|2.2.1.- Update using (5).|
|2.2.2.- Update by solving the unidimensional version|
|of the NNLS problem (1) subject to .|
|2.2.3.- If convergence criterion is reached, provide output|
|values with , otherwise back to 2.2.|
|2.3.- Deflate cross-covariance matrix using (6).|
|3.- Output: .|
|is defined as a vector with its component equal to 1, and all|
|other components equal to 0.|
Ii-D NMF-like OPLS (NMF-OPLS)
, which is maybe the most popular NMF algorithm. Moreover, the loss function of the Projected-NMF algorithm proposed in and some relationships among several versions  can be useful to see the similarities between NMF and the supervised version we propose here.
Unlike previous methods, the NMF methods require non-negative values both for and (i.e., and ) and, thus, the additional constraint needs to be considered. Since certain data will take negative values due to the centering operation, we will use the original data set and , which is non-negative.
The loss function to be minimized is also given by (1), although in this case the constraint will be added:
In order to ease the derivation of the current proposal, we rewrite the objective function of (1) in terms of the trace operator ():
As a summary of the MU rule, let us suppose that the gradient of (9) with respect to or can be decomposed as
where and . Then, the element-wise updating rule follows as :
where denotes the Hadamard (element-wise) product, represents the element-wise division, i.e., (for the row and the column), and is the matrix that needs to be updated. Note that this update keeps the non-negativity of the solution at every step.
To apply the MU rule in our case, we obtain first derivatives of (9) with respect to
that, considering that all involved matrices are non-negative, allows us to identify
Similarly, first derivatives of (9) with respect to are
so that we can identify
Therefore, following equation (10), the MU rules for updating and , which constitute the core of the NMF-OPLS method, are given by
As it is expected in NMF algorithms, preliminary experiments showed us that the algorithm initialization is critical. The nonnegative decomposition approximation method444We have used the NNDSVDa version of the algorithm in , as well as the implementation provided by its authors. (NNDSVD) proposed by  provides a good starting point for NMF algorithms, so we use it over the matrix, since we deal with a supervised scheme. Therefore, we initialize the and matrices as the left and right-decomposition matrices respectively by the NNDSVD approach: .
Non-negative constraints usually force a large number of zeros in the solution matrix, often causing numerical problems that make the MU update get stuck earlier than desired. In , it was proved that a slight improvement is reached substituting zeros by a small constant (e.g., ). Thus, the improved MU rules are given by
Furthermore, a normalization step in each MU update iteration in order to facilitate the convergence can be included in NMF algorithms . In our case, this step is also applied and the and matrices are normalized with their Frobenius norms respectively.
Table II provides the pseudocode for the NMF-OPLS algorithm that we have just described. Different convergence criteria can be used at Step 2.2.4 of the algorithm. In the experimental section we use as a stopping mechanism, where the superscripts denote the iteration index and is a small constant. Then, the algorithm is stopped when the solutions achieved in two consecutive iterations differ less than a small threshold.
|1.- Inputs: positive matrices and .|
|2.1.- Initialize and with NNDSVD algorithm.|
|2.2.1.- Update using (11).|
|2.2.2.- Update using (12).|
|2.2.3.- Normalize and .|
|2.2.4.- If convergence criterion is reached, go to 3.|
|3.- Outputs: , .|
A few considerations are in order with respect to the algorithm we have just described. First, the main advantage of the MU rule is its simplicity and ease of implementation; however, slow convergence is frequently observed . Second, the application of NMF-OPLS requires that the number of filters in the bank () is selected a priori, and sequential implementations are suboptimal since the positive constraint prevents the deflation procedure from completely removing the contribution of the previous projections. Finally, unlike NOPLS, the NMF-based implementation does not guarantee the filters of the bank (i.e., the columns of ) to be sorted according to their relevance.
Ii-E Positive Constrained OPLS (POPLS)
For completeness, this subsection describes the algorithm proposed in  to solve (1). In this case, matrix is not explicitly calculated since the trick here is to express in terms of and to introduce it into the functional in (1) to obtain an optimization problem involving only.
Thus, we arrive to the following optimization problem
where the last constraint is added in order to obtain one of the infinite solutions of objective function in (13). Note that this constraint is different from that of the OPLS problem. However, this constraint was preferred in  since it can be directly incorporated into the hyperspherical representation of the projection vectors, where each is represented by a radius and angles , . This way, the optimization can be solved with respect to for , and constraints guarantee the non-negativity of the solution. This approach was followed in  to solve the convergence problems of the fmincon matlab function with the original representation used in (13).
An inconvenience of this method is that the desired OPLS property is not satisfied, implying that filters are not arranged according to their discriminative power. To correct this, in this paper we apply a sequential implementation using Schur Complement deflation. The resulting sequential POPLS algorithm is summarized in Table III.
Iii Texture Classification Application
To illustrate the advantages of OPLS methods for supervised filter design, we will consider texture recognition and music genre classification applications. For this reason, this section provides a brief summary of the necessary background on image processing required to understand the application of OPLS methods for texture recognition and the alternatives that are currently available in the literature. Similarly, Section IV will deal with the music genre recognition problem.
Figure 1 describes all the stages that are typically encountered in texture recognition. Following these stages, the image is firstly preprocessed and, then, it is transformed to the frequency domain to facilitate the extraction of relevant features through filtering. Finally, a classifier is used to discriminate among the different textures.
A typical simple pre-processing
step is to apply a two-dimensional Fast Fourier Transform to each image (, if it is vectorized) which is usually transformed into gray scale if the raw image is in color. This allows the next stage to extract features () in a frequency domain by means of a filter bank () as
One of the most popular feature extraction techniques for texture classification is GF. However, GF show a strong dependence on several parameters whose values may significantly affect the discriminative performances of the subsequent classifier. The effects of GF parameters on texture classification was comprehensively evaluated by .
One remarkable conclusion of  is that smoothing parameters and are important, whereas the number of frequencies and orientations has, in general, little effect on texture classification. This result contradicts the widely accepted belief that the parameters presenting a highest influence over the texture classification performance are related to the number of orientations (), the number of frequencies (), and the largest frequency of all filters.
As illustrated by the study in , the design of a GF bank can be very costly as a result of the validation process that is required to adjust the free parameters. Furthermore, the general shape of GFs is predefined a priori, and apart from their widespread use in texture classification, there are no guarantees that GFs are the most adequate selection for a particular task. In contrast to this, our proposed methods use available labels to build the bank of filters and do not assume any predefined shape for the frequency response of the filters. For this reason, we expect that they are able to extract more discriminative features for each particular supervised task.
To conclude the subsection, there are some practical considerations that need to be taken into account and imply differences between our schemes and the direct application of GF:
GF provides two features per filtered image: the mean (
) and the standard deviation () of the filtered image. In contrast, our methods produce only one feature per filtered image.
In order to ease the operation of our algorithms, we decimate each frequency image by using the average energy over neighboring pixels of the image. This results in lower resolution of and thus in a dimensionality reduction (of the vectorized frequency vector) to variables (being ). This pre-processing step is illustrated in Figure 2.
Iv Music Genre Classification
In this section we review the applicability of the OPLS-based schemes presented in this paper for music recognition applications. Although we consider here the particular case of music genre classification, the approach could be straightforwardly extended to other music information retrieval tasks. As before, the goal of the automatic design of the filter bank is to obtain good recognition rates, while at the same time extracting interpretable features.
The whole music recognition application can be summarized into three well-differentiated blocks (see Figure 3), namely, the audio pre-processing step, a filter bank, and the classifier.
The audio pre-processing step, which transforms raw audio signals into relevant information for the subsequent step, can be further subdivided into two stages (see Figure 4): short-time feature extraction, which consists of features extracted in periods ranging from 5-100 ms where music signals are typically stationary, see e.g. ; and temporal feature integration, which is the process of combining all the feature vectors of a time frame into a single feature vector in order to capture the relevant temporal information of the frame.
In the following, we detail these two stages:
Short-time features: Mel Frequency Cepstral Coefficients (MFCC) have been selected as the short-time feature representation because of its widespread use and great success in several fields of MIR (see, e.g., [39, 56]). The MFCCs are ranked in decreasing order of the richness of representation of the spectral envelope. Thus, the lower MFCCs contain information about the slow variations in the spectral envelope. In experiments, we use only the 6 first MFCCs (as proposed in ) and, in order to minimize aliasing in the MFCCs, we apply a frame-size of 30 ms and a hop-size of 7.5 ms.
Temporal feature integration: In order to capture the relevant temporal information in the frame, we first estimate the power spectrum of each MFCC using a periodogram as suggested in . Then, we concatenate these six energy features into a new single feature vector. There exist many other temporal feature integration methods, see  for a good review on these techniques.
Once the raw data are converted into a non-negative representation (i.e. the periodograms of the MFCCs, ), the following step relies on applying a filter bank, , in order to extract the desired non-negative features, , which can be seen as the energy contained in certain frequency bands of each MFCC periodogram. Note that this filter bank is inherently non-negative as well, since it is applied directly on the estimated power spectrum (periodogram), , where is the periodogram of the MFCC coefficient, and is the corresponding feature vector, which has as many components as the number of filters in the bank. These feature vectors will be introduced into the subsequent classifier.
In order to design the filter bank (), there are two different alternatives: using expert knowledge, which is the most typically used approach; and the supervised approaches that we propose in the paper, which make use of the label information allowing the learning of filter banks for each recognition task. An example of the first alternative is the predefined Philips filter bank used in , where the authors suggest summarizing the power components in four frequency bands: 1) 0 Hz (DC component); 2) 0–2 Hz (beat rates); 3) 3–15 Hz (modulation energy, e.g., vibrato); 4) Hz (associated to the perceptual roughness). Therefore, for this particular filter bank, U is a matrix of size , where is the number of points of the periodogram and is the length of the MFCC series used to calculate the periodogram (measured in number of samples). In this paper, we will use . In Subsection V-B, we compare our solutions with this fixed filter bank.
In this section, we analyze the performance of the proposed set of supervised filter banks in two classification tasks: image texture recognition and music genre classification. In order to evaluate the advantage of our proposals, we analyze their discriminative power and their interpretability capability in comparison to the well studied Gabor and Philips Filter Banks, which are specifically designed for the considered applications. Moreover, we compare the proposed methods against deep learning approaches, which are the state-of-the-art in these problems, specially in visual applications as texture classification. Therefore, we can evaluate if the interpretable solutions obtained with our methods can be competitive in terms of performance.
In order to evaluate the interpretability of the solutions, we need to define some objective criterion. Focusing in these multimedia applications, we consider that the interpretability of the solutions is higher when 1) we use a small number of filters (), and 2) the number of non-zero energy bands in each feature is small. We denote as NZ the rate of non-zero coefficients of the filters, , where is the number of non-zero coefficients of the filters. However, given that these two interpretability criteria are of different nature and are subject to a tradeoff (a similar performance can be obtained with fewer but less sparse features), it is difficult to combine them into a single measure of interpretability. Moreover, the preference of either criterion will depend on the user or the application. Due to these reasons, in this paper we will report these two terms separately, but we also propose and use a combined interpretability measure:
where is a reference number of filters in order to be compared with this reference. As an example, if an algorithm use filters, the second part of (14) is equal to zero. Therefore, with (i.e. all coefficients are non-zero), a number of filters smaller than obtains a positive IM (good interpretability), while using more filters than produces negative IM (poor interpretability). In this paper, we use the number of classes to be classified () as (). Note that IM increases linearly as NZ approaches one logarithmically. Figure 5 displays the IM curve in terms of and NZ with .
V-a Experiment 1: Texture Classification
This subsection considers two different image texture classification tasks: classification based on a predefined set of categories, which is a more realistic scenario for texture classification, and an image classification task, which is also a typical task used in the literature.
The first task considers a real scenario for texture classification, where each image belongs to a specific class of textures555Textures were downloaded from http://www.cgtextures.com/ in 2009 and the dataset created and used in this paper can be downloaded from http://www.tsc.uc3m.es/~smunoz/CGTextures.zip. Due to the origin of the textures, we will refer to this dataset as CGTextures. among 10 different categories: bark, earth, gravel, plywood, snow, brick, grass, ivy, sky, and water. In order to provide more patterns to the database, each image is divided in a set of 16 sub-images. The second task considers the Brodatz dataset  which has been widely used in the texture classification literature. In this experiment, each image is also divided in a set of 16 sub-images and the goal of the classification task consists of assigning each sub-image to the original image. Table IV summarizes the main characteristics of these datasets. Figure 6 displays a 5 images per class excerpt of the CGTextures dataset, where each class consists of very different pictures, making this a difficult texture classification task.
For the following experiments, we have divided each image of side-size pixels in 16 sub-images, and for our methods we have converted each sub-image into a frequency image of pixels (i.e., ) decimating the original frequency image by a factor of 10.
In case of GF, we also cross-validated (CV) the parameters and , setting their values to and for both datasets. The rest of parameters have been fixed according to : , , and . Furthermore, the number of filters in the bank has been cross-validated for each method under study.
|No. images (train/test)||Size||No. classes|
Thereupon, we will study the discriminative power and interpretability of the proposed supervised filter designs in comparison to the specifically designed GF banks 
. After designing each filter bank, we will train a linear Support Vector Machine (C-SVM) using the projected input data () in order to evaluate the overall accuracy (OA) of every method; the optimal value of parameter , which sets a trade off between the classifier margin and the number of errors, has been cross-validated for each method under study. Since the aim of this paper is to obtain a subset of interpretable features useful for these classification tasks, the interpretability capability will be analyzed by measuring the number of frequencies used by each filter bank and displaying the resulting projected data.
Texture Classification in the CGTextures dataset
Table V and Figure 7 compare the performances obtained by the proposed methods and by the GF bank over the CGTextures dataset. In particular, Figure 7 displays the evolution of the overall accuracy (OA) with the number of filters in the bank, and Table V shows the OA of each method when is selected using CV. To carry out a fair analysis, we have included results for the GF approach sorting out the filters according to MSE in the training set, i.e., for each we selected the subset of filters achieving the best recognition capabilities.
As expected, the set of proposed supervised filter designs, and mainly the NOPLS algorithms, present an increased accuracy with respect to GF approaches; note that NOPLS outperforms the rest of the algorithms including OPLS. Moreover, the number of filters used by the supervised filter banks is less than half the number of filters selected for the GF bank. Furthermore, it is important to remark that, although all methods use a similar number of filters (), the number of frequency bands selected by our methods is significantly smaller than by GF as it can be seen with the rate of non-zero coefficients of the filters (NZ) in Table V.
Besides, as we explained in Section III, our methods only extract one feature by each filter (see in Table V), while GF uses two features extracted by each filter: the mean of the filtered image (), and its standard deviation (). In order to compare the performance between the GF with one or two features by filter and our best method, we display a comparison of the evolution of the OA with the number of considered filters in Figure 7 (b).
In summary, we can state that the proposed methods are more discriminative, more selective, and more sparse than GF. In order to analyze the interpretability of each method under study in a qualitative way, Figure 8 shows the 10 first filters () of the filter bank provided by each method, as well as an example of the filtered images (, being the convolution operation) from an image of the class grass. As we can see, the supervised filters are more accurate and selective than those of the GF bank, being a mixture of band-pass filters in horizontal, vertical, and oblique directions. It is interesting to note the similarity among the filters in the banks of NOPLS, defNOPLS, and even the first filters of POPLS, which is indicative of NOPLS doing better than P-NOPLS. With respect to the GF bank, the worse accuracy of the classification system relying on these features indicates that this set of filters failed to extract discriminative features for the task at hand, making clear the convenience of designing the filter bank in a supervised manner.
Texture Classification in the Brodatz dataset
In this subsection, we evaluate the different methods under study over the Brodatz texture classification scenario. In this case, each subimage has to be assigned to its original image correctly and, thus, the number of classes to be labeled is the same as the original number of images available in the database. As in the previous subsection, we compare our proposals with the GF bank, although in this case we use the GF design proposed in  where the GF bank is specifically designed for this particular task.
Following the same experimental procedure than in the previous experiment, Table VI and Figure 9 include a comparison of the different methods under study. Again, to get a more fair comparison, the filters in the GF bank have been sorted according to the MSE measured over the training dataset. To further discuss the differences between the proposed methods, in this subsection we will analyze the training times required to build the filter banks. All simulations were run using Matlab 8 on a Macbook Pro with 8 GB RAM Memory and a 2.9 GHz Dual-core Intel Core i7 CPU.
As we can see, all supervised methods are more discriminative than the GF bank, even when there are few filters in the banks. Although P-NOPLS is slightly more discriminative than NOPLS, it takes much longer to train, and the number of filters is also larger. It is important to remark that NOPLS is the fastest algorithm (2.34 sec.) –excluding the OPLS algorithm since it does not include any constraint on the formulation– and requires half of the features than GF, whereas, in this case, the defNOPLS solution is the most discriminative and second fastest (14.84 sec.). P-NOPLS and NMF-OPLS need around 20 sec. and POPLS is dramatically slower with 12 h. and 12 min. Unlike the previous problem, GF here uses less filters than supervised approaches, however the number of coefficients is similar (except for P-NOPLS) and the number of frequency bands of the images needed by our algorithms are drastically smaller than GF (see NZ in Table VI). Comparing to OPLS results, we can see that the standard OPLS solution obtains the worst performance with any subset of filters. This fact points out as the non-negativity constraints not only provides interpretable solutions but also (in some cases) increased performance.
Note that, as we explained in Section II, P-NOPLS and NMF-OPLS do not sort the filters of the bank in terms of their importance. One of the consequences of this is that they require more filters than the other supervised methods; as an example, we see that P-NOPLS needs twice the number of filters than the rest of the methods.
V-B Experiment 2: Music Genre Classification
This second block of experiments aims at classifying the music genre of a song from the periodogram of the 6 first MFCCs extracted from each song. The dataset used here has been previously investigated in [26, 57, 59], and their results have revealed a great difficulty to successfully classify each song according to its musical genre (see [26, 59]). Moreover, the human evaluation study of  has found that the human definition of the genre for the audios in this dataset presents low consistency, resulting in a difficult dataset to apply genre classification. Notwithstanding the above, it is interesting to study how supervised filter bank designs work in this setup.
The dataset consists of 1317 music snippets of 30s each, distributed evenly among the following 11 music genres: Alternative, Country, Easy Listening, Electronica, Jazz, Latin, PopDance, RapHiphop, RB and Soul, Reggae, and Rock. In the case of the Latin category, there are only 117 music samples. The music snippets are MP3 encoded music with a bitrate of 128 kbps or higher downsampled by a factor of two to 22050 Hz. Note that this dataset has on average 1.83 songs per artist, which is another reason for making it so hard for genre classification.
For comparison purposes, we are going to consider the Philips Filter Bank proposed in  for a music genre classification task. As we explained at the end of Section IV, we will use periodograms of length D=129, so that the size of matrix characterizing the Philips Filter Bank, as well as the supervised banks designed with any of our methods, will be , with for the Philips Filter Bank.
Due to the lack of a specific test subset, we apply a 10-fold cross-validation procedure in order to measure the classification accuracy of every method. In each fold, we design the optimal filters with nine partitions of the data, as described in Section IV, and subsequently we evaluate the performance on the remaining partition. Note that all samples of the same song cannot be divided into different partitions, i.e., partitions are defined in terms of songs instead of samples of the dataset.
A comparison among the supervised filter approaches and the Philips Filter Bank is displayed in Table VII. This table shows the OA (averaged over the 10 folds) when the 4 and 10 first filters in the bank are considered ( and , respectively). In the case of the Philips Filter Bank, results are only analyzed with 4 filters, since this is its maximum number of available filters. Table VII also includes the rate of non-zero coefficients of the filters (NZ) as well as the time required to design the different filter banks. To complete this analysis, Figure 10 displays the averaged OA as a function of the number of filters for all methods under study.
As we explained in Section II, two of our proposed methods (P-NOPLS and NMF-OPLS) lack the ability to sort the filters in the bank with respect to the importance of each filter. As a consequence of this shortcoming, when only a few filters are used the performance may be adversely affected, as is the case here, where these methods are even outperformed by the Philips Filter Bank when . Regarding the remaining supervised filters, it is not so clear which one presents the best performance: although POPLS has the best accuracy with , NOPLS obtains a similar performance, but with a lower percentage of non-zero coefficients. Even more, the accuracy of both defNOPLS and NOPLS are the best ones when few filters are used (see left subfigure of Figure 10), improving significantly the performance of the Philips Filter Bank. One of the advantages of defNOPLS with respect to NOPLS, is that the sequential implementation of the algorithm allows to automatically select and stop the training process.
To conclude the section, Figure 11 shows the first 4 filters obtained on a single fold666We have checked that the differences between the filters obtained for each data fold are not very significant, so the presented conclusions can be easily generalized to the remaining folds. for the first MFCC, so that we can analyze the information provided by each filter bank. Note that, similarly to the Philips Filter Bank, NOPLS, defNOPLS, and POPLS pay attention to three well differentiated regions of the spectra, even though they are not in the same order: the lower modulation frequencies, which include components at the beat-scale; the higher modulation frequencies, which are related to the perceptual roughness; and the modulation frequencies of instruments, which are the most important frequencies of the MFCCs periodograms. However, the supervised approaches are more flexible in the definition of the filters, and can adjust the cut-off frequencies and even shape the filter waveform to obtain the best possible performance in the genre classification task. The superior performance of the supervised techniques allows us to conclude the convenience of using the available target data not only for the training of the final classifier, but also in the design of the filters used in the feature extraction stage.
|Philips F. B.||34.15||-||0.038||1.9||-|
V-C Experiment 3: Comparison with deep learning solutions
To complete this experimental analysis, in this section we compare the performance of the proposed methods with deep-learning approaches, since they are considered the state-of-the-art
in image recognition. With this purpose, we have selected these Deep Neural Networks (DNN):
T-CNN-3 , which is a 3-layer convolutional neural network (CNN), specifically designed for texture recognition problems,
AlexNet , which is probably one the most popularstate-of-the-art DNN architectures,
FV-CNN , that combines a CNN architecture with an internal representation based on the so-called Fisher vectors, and is the current state-of-the-art solution in texture classification,
and we will compare their performance with our proposed methods in two texture classification databases: kth-tips-2b 777https://github.com/v-andrearczyk/caffe-TCNN  and DTD 888https://www.robots.ox.ac.uk/~vgg/data/dtd/ , with 11 and 47 classes, respectively. In order to fairly compare all the algorithms, the same training and test data partitions have been used for all the methods and as well as the same experimental configuration as .
The results for all DNN methods have been directly taken from . As we can see in Table VIII, our methods outperform these state-of-the-art techniques. Regarding the interpretability of the different solutions, we have reported the objective measure IM for our approaches. In spite of their good performance, the main criticism of DNNs is precisely that they can be seen as black boxes implementing a very large number of connections among computation nodes. Consequently, the number of filters is very large, which results in small IM values. To be more precise, and using the information available in , AlexNet is using 613,120 filters for a final IM of -4.7 and -4.1 in kth-tips-2b and DTD, respectively. For the T-CNN-3 method, the number of units reported in  is 1,824, and thus it obtains a final IM of -2.2 and -1.6 in kth-tips-2b and DTD, respectively. IM of these DNN approaches is significantly smaller than for our methods. In summary, our methods show competitive performance in these problems, while providing more interpretable solutions.
For completeness, we should also mention that pretrained DNNs have also been used in [59,61], where the widely-used ImageNet database has been used as a pre-step to adjust the intermediate units of the networks.Although this approach is very useful from a practical point of view,
these results cannot be directly compared to our methods, since for a fair comparison we should design specific procedures that allow our feature extractors to learn from additional databases such as ImageNet, which is out of the scope of this paper. Nevertheless, our methods still remain better than the results provided for DNN approaches inthe kth-tips-2b dataset, being 73.2%, 71.5% and 81.5% for T-CNN-3, AlexNet and FV-CNN, respectively. However, using ImageNet together with the DTD problem allows deep learning networks to significantly improve their performance, achieving up to 75% for the current state-of-the-art FV-CNN method, which is much better than any of the results in Table VIII for that database. This suggests that, independently of the selected method, certain databases (such as DTD) probably do not contain enough diversity to make possible the design of a sufficiently good set of filters, and using external databases can be rather beneficial.
Regarding interpretability, Figure 12 shows that just 10 sparse filters can be very useful for image processing experts to understand the filter design. In the case of DNN approaches, we have not considered showing their filters due to the high number of obtained filters, which are also not ordered by any criterion of relevance to the classification.
In this paper, we have presented different methods for designing very versatile and interpretable filter banks for particular visual or audio classification tasks. All proposed methods are based on a supervised design with a common objective function, and differ in the way they try to solve this non-convex problem. As an alternative to the POPLS algorithm proposed in , in this paper we propose several far less time-consuming methods which obtain similar or even better performance than POPLS. Moreover, our proposals outperform the purposely designed and well studied filter banks used in the state-of-art of visual and audio applications.
We have illustrated the versatility of our methods in our experiments section, where we have tackled two very different classification tasks: texture and music genre classification. The advantages of our approaches over other feature extraction methods are: 1) they provide elegant physical interpretations of the extracted features; 2) they are more discriminative with less number of filters; 3) they provide more interpretable and sparse solutions; 4) and they fit their filter banks to each particular task, unlike generic filter banks. Based on our findings, we can conclude that the block and deflated NOPLS algorithms seem to obtain the best results in terms of accuracy, sparsity and CPU requirements and, therefore, they should be preferred over the other methods.
-  K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate analysis. Academic press, 1980.
-  J. Arenas-García, K. Petersen, G. Camps-Valls, and L. K. Hansen, “Kernel multivariate analysis framework for supervised subspace learning: A tutorial on linear and kernel multivariate methods,” IEEE Signal Process. Mag., vol. 30, no. 4, pp. 16–29, 2013.
-  M. A. J. van Gerven, Z. C. Chao, and T. Heskes, “On the decoding of intracranial data using sparse orthonormalized partial least squares,” Journal of Neural Engineering, vol. 9, no. 2, pp. 26 017–26 027, 2012.
-  L. K. Hansen, “Multivariate strategies in functional magnetic resonance imaging,” Brain and Language, vol. 102, no. 2, pp. 186–191, 2007.
-  J. Arenas-García and G. Camps-Valls, “Efficient kernel orthonormalized PLS for remote sensing applications,” IEEE Trans. Geosci. Remote Sens., vol. 44, pp. 2872–2881, 2008.
-  J. Arenas-García and K. B. Petersen, “Kernel multivariate analysis in remote sensing feature extraction,” in Kernel Methods for Remote Sensing Data Analysis, G. Camps-Valls and L. Bruzzone, Eds. Wiley, 2009.
-  M. Barker and W. Rayens, “Partial least squares for discrimination,” Journal of Chemometrics, vol. 17, no. 3, pp. 166–173, 2003.
-  K. Pearson, “On lines and planes of closest fit to systems of points in space,” Philosophical Magazine, vol. 2, no. 6, pp. 559–572, 1901.
-  H. Wold, “Non-linear estimation by iterative least squares procedures,” in Research Papers in Statistics. Wiley, 1966, pp. 411–444.
-  H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, pp. 321–377, 1936.
-  K. Worsley, J. Poline, K. Friston, and A. Evans., “Characterizing the response of pet and fMRI data using multivariate linear models (MLM),” Neuroimage, vol. 6, pp. 305–319, 1998.
-  M. Borga, T. Landelius, and H. Knutsson, “A unified approach to PCA, PLS, MLR and CCA,” Linköping University, SE-581 83 Linköping, Sweden, Report LiTH-ISY-R-1992, November 1997.
-  G. C. Reinsel and R. P. Velu, Multivariate reduced-rank regression: theory and applications. Springer New York, 1998.
-  S. Roweis and C. Brody, “Linear heteroencoders,” Gatsby Computational Neuroscience Unit, Tech. Rep. 1999-002, 1999.
-  J. Arenas-García, K. B. Petersen, and L. K. Hansen, “Sparse kernel orthonormalized PLS for feature extraction in large data sets,” Advances in Neural Information Processing Systems, vol. 19, pp. 33–40, 2007.
-  C. Dhanjal, S. R. Gunn, and J. Shawe-Taylor, “Efficient sparse kernel feature extraction based on partial least squares,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 31, no. 8, pp. 1347–1361, 2009.
L. Sun, S. Ji, S. Yu, and J. Ye, “On the equivalence between canonical
correlation analysis and orthonormalized partial least squares,” in
Proc. 21st Intl. Joint Conf. on Artificial Intelligence (IJCAI-09), Pasadena, California, USA, July 2009, pp. 1230–1235.
-  F. De la Torre, “A least-squares framework for component analysis,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 34, no. 6, pp. 1041–1055, 2012.
-  D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
-  D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, and R. Horaud, “A variational EM algorithm for the separation of time-varying convolutive audio mixtures,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 8, pp. 1408–1423, 2016.
-  P. Smaragdis and J. C. Brown, “Non-negative matrix factorization for polyphonic music transcription,” in 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. New Paltz, NY: IEEE, October 2003, pp. 177–180.
-  V. P. Pauca, J. Piper, and R. J. Plemmons, “Nonnegative matrix factorization for spectral data analysis,” Linear algebra and its applications, vol. 416, no. 1, pp. 29–47, 2006.
E. Oja and M. Plumbley, “Blind separation of positive sources using
non-negative PCA,” in
Proc. 4th International Symposium on Independent Component Analysis and Blind Signal Separation, Nara, Japan, April 2003.
-  G. I. Allen, C. Peterson, M. Vannucci, and M. Maletić-Savatić, “Regularized partial least squares with an application to NMR spectroscopy,” Statistical Analysis and Data Mining, vol. 6, no. 4, pp. 302–314, 2013.
-  C. Sigg, B. Fischer, B. Ommer, V. Roth, and J. Buhmann, “Nonnegative CCA for audiovisual source separation,” in Proc. IEEE Intl. Workshop on Machine Learning for Signal Processing, Thessaloniki, Greece, August 2007, pp. 253–258.
-  J. Arenas-García, J. Larsen, L. K. Hansen, and A. Meng, “Optimal filtering of dynamics in short-time features for music organization,” in Proc. 7th Intl. Conf. on Music Information Retrieval (ISMIR), Victoria, Canada, October 2006, pp. 290–295.
-  H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” Journal of Computational and Graphical Statistics, vol. 15, no. 2, pp. 265–286, 2006.
-  D. Hardoon and J. Shawe-Taylor, “Sparse canonical correlation analysis,” Machine Learning, vol. 83, no. 3, pp. 331–353, 2011.
-  S. Muñoz Romero, J. Arenas-García, and V. Gómez-Verdejo, “Iterative orthonormalized partial least squares with sparsity constraints,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Process. (ICASSP). Vancouver, BC, Canada: IEEE, May 2013, pp. 3387–3391.
-  M. R. Turner, “Texture discrimination by Gabor functions,” Biological Cybernetics, vol. 55, no. 2-3, pp. 71–82, 1986.
-  I. Fogel and D. Sagi, “Gabor filters as texture discriminator,” Biological Cybernetics, vol. 61, no. 2, pp. 103–113, 1989.
-  F. Bianconi and A. Fernández, “Evaluation of the effects of Gabor filter parameters on texture classification,” Pattern Recognition, vol. 40, no. 12, pp. 3325–3335, 2007.
-  W. Li, K. Mao, H. Zhang, and T. Chai, “Designing compact Gabor filter banks for efficient texture feature extraction,” in Proc. 11th Intl. Conf. on Control Automation Robotics & Vision (ICARCV), Singapore, December 2010, pp. 1193–1197.
J. Han and K.-K. Ma, “Rotation-invariant and scale-invariant Gabor features for texture image retrieval,”Image and Vision Computing, vol. 25, no. 9, pp. 1474–1481, 2007.
-  T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 24, no. 7, pp. 971–987, 2002.
-  Z. Fu, G. Lu, K. M. Ting, and D. Zhang, “A survey of audio-based music classification and annotation,” IEEE Trans. Multimedia, vol. 13, no. 2, pp. 303–319, 2011.
-  N. Scaringella, G. Zoia, and D. Mlynek, “Automatic genre classification of music content: a survey,” IEEE Signal Process. Mag., vol. 23, no. 2, pp. 133–141, 2006.
-  B. L. Sturm, “On music genre classification via compressive sampling,” in Proc. IEEE Intl. Conf. on Multimedia and Expo (ICME 2013), San Jose, USA, July 2013.
-  M. F. McKinney and J. Breebaart, “Features for audio and music classification,” in Proc. Intl. Symposium on Music Information Retrieval (ISMIR), vol. 3, Baltimore, Maryland, USA, October 2003, pp. 151–158.
-  C. Bishop, Neural Networks for Pattern Recognition. New York (NY): Oxford University Press, 1995.
-  J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis. Cambridge University Press, 2004.
-  S. Muñoz Romero, J. Arenas-García, and V. Gómez-Verdejo, “Sparse and kernel opls feature extraction based on eigenvalue problem solving,” Pattern Recognition, vol. 48, no. 5, pp. 1797 – 1811, 2015.
-  S. Muñoz Romero, V. Gómez-Verdejo, and J. Arenas-García, “Regularized multivariate analysis framework for interpretable high-dimensional variable selection,” IEEE Computational Intelligence Magazine, vol. 11, no. 4, pp. 24–35, Nov 2016.
-  P. H. Schönemann, “A generalized solution of the orthogonal procrustes problem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966.
-  M. H. Van Benthem and M. R. Keenan, “Fast algorithm for the solution of large-scale non-negativity-constrained least squares problems,” Journal of chemometrics, vol. 18, no. 10, pp. 441–450, 2004.
-  J. Kim and H. Park, “Toward faster nonnegative matrix factorization: A new algorithm and comparisons,” in Proc. 8th IEEE Intl. Conf. on Data Mining (ICDM’08). Pisa, Italy: IEEE, December 2008, pp. 353–362.
-  T. Hastie, J. Taylor, R. Tibshirani, and G. Walther, “Forward stagewise regression and the monotone lasso,” Electronic Journal of Statistics, vol. 1, pp. 1–29, 2007.
-  L. W. Mackey, “Deflation methods for sparse PCA,” in Advances in neural information processing systems, vol. 21. The MIT Press, 2008, pp. 1017–1024.
-  D. Seung and L. Lee, “Algorithms for non-negative matrix factorization,” Advances in neural information processing systems, vol. 13, pp. 556–562, 2001.
-  Z. Yuan and E. Oja, “Projective nonnegative matrix factorization for image compression and feature extraction,” in Proc. 14th Scandinavian Conf. Image Analysis (SCIA 2005), Joensuu, Finland, June 2005, pp. 333–342.
-  S. Choi, “Algorithms for orthogonal nonnegative matrix factorization,” in Proc. IEEE Intl. Joint Conf. on Neural Networks, IJCNN 2008, Hong Kong, China, June 2008, pp. 1828–1832.
-  C. Boutsidis and E. Gallopoulos, “SVD based initialization: A head start for nonnegative matrix factorization,” Journal of Pattern Recognition, vol. 41, no. 4, pp. 1350–1362, 2008.
-  N. Gillis and F. Glineur, “Accelerated multiplicative updates and hierarchical ALS algorithms for nonnegative matrix factorization,” Neural computation, vol. 24, no. 4, pp. 1085–1105, 2012.
-  C. J. Lin, “On the convergence of multiplicative update algorithms for nonnegative matrix factorization,” IEEE Transactions on Neural Networks, vol. 18, no. 6, pp. 1589–1596, Nov 2007.
-  J.-J. Aucouturier, F. Pachet, and M. Sandler, “The way it sounds”: timbre models for analysis and retrieval of music signals,” IEEE Trans. Multimedia, vol. 7, no. 6, pp. 1028–1035, December 2005.
M. I. Mandel, G. E. Poliner, and D. P. Ellis, “Support vector machine active learning for music retrieval,”Multimedia systems, vol. 12, no. 1, pp. 3–13, 2006.
-  A. Meng, P. Ahrendt, J. Larsen, and L. K. Hansen, “Temporal feature integration for music genre classification,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 15, no. 5, pp. 1654–1664, 2007.
-  P. Brodatz, Textures: a photographic album for artists and designers. Dover New York, 1966, vol. 66.
-  A. Meng and J. Shawe-Taylor, “An investigation of feature models for music genre classification using the support vector classifier,” in Proc. 6th Intl. Conf. on Music Information Retrieval (ISMIR), London, UK, September 2005, pp. 604–609.