1 Introduction
Unsupervised learning is a branch of machine learning tasked with inferring functions that describe hidden structures from unlabeled data. In terms of neural networks, these functions are a set of affine transformations subjected to nonlinearity. Each function is described as a layer, and stacking layers leads to deep neural nets. These deep networks can be used as end to end systems utilizing all the underlying layers as one entity. An example of vanilla neural networks are autoencoders. The authors in
[1]describe autoencoders as neural networks that are trained to copy inputs to outputs. By placing multiple constraints, the network learns to extract hidden characteristics from data. Autoencoders can extract overcomplete basis functions which can be used for denoising. Stacking layers of denoising autoencoders, which are trained layerwise to locally denoise corrupted versions of their inputs, led to developement of Stacked Denoising Autoencoders
[2].In this article, we show that the individual layers can be repurposed to tasks different from what they were trained for. In other words, we use neural networks as tools to generate affine functions. Mathematically, these functions act as filters that process inputs to span a response set in which we can perform various image processing tasks. To test the applicability of generated filters, we focus on image quality assessment and texture retrieval. These applications are a challenge for the traditional deep learning networks due to lack of large labeled datasets that are essential for training. To overcome label requirements and application specific data, we learn filters from natural images in an unsupervised fashion. In practice, we need to train filters that are insensitive to lower order statistics of natural images, so that the trained model can work on disparately distributed images during testing. We achieve this by proposing a simple extension to ZCA whitening procedure during preprocessing in Section
2.1, and describe the learning architecture that generates the filter sets in Section 2.2. We integrate the preprocessing and the learning blocks to obtain the Unsupervised Learning Framework which has two tuning parameters. These parameters can be adjusted to generate multiple filter sets capturing multiorder statistics as discussed in Section 3. We utilize the generated filters to perform image quality assessment in Section 4 and texture retrieval in Section 5. Finally, we conclude our work in Section 6.2 Unsupervised Learning Framework
In this section, we describe the Unsupervised Learning Framework , whose block diagram is shown in Fig. 1. The framework takes in an input and generates weights and bias , which span our filter sets. The subscripts and correspond to tuning parameters that determine filter characteristics.
2.1 Preprocessing
As shown in Fig. 1, the preprocessing stage converts raw images from input database
into patchbased vectors
, and feeds them into the ZeroPhase Component Analysis (ZCA) whitening algorithm [3]. ZCA whitening is a decorrelating procedure that removes the first and the second order statistics of input data and forces the ensuing learning stage to capture higher order statistics. Essentially, this specifies the covariance matrix of outputto be the identity matrix
. Therefore, the sample covariance matrix of is given by(1) 
where is the number of input vectors . For a large value of
, the sample covariance estimate used in Eq.
1, which is the maximum likelihood estimate (MLE), would be close to the actual covariance matrix. ZCA whitening attempts to find a set of linear, symmetric filters that transform to . The output of the preprocessing block is given by(2) 
where is assumed to be zero centered. Following the approach described by the authors in [3] and [4], we calculate as
(3) 
The outer product
is symmetric and orthogonally diagonalizable. Hence, applying eigenvalue decomposition to
, we obtain(4) 
where
are the eigenvectors and
are the eigenvalues of . Raising both sides of Eq. 4 to the power , we get(5) 
Combining Eq. 5 and Eq. 3, we obtain an expression for as
(6) 
The matrix is, however, ill conditioned. Therefore, a regularization term is added to the eigenvalues, so that the inverse of the smaller eigenvalues do not arbitrarily scale the resultant. Hence, the expression for the filters of a standard ZCA whitening algorithm is given by
(7) 
Applying on produces whitened patches whose covariance matrix is ideally . In practice, this is not always the case. Fig. 2(a) shows while Fig. 2(b) shows using Eq. 2 and Eq. 3. The nine distinct blocks are present in Fig. 2(a) because of the three separate channels (RGB) whose distributions are different but not entirely uncorrelated. These blocks are still present in Fig. 2(b) because of the approximation caused by the addition of . Such an approximation is tolerable as long as the testing data distribution is similar to the training data distribution. Since we propose utilizing texture images for testing, which do not adhere to the natural scene statistics, an approximate identity matrix for is not acceptable. Hence, we propose a simple extension to the standard ZCA algorithm in which we apply Eq. 2 iteratively on the initial input , times. k is specified as an input to the algorithm. Using this extension, the intermediate whitened vector denoted by , at every iteration is given by
(8) 
where is the intermediate whitening matrix at iteration . As can be completely described with the knowledge of , the sequence of matrices satisfy the markov property. Hence after calculating all the matrices using Eq. 8, we express the final whitened patches P as
(9) 
where and are the whitening filters, and input vectors at iteration. gives us the standard ZCA algorithm. Taking to be , with , we obtain the covariance matrix as shown in Fig. 2(c). Essentially, by performing whitening multiple times, the effect of is reduced. This ensures that Eq. 1 holds and the lower order statistics from natural images is eliminated.
In common practice, the whitening filters calculated during training are used to transform the test data as well, because there is not enough test data () to estimate sample covariance matrix in Eq. 1. However, is trained on natural images and is not suitable to transform texture or distorted images. Hence, we calculate during testing as well. Even though during testing is insufficient to estimate covariance matrix precisely, it is not singular and for the present we overlook this approximation.
2.2 Linear Decoder
Linear decoder is a variation of standard autoencoder in which sigmoid nonlinearity of the final reconstructed output is replaced with a linear activation function. The primary task of this network is to learn representative structure in its input data while attempting to reconstruct it. Adding a sparsity criterion forces the network to learn unique statistical features from the data rather than an identity function in the hidden layer
[1]. The hidden layer responses , with a sigmoid nonlinearity activation, are obtained as(10) 
where and are the forward weights and bias, and is the input patch matrix. Eq. 10 is an affine function followed by a non linearity. The rows of weights represent a set of filters of dimension
that linearly transform
. If is greater than , we obtain an overcomplete basis which, coupled with sparsity, learns localized features [1]. If is lower than , then we have undercomplete basis set in which the network is forced to learn the most salient features. In this work, we use as a tuning parameter of the learning framework.The responses are used to reconstruct the patches , using a set of backward weights , and bias, , as
(11) 
Backpropagation is used to train weights and bias by setting up the objective function as
(12) 
where the first term is the reconstructed norm error, the second term is the sparsity constraint, and the third term is the weight decay term used for regularization [5]. KLDivergence is used as a proxy for achieving sparsity because of its differentiability at the origin. Differentiability ensures that Eq. (12) is viable for LBFGS minimization. is the desired sparse average activation while is the actual average activation. The values for , , and are set to , , and respectively as suggested by the authors in [6].
3 Filter Set Generation
We use the proposed framework to generate filter sets that are trained with natural images from ImageNet database
[7]. random patches of dimensions are sampled from each of random images to form . These patches are preprocessed with different values of to train . The weights are visualized in Fig. 2(d)(f) for. The number of neurons
, in the hidden layer, represents the number of filters for that set. The covariance matrices of patches that generate these filters are visualized in Fig. 2(a)(c). Without preprocessing, filters primarily learn the first order statistics. Edges are predominant when , which are the independent components of natural images [3]. When , natural scenes lose pixelwise correlations and hence there are no discernible simple structures including edges.4 Image Quality Assessment
Perceptual image quality assessment (IQA) is a challenging field whose objective is to analyze an image and estimate its quality as perceived by humans. Database generation in this domain is a challenge, since it requires gathering subjective scores of distorted images. Learning based approaches are commonly employed in the literature to estimate quality [8][11]. However, all these methods are trained with distortion specific images that may not be available or sufficient. To overcome this limitation of insufficient data, we proposed an unsupervised approach to image quality assessment called UNIQUE [12] and it’s extension MSUNIQUE [13], that use only generic natural images during training. In this section we recast both these estimators using the learning framework described in Section 2.
4.1 UNIQUE: Unsupervised Image Quality Estimation
The block diagram in Fig. 3 shows UNIQUE trained with the proposed framework. The proposed training uses a standard ZCA procedure with = and an overcomplete architecture with = [6]. The feature vectors are obtained by filtering non overlapping preprocessed patches of both reference and distorted images. The two vectors are compared using Spearman correlation to estimate quality. A complete description of the algorithm is provided in [12].
4.2 MSUNIQUE: MultiModel and SharpnessWeighted Unsupervised Image Quality Estimation
MSUNIQUE was designed to have a wide configuration with independent networks modeling the same input using different number of neurons. Such a task is directly achieved by varying in the proposed framework, hence casting MSUNIQUE as an instance of the framework. We use instances of
spanning both undercomplete and overcomplete states. In MSUNIQUE, we also classify the filters based on sharpness as either being activated by edge or color content. A detailed description of the algorithm is provided in
[13].4.3 Validation
A detailed validation of UNIQUE and MSUNIQUE is provided in [12] and [13]. Table 1 lists the results of state of the art estimators on LIVE [14] and TID 2013 [15]
databases. The estimated scores from each method is validated against the subjective scores. Validation is performed utilizing metrics that analyze accuracy (RMSE), consistency (Outlier Ratio), linearity (Pearson correlation), and monotonic behavior (Spearman correlation). It is evident that both UNIQUE and MSUNIQUE are among the top performing estimators.
Methods PSNR  MS  SR  FSIMc UNI  MS  
HMA  SSIM  SIM  QUE  UNIQUE  
[16]  [17]  [18]  [19]  [12]  [13]  
Outlier Ratio  
TID13  0.670  0.697  0.632  0.727  0.640  0.611 
Root Mean Square Error  
LIVE  6.58  7.43  7.54  7.20  6.76  6.61 
TID13  0.69  0.68  0.61  0.68  0.60  0.57 
Pearson Correlation Coefficient  
LIVE  0.958  0.946  0.945  0.950  0.956  0.958 
TID13  0.827  0.832  0.866  0.832  0.870  0.884 
Spearman Correlation Coefficient  
LIVE  0.944  0.951  0.955  0.959  0.952  0.949 
TID13  0.817  0.785  0.807  0.851  0.860  0.870 
MS
5 Texture Retrieval
The aim of texture retrieval is to detect images that are sampled from the same repeating texture as the given query image. Algorithms designed for recognizing and retrieving textures need to filter the entire contents of an image rather than nonoverlapping patches of it. Hence, to perform a global filtering operation, we need to design a pyramidal structure of filter banks, which can be mimicked by stacking multiple hidden layers during training.
5.1 Training Filter Sets
The block diagram in Fig. 5 gives an overview of the filter set generation module. We divide the filtering process into two phases based on the data characteristics we wish to capture  color based filtering and structure based filtering. For color based filtering, we train filters with . Thus, the network is limited to learn only the first order statistics from unwhitened data. The filters are visualized in Fig. 2(d). For capturing structure, we process the entire image by building multiple layers of filter sets. Each layer is a self contained filter set, acting on a specific subregion. Following the illustration provided on the right side of Fig. 5, we train filters to act on subregions. The responses of nine patches are concatenated to process subregions. The responses of adjoining nine
patches are passed through a max pooling layer to obtain the maximum
values from eachsubregion. Pooling is required to limit the input dimensions, since it is challenging for an autoencoder to extract features from high dimensional data. Moreover, pooling provides translation invariance. The pooled feature responses are concatenated to obtain a response vector that includes values from every part of an image. A final filter set is trained to transform these responses to obtain the final feature vector.
5.2 Retrieval
The texture images in the database are first resized to get patches of dimensions , as shown on the top right of Fig. 5. These low resolution images, essentially represent the mean color. Without any preprocessing, all the images are filtered using the generated color filter sets. The images with relatively similar colors to the query image are selected and examined for structure. Each texture image is resized to dimensions of . The block diagram in Fig. 5 shows the progression of spatial filtering to obtain a final feature vector that encompasses the entire image. Once all feature vectors are obtained, Spearman correlation is calculated between all features in the database. Based on the similarity coefficients, similar texture images to the query image are retrieved.
5.3 Validation
We test the retrieval process on the CUReT database [20]. Non overlapping patches of size were extracted from all images with viewing condition number as detailed by the authors in [21]. There are different texture classes, each of which have samples. To quantify the results of the proposed method, we use three standard retrieval metrics including Precision at one (P@1), Mean Reciprocal Rank (MRR), and Mean Average Precision (MAP). All of these metrics produce results in the range with being the ideal score. The results are presented in Table 2. The proposed method is measured against commonly utilized and state of the art techniques. It can be observed that the proposed method is always among the top two performing metrics, second only to STSIM1 [22], which is a handcrafted technique designed to measure texture similarity.
5.4 Robustness
We examine the proposed method’s robustness against noise. Texture retrieval is performed after adding random noise to CUReT database, with standard deviation
in the range . The MAP results of top metrics are presented in Table 3.CWSSIM  STSIM1  STSIM2  STSIMM  Proposed  

5  0.1939  0.9017  0.8433  0.8579  0.8931 
25  0.1700  0.8576  0.7478  0.8033  0.8600 
50  0.1185  0.7450  0.5839  0.6990  0.8535 
75  0.1204  0.6825  0.4460  0.6218  0.8351 
100  0.0876  0.5465  0.3487  0.4826  0.8219 
The results clearly indicate that the drop off of MAP values for the proposed method is lower than the compared techniques. Robustness of the proposed method is rooted in the expectation that a higher level representation should be stable under corruptions to the input. Using the extension to the standard ZCA algorithm decorrelates adjoining pixel values in the natural images, hence disrupting local structure in the data  a behavior equivalent to adding noise. The proposed method has been trained on decorrelated data and has learnt to consider only the underlying principal components that constitute an image ignoring the lower order statistics.
6 Conclusion
In this paper, we proposed the Unsupervised Learning Framework (ULF) that is based on a data driven approach, to solve challenges which lack sufficient domainspecific data and annotations. An extension to the classical ZCA algorithm was proposed that would, in practice, orthogonalize the input vectors thereby eliminating all the lower order natural scene statistics. Orthogonalization allows the linear decoder to learn higher order characterizations from the data. We modified the number of neurons in the linear decoder to obtain multiple filter sets modeling the same input and used them in parallel. We demonstrated the use of these filter sets on image quality assessment (IQA) and texture retrieval. In IQA, we showed that the already proposed estimators, UNIQUE and MSUNIQUE are instances of the framework and that the filter sets spanned a response space wherein perceptual dissimilarity could be measured. We also showed that similarity is quantifiable in the same response space, by retrieving texture images. We established the robustness of the proposed method towards noisy input data. In conclusion, we illustrated utilizing unsupervised learning to generate robust and adaptive filter sets that can be used in various image processing applications.
References
 [1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
 [2] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010.
 [3] A. J. Bell and T. J. Sejnowski, “The “independent components” of natural scenes are edge filters,” Vision research, vol. 37, no. 23, pp. 3327–3338, 1997.
 [4] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
 [5] A. Ng, “Sparse autoencoder,” CS294A Lecture notes, vol. 72, no. 2011, pp. 1–19, 2011.
 [6] A. Ng, J. Ngiam, C. Y. Foo, Y. Mai, and C. Suen, “Ufldl tutorial,” 2012.

[7]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, et al.,
“Imagenet large scale visual recognition challenge,”
International Journal of Computer Vision
, vol. 115, no. 3, pp. 211–252, 2015.  [8] C. Charrier, O. Lézoray, and G. Lebrun, “Machine learning to design fullreference image quality assessment algorithm,” Signal Processing: Image Communication, vol. 27, no. 3, pp. 209–219, 2012.

[9]
H. Tang, N. Joshi, and A. Kapoor,
“Blind image quality assessment using semisupervised rectifier
networks,”
in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2014, pp. 2877–2884.  [10] P. Ye, J. Kumar, L. Kang, and D. Doermann, “Realtime noreference image quality assessment based on filter learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 987–994.
 [11] H. Chang and M. Wang, “Sparse correlation coefficient for objective image quality assessment,” Signal processing: Image communication, vol. 26, no. 10, pp. 577–588, 2011.
 [12] D. Temel, M. Prabhushankar, and G. AlRegib, “Unique: Unsupervised image quality estimation,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1414–1418, 2016.
 [13] M. Prabhushankar, D. Temel, and G. AlRegib, “Msunique: Multimodel and sharpnessweighted unsupervised image quality estimation,” Electronic Imaging, vol. 2017, no. 12, 2017.
 [14] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Transactions on image processing, vol. 15, no. 11, pp. 3440–3451, 2006.
 [15] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, et al., “Image database tid2013: Peculiarities, results and perspectives,” Signal Processing: Image Communication, vol. 30, pp. 57–77, 2015.
 [16] N. Ponomarenko, O. Ieremeiev, V. Lukin, K. Egiazarian, and M. Carli, “Modified image visual quality metrics for contrast change and mean shift accounting,” in CAD Systems in Microelectronics (CADSM), 2011 11th International Conference The Experience of Designing and Application of. IEEE, 2011, pp. 305–311.
 [17] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
 [18] L. Zhang and H. Li, “Srsim: A fast and high performance iqa index based on spectral residual,” in Image Processing (ICIP), 2012 19th IEEE International Conference on. IEEE, 2012, pp. 1473–1476.
 [19] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “Fsim: A feature similarity index for image quality assessment,” IEEE transactions on Image Processing, vol. 20, no. 8, pp. 2378–2386, 2011.
 [20] K. J. Dana, B. van Ginneken, S. Nayar, and J. Koenderink, “Curet: Columbiautrecht reflectance and texture database,” 1999.
 [21] M. Alfarraj, Y. Alaudah, and G. AlRegib, “Contentadaptive nonparametric texture similarity measure,” in Multimedia Signal Processing (MMSP), 2016 IEEE 18th International Workshop on. IEEE, 2016, pp. 1–6.
 [22] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution grayscale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp. 971–987, 2002.

[23]
M. P. Sampat, Z. Wang, S. Gupta, A. C. Bovik, and M. K. Markey,
“Complex wavelet structural similarity: A new image similarity index,”
IEEE transactions on image processing, vol. 18, no. 11, pp. 2385–2401, 2009.