Semantically Interpretable and Controllable Filter Sets

02/17/2019 ∙ by Mohit Prabhushankar, et al. ∙ Georgia Institute of Technology 0

In this paper, we generate and control semantically interpretable filters that are directly learned from natural images in an unsupervised fashion. Each semantic filter learns a visually interpretable local structure in conjunction with other filters. The significance of learning these interpretable filter sets is demonstrated on two contrasting applications. The first application is image recognition under progressive decolorization, in which recognition algorithms should be color-insensitive to achieve a robust performance. The second application is image quality assessment where objective methods should be sensitive to color degradations. In the proposed work, the sensitivity and lack thereof are controlled by weighing the semantic filters based on the local structures they represent. To validate the proposed approach, we utilize the CURE-TSR dataset for image recognition and the TID 2013 dataset for image quality assessment. We show that the proposed semantic filter set achieves state-of-the-art performances in both datasets while maintaining its robustness across progressive distortions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual understanding is a research field that aims to provide semantically meaningful interpretation of the visual cues in data [1]. The objective of visual understanding algorithms is to compute mappings that lead to a representation space in which non-informative and informative representations are distinguishable. Traditional visual understanding algorithms are based on handcrafted approaches in which mappings between input pixel spaces and distinguishable representation spaces are tractable based on data dependent characteristics [2]

. Because of the tractable nature of the mapping functions, the representation space spanned by the handcrafted mappings is interpretable. The interpretability and tractability of the representation spaces as well as their mappings have enabled handcrafted visual understanding algorithms to achieve application generalizability. Hence, based on the same underlying representation space, multiple tasks on various applications can be performed using handcrafted feature mappings. For instance, mappings that rely on keypoint detection and feature extraction such as SIFT are used for numerous application including image retrieval 

[3] and image quality assessment [4].

Recently, the availability of big data and advancements in computational resources have enabled the development of powerful data-driven algorithms for various vision tasks [5, 6]

. In data-driven approaches, representation spaces are directly learned from data and labels. A commonly used supervised data-driven method is Convolutional neural network (CNN) which learns a set of discriminative features between classes  

[6]. In this method, the discriminative features are extracted from the input pixel space using a set of learned convolutional filters and mapped to a latent representation space. To interpret the data-driven representation space, the latent representation is visualized in different ways. The authors in [7] proposed a visualization technique that projects a feature map of interest back to input pixel space. In [8], the authors visualized image patches that maximally activate hidden units and quantified the interpretability of individual filters. These techniques help to interpret the representation space. However, we cannot directly leverage the information obtained from the interpretation because data-driven representations normally contain mixture of abstract features [9]. Therefore, filters learned in a task-dependent manner are not easily application generalizable and such algorithms require additional training or fine-tuning for different tasks.

In this paper, we generate controllable semantic filter sets in an unsupervised fashion. The authors in [10] describe semantic filters as tools to extract subjectively meaningful structures from natural images. We use an autoencoder, an unsupervised neural network, to generate these filters. We demonstrate that, by using such filters, the performance gains of a data-driven approach and the application-generalizability of interpretable models can be combined. The contributions of this paper are threefold:

  • [leftmargin=*]

  • We analyze various methods to control the training phase of an autoencoder. The filters learned from considered methods are visualized and validated based on their structural interpretability.

  • We group interpretable filter sets into semantically meaningful visual concepts that are based on color and edge characteristics.

  • We demonstrate the feasibility of semantic filter sets on two contrasting applications including image recognition and image quality assessment. Specifically, we test the robustness of these filters under mild to severe color degradation.

In [11] and [12]

, we proposed objective image quality estimators based on the representation space spanned by autoencoder filter sets, which were weighed with perceptually inspired formulations. In this paper, we develop further insight into the representation space by delving into the generation of the filter sets to embed semantic meaning within them. In particular, we investigate different regularization techniques for semantically meaningful filter sets and generalize them to applications including recognition under challenging domain shifted conditions. In addition, we analyze the robust performance of image quality estimators under varying challenge levels. Such results provide further insight into understanding the correlation between objective predictors and subjective quality opinions.

Figure 1: Unsupervised training of an autoencoder.

2 Background

In this section, we describe a vanilla autoencoder network. An autoencoder is an unsupervised learning network that is trained to copy inputs to outputs [13]. The network consists of encoder and decoder functions as shown in Fig. 1. The encoder maps a matrix , consisting of

input feature vectors each of size

, to a hidden layer with neurons. The encoder can also be considered as a set of forward filters each of size , whose responses are passed through a non-linearity to obtain . Mathematically, can be represented as,

(1)

where is the set of non-linear hidden layer responses to an affine filter set parameterized by weights and bias . The sigmoidal non-linearity is used through the rest of this work. The responses are mapped back to the input data space to obtain using a decoder. The decoder is a set of backward affine filters parameterized by weights and bias . The reconstructed output is obtained as,

(2)

Note that we use a linear decoder in our experiments. The forward and backward filters are simultaneously trained using backpropagation by constructing and minimizing a cost function

, between and given by,

(3)

where the first term is the Mean Square Error (MSE) between the input and reconstructed outputs, and is a regularization term. Regularization is a modification of an optimization objective function to reduce the generalization error without impacting the training error [13]

. Practically, adding regularization to an autoencoder means that the network is restricted to copying only an approximate input as its output. This forces the autoencoder to prioritize the necessary aspects of the input thereby learning a dictionary of useful properties that characterize the input data. The need for regularization is well recognized within the machine learning community 

[14] and a number of regularization techniques have been proposed. The commonly used techniques for autoencoders include weight penalties, derivative penalties, and training tricks including early stopping [15]. In this work, we concentrate on analyzing weight penalties. This is in keeping with the overall theme of generating semantic filters that are parameterized by weights. While these techniques have already been proposed, our focus is to to control the semantic meaning and visual concepts learned by the weights.

Figure 2: (a), (b), (c): Reconstructed images from , , and elastic net regularization, respectively, (e), (f), (g): Weights of autoencoder from , , and elastic net regularization, respectively.

3 Regularization analysis

We train an autoencoder on ImageNet database 

[5] to obtain encoder filters. 100 patches of dimension are sampled from each of images and preprocessed using Zero-Phase Component Analysis (ZCA) whitening [16]. These whitened patches are fed into multiple autoencoders, each with different weight penalties, from eq. 3. The different penalties are or LASSO [17],

or ridge regression 

[18], and elastic net [19] regularization which is a weighted combination of and penalties. While all these penalties have been explored in regularization theory, we provide a qualitative analysis based on the visual concepts that each of the regularization techniques learns.

3.1 penalty

The constraint promotes sparsity within the hidden layer responses . The training cost function for regularization is given by,

(4)

where is a positive regularization parameter. The trained filters are visualized in Fig. 2(d). An example peppers image is passed through the autoencoder and its reconstructed image is visualized in Fig. 2(a). The fidelity of the reconstructed image is dB.

Theoretically, regularization can help with interpretability since, for every train and test case, only a few filters are activated. However, in practice the penalty suffers when there are correlated weights or filters. If there is a group of filters among which the pairwise correlations are very high, then the penalty selects any one of those correlated filters [19]. This filter selection issue is not suitable for applications where the representation space is used as features for further analysis. Filters learned from images have large correlations between them because of the inherent spatial correlation within training data. Also, as shown in Fig. 2(d), the interpretability of the mapping functions is not obvious.

3.2 penalty

The penalty in the cost function is also called ridge regression, weight decay or Tikhnov regularization [17]. This is a well studied regularization technique that promotes shrinkage between filters. The cost function with penalty on weights is given by,

(5)

By penalizing the filters in this manner, the input data produces dense responses. However, a mixture of large and small responses causes instability while learning [19]. Practically, this means that the representation space changes rapidly even with slight domain shift between train and test data. For application-generalizability, the regularization may not be suitable. The filter sets learned from this technique along with the reconstructed peppers image are provided in Fig. 2(b) and  2(e) respectively. Again, the filters are able to reconstruct the image with a PSNR score of dB. However, there is no semantic interpretability that can be associated with either the filters or the activations.

3.3 Elastic Net penalty

The elastic net was proposed in [19] to correct for the filter selection challenge faced by penalty. Instead of one filter being activated from a group of correlated filters, all the correlated filters are activated at the same time. This is achieved by adding a weighted norm penalty to obtain within-group dense responses. A thorough explanation and analysis is provided in [19]. The minimization cost function for this regularization technique is given by,

(6)

The values of and are set to and respectively as suggested by the authors in [20]. The weights learned using the elastic net regularization along with the reconstructed peppers image are visualized in Fig. 2(c) and  2(f) respectively. The lower fidelity of the reconstructed image is expected because of the higher regularization penalties. However, it is obvious that each filter in Fig. 2(c) has a visual concept associated with it. These concepts include different colors, color gradients, and edges with multiple orientations. These weights are structurally meaningful and semantic in nature. While other works in sparse coding [16] and CNNs have shown that edges dominate the first layer with just penalty, the complete demarcation between color and edge filters using elastic net penalty is very interesting and deems further investigation.

4 Semantic Visual Concepts

To leverage these semantic filters for different tasks, we further group them based on individual visual concepts. The grouping of semantic filter sets is performed based on the kurtosis measure of each filter. Kurtosis,

, is defined as,

(7)

where are vectorized, zero-centered and normalized encoder filter values, and

are respectively mean and the standard deviation of

. For every greater than a threshold of

, the filter is classified as an edge filter and if

, the filter is classified as a color filter. The thresholds are empirically determined to ensure a complete demarcation between filters that represent color and edges. Edge filters have a higher kurtosis value because they have values in a localized area. Therefore, most of the filter values are located away from the mean of distribution leading to higher kurtosis. While the authors in [21] and [22] use elastic net regularization for learning, they do not go further to group filters by their representative visual concepts. We note that the entire process from obtaining interpretable filters to grouping semantic filters is conducted in an unsupervised fashion.

5 Experiments and Results

In this section, we demonstrate the advantages of learning semantic filters and categorizing them based on visual concepts with two contrasting applications:

  • [leftmargin=*]

  • Recognition under progressive decolorization challenge where the objective is to recognize decolorized images when trained on color images of the same class.

  • Image Quality Assessment (IQA) under color distortion where the goal is to objectively estimate the subjective scores of images affected by color distortion.

In these two applications, we show proposed semantic filter-based methods outperform state-of-the-art algorithms while maintaining its robust performance under different levels of color distortion. For validation, we use the Challenging Unreal and Real Environments for Traffic Sign Recognition (CURE-TSR) dataset [23] and the TID 2013 [24] dataset, for recognition and IQA respectively. They contain images with decolorization and color distortion challenges across progressive levels. After the semantic filters are grouped based on the visual concepts, we prune the filters in an application dependent manner. Pruning is carried out by applying application-dependent weights on color and edge filter sets. We refer to this model as a semantic autoencoder (Sem-AE) and a general framework is illustrated in 3. The refers to the whitened patches extracted from images. The filter set is grouped into visual concepts and their responses are processed in an application-dependent manner.

Figure 3: General framework of the semantic autoencoder.

5.1 Recognition

Figure 4: Visualization of filters that are maximally activated by each patch in the yield traffic sign image. (1st row: raw images, 2nd row: the maximally activated filters from both edge and color filters, 3rd row: the maximally activated filters from edge filters.)

The CURE-TSR dataset consists of challenge-free traffic sign images over classes for training. Testing is performed on images that includes images in each of distortion-free and five different levels of decolorization category (total images). An example of distortion-free, decolorization levels and are shown in the Fig. 4(a). In the Fig. 4(b), we visualize filters which show maximum activations for each patch in distortion-free, level , and level decolorization images among all filters including color and edge concepts. In the Fig. 4(c), only the edge concept filters are used for visualization. It is apparent that the representation obtained only from edge filters are not adversely affected by decolorization while the representation obtained from all the filters change significantly. Hence, we assign to be and only the edge filters are used to obtain features. A softmax classifier is trained on these edge features as part of the application-dependent post processing block in Fig. 3. Performance accuracies of the Sem-AE-based recognition algorithm across decolorization levels are plotted in Fig. 5. We compare proposed algorithm with other baseline algorithms detailed in [23]. In addition, we show performance of (AE(L1)), (AE(L2)), and elastic net (AE) regularized filters on which softmax classifiers are trained. The performance of two intensity-based methods (I-Softmax, I-SVM) is not affected by decolorization since they do not use color information. However, by discarding color information, these methods achieve lower performance among the first levels than all the other methods. It can be seen that the Sem-AE with shows high and steady accuracy across decolorization levels while the accuracy of all other RGB based methods degrades as decolorzation becomes severe. The steady performance indicates that the representation space spanned by Sem-AE filters is robust to color distortions, because the edge-based filter responses are invariant to color degradations.

Figure 5: Accuracy of traffic sign recognition for different levels of decoloriazation.

5.2 Image Quality Assessment

For the task of Image Quality Assessment (IQA), we use TID 2013 database to evaluate the performance of Sem-AE. In particular, we analyze the categories of color saturation, color quantization with dither and chromatic aberrations. We choose these categories to specifically illustrate the worth of learning color and edge concepts for IQA. Each category has levels of progressive color based distortions. Previous studies  [25, 26, 27, 28, 29] in this field, including our own [11, 12] concentrated on and provided results on all the distortion levels together. In this section, we report results individually for each level. Ideally, a good quality estimator is correlated with subjective scores across varying distortions.

The Sem-AE scores are obtained for both reference and distorted images with . The higher weighting is given to edges because the human visual system is more sensitive to sharpness of images and distortion in the edge components of images causes higher visual discomfort [30]. Spearman correlation is calculated between the weighted responses of distorted and reference images to obtain the objective scores. In Table 1, we compare (Sem-AE) with other commonly compared metrics. Pearson (PCCs) and Spearman Correlation Coefficients (SCCs) which measure linearity and monotonic behavior between subjective and objective image quality scores are used to validate the objective metrics. We observe that Sem-AE follows the subjective scores closer than the other metrics. Not only does it exhibit the highest correlation in of the categories in both PCC and SCC, it maintains its steady performance across levels. This is in contrast to other methods which show low correlations in levels . Note that both FSIMc and PerSIM take color characteristics into consideration. The tested quality estimators exhibit uneven correlations across distortion levels, which should be addressed to obtain robust methods.

Metric Pearson Correlation Coefficient
Lv. 1 Lv. 2 Lv. 3 Lv. 4 Lv. 5
PSNR-HMA 0.643 0.626 0.280 0.046 0.486
MS-SSIM 0.248 0.143 0.302 0.525 0.744
SR-SIM 0.370 0.260 0.301 0.497 0.732
FSIMc 0.391 0.253 0.303 0.553 0.778
PerSIM 0.126 0.085 0.304 0.554 0.804
AE 0.716 0.725 0.765 0.775 0.577
AE(L1) 0.557 0.406 0.542 0.682 0.619
AE(L2) 0.079 0.004 0.275 0.454 0.568
Sem-AE 0.772 0.795 0.801 0.816 0.730
Metric Spearman Correlation Coeffcieint
Lv. 1 Lv. 2 Lv. 3 Lv. 4 Lv. 5
PSNR-HMA 0.505 0.475 0.140 0.229 0.732
MS-SSIM 0.471 0.345 0.111 0.224 0.691
SR-SIM 0.505 0.401 0.098 0.234 0.732
FSIMc 0.432 0.347 0.013 0.395 0.793
PerSIM 0.306 0.160 0.143 0.479 0.825
AE 0.648 0.764 0.795 0.786 0.389
AE(L1) 0.451 0.378 0.541 0.660 0.480
AE(L2) 0.084 0.120 0.188 0.381 0.543
Sem-AE 0.725 0.815 0.802 0.797 0.615
Table 1: Pearson correlation coefficients and Spearman correlation coefficients for different decolorization level

6 Conclusion

In this paper, we analyzed existing regularization techniques to obtain semantic filter sets using an unsupervised learning technique. The semantic filters which represent characteristic visual concepts were learned jointly from natural images. While the learned semantic filters succeeded in their primary task of reconstructing an input image, their worth was illustrated in two applications that employed their color and structural groupings. This work provides a promising step towards defining perceptual visual concepts that can be used to learn, interpret, and leverage deep learning models.

References

  • [1] F. Porikli, S. Shan, C. Snoek, R. Sukthankar, and X. Wang, “Deep learning for visual understanding: Part 2 [from the guest editors],” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 17–19, Jan 2018.
  • [2] D. Marr and A Vision, “A computational investigation into the human representation and processing of visual information,” WH San Francisco: Freeman and Company, vol. 1, no. 2, 1982.
  • [3] D. G. Lowe, “Object recognition from local scale-invariant features,” in Computer vision, 1999. The proceedings of the seventh IEEE international conference on, 1999, vol. 2, pp. 1150–1157.
  • [4] D. Temel and G. AlRegib, “ReSIFT: Reliability-weighted sift-based image quality assessment,” in Image Processing (ICIP), 2016 IEEE International Conference on. IEEE, 2016, pp. 2047–2051.
  • [5] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
  • [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [7] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in ECCV, 2014, pp. 818–833.
  • [8] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network dissection: Quantifying interpretability of deep visual representations,” in

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2017.
  • [9] C. Olah, A. Mordvintsev, and L. Schubert, “Feature visualization,” Distill, 2017, https://distill.pub/2017/feature-visualization.
  • [10] Q. Yang, “Semantic filtering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4517–4526.
  • [11] D. Temel, M. Prabhushankar, and G. AlRegib, “UNIQUE: Unsupervised image quality estimation,” IEEE signal processing letters, vol. 23, no. 10, pp. 1414–1418, 2016.
  • [12] M. Prabhushankar, D. Temel, and G. AlRegib, “Ms-unique: Multi-model and sharpness-weighted unsupervised image quality estimation,” Electronic Imaging, vol. 2017, no. 12, pp. 30–35, 2017.
  • [13] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
  • [14] F. Bauer, S. Pereverzev, and L. Rosasco, “On regularization algorithms in learning theory,” Journal of complexity, vol. 23, no. 1, pp. 52–72, 2007.
  • [15] Y. Yao, L. Rosasco, and A. Caponnetto, “On early stopping in gradient descent learning,” Constructive Approximation, vol. 26, no. 2, pp. 289–315, 2007.
  • [16] A. J. Bell and T. J. Sejnowski, “The “independent components” of natural scenes are edge filters,” Vision Research, vol. 37, no. 23, pp. 3327 – 3338, 1997.
  • [17] A. N. Tikhonov, A. Goncharsky, V. V. Stepanov, and A. G Yagola, Numerical methods for the solution of ill-posed problems, vol. 328, Springer Science & Business Media, 2013.
  • [18] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
  • [19] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.
  • [20] A. Ng, J. Ngiam, C. Y. Foo, Y. Mai, and Caroline S., “Ufldl tutorial,” 2012.
  • [21] A. Majumdar and R. K. Ward, “Classification via group sparsity promoting regularization,” in Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on. IEEE, 2009, pp. 861–864.
  • [22] A. Ng, “Sparse autoencoder,” CS294A Lecture notes, vol. 72, no. 2011, pp. 1–19, 2011.
  • [23] D. Temel, G. Kwon*, M. Prabhushankar*, and G. AlRegib, “CURE-TSR: Challenging unreal and real environments for traffic sign recognition,” in Neural Information Processing Systems (NIPS) Workshop on Machine Learning for Intelligent Transportation Systems, Long Beach, U.S., December 2017, (*: equal contribution).
  • [24] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, et al., “Image database tid2013: Peculiarities, results and perspectives,” Signal Processing: Image Communication, vol. 30, pp. 57–77, 2015.
  • [25] N. Ponomarenko, O. Ieremeiev, V. Lukin, K. Egiazarian, and M. Carli, “Modified image visual quality metrics for contrast change and mean shift accounting,” in 2011 11th International Conference The Experience of Designing and Application of CAD Systems in Microelectronics (CADSM), Feb 2011, pp. 305–311.
  • [26] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003, Nov 2003, vol. 2, pp. 1398–1402 Vol.2.
  • [27] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “Fsim: A feature similarity index for image quality assessment,” IEEE Transactions on Image Processing, vol. 20, no. 8, pp. 2378–2386, Aug 2011.
  • [28] L. Zhang and H. Li, “Sr-sim: A fast and high performance iqa index based on spectral residual,” in 2012 19th IEEE International Conference on Image Processing, Sept 2012, pp. 1473–1476.
  • [29] D. Temel and G. AlRegib, “PerSIM: Multi-resolution image quality assessment in the perceptually uniform color domain,” in 2015 IEEE International Conference on Image Processing (ICIP), Sept 2015, pp. 1682–1686.
  • [30] R. Hassen, Z. Wang, and M. Salama, “No-reference image sharpness assessment based on local phase coherence measurement,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, March 2010, pp. 2434–2437.