Deep Image Structure and Texture Similarity (DISTS) Metric
Objective measures of image quality generally operate by making local comparisons of pixels of a "degraded" image to those of the original. Relative to human observers, these measures are overly sensitive to resampling of texture regions (e.g., replacing one patch of grass with another). Here we develop the first full-reference image quality model with explicit tolerance to texture resampling. Using a convolutional neural network, we construct an injective and differentiable function that transforms images to a multi-scale overcomplete representation. We empirically show that the spatial averages of the feature maps in this representation capture texture appearance, in that they provide a set of sufficient statistical constraints to synthesize a wide variety of texture patterns. We then describe an image quality method that combines correlation of these spatial averages ("texture similarity") with correlation of the feature maps ("structure similarity"). The parameters of the proposed measure are jointly optimized to match human ratings of image quality, while minimizing the reported distances between subimages cropped from the same texture images. Experiments show that the optimized method explains human perceptual scores, both on conventional image quality databases, as well as on texture databases. The measure also offers competitive performance on related tasks such as texture classification and retrieval. Finally, we show that our method is relatively insensitive to geometric transformations (e.g., translation and dilation), without use of any specialized training or data augmentation. Code is available at https://github.com/dingkeyan93/DISTS.READ FULL TEXT VIEW PDF
The goal of full-reference image quality assessment (FR-IQA) is to predi...
In this paper, we introduce an adaptive unsupervised learning framework,...
Next-generation ground-based solar observations require good image quali...
Image quality assessment (IQA) aims to estimate human perception based i...
Numerous image superresolution (SR) algorithms have been proposed for
The performance of objective image quality assessment (IQA) models has b...
Recent developments in image quality, data storage, and computational
Deep Image Structure and Texture Similarity (DISTS) Metric
Pioneering work on full-reference IQA dates back to the 1970s, when Mannos and Sakrison  investigated a class of visual fidelity measures in the context of rate-distortion optimization. A number of alternative measures were subsequently proposed [19, 20], trying to mimic certain functionalities of the HVS and penalize the errors between the reference and distorted images “perceptually”. However, the HVS is a complex and highly nonlinear system , and most IQA measures within the error visibility framework rely on strong assumptions and simplifications (e.g., linear or quasi-linear models for early vision characterized by restricted visual stimuli), leading to a number of problems regarding the definition of visual quality, quantification of suprathreshold distortions, and generalization to natural images . The SSIM index  introduced the concept of comparing structure similarity (instead of measuring error visibility), opening the door to a new class of full-reference IQA measures [16, 23, 24, 25]. Other design methodologies for knowledge-driven IQA include information-theoretic criterion  and perception-based pooling . Recently, there has been a surge of interest in leveraging advances in large-scale optimization to develop data-driven IQA measures [17, 6, 27, 7]. However, databases of human quality scores are often insufficiently rich to constrain the large number of model parameters. As a result, the learned methods are at risk of over-fitting .
Nearly all knowledge-driven full-reference IQA models base their quality measurements on point-by-point comparisons between pixels or convolution responses (e.g., wavelets). As such, they are not capable of handling “visual textures”, which are loosely defined as spatially homogeneous regions with repeated elements, often subject to some randomization in their location, size, color and orientation . Images of the same texture can look nearly the same to the human eye, while differing substantially at the level of pixel intensities. Research on visual texture has a long history, and can be partitioned into four problems: texture classification, texture segmentation, texture synthesis, and shape from texture. At the core of texture analysis is an efficient description (i.e., representation) that matches human perception of visual texture. In this paper, we aim to measure the perceptual similarity of texture, a goal first elucidated and explored in [29, 30].
The response amplitudes and variances of computational texture features (e.g., Gabor basis functions , local binary patterns ) have achieved good performance for texture classification, but do not correlate well with human perceptual observations as texture similarity measures [29, 30]. Texture representations that incorporate more sophisticated statistical features, such as correlations of complex wavelet coefficients , have shown significantly more power for texture synthesis, suggesting that they may provide a good substrate for similarity measures. In recent years, the use of such statistics within CNN-based representations [14, 33, 34] has led to even more powerful texture representations.
Our goal is to develop a new full-reference IQA model that combines sensitivity to structural distortions (e.g., artifacts due to noise, blur, or compression) with a tolerance of texture resampling (exchanging a texture region with a new sample that differs substantially in pixel values but looks essentially identical). As is common in many IQA methods, we first transform the reference and distorted images to a new representation, using a CNN. Within this representation, we develop a set of measurements that are sufficient to capture the appearance of a variety of different visual textures, while exhibiting a high degree of tolerance to resampling. Finally, we combine these texture parameters with global structural measures to form an IQA measure.
Our model is built on an initial transformation, , that maps the reference image and the distorted image to “perceptual” representations and , respectively. The primary motivation is that perceptual distances are non-uniform in the pixel space [35, 36], and this is the main reason that MSE is inadequate as a perceptual IQA model. The function should endeavor to map the pixel space to another space that is more perceptually uniform. Previous IQA methods have used filter banks for local frequency representation to capture the frequency-dependence of error visibility [19, 4]. Others have used transformations that mimic the early visual system [20, 37, 38, 39]. More recently, deep CNNs have shown surprising power in representing perceptual image distortions [6, 27, 7]. In particular, Zhang et al. 
have demonstrated that pre-trained deep features from VGG have “reasonable” effectiveness in measuring perceptual quality.
on the ImageNet database. The VGG transformation is constructed from a feedforward cascade of layers, each including spatial convolution, halfwave rectification, and downsampling. All operations are continuous and differentiable, both advantageous for an IQA method that is to be used in optimizing image processing systems. We modified the VGG architecture to achieve two additional desired mathematical properties. First, in order to provide a good substrate for the invariances needed for texture resampling, we wanted the initial transformation to be translation-invariant
. The “max pooling” operation of the original VGG architecture has been shown to disrupt translation-invariance
, and leads to visible aliasing artifacts when used to interpolate between images with geodesic sequences. To avoid aliasing when subsampling by a factor of two, the Nyquist theorem requires blurring with a filter whose cutoff frequency is below radians/sample . Following this principle, we replace all max pooling layers in VGG with weighted pooling :
where denotes pointwise product, and the blurring kernel is implemented by a Hanning window that approximately enforces the Nyquist criterion. As additional motivation, we note that pooling has been used to describe the behavior of complex cells in primary visual cortex , and is also closely related to the complex modulus used in the scattering transform .
A second desired property for our transformation is that it should be injective: distinct inputs should map to distinct outputs. This is necessary to ensure that the final quality measure can be transferred into a proper metric (in the mathematical sense) - if the representation of an image is non-unique, then equality of the output representations will not imply equality of the input images. This property has proven useful in perceptual optimization. Earlier IQA measures such as MSE and SSIM relied on an injective transformation (in fact, the identity mapping), but many more recent methods do not. For example, the mapping function in GMSD  extracts image gradients, discarding local luminance information that is essential to human perception of image quality. Similarly, GTI-CNN , uses a surjective CNN to construct the transformation, in an attempt to achieve invariance to mild geometric transformations.
Considerable effort has been made in developing invertible CNN-based transformations in the context of density modeling [47, 48, 49, 50]. These methods place strict constraints on either network architectures [47, 49] or network parameters , which limit the expressiveness in learning quality-relevant representations (as empirically verified in our experiments). Ma et al. i.e., the output dimension of each layer should increase by at least a logarithmic factor). Although mathematically appealing, this result does not constrain parameter settings of CNNs of more than two layers. In addition, a Gaussian-weighted CNN is less like to be perceptually relevant [14, 17].
Like most CNNs, VGG discards information at each stage of transformation. Given the difficulty of constraining parameters of VGG to ensure an injective mapping, we use a far simpler modification, incorporating the input image as an additional feature map (the “zeroth” layer of the network). The representation consists of the reference image , concatenated with the convolution responses of five VGG layers (labelled , , , , and ):
where denotes the number of convolution layers chosen to construct , is the number of feature maps in the -th convolution layer, and . Similarly, we also compute the representaiton of the distorted image:
Fig. 2 demonstrates the injective property of the resulting transformation, in comparison to GMSD and GTI-CNN. For each IQA method, , we attempt to recover an original image , by solving the optimization problem
with gradient descent. For initialization from white noise, or a noise-corrupted copy of the original image, both GMSD and GTI-CNN fail on this simple task.
The visual appearance of texture is often characterized in terms of sets of local statistics  that are presumably measured by the HVS. Models consisting of various sets of features have been tested using synthesis [13, 52, 11, 14]: one generates an image with statistics that match those of a texture photograph. If the set of statistical measurements is a complete description of the appearance of the texture, then the synthesized image should be perceptually indistinguishable from the original , at least based on preattentive judgements .
Portilla & Simoncelli  found that the local correlations (and other pairwise statistics) of complex wavelet responses were sufficient to generate reasonable facsimiles of a wide variety of visual textures. Gatys et al. used correlations across channels of several layers in a VGG network, and were able to synthesize consistently better textures, albeit with a much larger set of statistics (K parameters). Since this is typically larger than the number of pixels in the input image, it is likely that this image is unique in matching these statistics, and any diversity in the synthesis results may reflect local optima of the optimization procedure. Ustyuzhaninov et al.  provide direct evidence of this hypothesis: If the number of the statistical measurements is sufficiently large (on the order of millions), a single-layer CNN with random filters can produce textures that are visually indiscernible to the human eye. Subsequent results suggest that matching only the mean and variance of CNN channels is sufficient for texture classification or style transfer [55, 56, 57].
In our experiments, we found that measuring only the spatial means of the feature maps (a total of
statistics) provide an effective parametric model for visual texture. Specifically, we used this model to synthesize textures by solving
where is the target texture image, is the synthesized texture image, obtained by gradient descent optimization from a random initialization, and and are the global means of channels and , respectively. Fig. 3 shows the synthesis results of our texture model using statistical constraints from individual and combined convolution layers of the pre-trained VGG. We find that measurements from early layers appear to capture basic intensity and color information, and those from later layers summarize the shape and structure information. By matching statistics up to layer , the synthesized texture appears visually similar to the reference.
Next, we need to specify the quality measurements based on and . Fig. 5 visualizes some feature maps of the six stages of the reference image “Buildings”. As can been seen, spatial structures are present at all stages, indicating strong statistical dependencies between neighbouring coefficients. Therefore, use of an -norm, that assumes statistical independence of errors at different locations, is not appropriate. Inspired by the form of SSIM , we define separate quality measurements for the texture (using the global means) and the structure (using the global correlation) of each pair of corresponding feature maps:
where , , , , and represent the global means and variances of and , and the global covariance between and , respectively. Two small positive constants, and , are included to avoid numerical instability when the denominators are close to zero. The normalization mechanisms in Eq. (5) and Eq. (6) serve to equalize the magnitudes of feature maps at different stages.
Finally, the proposed DISTS model is a weighted sum of global quality measurements at different convolution layers
where are positive learnable weights, satisfying . Note that the convolution kernels are fixed throughout the development of the method. Fig. 6 shows the full computation diagram of our quality assessment system.
For (as is the case for responses after ReLU nonlinearity), it can be shown that
is a valid metric, satisfying
triangle inequality: ;
identity of indiscernibles (i.e., unique minimum): .
The non-negative and symmetric properties are immediately apparent. The identity of indiscernibles is guaranteed due to the injective mapping function and the use of SSIM-motivated quality measurements. It remains only to verify that satisfies the triangle inequality. We first rewrite as
Brunet et al.  have proved that is a metric for and . Then,
where Eq. (12) follows from the Cauchy–Schwarz inequality. ∎
The perceptual weights in Eq. (7) are jointly optimized for human perception of image quality and texture invariance. Specifically, for image quality, we minimize the absolute error between model predictions and human ratings:
where denotes the normalized ground-truth quality score of collected from psychophysical experiments. We choose the large-scale IQA dataset KADID-10k  as the training set, which contains reference images, each of which is distorted by distortion types at distortion levels. In addition, we explicitly enforce the model to be invariant to texture substitution in a data-driven fashion. We minimize the distance (measured by Eq. (7)) between two patches sampled from the same texture image :
We select texture images from the describable textures dataset (DTD) , consisting of images ( categories and images for each category). In practice, we randomly sample two minibatches and
from KADID-10k and DTD, respectively, and use a variant of stochastic gradient descent to adjust the parameters:
where governs the trade-off between the two terms.
|Method||LIVE ||CSIQ ||TID2013 |
The proposed DISTS model has a close relationship to a number of existing IQA methods.
SSIM and its variants [2, 63, 23]: The multi-scale extension of SSIM  incorporates the variations of viewing conditions in IQA, and calibrates the cross-scale parameters via subject testing on artificially synthesized images. Our model follows a similar approach, building on a multi-scale hierarchical representation and directly calibrating cross-scale parameters (i.e., ) using subject-rated natural images with various distortions. The extension of SSIM in the complex wavelet domain  gains invariance to small geometric transformations by measuring relative phase patterns of the wavelet coefficients. As will be clear in Section 3.5, by optimizing for texture invariance, our method inherits insensitivity to mild geometric transformations. Nevertheless, DISTS does not offer a 2D map that indicates local quality variations across spatial locations as the SSIM family does.
The adaptive linear system framework  decomposes the distortion between two images into a linear combination of adaptive components to local image structures, separating structural and non-structural distortions. It generalizes many IQA models, including MSE, space/frequency weighting [18, 65], transform domain masking , and the tangent distance . DISTS can be seen as an adaptive nonlinear system, where structure comparison captures structural distortions, and mean intensity comparison measures non-structural distortions, with basis functions adapted to global image content.
Style and content losses  based on the pre-trained VGG network have reignited the field of style transfer. Specifically, the style loss is built upon the correlations between convolution responses at the same stages - the Gram matrix, while the content loss is defined by the MSE between two representations. The combined loss does not have the desired property of unique minimum we seek. By incorporating the input image as the zeroth stage feature representation of VGG and making SSIM-inspired quality measurements, the square root of DISTS is a valid metric.
Image restoration losses 
in the era of deep learning are typically defined as a weighted sum of-norm distances computed on the raw pixels and several stages of VGG feature maps, where the weights are manually tuned for tasks at hand. Later stages of the VGG representation are often preferred so as to incorporate image semantics into low-level vision, encouraging perceptually meaningful details that are not necessarily aligned with the underlying image. This type of loss does not achieve the level of texture invariance we are looking for. Moreover, the weights of DISTS are jointly optimized for image quality and texture invariance, and can be used across multiple low-level vision tasks.
In this section, we first present the implementation details of the proposed DISTS. We then compare our method with a wide range of image similarity models in the term of quality prediction, texture similarity, texture classification/retrieval, and invariance of geometric transformations.
We fix the filter kernels of the pre-trained VGG, and learn the perceptual weights . The training is carried out by optimizing the objective function in Eq. (16), assuming a value of , using Adam  with a batch size of 32 and an initial learning rate of . After every 1K iterations, we reduce the learning rate by a factor of . We train DISTS for 5K iterations, which takes approximately one hour on an NVIDIA GTX 2080 GPU. To ensure a unique minimum of our model, we project the weights of the zeroth stage onto the interval after each gradient step. We choose a Hanning window to anti-alias the VGG representation. Both in Eq. (5) and in Eq. (6) are set to . During training and testing, we follow the suggestions in , and rescale the input images such that the smaller dimension is pixels.
Trained on the entire KADID  dataset, DISTS is tested on the other three standard IQA databases LIVE , CSIQ  and TID2013  to verify model generalizability. We use the Spearman rank correlation coefficient (SRCC), the Kendall rank correlation coefficient (KRCC), the Pearson linear correlation coefficient (PLCC), and the root mean square error (RMSE) as the evaluation criteria. Before computing PLCC and RMSE, we fit a four-parameter function to compensate the nonlinearity:
where are parameters to be fitted. We compare DISTS against a set of full-reference IQA methods, including nine knowledge-driven models and three data-driven CNN-based models. The implementations of all methods are obtained from the respective authors, except for DeepIQA , which is retrained on KADID for fair comparison. As LPIPS  has different configurations, we choose the default one - LPIPS-VGG-lin.
Results, reported in Table I, demonstrate that DISTS performs favorably in comparison to both classic methods (e.g., PSNR and SSIM ) and CNN-based models (DeepIQA, PieAPP, and LPIPS). Overall, the best performances across all three databases and all comparison metrics are obtained with MAD , FSIM  and GMSD . It is worth noting that the three databases have been re-used for many years throughout the algorithm design processes, and recent full-reference IQA methods tend to adapt themselves to these databases, deliberately or unintentionally, via extensive computational module selection, raising the risk of overfitting (see Fig. 2). Fig. 7 shows scatter plots of model predictions of representative IQA methods versus the raw (i.e., before nonlinear mapping of Eq. 17) subjective mean opinion scores (MOSs) on the TID2013 database. From the fitted curves, one can observe that DISTS is nearly linear in MOS.
We also tested DISTS on BAPPS 
, a large-scale and highly-varied patch similarity dataset. BAPPS contains 1) traditional synthetic distortions, such as geometric and photometric manipulation, noise contamination, blurring, and compression, 2) CNN-based distortions, such as from denoising autoencoders and image restoration tasks, and distortions generated by real-world image processing systems. The human similarity judgments are obtained from a two-alternative forced choice (2AFC) experiment. From TableII, we find that DISTS (which was not trained on BAPPS, or any similar database) achieves a comparable performance to LPIPS , which has been trained on BAPPS. We conclude that DISTS predicts image quality well, and generalizes to challenging unseen distortions, such as those caused by real-world algorithms.
|Method||Synthetic distortions||Distortions by real-world algorithms||All|
We also tested the performance of DISTS on texture quality assessment. Since most knowledge-driven full-reference IQA models are not good at measuring texture similarity (see Fig 1), we only include SSIM  and FSIM  for reference. We add CW-SSIM  and three computational models specifically designed for texture similarity - STSIM , NPTSM  and IGSTQA . STSIM has several configurations, and we choose local STSIM-2 that is publicly available111https://github.com/andreydung/Steerable-filter.
We used a synthesized texture quality assessment database SynTEX , consisting of reference textures with synthesized versions generated by five texture synthesis algorithms. Table III shows the SRCC and KRCC results, where we can see that texture similarity models generally perform better than IQA models. Focusing on texture similarity, IGSTQA  achieves a relatively high performance, but is still inferior to DISTS. This indicates that the VGG-based global measurements of DISTS capture the essential features and attributes of visual textures.
To further investigate DISTS from texture similarity to texture quality, we construct a texture quality database (TQD), which contains 10 texture images selected from Pixabay222https://pixabay.com/images/search/texture. For each texture image, we first add seven traditional synthetic distortions, including additive white Gaussian noise, Gaussian blur, JPEG compression, JPEG2000 compression, pink noise, chromatic aberration, and image color quantization. For each distortion type, we randomly select one distortion level from a total of three levels, and apply it to each texture image. We then create four copies for each texture using different texture synthesis algorithms, including two classical ones (a parametric model 
and a non-parametric model), and two CNN-based ones [14, 73]. Last, to produce “high-quality” images, we randomly crop four subimages from the original texture. In total, TQD has images. We gather human data from subjects, who have general knowledge of image processing but are unaware of the detailed purpose of the study. The viewing distance is fixed to pixels per degree of visual angle. Each subject is shown all ten sets of images, one set at a time, and is asked to rank the images according to the perceptual similarity to the reference texture. Instead of simply averaging the human opinions, we use reciprocal rank fusion  to obtain the final ranking
where is the rank of given by the -th subject and
is a constant to mitigate the impact of high rankings by outlier systems. Table III lists the SRCC and KRCC results, where we compute the correlations within each texture pattern and average them across textures. We find that nearly all existing models perform poorly on the new database, including those tailored to texture similarity. In contrast, DISTS significantly outperforms these methods by a large margin. Fig. 8 shows a set of texture examples, where we notice that DISTS gives high rankings to resampled images and low rankings to images suffering from visible distortions. This verifies that our model is in close agreement with human perception of texture quality, and has great potentials for use in other texture analysis problems, such as high-quality texture retrieval.
|Method||SynTEX ||TQD (proposed)|
We also applied DISTS to texture classification and retrieval. We used the grayscale and color Brodatz texture databases  (denoted by GBT and CBT, respectively), each of which contains different texture images. We resampled nine non-overlapping patches from each texture pattern. Fig. 9 shows representative texture samples extracted from CBT.
The texture classification problem consists of assigning an unknown sample image to one of the known texture classes. For each texture, we randomly choose five patches for training, two for validation, and the remaining two for testing. A simple -nearest neighbors (-NN) classification algorithm is implemented, which allows us to incorporate and compare different similarity models as distance measures. The predicted label of a test image is determined by a majority vote over its nearest neighbors in the training set, where the value of is chosen using the validation set. We implement a baseline model - the bag-of-words of SIFT features  with -NN. The classification accuracy results are listed in Table IV, where we see that the baseline model beats most image similarity-based
-NN classifiers, except LPIPS (on CBT) and DISTS. This shows that our model is effective at discriminating textures that are visually different to the human eye.
The content-based texture retrieval problem consists of searching for images from a large database that are visually indistinguishable. In our experiment, for each texture, we set three patches as the queries and aim to retrieve the remaining six patches. Specifically, the distances between each query and the remaining images in the dataset are computed and ranked so as to retrieve the images with minimal distances. To evaluate the retrieval performance, we use mean average precision (mAP), which is defined by
where is the number of queries, is the number of similar images in the database, is the precision at cut-off in the ranked list, and is an indicator function equal to one if the item at rank is a similar image and zero otherwise. As seen in Table IV, DISTS achieves the best performance on both CBT and GBT datasets. The classification/retrieval errors are primarily due to textures with noticeable inhomogeneities (e.g., middle patch in Fig. 9 (c)). In addition, the performance on GBT is slightly reduced compared with that on CBT, indicating the importance of color information in these tasks.
Classification and retrieval of texture patches resampled from the same images are relatively easy tasks. We also tested DISTS on a more challenging large-scale texture database, the Amsterdam Library of Textures (ALOT) , containing photographs of textured surfaces, from different viewing angles and illumination conditions. Again, we adopt a naïve -NN method () using our model as the measure of distance, and test it on of the samples randomly selected from the database. Without training on ALOT, DISTS achieves a reasonable classification accuracy of , albeit lower than the value of achieved by a knowledge-driven method 
with hand-crafted features and support vector machines, and the value ofachieved by a data-driven CNN-based method . The primary cause of errors when using DISTS in this task is that images from the same textured surface can appear quite different under different lighting or viewpoint, as seen in the example in Fig. 10. DISTS, which is designed to capture visual appearance only, could likely be improved for this task by fine-tuning the perceptual weights (along with the VGG network parameters) on a small subset of human-labeled ALOT images.
|Method||Classification acc.||Retrieval mAP|
Apart from texture similarity, most full-reference IQA measures fail dramatically when the original and distorted images are misregistered, either globally or locally. The underlying reason is again reliance on the assumption of pixel alignment. Although pre-registration alleviates this issue in certain occasions, it comes with substantial computational complexity, and does not work well in the presence of severe distortions . Here we investigate the degree of invariance of DISTS to geometric transformations that are imperceptible to our visual system.
As there are no subject-rated IQA databases designed for this specific purpose, we augment the LIVE database  (LIVEAug) with geometric transformations. In real-world scenarios, an image should first undergo geometric transformations (e.g., camera movement) and then distortions (e.g., JPEG compression). We follow the suggestion in , and implement an equivalent but much simpler approach - directly applying the transformations to the original image. Specifically, we generate four augmented reference images using geometric transformations: 1) shift by pixels in horizontal direction, 2) clockwise rotation by a degree of , 3) dilation by a factor of , and 4) their combination. This yields a set of reference-distortion pairs in the augmented LIVE database. Since the transformations are modest, the quality scores of distorted images with respect to the modified reference images are assumed to be the same as with respect to the original reference image.
The SRCC results of the augmented LIVE database are shown in Table V. We find that data-driven methods based on CNNs outperform traditional ones by a large margin. Note that, even the simplest geometric transformation - translation - may hurt the performance of CNN-based methods, which indicates that this type of invariance does not come for free if CNNs ignore the Nyquist theorem when downsampling. Trained on augmented data by geometric transformations, GTI-CNN  achieves desirable invariance at the cost of discarding perceptually important features (see Fig. 2). DISTS is seen to perform extremely well across all distortions and exhibit a high degree of robustness to geometric transformations, which we believe arises from 1) replacing max pooling with pooling, 2) using global quality measurements, and 3) optimizing for invariance to texture resampling (see also Fig. 11).
|Method||Distortion type||Geometric transformation||Total|
|JPEG2000||JPEG||Gauss. noise||Gauss. blur||Fast fading||Translation||Rotation||Dilation||Mixed|
We have presented a new full-reference IQA method, DISTS, which is the first of its kind with built-in invariance to texture resampling. Our model unifies structure and texture similarity, is robust to mild geometric distortions, and performs well in texture classification and retrieval.
DISTS is based on the pre-trained VGG network for object recognition. By computing the global means of convolution responses at each stage, we established a universal parametric texture model similar to that of Portilla & Simoncelli . Despite the empirical success, it is imperative to open this “black box” and to understand 1) what and how certain texture features and attributes are captured by the pre-trained network, 2) the importance of cascaded convolution and subsampled pooling in summarizing useful texture information. It is also of interest to extend the current model to measure distortions locally, as is done in SSIM. In this case, the distance measure could be reformulated to select between structure and texture measures as appropriate, instead of simply combining them linearly.
The most direct use of IQA measures is for performance assessment and comparison of image processing systems. But perhaps more importantly, they may be used to optimize image processing methods, so as to improve the visual quality of their results. In this context, most existing IQA measures present major obstacles due to the fact that they lack desired mathematical properties that aid optimization (e.g., injectivity, differentiability and convexity). In many cases, they rely on surjective mappings, and minima are non-unique (see Fig. 2). Although DISTS enjoys several advantageous mathematical properties, it is still highly non-convex (with abundant saddle points and plateaus), and recovery from random noise using stochastic gradient descent methods (see Fig. 2) requires many more iterations than for SSIM. In practice, the larger the weight of the structure term at the zeroth stage (, Eq. (6), the faster the optimization converges. However, to reach a reasonable level of texture invariance, the learned should be larger than
, hindering optimization. We are currently analyzing DISTS in the context of perceptual optimization, with the intention of learning a more suitable set of perceptual weights by adding the optimizability constraints. Initial results indicate that DISTS-based optimization of image processing applications, including denoising, deblurring, super-resolution, and compression can lead to noticeable improvements in visual quality.
IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 586–595.
K. Popat and R. W. Picard, “Cluster-based probability model and its application to image and texture processing,”IEEE Transactions on Image Processing, vol. 6, no. 2, pp. 268–284, 1997.
Journal of Machine Learning Research, vol. 20, no. 184, pp. 1–25, 2019.
B. Vintch, J. A. Movshon, and E. P. Simoncelli, “A convolutional subunit model for neuronal responses in macaque v1,”Journal of Neuroscience, vol. 35, no. 44, pp. 14 829–14 841, 2015.
L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear independent components estimation,” inInternational Conference on Learning Representations, 2015, pp. 1–13.
International Joint Conferences on Artificial Intelligence, 2017, pp. 2230–2236.