DISTS
Deep Image Structure and Texture Similarity (DISTS) Metric
view repo
Objective measures of image quality generally operate by making local comparisons of pixels of a "degraded" image to those of the original. Relative to human observers, these measures are overly sensitive to resampling of texture regions (e.g., replacing one patch of grass with another). Here we develop the first fullreference image quality model with explicit tolerance to texture resampling. Using a convolutional neural network, we construct an injective and differentiable function that transforms images to a multiscale overcomplete representation. We empirically show that the spatial averages of the feature maps in this representation capture texture appearance, in that they provide a set of sufficient statistical constraints to synthesize a wide variety of texture patterns. We then describe an image quality method that combines correlation of these spatial averages ("texture similarity") with correlation of the feature maps ("structure similarity"). The parameters of the proposed measure are jointly optimized to match human ratings of image quality, while minimizing the reported distances between subimages cropped from the same texture images. Experiments show that the optimized method explains human perceptual scores, both on conventional image quality databases, as well as on texture databases. The measure also offers competitive performance on related tasks such as texture classification and retrieval. Finally, we show that our method is relatively insensitive to geometric transformations (e.g., translation and dilation), without use of any specialized training or data augmentation. Code is available at https://github.com/dingkeyan93/DISTS.
READ FULL TEXT VIEW PDF
The goal of fullreference image quality assessment (FRIQA) is to predi...
read it
In this paper, we introduce an adaptive unsupervised learning framework,...
read it
Nextgeneration groundbased solar observations require good image quali...
read it
Image quality assessment (IQA) aims to estimate human perception based i...
read it
Numerous image superresolution (SR) algorithms have been proposed for
re...
read it
The performance of objective image quality assessment (IQA) models has b...
read it
Recent developments in image quality, data storage, and computational
ca...
read it
Deep Image Structure and Texture Similarity (DISTS) Metric
Pioneering work on fullreference IQA dates back to the 1970s, when Mannos and Sakrison [18] investigated a class of visual fidelity measures in the context of ratedistortion optimization. A number of alternative measures were subsequently proposed [19, 20], trying to mimic certain functionalities of the HVS and penalize the errors between the reference and distorted images “perceptually”. However, the HVS is a complex and highly nonlinear system [21], and most IQA measures within the error visibility framework rely on strong assumptions and simplifications (e.g., linear or quasilinear models for early vision characterized by restricted visual stimuli), leading to a number of problems regarding the definition of visual quality, quantification of suprathreshold distortions, and generalization to natural images [22]. The SSIM index [2] introduced the concept of comparing structure similarity (instead of measuring error visibility), opening the door to a new class of fullreference IQA measures [16, 23, 24, 25]. Other design methodologies for knowledgedriven IQA include informationtheoretic criterion [3] and perceptionbased pooling [26]. Recently, there has been a surge of interest in leveraging advances in largescale optimization to develop datadriven IQA measures [17, 6, 27, 7]. However, databases of human quality scores are often insufficiently rich to constrain the large number of model parameters. As a result, the learned methods are at risk of overfitting [28].
Nearly all knowledgedriven fullreference IQA models base their quality measurements on pointbypoint comparisons between pixels or convolution responses (e.g., wavelets). As such, they are not capable of handling “visual textures”, which are loosely defined as spatially homogeneous regions with repeated elements, often subject to some randomization in their location, size, color and orientation [11]. Images of the same texture can look nearly the same to the human eye, while differing substantially at the level of pixel intensities. Research on visual texture has a long history, and can be partitioned into four problems: texture classification, texture segmentation, texture synthesis, and shape from texture. At the core of texture analysis is an efficient description (i.e., representation) that matches human perception of visual texture. In this paper, we aim to measure the perceptual similarity of texture, a goal first elucidated and explored in [29, 30].
The response amplitudes and variances of computational texture features (
e.g., Gabor basis functions [31], local binary patterns [32]) have achieved good performance for texture classification, but do not correlate well with human perceptual observations as texture similarity measures [29, 30]. Texture representations that incorporate more sophisticated statistical features, such as correlations of complex wavelet coefficients [11], have shown significantly more power for texture synthesis, suggesting that they may provide a good substrate for similarity measures. In recent years, the use of such statistics within CNNbased representations [14, 33, 34] has led to even more powerful texture representations.Our goal is to develop a new fullreference IQA model that combines sensitivity to structural distortions (e.g., artifacts due to noise, blur, or compression) with a tolerance of texture resampling (exchanging a texture region with a new sample that differs substantially in pixel values but looks essentially identical). As is common in many IQA methods, we first transform the reference and distorted images to a new representation, using a CNN. Within this representation, we develop a set of measurements that are sufficient to capture the appearance of a variety of different visual textures, while exhibiting a high degree of tolerance to resampling. Finally, we combine these texture parameters with global structural measures to form an IQA measure.
Our model is built on an initial transformation, , that maps the reference image and the distorted image to “perceptual” representations and , respectively. The primary motivation is that perceptual distances are nonuniform in the pixel space [35, 36], and this is the main reason that MSE is inadequate as a perceptual IQA model. The function should endeavor to map the pixel space to another space that is more perceptually uniform. Previous IQA methods have used filter banks for local frequency representation to capture the frequencydependence of error visibility [19, 4]. Others have used transformations that mimic the early visual system [20, 37, 38, 39]. More recently, deep CNNs have shown surprising power in representing perceptual image distortions [6, 27, 7]. In particular, Zhang et al. [6]
have demonstrated that pretrained deep features from VGG have “reasonable” effectiveness in measuring perceptual quality.
As such, we chose to base our model on the VGG16 CNN [10], pretrained for object recognition [40]
on the ImageNet database
[41]. The VGG transformation is constructed from a feedforward cascade of layers, each including spatial convolution, halfwave rectification, and downsampling. All operations are continuous and differentiable, both advantageous for an IQA method that is to be used in optimizing image processing systems. We modified the VGG architecture to achieve two additional desired mathematical properties. First, in order to provide a good substrate for the invariances needed for texture resampling, we wanted the initial transformation to be translationinvariant. The “max pooling” operation of the original VGG architecture has been shown to disrupt translationinvariance
[42], and leads to visible aliasing artifacts when used to interpolate between images with geodesic sequences
[43]. To avoid aliasing when subsampling by a factor of two, the Nyquist theorem requires blurring with a filter whose cutoff frequency is below radians/sample [44]. Following this principle, we replace all max pooling layers in VGG with weighted pooling [43]:(1) 
where denotes pointwise product, and the blurring kernel is implemented by a Hanning window that approximately enforces the Nyquist criterion. As additional motivation, we note that pooling has been used to describe the behavior of complex cells in primary visual cortex [45], and is also closely related to the complex modulus used in the scattering transform [46].
A second desired property for our transformation is that it should be injective: distinct inputs should map to distinct outputs. This is necessary to ensure that the final quality measure can be transferred into a proper metric (in the mathematical sense)  if the representation of an image is nonunique, then equality of the output representations will not imply equality of the input images. This property has proven useful in perceptual optimization. Earlier IQA measures such as MSE and SSIM relied on an injective transformation (in fact, the identity mapping), but many more recent methods do not. For example, the mapping function in GMSD [25] extracts image gradients, discarding local luminance information that is essential to human perception of image quality. Similarly, GTICNN [17], uses a surjective CNN to construct the transformation, in an attempt to achieve invariance to mild geometric transformations.
Considerable effort has been made in developing invertible CNNbased transformations in the context of density modeling [47, 48, 49, 50]. These methods place strict constraints on either network architectures [47, 49] or network parameters [50], which limit the expressiveness in learning qualityrelevant representations (as empirically verified in our experiments). Ma et al. [51]
proved that under Gaussiandistributed random weights and ReLU nonlinearity, a twolayer CNN is injective provided that it is sufficiently expansive (
i.e., the output dimension of each layer should increase by at least a logarithmic factor). Although mathematically appealing, this result does not constrain parameter settings of CNNs of more than two layers. In addition, a Gaussianweighted CNN is less like to be perceptually relevant [14, 17].Like most CNNs, VGG discards information at each stage of transformation. Given the difficulty of constraining parameters of VGG to ensure an injective mapping, we use a far simpler modification, incorporating the input image as an additional feature map (the “zeroth” layer of the network). The representation consists of the reference image , concatenated with the convolution responses of five VGG layers (labelled , , , , and ):
(2) 
where denotes the number of convolution layers chosen to construct , is the number of feature maps in the th convolution layer, and . Similarly, we also compute the representaiton of the distorted image:
(3) 
Fig. 2 demonstrates the injective property of the resulting transformation, in comparison to GMSD and GTICNN. For each IQA method, , we attempt to recover an original image , by solving the optimization problem
with gradient descent. For initialization from white noise, or a noisecorrupted copy of the original image, both GMSD and GTICNN fail on this simple task.
The visual appearance of texture is often characterized in terms of sets of local statistics [12] that are presumably measured by the HVS. Models consisting of various sets of features have been tested using synthesis [13, 52, 11, 14]: one generates an image with statistics that match those of a texture photograph. If the set of statistical measurements is a complete description of the appearance of the texture, then the synthesized image should be perceptually indistinguishable from the original [12], at least based on preattentive judgements [53].
Portilla & Simoncelli [11] found that the local correlations (and other pairwise statistics) of complex wavelet responses were sufficient to generate reasonable facsimiles of a wide variety of visual textures. Gatys et al.[14] used correlations across channels of several layers in a VGG network, and were able to synthesize consistently better textures, albeit with a much larger set of statistics (K parameters). Since this is typically larger than the number of pixels in the input image, it is likely that this image is unique in matching these statistics, and any diversity in the synthesis results may reflect local optima of the optimization procedure. Ustyuzhaninov et al. [54] provide direct evidence of this hypothesis: If the number of the statistical measurements is sufficiently large (on the order of millions), a singlelayer CNN with random filters can produce textures that are visually indiscernible to the human eye. Subsequent results suggest that matching only the mean and variance of CNN channels is sufficient for texture classification or style transfer [55, 56, 57].
In our experiments, we found that measuring only the spatial means of the feature maps (a total of
statistics) provide an effective parametric model for visual texture. Specifically, we used this model to synthesize textures
[11] by solving(4) 
where is the target texture image, is the synthesized texture image, obtained by gradient descent optimization from a random initialization, and and are the global means of channels and , respectively. Fig. 3 shows the synthesis results of our texture model using statistical constraints from individual and combined convolution layers of the pretrained VGG. We find that measurements from early layers appear to capture basic intensity and color information, and those from later layers summarize the shape and structure information. By matching statistics up to layer , the synthesized texture appears visually similar to the reference.
Next, we need to specify the quality measurements based on and . Fig. 5 visualizes some feature maps of the six stages of the reference image “Buildings”. As can been seen, spatial structures are present at all stages, indicating strong statistical dependencies between neighbouring coefficients. Therefore, use of an norm, that assumes statistical independence of errors at different locations, is not appropriate. Inspired by the form of SSIM [2], we define separate quality measurements for the texture (using the global means) and the structure (using the global correlation) of each pair of corresponding feature maps:
(5) 
(6) 
where , , , , and represent the global means and variances of and , and the global covariance between and , respectively. Two small positive constants, and , are included to avoid numerical instability when the denominators are close to zero. The normalization mechanisms in Eq. (5) and Eq. (6) serve to equalize the magnitudes of feature maps at different stages.
Finally, the proposed DISTS model is a weighted sum of global quality measurements at different convolution layers
(7) 
where are positive learnable weights, satisfying . Note that the convolution kernels are fixed throughout the development of the method. Fig. 6 shows the full computation diagram of our quality assessment system.
For (as is the case for responses after ReLU nonlinearity), it can be shown that
(8) 
is a valid metric, satisfying
nonnegativity: ;
symmetry: ;
triangle inequality: ;
identity of indiscernibles (i.e., unique minimum): .
The nonnegative and symmetric properties are immediately apparent. The identity of indiscernibles is guaranteed due to the injective mapping function and the use of SSIMmotivated quality measurements. It remains only to verify that satisfies the triangle inequality. We first rewrite as
(9) 
where
(10) 
Brunet et al. [58] have proved that is a metric for and . Then,
(11)  
(12)  
(13) 
where Eq. (12) follows from the Cauchy–Schwarz inequality. ∎
The perceptual weights in Eq. (7) are jointly optimized for human perception of image quality and texture invariance. Specifically, for image quality, we minimize the absolute error between model predictions and human ratings:
(14) 
where denotes the normalized groundtruth quality score of collected from psychophysical experiments. We choose the largescale IQA dataset KADID10k [59] as the training set, which contains reference images, each of which is distorted by distortion types at distortion levels. In addition, we explicitly enforce the model to be invariant to texture substitution in a datadriven fashion. We minimize the distance (measured by Eq. (7)) between two patches sampled from the same texture image :
(15) 
We select texture images from the describable textures dataset (DTD) [60], consisting of images ( categories and images for each category). In practice, we randomly sample two minibatches and
from KADID10k and DTD, respectively, and use a variant of stochastic gradient descent to adjust the parameters
:(16) 
where governs the tradeoff between the two terms.
Method  LIVE [61]  CSIQ [4]  TID2013 [62]  

PLCC  SRCC  KRCC  RMSE  PLCC  SRCC  KRCC  RMSE  PLCC  SRCC  KRCC  RMSE  
PSNR  0.865  0.873  0.680  13.716  0.819  0.810  0.601  0.154  0.677  0.687  0.496  0.912 
SSIM [2]  0.937  0.948  0.796  9.575  0.852  0.865  0.680  0.138  0.677  0.687  0.496  0.912 
MSSSIM [63]  0.940  0.951  0.805  9.308  0.889  0.906  0.730  0.120  0.830  0.786  0.605  0.692 
VSI [64]  0.948  0.952  0.806  8.682  0.928  0.942  0.786  0.098  0.900  0.897  0.718  0.540 
MAD [4]  0.968  0.967  0.842  6.907  0.950  0.947  0.797  0.082  0.827  0.781  0.604  0.698 
VIF [3]  0.960  0.964  0.828  7.679  0.913  0.911  0.743  0.107  0.771  0.677  0.518  0.789 
FSIM [24]  0.961  0.965  0.836  7.530  0.919  0.931  0.769  0.103  0.877  0.851  0.667  0.596 
NLPD [39]  0.932  0.937  0.778  9.901  0.923  0.932  0.769  0.101  0.839  0.800  0.625  0.674 
GMSD [25]  0.957  0.960  0.827  7.948  0.945  0.950  0.804  0.086  0.855  0.804  0.634  0.642 
DeepIQA [27]  0.940  0.947  0.791  9.305  0.901  0.909  0.732  0.114  0.834  0.831  0.631  0.684 
PieAPP [7]  0.909  0.918  0.749  11.417  0.873  0.890  0.705  0.128  0.829  0.844  0.657  0.694 
LPIPS [6]  0.934  0.932  0.765  9.735  0.896  0.876  0.689  0.117  0.749  0.670  0.497  0.823 
DISTS (ours)  0.954  0.954  0.811  8.214  0.928  0.929  0.767  0.098  0.855  0.830  0.639  0.643 
The proposed DISTS model has a close relationship to a number of existing IQA methods.
SSIM and its variants [2, 63, 23]: The multiscale extension of SSIM [63] incorporates the variations of viewing conditions in IQA, and calibrates the crossscale parameters via subject testing on artificially synthesized images. Our model follows a similar approach, building on a multiscale hierarchical representation and directly calibrating crossscale parameters (i.e., ) using subjectrated natural images with various distortions. The extension of SSIM in the complex wavelet domain [23] gains invariance to small geometric transformations by measuring relative phase patterns of the wavelet coefficients. As will be clear in Section 3.5, by optimizing for texture invariance, our method inherits insensitivity to mild geometric transformations. Nevertheless, DISTS does not offer a 2D map that indicates local quality variations across spatial locations as the SSIM family does.
The adaptive linear system framework [16] decomposes the distortion between two images into a linear combination of adaptive components to local image structures, separating structural and nonstructural distortions. It generalizes many IQA models, including MSE, space/frequency weighting [18, 65], transform domain masking [20], and the tangent distance [66]. DISTS can be seen as an adaptive nonlinear system, where structure comparison captures structural distortions, and mean intensity comparison measures nonstructural distortions, with basis functions adapted to global image content.
Style and content losses [55] based on the pretrained VGG network have reignited the field of style transfer. Specifically, the style loss is built upon the correlations between convolution responses at the same stages  the Gram matrix, while the content loss is defined by the MSE between two representations. The combined loss does not have the desired property of unique minimum we seek. By incorporating the input image as the zeroth stage feature representation of VGG and making SSIMinspired quality measurements, the square root of DISTS is a valid metric.
Image restoration losses [67]
in the era of deep learning are typically defined as a weighted sum of
norm distances computed on the raw pixels and several stages of VGG feature maps, where the weights are manually tuned for tasks at hand. Later stages of the VGG representation are often preferred so as to incorporate image semantics into lowlevel vision, encouraging perceptually meaningful details that are not necessarily aligned with the underlying image. This type of loss does not achieve the level of texture invariance we are looking for. Moreover, the weights of DISTS are jointly optimized for image quality and texture invariance, and can be used across multiple lowlevel vision tasks.In this section, we first present the implementation details of the proposed DISTS. We then compare our method with a wide range of image similarity models in the term of quality prediction, texture similarity, texture classification/retrieval, and invariance of geometric transformations.
We fix the filter kernels of the pretrained VGG, and learn the perceptual weights . The training is carried out by optimizing the objective function in Eq. (16), assuming a value of , using Adam [68] with a batch size of 32 and an initial learning rate of . After every 1K iterations, we reduce the learning rate by a factor of . We train DISTS for 5K iterations, which takes approximately one hour on an NVIDIA GTX 2080 GPU. To ensure a unique minimum of our model, we project the weights of the zeroth stage onto the interval after each gradient step. We choose a Hanning window to antialias the VGG representation. Both in Eq. (5) and in Eq. (6) are set to . During training and testing, we follow the suggestions in [2], and rescale the input images such that the smaller dimension is pixels.
Trained on the entire KADID [59] dataset, DISTS is tested on the other three standard IQA databases LIVE [61], CSIQ [4] and TID2013 [62] to verify model generalizability. We use the Spearman rank correlation coefficient (SRCC), the Kendall rank correlation coefficient (KRCC), the Pearson linear correlation coefficient (PLCC), and the root mean square error (RMSE) as the evaluation criteria. Before computing PLCC and RMSE, we fit a fourparameter function to compensate the nonlinearity:
(17) 
where are parameters to be fitted. We compare DISTS against a set of fullreference IQA methods, including nine knowledgedriven models and three datadriven CNNbased models. The implementations of all methods are obtained from the respective authors, except for DeepIQA [27], which is retrained on KADID for fair comparison. As LPIPS [6] has different configurations, we choose the default one  LPIPSVGGlin.
Results, reported in Table I, demonstrate that DISTS performs favorably in comparison to both classic methods (e.g., PSNR and SSIM [2]) and CNNbased models (DeepIQA, PieAPP, and LPIPS). Overall, the best performances across all three databases and all comparison metrics are obtained with MAD [4], FSIM [24] and GMSD [25]. It is worth noting that the three databases have been reused for many years throughout the algorithm design processes, and recent fullreference IQA methods tend to adapt themselves to these databases, deliberately or unintentionally, via extensive computational module selection, raising the risk of overfitting (see Fig. 2). Fig. 7 shows scatter plots of model predictions of representative IQA methods versus the raw (i.e., before nonlinear mapping of Eq. 17) subjective mean opinion scores (MOSs) on the TID2013 database. From the fitted curves, one can observe that DISTS is nearly linear in MOS.
We also tested DISTS on BAPPS [6]
, a largescale and highlyvaried patch similarity dataset. BAPPS contains 1) traditional synthetic distortions, such as geometric and photometric manipulation, noise contamination, blurring, and compression, 2) CNNbased distortions, such as from denoising autoencoders and image restoration tasks, and distortions generated by realworld image processing systems. The human similarity judgments are obtained from a twoalternative forced choice (2AFC) experiment. From Table
II, we find that DISTS (which was not trained on BAPPS, or any similar database) achieves a comparable performance to LPIPS [6], which has been trained on BAPPS. We conclude that DISTS predicts image quality well, and generalizes to challenging unseen distortions, such as those caused by realworld algorithms.Method  Synthetic distortions  Distortions by realworld algorithms  All  

Traditional  CNNbased  All 


Colorization 

All  
Human  0.808  0.844  0.826  0.734  0.671  0.688  0.686  0.695  0.739  
PSNR  0.573  0.801  0.687  0.642  0.590  0.624  0.543  0.614  0.633  
SSIM [2]  0.605  0.806  0.705  0.647  0.589  0.624  0.573  0.617  0.640  
MSSSIM [63]  0.585  0.768  0.676  0.638  0.589  0.524  0.572  0.596  0.617  
VSI [64]  0.630  0.818  0.724  0.668  0.592  0.597  0.568  0.622  0.648  
MAD [4]  0.598  0.770  0.684  0.655  0.593  0.490  0.581  0.599  0.621  
VIF [3]  0.556  0.744  0.650  0.651  0.594  0.515  0.597  0.603  0.615  
FSIM [24]  0.627  0.794  0.710  0.660  0.590  0.573  0.581  0.615  0.640  
NLPD [39]  0.550  0.764  0.657  0.655  0.584  0.528  0.552  0.600  0.615  
GMSD [25]  0.609  0.772  0.690  0.677  0.594  0.517  0.575  0.613  0.633  
DeepIQA [27]  0.703  0.794  0.748  0.660  0.582  0.585  0.598  0.615  0.650  
PieAPP [7]  0.725  0.769  0.747  0.685  0.582  0.594  0.598  0.626  0.658  
LPIPS [6]  0.760  0.828  0.794  0.705  0.605  0.625  0.630  0.641  0.692  
DISTS (ours)  0.772  0.822  0.797  0.710  0.600  0.627  0.625  0.651  0.689 
We also tested the performance of DISTS on texture quality assessment. Since most knowledgedriven fullreference IQA models are not good at measuring texture similarity (see Fig 1), we only include SSIM [2] and FSIM [24] for reference. We add CWSSIM [23] and three computational models specifically designed for texture similarity  STSIM [30], NPTSM [69] and IGSTQA [70]. STSIM has several configurations, and we choose local STSIM2 that is publicly available^{1}^{1}1https://github.com/andreydung/Steerablefilter.
We used a synthesized texture quality assessment database SynTEX [71], consisting of reference textures with synthesized versions generated by five texture synthesis algorithms. Table III shows the SRCC and KRCC results, where we can see that texture similarity models generally perform better than IQA models. Focusing on texture similarity, IGSTQA [70] achieves a relatively high performance, but is still inferior to DISTS. This indicates that the VGGbased global measurements of DISTS capture the essential features and attributes of visual textures.
To further investigate DISTS from texture similarity to texture quality, we construct a texture quality database (TQD), which contains 10 texture images selected from Pixabay^{2}^{2}2https://pixabay.com/images/search/texture. For each texture image, we first add seven traditional synthetic distortions, including additive white Gaussian noise, Gaussian blur, JPEG compression, JPEG2000 compression, pink noise, chromatic aberration, and image color quantization. For each distortion type, we randomly select one distortion level from a total of three levels, and apply it to each texture image. We then create four copies for each texture using different texture synthesis algorithms, including two classical ones (a parametric model [11]
and a nonparametric model
[72]), and two CNNbased ones [14, 73]. Last, to produce “highquality” images, we randomly crop four subimages from the original texture. In total, TQD has images. We gather human data from subjects, who have general knowledge of image processing but are unaware of the detailed purpose of the study. The viewing distance is fixed to pixels per degree of visual angle. Each subject is shown all ten sets of images, one set at a time, and is asked to rank the images according to the perceptual similarity to the reference texture. Instead of simply averaging the human opinions, we use reciprocal rank fusion [74] to obtain the final ranking(18) 
where is the rank of given by the th subject and
is a constant to mitigate the impact of high rankings by outlier systems
[74]. Table III lists the SRCC and KRCC results, where we compute the correlations within each texture pattern and average them across textures. We find that nearly all existing models perform poorly on the new database, including those tailored to texture similarity. In contrast, DISTS significantly outperforms these methods by a large margin. Fig. 8 shows a set of texture examples, where we notice that DISTS gives high rankings to resampled images and low rankings to images suffering from visible distortions. This verifies that our model is in close agreement with human perception of texture quality, and has great potentials for use in other texture analysis problems, such as highquality texture retrieval.Method  SynTEX [71]  TQD (proposed)  

SRCC  KRCC  SRCC  KRCC  
SSIM [2]  0.620  0.446  0.307  0.185 
CWSSIM [23]  0.497  0.335  0.325  0.238 
DeepIQA [27]  0.512  0.354  0.444  0.323 
PieAPP [7]  0.709  0.530  0.713  0.554 
LPIPS [6]  0.663  0.478  0.392  0.301 
STSIM [30]  0.643  0.469  0.408  0.315 
NPTSM [69]  0.496  0.361  0.679  0.547 
IGSTQA [70]  0.820  0.621  0.802  0.651 
DISTS (Ours)  0.923  0.759  0.910  0.785 
We also applied DISTS to texture classification and retrieval. We used the grayscale and color Brodatz texture databases [75] (denoted by GBT and CBT, respectively), each of which contains different texture images. We resampled nine nonoverlapping patches from each texture pattern. Fig. 9 shows representative texture samples extracted from CBT.
The texture classification problem consists of assigning an unknown sample image to one of the known texture classes. For each texture, we randomly choose five patches for training, two for validation, and the remaining two for testing. A simple nearest neighbors (NN) classification algorithm is implemented, which allows us to incorporate and compare different similarity models as distance measures. The predicted label of a test image is determined by a majority vote over its nearest neighbors in the training set, where the value of is chosen using the validation set. We implement a baseline model  the bagofwords of SIFT features [76] with NN. The classification accuracy results are listed in Table IV, where we see that the baseline model beats most image similaritybased
NN classifiers, except LPIPS (on CBT) and DISTS. This shows that our model is effective at discriminating textures that are visually different to the human eye.
The contentbased texture retrieval problem consists of searching for images from a large database that are visually indistinguishable. In our experiment, for each texture, we set three patches as the queries and aim to retrieve the remaining six patches. Specifically, the distances between each query and the remaining images in the dataset are computed and ranked so as to retrieve the images with minimal distances. To evaluate the retrieval performance, we use mean average precision (mAP), which is defined by
(19) 
where is the number of queries, is the number of similar images in the database, is the precision at cutoff in the ranked list, and is an indicator function equal to one if the item at rank is a similar image and zero otherwise. As seen in Table IV, DISTS achieves the best performance on both CBT and GBT datasets. The classification/retrieval errors are primarily due to textures with noticeable inhomogeneities (e.g., middle patch in Fig. 9 (c)). In addition, the performance on GBT is slightly reduced compared with that on CBT, indicating the importance of color information in these tasks.
Classification and retrieval of texture patches resampled from the same images are relatively easy tasks. We also tested DISTS on a more challenging largescale texture database, the Amsterdam Library of Textures (ALOT) [77], containing photographs of textured surfaces, from different viewing angles and illumination conditions. Again, we adopt a naïve NN method () using our model as the measure of distance, and test it on of the samples randomly selected from the database. Without training on ALOT, DISTS achieves a reasonable classification accuracy of , albeit lower than the value of achieved by a knowledgedriven method [78]
with handcrafted features and support vector machines, and the value of
achieved by a datadriven CNNbased method [79]. The primary cause of errors when using DISTS in this task is that images from the same textured surface can appear quite different under different lighting or viewpoint, as seen in the example in Fig. 10. DISTS, which is designed to capture visual appearance only, could likely be improved for this task by finetuning the perceptual weights (along with the VGG network parameters) on a small subset of humanlabeled ALOT images.Method  Classification acc.  Retrieval mAP  

CBT  GBT  CBT  GBT  
SSIM [2]  0.397  0.210  0.371  0.145 
CWSSIM [23]    0.424    0.351 
DeepIQA [27]  0.388  0.308  0.389  0.293 
PieAPP [7]  0.178  0.117  0.260  0.157 
LPIPS [6]  0.960  0.861  0.951  0.839 
STSIM [30]    0.708    0.632 
NPTSM [69]    0.895    0.837 
IGSTQA [70]    0.862    0.798 
SIFT [76]  0.924  0.928  0.859  0.865 
DISTS (ours)  0.995  0.968  0.988  0.951 
Apart from texture similarity, most fullreference IQA measures fail dramatically when the original and distorted images are misregistered, either globally or locally. The underlying reason is again reliance on the assumption of pixel alignment. Although preregistration alleviates this issue in certain occasions, it comes with substantial computational complexity, and does not work well in the presence of severe distortions [17]. Here we investigate the degree of invariance of DISTS to geometric transformations that are imperceptible to our visual system.
As there are no subjectrated IQA databases designed for this specific purpose, we augment the LIVE database [61] (LIVEAug) with geometric transformations. In realworld scenarios, an image should first undergo geometric transformations (e.g., camera movement) and then distortions (e.g., JPEG compression). We follow the suggestion in [17], and implement an equivalent but much simpler approach  directly applying the transformations to the original image. Specifically, we generate four augmented reference images using geometric transformations: 1) shift by pixels in horizontal direction, 2) clockwise rotation by a degree of , 3) dilation by a factor of , and 4) their combination. This yields a set of referencedistortion pairs in the augmented LIVE database. Since the transformations are modest, the quality scores of distorted images with respect to the modified reference images are assumed to be the same as with respect to the original reference image.
The SRCC results of the augmented LIVE database are shown in Table V. We find that datadriven methods based on CNNs outperform traditional ones by a large margin. Note that, even the simplest geometric transformation  translation  may hurt the performance of CNNbased methods, which indicates that this type of invariance does not come for free if CNNs ignore the Nyquist theorem when downsampling. Trained on augmented data by geometric transformations, GTICNN [17] achieves desirable invariance at the cost of discarding perceptually important features (see Fig. 2). DISTS is seen to perform extremely well across all distortions and exhibit a high degree of robustness to geometric transformations, which we believe arises from 1) replacing max pooling with pooling, 2) using global quality measurements, and 3) optimizing for invariance to texture resampling (see also Fig. 11).
Method  Distortion type  Geometric transformation  Total  

JPEG2000  JPEG  Gauss. noise  Gauss. blur  Fast fading  Translation  Rotation  Dilation  Mixed  
PSNR  0.077  0.106  0.781  0.112  0.003  0.159  0.153  0.152  0.146  0.195 
SSIM [2]  0.104  0.107  0.679  0.133  0.08  0.171  0.168  0.177  0.1660  0.190 
MSSSIM [63]  0.091  0.126  0.595  0.107  0.066  0.165  0.174  0.198  0.174  0.177 
CWSSIM [23]  0.062  0.182  0.579  0.065  0.054  0.207  0.312  0.364  0.219  0.194 
VSI [64]  0.083  0.362  0.710  0.034  0.217  0.282  0.360  0.372  0.297  0.309 
MAD [4]  0.195  0.418  0.542  0.149  0.274  0.354  0.630  0.587  0.453  0.327 
VIF [3]  0.277  0.262  0.366  0.194  0.391  0.296  0.433  0.522  0.387  0.294 
FSIM [24]  0.104  0.432  0.634  0.106  0.283  0.380  0.396  0.408  0.365  0.339 
NLPD [39]  0.060  0.069  0.501  0.166  0.047  0.062  0.074  0.083  0.066  0.112 
GMSD [25]  0.048  0.470  0.477  0.106  0.235  0.252  0.299  0.303  0.247  0.288 
DeepIQA [27]  0.813  0.873  0.948  0.827  0.813  0.822  0.919  0.918  0.881  0.859 
PieAPP [7]  0.875  0.884  0.952  0.912  0.908  0.848  0.901  0.903  0.876  0.872 
LPIPS [6]  0.730  0.872  0.919  0.592  0.743  0.811  0.908  0.893  0.861  0.779 
GTICNN [17]  0.879  0.910  0.910  0.765  0.837  0.864  0.906  0.904  0.890  0.875 
DISTS (ours)  0.944  0.948  0.957  0.921  0.894  0.948  0.939  0.946  0.937  0.928 
We have presented a new fullreference IQA method, DISTS, which is the first of its kind with builtin invariance to texture resampling. Our model unifies structure and texture similarity, is robust to mild geometric distortions, and performs well in texture classification and retrieval.
DISTS is based on the pretrained VGG network for object recognition. By computing the global means of convolution responses at each stage, we established a universal parametric texture model similar to that of Portilla & Simoncelli [11]. Despite the empirical success, it is imperative to open this “black box” and to understand 1) what and how certain texture features and attributes are captured by the pretrained network, 2) the importance of cascaded convolution and subsampled pooling in summarizing useful texture information. It is also of interest to extend the current model to measure distortions locally, as is done in SSIM. In this case, the distance measure could be reformulated to select between structure and texture measures as appropriate, instead of simply combining them linearly.
The most direct use of IQA measures is for performance assessment and comparison of image processing systems. But perhaps more importantly, they may be used to optimize image processing methods, so as to improve the visual quality of their results. In this context, most existing IQA measures present major obstacles due to the fact that they lack desired mathematical properties that aid optimization (e.g., injectivity, differentiability and convexity). In many cases, they rely on surjective mappings, and minima are nonunique (see Fig. 2). Although DISTS enjoys several advantageous mathematical properties, it is still highly nonconvex (with abundant saddle points and plateaus), and recovery from random noise using stochastic gradient descent methods (see Fig. 2) requires many more iterations than for SSIM. In practice, the larger the weight of the structure term at the zeroth stage (, Eq. (6), the faster the optimization converges. However, to reach a reasonable level of texture invariance, the learned should be larger than
, hindering optimization. We are currently analyzing DISTS in the context of perceptual optimization, with the intention of learning a more suitable set of perceptual weights by adding the optimizability constraints. Initial results indicate that DISTSbased optimization of image processing applications, including denoising, deblurring, superresolution, and compression can lead to noticeable improvements in visual quality.
IEEE Conference on Computer Vision and Pattern Recognition
, 2018, pp. 586–595.K. Popat and R. W. Picard, “Clusterbased probability model and its application to image and texture processing,”
IEEE Transactions on Image Processing, vol. 6, no. 2, pp. 268–284, 1997.Journal of Machine Learning Research
, vol. 20, no. 184, pp. 1–25, 2019.B. Vintch, J. A. Movshon, and E. P. Simoncelli, “A convolutional subunit model for neuronal responses in macaque v1,”
Journal of Neuroscience, vol. 35, no. 44, pp. 14 829–14 841, 2015.L. Dinh, D. Krueger, and Y. Bengio, “NICE: Nonlinear independent components estimation,” in
International Conference on Learning Representations, 2015, pp. 1–13.International Joint Conferences on Artificial Intelligence
, 2017, pp. 2230–2236.
Comments
There are no comments yet.