The Fourier Loss Function

02/05/2021
by   Auricchio Gennaro, et al.
University of Pavia
0

This paper introduces a new loss function induced by the Fourier-based Metric. This metric is equivalent to the Wasserstein distance but is computed very efficiently using the Fast Fourier Transform algorithm. We prove that the Fourier loss function is twice differentiable, and we provide the explicit formula for both its gradient and its Hessian matrix. More importantly, we show that minimising the Fourier loss function is equivalent to maximising the likelihood of the data under a Gaussian noise in the space of frequencies. We apply our loss function to a multi-class classification task using MNIST, Fashion-MNIST, and CIFAR10 datasets. The computational results show that, while its accuracy is competitive with other state-of-the-art loss functions, the Fourier loss function is significantly more robust to noisy data.

READ FULL TEXT VIEW PDF

Authors

page 9

page 10

06/26/2020

A Loss Function for Generative Neural Networks Based on Watson's Perceptual Model

To train Variational Autoencoders (VAEs) to generate realistic imagery r...
08/21/2018

Wrapped Loss Function for Regularizing Nonconforming Residual Distributions

Multi-output is essential in machine learning that it might suffer from ...
08/28/2021

Generalized Huber Loss for Robust Learning and its Efficient Minimization for a Robust Statistics

We propose a generalized formulation of the Huber loss. We show that wit...
04/25/2022

OCFormer: One-Class Transformer Network for Image Classification

We propose a novel deep learning framework based on Vision Transformers ...
02/04/2021

Wind Field Reconstruction with Adaptive Random Fourier Features

We investigate the use of spatial interpolation methods for reconstructi...
07/12/2020

An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay

Prioritized Experience Replay (PER) is a deep reinforcement learning tec...
01/03/2020

The Real-World-Weight Cross-Entropy Loss Function: Modeling the Costs of Mislabeling

In this paper, we propose a new metric to measure goodness-of-fit for cl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The main purpose of Machine Learning is to build algorithms capable of performing specific tasks by extracting information from data and thus achieving good generalisation skills

[1]

. A popular task in Machine Learning is classification. The aim is to learn from a training set, in order to be able to assign a class to any given new input. For instance, in image recognition, first, a set of correctly labelled images is given, and, second, a model is trained to extract the right features from data, and, then unobserved new images are classified

[2, 3, 4].

A very popular class of algorithms that solves classification problems are neural networks, which achieve very high performances in terms of accuracy

[5, 6, 7, 8, 9, 10, 11]

. When dealing with classification, the output of a neural network is usually a discrete probability measure, supported on the set of all the possible classes. The higher the probability of a class is, the more confident is the model to assign that class to the input. During the training, these probability measures are compared with the correct label through a loss function. Choosing different loss functions can lead to different learning performances. This is why the choice of the loss function plays a crucial role. However, as pointed out in

[12]

, the most common choice in the deep learning community is the Cross Entropy loss function.

Recently, the Wasserstein distance [13, 14, 15] has been applied in many different fields to compare probability measures. Unlike the Total Variation distance and the Kullback-Lieber divergence, the Wasserstein distance induces a weaker notion of convergence. From a mathematical point of view, this results in a larger amount of minimising sequences, making the Wasserstein distance appealing as a loss function in both supervised and unsupervised tasks [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26].

However, the Wasserstein distance presents several drawbacks. The main issue is that its exact computation is very time-consuming. Furthermore, when used as a loss function, it makes gradient descent methods difficult to apply since the computation of its gradient is a challenging task [16]. In order to overcome these limitations, many regularisations of the Wasserstein distance have been proposed [27, 26, 28, 29]. However, to use these regularisations, a trade-off between the quality of the approximation and the computational cost has to be found, which can be a challenging task.

In this paper, we advocate for the use of a new loss function, based on the Fourier metrics recently introduced in [30], which generalises the results of [31, 32, 33]. These metrics, which are equivalent to the Wasserstein distance, have the advantage of being faster to compute thanks to the Fast Fourier Transform (FFT) algorithm.

In this work, among the Fourier metrics, we choose a distance that is equivalent to the

. We show that the loss function induced by this metric is twice differentiable and its gradient and Hessian matrix have an explicit formula. Moreover, we show that the Hessian is positive definite and provide the expression for its eigenvalues. We are able to justify the use of this loss function from a probabilistic point of view. Indeed, if we consider a Gaussian noise on the space of frequencies, the minimisation of the Fourier loss function is equivalent to the maximisation of the likelihood of the observations.

Finally, we validate the Fourier loss function for multi-class classification tasks. We compare the accuracy and the robustness of different neural networks, trained with the Fourier, the Kullback-Leibler, and the Total Variation loss functions. While the accuracies are comparable, the neural networks trained with the Fourier loss function are more robust to random noise on the input data, showing a better generalisation ability.

This paper is organised as follows. In Section 2 we set our notation and recall the state-of-the-art loss functions for classification. In Section 3 we recall the definition of the Discrete Fourier Transform (DFT) and the Fourier-based metric. Afterwards, we introduce the Fourier loss function and present the main theoretical results. In Section 4 we report the results obtained in multi-class classification tasks on MNIST, Fashion-MNIST, and CIFAR10 datasets. Final remarks and future works are discussed in Section 5.

2 Loss Functions for Probability Measures

In this section, we set our notation and recall the main distances and divergences used as loss functions to compare probability measures in Machine Learning. Here denotes a generic metric set, and denotes the set of all the probability measures over . For the sake of simplicity, we will always suppose to be discrete probability measures.

Definition 1.

Given a target measure and a function , we define the loss of (with respect to ) as

(1)

The role of the loss function is to quantify the differences between the measure and the target , therefore, the function is usually a distance.

In the following, we review the commonly used distances between probability measures.

  • The Total Variation distance () [34] is defined as

  • The Kullback-Leibler divergence (

    ) [35] is defined as

    (2)

    if for every such that , and otherwise . We follow the convention .

    In classification tasks the label is fixed. Therefore we can simplify equation (2) by removing the constant term . As a result, we obtain the Cross Entropy (), which is defined as

    (3)

    Hence, in classification tasks minimising the Cross Entropy is equivalent to minimising the Kullback-Leibler divergence [4, 36].

  • The Wasserstein Distance () [13, 14] is defined as

    (4)

    where

    Intuitively, denotes the amount of mass that is moved from the point to the point to reshape the configuration into the configuration . The cost of moving a unit of mass from to is given by . The Wasserstein distance is then the minimum cost for performing the total reshape. When is a subset of an Euclidean space and , the distance is called .

All these functions can be used to compare measures, but their behaviour concerning convergent sequences is quite different, as shown in the next example [19].

Example 1.

Let us consider the family of probability measures given by with . We expect that the distance between and gets smaller as gets closer to . However, only the distance detects this convergence. In fact, we have that:

while

Figure 1: Distance between and . We have scaled the Wasserstein distance for visual convenience.

Unlike the Total Variation and the Kullback-Leibler divergence, the Wasserstein distance allows us to perform gradient descent in order to learn a Dirac delta distribution. However, the computation of the Wasserstein distance and its gradient are time-consuming [27, 37, 16]

. For this reason, in computer vision and machine learning, the true Wasserstein distance is approximated

[19, 38, 39, 40, 25, 41].

In the next section, we introduce the Fourier loss function, which has the same topological properties of the Wasserstein distance but that is easier to compute. Moreover, we provide an explicit formula for the gradient and Hessian of the Fourier loss function.

3 Fourier Metric as Loss Function

In what follows, we fix , where is defined as

for any given . A discrete measure on is then defined as

(5)

where the values are non-negative real numbers such that . Since any discrete measure supported on is fully characterised by the uple of positive values

, we refer to discrete measures and vectors interchangeably.

3.1 The Fourier-based Metric

In this subsection, we review the main notions on the Fourier Transform of discrete measures; for a complete discussion, we refer to [42].

Definition 2.

The Discrete Fourier Transform (DFT) of is the dimensional vector defined as

(6)
Remark 1.

Since the complex exponential function is a periodic function for any integer , we set

for any , where is the modulo operation. In particular, for , .

Remark 2.

The DFT of a discrete measure can be expressed as a linear map:

(7)

where is the matrix defined as

(8)

and . The matrix is invertible, and therefore the DFT is a bijective function.

Proposition 1.

Let be a discrete measure; then

(9)

where denotes the complex conjugate of . In particular, we have

(10)

We now introduce the Periodic Fourier-based Metric [30].

Definition 3.

Let and be two discrete measures over . The Periodic Fourier-based Metric is defined as

(11)

In [30], it is proved that the integral in converges for any pair of probability measures and , and that is equivalent to .

As we can see in Figure 2, the equivalence between the Fourier-based metric and the Wasserstein distance means that the Fourier-based metric is capable of detecting convergence in situations similar to those of Example 1.

Figure 2: Distance between and . We have scaled the distances for visual convenience.

By interpolating with piecewise constant functions, we can approximate the integral in (

11) through

(12)

since, from Remark 1, .

Remark 3.

From Proposition 1, we know that

which means that the discrete frequencies give us the same information of the frequencies. For this reason, we consider only the first frequencies, where denotes the greatest such that . Moreover, once we fix , the term in equation (12) is a constant, and thus we omit it from the definition of our loss function.

3.2 The Fourier Loss Function

We now introduce the Fourier loss function.

Definition 4.

Given a discrete probability measure , we define the Fourier loss function as

(13)

Notice that the term in (13) penalises the differences between higher frequencies.

Proposition 2.

The function defined in (13) is a distance on . Moreover, it is equivalent to the Wasserstein distance .

The proof is reported in the additional material.

Proposition 3.

For any given probability measure , the Fourier loss function can be differentiated twice. Moreover, its gradient and Hessian matrix are expressed through the explicit formulae:

(14)

and

(15)

where is the Fourier Transform of , defined as

if

is odd,

if is even.

The proof is reported in the additional material.

We observe that the Hessian matrix does not depend on nor . Moreover, it is a symmetric and circulant matrix, since

The eigenvalues of this class of matrices can be explicitly computed [43]. The following holds.

Proposition 4.

The eigenvalues of the matrix are given by

Therefore, is positive definite.

For the complete proof, see the additional material.

Example 2.

In multi-class classification problems, each input has to be correctly labelled in one of the classes. If for any given input there is only a single correct label, the measure we want to learn is a Dirac’s delta , whose mass is concentrated on the correct class . For the sake of simplicity, let us take and . It is reasonable to assume that, before the learning process, we do not have any bias about how to associate an input to a label. Therefore, we see each label equally possible and hence our starting measure is .

Figure 3 reports a plot of the components of . We observe positive values around the correct class , where the maximum is located; the values of get smaller as we move further from . Therefore, the gradient descent method leads to a measure concentrated around the label .

Figure 4 shows the behaviour of when we have and equally possible correct classes. We observe that the gradients assume their local maxima on the correct labels. In particular, when we have correct labels, we observe that the central maxima is global. Indeed, the central correct label benefits from being closer to the other two correct labels.

Figure 3: Visualisation of the gradient of with and .
Figure 4: Visualisation of the gradient when with and .

To conclude this section, we showcase how the minimisation of the Fourier loss is related to the maximum likelihood estimator in classification models with a random noise.

3.3 Probabilistic interpretation

In Machine Learning, we often assume the existence of an underlying probabilistic model that generates the data [4, 36]. This model is typically expressed as

(16)

where are the data, are i.i.d. random noises, is a function that specifies the model structure, and is the parameter that has to be optimised.

Let us suppose that, for every , , where

is the circularly-symmetric complex normal distribution with zero mean and covariance matrix

, defined as , for some . For a complete discussion on complex normal distributions, we refer to [44].

The likelihood of the observations is then given by

(17)

provided that the mass of is the same as , for every . The vector denotes the conjugate transpose of and is a positive constant that does not depend from the data.

By taking the logarithm in (3.3), we obtain the following result.

Theorem 1.

Let us consider a model of the form (16), where are the data and are distributed as described above. Then, the value of maximising the likelihood of the data is the one minimising the Fourier loss

Notice that the structure of the covariance matrix measures how the error on the

frequency is weighted. As the variance grows we are more willing to accept discrepancies between the real value of the frequency and the predicted one. In particular, for

, we have a null variance Gaussian (i.e. a Dirac’s Delta), therefore the model does not admit any error on the null frequency: and must have the same mass.

(a) ResNet34
(b) ResNet50
Figure 5: Accuracy on the CIFAR10 dataset.

4 Numerical Results

In this section, we present the numerical results obtained implementing the Fourier loss, introduced in Definition 4. The aim of our tests is to validate the Fourier loss as a reliable tool in machine learning. We compare our loss function with the ones induced by the Kullback-Leibler and the Total Variation, introduced in Section 2. We do not use the Wasserstein distance since the computation of its gradient is a challenging task [16].

We perform a multi-class classification on MNIST [45], Fashion-MNIST [46] and CIFAR10 [47] using neural networks. In particular, we test the accuracy and the robustness to random noises.

For MNIST and Fashion-MNIST, we follow a similar approach to [12]

. For the first dataset we use a fully connected neural network, with 4 hidden layers with 512 neurons each, and the ReLu activation function. For the second one we use a neural network with 2 convolutional layers followed by 3 fully connected layers, with the hyperbolic tangent activation function. For these two experiments we train our neural network for 20 epochs using Adam

[48] with a fixed learning rate of .

For CIFAR10, we use ResNet [6], a standard architecture for classification tasks in computer vision. We implement the neural network with two different dimensions: ResNet34 and ResNet50. We train the former for 20 epochs and the latter for 50 epochs, using SGD with a momentum of and weight decay equal to . In order to make a significant comparison, we choose different values of the learning rate: for the Fourier loss and for the Total Variation and the Kullback-Leibler.

For all the datasets, during training, we use a batch size equal to

. No preprocessing on the data is performed. All the experiments are run on a MacBook Pro (13-inch, 2017) with 2,5 GHz Intel Core i7 dual-core and 16 GB of RAM. All of our codes are implemented in Python (v3.8.5) using PyTorch (v1.7.1)

[49]

. The Fourier loss function does not need any ad-hoc implementations. For the backpropagation, we rely on PyTorch built in auto-grad tool. The runtime for training the neural networks is the same for all the loss functions.

We next present the results of our tests.

Accuracy.

Figure 5 shows the accuracy of ResNet34 and ResNet50 on CIFAR10 as a function of the epochs. In both cases, the neural networks trained with the Fourier and the loss functions show a similar behaviour, while the networks trained with the loss function show a slower rate of learning. As pointed out in [12], the slow rate of learning of the is due to the fact that the gradient of the returns little information.

(a) MNIST
(b) Fashion-MNIST
Figure 6: MNIST and Fashion-MNIST testing dataset with perturbation.
Figure 7: CIFAR10 testing dataset with perturbation.
(a) MNIST
(b) Fashion-MNIST
(c) CIFAR10, ResNet34
(d) CIFAR10, ResNet50
Figure 8: Accuracy with -perturbed images.

Robustness.

For each dataset, we test the robustness of the neural networks, trained with the Fourier, , and loss functions, against random noise. After training the neural networks, we add independent Gaussian noises to each pixel of the images in the test set, with ranging in the set , as shown in Figures 6 and 7.

Figures 8.(a) and 8.(b) show the accuracy of the networks on MNIST and Fashion-MNIST as a function of . Figures 8.(c) and 8.(d) show the accuracy of ResNet34 and ResNet50 on CIFAR10 as a function of . For all the datasets, we observe that, when trained with the Fourier loss function, the networks achieve higher performances.

The results of our experiments on CIFAR10 are also detailed in Tables 1.(a) and 1.(b). For each value of , we compute the ratio between the accuracy of each loss function and the best accuracy. We report the ratios in brackets. We observe that, as grows, the Fourier loss function shows an improvement in accuracy with respect to the other loss functions. This means that the network trained with the Fourier loss generalises better, and likely, it is less prone to overfitting.

We believe that this difference in performance is due to the different topology induced by the Fourier-based metric. The topology has a major role when we look for minimising sequences, as noticed in Section 2.


(b) ResNet34
KL TV Fourier
0.00 0.839 (1.00) 0.803 (0.95) 0.820 (0.97)
0.05 0.738 (1.00) 0.729 (0.98) 0.718 (0.97)
0.10 0.516 (0.93) 0.507 (0.91) 0.554 (1.00)
0.15 0.329 (0.77) 0.338 (0.79) 0.422 (1.00)
0.20 0.215 (0.64) 0.248 (0.73) 0.338 (1.00)
0.25 0.165 (0.58) 0.207 (0.72) 0.284 (1.00)
0.30 0.136 (0.56) 0.185 (0.77) 0.242 (1.00)
(a) ResNet50
KL TV Fourier
0.00 0.824 (1.00) 0.809 (0.98) 0.820 (0.99)
0.05 0.676 (0.90) 0.710 (0.95) 0.747 (1.00)
0.10 0.366 (0.62) 0.451 (0.77) 0.583 (1.00)
0.15 0.189 (0.44) 0.263 (0.61) 0.429 (1.00)
0.20 0.128 (0.39) 0.189 (0.58) 0.322 (1.00)
0.25 0.110 (0.44) 0.156 (0.62) 0.247 (1.00)
0.30 0.106 (0.52) 0.138 (0.67) 0.204 (1.00)
Table 1: Accuracy as a function of . In brackets, we report the ratio with the best accuracy (i.e., a ratio of 1.0 is better).

5 Conclusions and Future Works

We have introduced the Fourier loss function, which has a strong theoretical motivation. It can be computed efficiently using the Fast Fourier Transform and its gradient has an explicit formula. Moreover, we have provided a justification of this loss function in terms of the maximisation of the likelihood under a Gaussian noise in the space of frequencies.

The numerical results confirm the validity of the Fourier loss function when applied to multi-class classification problems. In particular, the neural networks trained with our loss function show better robustness to random noise and thus a greater generalisation ability.

We believe that several future directions of research are possible. From the theoretical point of view, we want to generalise our loss function in order to make it equivalent to the Wasserstein distance induced by any ground metric. Moreover, we plan to extend the Fourier loss to unnormalised positive measures. Finally, the gradient of the Fourier loss function, shown in Figure 4, suggests the development of specific models for multi-class classification tasks that drive the misclassification on classes that are close to the true one.

References

  • [1] Christopher M Bishop. Pattern recognition and machine learning. Springer, 2006.
  • [2] Sotiris B Kotsiantis, I Zaharakis, and P Pintelas. Supervised machine learning: A review of classification techniques.

    Emerging artificial intelligence applications in computer engineering

    , 160(1):3–24, 2007.
  • [3] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
  • [4] Yoshua Bengio, Ian Goodfellow, and Aaron Courville. Deep learning. MIT press, 2017.
  • [5] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015.
  • [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [7] Mingxing Tan and Quoc Le.

    Efficientnet: Rethinking model scaling for convolutional neural networks.

    In International Conference on Machine Learning, pages 6105–6114, 2019.
  • [8] Michael Vogt. An overview of deep learning and its applications. Fahrerassistenzsysteme 2018, pages 178–202, 2019.
  • [9] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • [10] Dan Cireşan, Ueli Meier, Jonathan Masci, and Jürgen Schmidhuber. A committee of neural networks for traffic sign classification. In The 2011 international joint conference on neural networks, pages 1918–1921. IEEE, 2011.
  • [11] Dan Cireşan, Alessandro Giusti, Luca Gambardella, and Jürgen Schmidhuber. Deep neural networks segment neuronal membranes in electron microscopy images. Advances in neural information processing systems, 25:2843–2851, 2012.
  • [12] Katarzyna Janocha and Wojciech Marian Czarnecki. On Loss Functions for Deep Neural Networks in Classification. Schedae Informaticae, 25:49–59, 2016.
  • [13] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • [14] Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.
  • [15] Gabriel Peyré, Marco Cuturi, et al.

    Computational optimal transport: With applications to data science.

    Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  • [16] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya-Polo, and Tomaso Poggio. Learning with a wasserstein loss. In Proceedings of the 28th International Conference on Neural Information Processing Systems, volume 2, pages 2053–2061, 2015.
  • [17] Yuzhuo Han, Xiaofeng Liu, Zhenfei Sheng, Yutao Ren, Xu Han, Jane You, Risheng Liu, and Zhongxuan Luo. Wasserstein loss-based deep object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 998–999, 2020.
  • [18] Xiaofeng Liu, Yang Zou, Tong Che, Peng Ding, Ping Jia, Jane You, and BVK Kumar.

    Conservative Wasserstein training for pose estimation.

    In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8262–8272, 2019.
  • [19] Martin Arjovsky, Soumith Chintala, and Léon Bottou.

    Wasserstein generative adversarial networks.

    In Proceedings of the 34th International Conference on Machine Learning, pages 214–223, 2017.
  • [20] Jonas Adler and Sebastian Lunz. Banach Wasserstein GAN. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 6755–6764, 2018.
  • [21] I Tolstikhin, O Bousquet, S Gelly, and B Schölkopf. Wasserstein auto-encoders. In 6th International Conference on Learning Representations, 2018.
  • [22] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of Wasserstein GANs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5769–5779, 2017.
  • [23] Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced Wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3483–3491, 2018.
  • [24] Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • [25] Federico Bassetti, Stefano Gualandi, and Marco Veneroni. On the computation of Kantorovich-Wasserstein distances between two-dimensional histograms by uncapacitated minimum cost flows. SIAM Journal on Optimization, 30(3):2441–2469, 2020.
  • [26] Julien Rabin and Gabriel Peyré. Wasserstein regularization of imaging problem. In 18th IEEE International Conference on Image Processing, pages 1541–1544, 2011.
  • [27] Marco Cuturi. Sinkhorn distances: lightspeed computation of optimal transport. In Proceedings of the 26th International Conference on Neural Information Processing Systems, pages 2292–2300, 2013.
  • [28] Laurent Risser, Quentin Vincenot, Nicolas Couellan, and Jean-Michel Loubes. Using Wasserstein-2 regularization to ensure fair decisions with neural-network classifiers. arXiv preprint arXiv:1908.05783, 2019.
  • [29] Henning Petzka, Asja Fischer, and Denis Lukovnikov. On the regularization of Wasserstein GANs. In International Conference on Learning Representations, 2018.
  • [30] Gennaro Auricchio, Andrea Codegoni, Stefano Gualandi, Giuseppe Toscani, and Marco Veneroni. The equivalence of Fourier-based and Wasserstein metrics on imaging problems. Rendiconti Lincei - Matematica e Applicazioni, 31:627–649, 2020.
  • [31] G. Gabetta, G. Toscani, and B. Wennberg.

    Metrics for probability distributions and the trend to equilibrium for solutions of the Boltzmann equation.

    Journal of statistical physics, 81(5):901–934, 1995.
  • [32] J.A. Carrillo and G. Toscani. Contractive probability metrics and asymptotic behavior of dissipative kinetic equations. Rivista Matematica Università di Parma, 7(6):75–198, 2007.
  • [33] Ludwig Baringhaus and Rudolf Grübel. On a class of characterization problems for random convex combinations. Annals of the Institute of Statistical Mathematics, 49(3):555–567, 1997.
  • [34] Erhan Çınlar. Probability and stochastics. Springer Science & Business Media, 2011.
  • [35] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  • [36] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
  • [37] Gennaro Auricchio, Stefano Gualandi, Marco Veneroni, and Federico Bassetti. Computing Kantorovich-Wasserstein distances on -dimensional histograms using -partite graphs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5798–5808, 2018.
  • [38] Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with Sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pages 1608–1617, 2018.
  • [39] O. Pele and M. Werman. Fast and robust Earth Mover’s Distances. In IEEE 12th International Conference on Computer Vision, pages 460–467, 2009.
  • [40] L.J. Guibas, Y. Rubner, and C. Tomasi.

    The Earth Mover’s Distance as a metric for image retrieval.

    International journal of computer vision, 40(2):99–121, 2000.
  • [41] N. Bonneel and D. Coeurjolly. SPOT: Sliced Partial Optimal Transport. ACM Transactions on Graphics, 38(4):1–13, 2019.
  • [42] Kamisetty Ramam Rao and Patrick C Yip. The transform and data compression handbook. CRC press, 2018.
  • [43] Philip J Davis. Circulant matrices. Wiley, 1979.
  • [44] Nathaniel R Goodman.

    Statistical analysis based on a certain multivariate complex gaussian distribution (an introduction).

    The Annals of mathematical statistics, 34(1):152–177, 1963.
  • [45] Yann LeCun. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  • [46] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • [47] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009.
  • [48] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (Poster), 2015.
  • [49] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems 32, pages 8024–8035, 2019.

Additional Material

Proof of Proposition 1.

By definition of , we have

since for each . ∎

Proof of Proposition 2.

Clearly the function is symmetric and satisfies the triangular inequality.

Then, it suffices to prove

If , then and, therefore, .

Let us now suppose . Since is defined as a sum of positive terms, each term of the sum vanishes, hence

From Proposition 1, we deduce

hence, for . Moreover, since and are probability measures, we have

therefore , and, from Remark 2, we deduce .

For the proof of the equivalence between and , we refer to [30]. ∎

Proof of Proposition 3.

From a direct computation, we find

Therefore

(18)

where the last inequality holds since and are probability measures and hence . For odd, let us consider

the DFT of is then

hence

and, therefore

We recall that for any .

When is even, we define the vector

and similarly to the previous case, we have

Finally, from a further derivative, we find

which concludes the proof.

Proof of Proposition 4.

Let be defined as in Remark 2. This matrix is symmetric and satisfies the identity

where is the conjugate of and

is the identity matrix

[42]. We set

where is the vector of the canonical euclidean basis of . We have that

where is defined as

Since is symmetric and , we have that

therefore . Then

Therefore,

is an eigenvector of

and its eigenvalue is . Since for any , is positive definite. ∎