# Designing GANs: A Likelihood Ratio Approach

We are interested in the design of generative adversarial networks. The training of these mathematical structures requires the definition of proper min-max optimization problems. We propose a simple methodology for constructing such problems assuring, at the same time, that they provide the correct answer. We give characteristic examples developed by our method, some of which can be recognized from other applications and some introduced for the first time. We compare various possibilities by applying them to well known datasets using neural networks of different configurations and sizes.

## Authors

• 6 publications
• 13 publications
• ### Training generative networks using random discriminators

In recent years, Generative Adversarial Networks (GANs) have drawn a lot...
04/22/2019 ∙ by Babak Barazandeh, et al. ∙ 0

• ### Kernel-Based Training of Generative Networks

Generative adversarial networks (GANs) are designed with the help of min...
11/23/2018 ∙ by Kalliopi Basioti, et al. ∙ 4

• ### Optimizing Shallow Networks for Binary Classification

Data driven classification that relies on neural networks is based on op...
05/24/2019 ∙ by Kalliopi Basioti, et al. ∙ 0

• ### LEAD: Least-Action Dynamics for Min-Max Optimization

Adversarial formulations in machine learning have rekindled interest in ...
10/26/2020 ∙ by Reyhane Askari Hemmat, et al. ∙ 22

• ### Interior Point Methods with Adversarial Networks

We present a new methodology, called IPMAN, that combines interior point...
05/23/2018 ∙ by Rafid Mahmood, et al. ∙ 0

• ### Training Neural Networks for Likelihood/Density Ratio Estimation

Various problems in Engineering and Statistics require the computation o...
11/01/2019 ∙ by George V. Moustakides, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The problem we are interested in can be summarized as follows: We are given two collections of training data and . In the first set the samples follow the originprobability density and in the second the target density . The target density is considered unknown while can either be known with the possibility to produce samples every time it is necessary or unknown in which case we have a second fixed training set . Our goal is to design a deterministic transformation so that the data produced by applying the transformation onto follow the target density .

Of course one may wonder whether the proposed problem enjoys any solution, namely, whether there indeed exists a transformation capable of transforming into with the former following the origin density and the latter the target density

. The problem of transforming random vectors has been analyzed by

(Box & Cox, 1964) where existence is shown under general conditions. Computing, however, the actual transformation is a completely different challenge with one of the possible solutions relying on adversarial approaches applied to neural networks.

The most well known usage of this result is, clearly, the possibility to generate synthetic data that follow the unknown target density . In this case is selected to be simple (e.g. i.i.d. standard Gaussian or i.i.d. uniform) so that generating realizations from is straightforward. As mentioned, the adversarial approach can be applied even if the origin density is unknown provided that we have a dataset with data following the origin density. When, however, is known and we can generate any number of realizations , it is expected, the adversarial approach, to identify the transformation with higher accuracy due to the possibility of generating more training data.

It was (Goodfellow et al., 2016) that first introduced the idea of adversarial (min-max) optimization and demonstrated that it results in the determination of the desired transformation . Alternative adversarial approaches by (Arjovsky et al., 2017; Binkowski et al., 2018) were subsequently suggested and shown to also deliver the correct transformation . Finally, we must mention the work by (Nowozin et al., 2016) which is closely related to our results and for which, at the end of Section 2, we give details, emphasizing differences and similarities with our method. As in (Nowozin et al., 2016), we will show that our methods provides an abundance of adversarial problems that are capable of identifying the appropriate transformation . Furthermore, we will also provide a simple recipe as to how we can successfully construct such problems.

Arguing along the same lines of the existing min-max formulations: We would like to optimally specify a vector transformation , the generator, and a scalar function , the discriminator. To achieve this, for each combination we define the cost function

 J(G,D)=Ef[ϕ(D(X))]+Eh[ψ(D(G(Z)))] (1)

where are two scalar functions of the scalar and denote expectation with respect to the density respectively. The optimum combination generator/discriminator is then identified by solving the following min-max problem

 (2)

We must point out that our goal is not to solve (2), but rather find a class of functions so that the transformation that will come out of the solution of (2) is such that follows the target density when follows the origin density .

If is random following then is also random and we denote with its corresponding probability density. Clearly, there exists a correspondence between transformations and densities when the density of is fixed. Since we can write

this allows us to argue that the min-max problem in (2) is equivalent to

 (3)

It is now possible to combine the two expectations by applying a change of measure and a change of variables and equivalently write (3) as follows

 (4)

where denotes the corresponding likelihood ratio. Since is also fixed, there is again a correspondence between and , hence the previous min-max problem becomes equivalent to

 minr(X)∈RfmaxD(X)Ef[ϕ(D(X))+r(X)ψ(D(X))]. (5)

Here denotes the class of all likelihood ratios with respect to the density , namely, all the functions that satisfy

 Rf={r(X): r(X)≥0, ∫r(X)f(X)dX=1}. (6)

Using these definitions, let us define the cost

 J(r,D)=Ef[ϕ(D(X))+r(X)ψ(D(X))] (7)

and, according to (5), we are interested in the following min-max problem

 minr(X)∈RfmaxD(X)J(r,D). (8)

As mentioned, our actual goal is not to solve the adversarial problem. Instead, we would like to properly identify pairs of functions so that (8) accepts as solution the function . Indeed, if is the solution to (8), this means that is the solution to (3) and, finally, that the optimum obtained from (1) is such that follows which, of course, is our original objective. Even though the min-max problem in (1) is what we attempt to solve, it is through (8) that we understand what its solution entails. In the next section we focus on (7), (8) and propose a simple design method (recipe) for the two functions that assures that the solution of (8) is indeed .

## 2 A Class of Functions ϕ(z),ψ(z)

Suppose that is a strictly increasing and (left and right) differentiable scalar function of the nonnegative scalar , i.e. . Denote with the range of values of and let be the inverse function of which is defined for . Let be a positive scalar function also defined for then, using and , we propose the following pair

 ϕ′(z)=−ω−1(z)ρ(z),  ψ′(z)=ρ(z), (9)

where “” denotes derivative. Since and are arbitrary (provided they satisfy the strict increase and positivity constraint respecitively), the class of pairs defined by (9) is very rich allowing for a multitude of choices. We show next that any such pair gives rise to a min-max problem, as in (8), that accepts as its unique solution. We prove this claim in two steps. The first, involves a theorem where we consider a simplified version of the min-max problem.

###### Theorem 1

Let and be defined as above with the additional constraint . Fix and consider as a function of the scalar . Then, for any , we have that

 ϕ(D)+rψ(D)≤ϕ(ω(r))+rψ(ω(r)), (10)

with equality if and only if .

Consider next the minimization with respect to of the maximal value in (10). It is true that

 minr≥0{ϕ(ω(r))+rψ(ω(r))}=ϕ(ω(1)), (11)

with equality if and only if .

Proof  We note that the constraint does not affect the generality of our class of functions since from (9) we have that , after integration, is defined up to an arbitrary additive constant. We can always select this constant so that the constraint is satisfied. We would also like to emphasize that this constraint is needed only for the proof of this theorem and it is not necessary for the corresponding min-max problem defined in (8).

For fixed , to find the maximum of we consider the derivative with respect to which, using (9), takes the form

 ϕ′(D)+rψ′(D)=(r−ω−1(D))ρ(D).

The strict increase of is inherited by its inverse function which, combined with the positivity of , implies that the previous expression has the same sign as or . Consequently is the only critical point of which is a global maximum. Of course there are possibilities for extrema at the two end points of but they can only be (local) minima.

Let us now focus on the resulting function . Taking its derivative with respect to yields

 ddr{ϕ(ω(r))+rψ(ω(r))}=\allowdisplaybreaks{ϕ′(ω(r))+rψ′(ω(r))}ω′(r)+ψ(ω(r))=ψ(ω(r)),

where the last equality is due to the specific definition of the two functions in (9). Since , this implies that is strictly increasing, being also the integral of it is continuous in . If we combine this property with the strict increase and continuity (as a result of left and right differentiability) of we conclude that is also strictly increasing and continuous in . We recall that is selected to satisfy , consequently for the function has a unique minimum which is global and no other critical points. Of course it can still exhibit extrema at and/or but they can only be (local) maxima.

A consequence of Theorem 1 is the next corollary, which constitutes the second and final step in proving that the adversarial problem defined in (8) has as unique solution the function .

###### Corollary 1

If the functions satisfy (9) and is strictly increasing and left and right differentiable, then in the adversarial problem defined in (8) the maximizer is and the minimizer is , while the resulting min-max value is equal to

 minr(X)∈RfmaxD(X)Ef[ϕ(D(X))+r(X)ψ(D(X))]=ϕ(ω(1))+ψ(ω(1)). (12)

Proof  The proof is simple. First we observe that

 Ef[ϕ(D(X))+r(X)ψ(D(X))]=Ef[ϕ(D(X))+r(X)~ψ(D(X))]+ψ(ω(1)) (13)

with the last equality being true since and where . We start with the maximization problem. Since is a function of we have

 maxD(X)Ef[ϕ(D(X))+r(X)~ψ(D(X))]\allowdisplaybreaks=\allowdisplaybreaksEf[maxD(X){ϕ(D(X))+r(X)~ψ(D(X))}]. (14)

The maximization under the expectation can be performed for each fixed . However, when we fix then becomes a constant and the result of the maximization depends only on the actual value of . This suggests that we can limit ourselves to functions of the form . After this observation we can drop the dependence on and perform, equivalently, the maximization

 maxD{ϕ(D(r))+r~ψ(D(r))}

for each fixed . The pair satisfies the assumptions of Theorem 1, therefore maximization is achieved for . This implies that

We can now continue in a similar way for the minimization problem. Specifically

with the last inequality being true since the minimization that follows is unconstrained and the last equality being a consequence of Theorem 1. The final lower bound is clearly attained by , which is also a legitimate solution of the constrained minimization, since belongs to the class of likelihood ratios. Consequently is the solution to the min-max problem. Returning to the original min-max setup with replacing , we can clearly see that it satisfies (12). This completes the proof.

###### Remark 1

The adversarial problem is defined with the help of the two functions which, according to (9), can be obtained by integrating the corresponding derivatives. However, this integration might not always be possible, analytically. As we will have the chance to confirm in Section 4, in an actual optimization algorithm (e.g. of gradient type) that solves (2), the exact form of is not necessary. Instead, what is required is their derivatives which are analytically available from (9).

We must emphasize that there already exists the significant work by (Nowozin et al., 2016) that addresses a similar problem as our current work, namely the definition of a class of min-max optimizations that can be used to design the generator/discriminator pair. The class in (Nowozin et al., 2016) is defined in terms of a convex function which can be shown to correspond to the outcome of our maximization, namely the function . This establishes a one-to-one correspondence between the two methods under the ideal (non data-driven) setup. However, we believe that, our approach enjoys certain significant advantages:

First, the definition of the two functions in Equ. (9) is straightforward while in (Nowozin et al., 2016) requires the solution of an optimization problem.

Second, in our case we have complete control over the result of the maximization problem that defines the discriminator. In other words we can decide what transformation of the likelihood ratio

, the discriminator must estimate. In

(Nowozin et al., 2016) such flexibility does not exist.

Controlling the function we estimate with the discriminator plays a significant role in the implementation of our method. Indeed when we use a neural network to approximate the optimum discriminator, this affects the overall quality of the resulting generator/discriminator pair. We should also note that there are important applications in Statistics where one is interested in estimating only the transformation of the likelihood ratio, with the most common cases being the likelihood ratio itself, its logarithm (log-likelihood ratio), or the ratio

which plays the role of the posterior probability between two densities. In other words, there are applications where one is interested only in the “max” part of the min-max problem. In fact, in the next section we give examples of various choices of

and mention problems where the discriminator function becomes the actual target and not the generator.

## 3 Examples

Let us now present characteristic cases for the function and give pairs that satisfy (9). As we proved, this implies that the corresponding adversarial problem in (8) accepts the desired solution .

### 3.1 Case ω(r)=rα

For we have that and . According to (9), for we must define

 ϕ′(z)=−z1αρ(z),  ψ′(z)=ρ(z). (15)

The following examples can be shown to satisfy (15).

A1) If we select , with , this yields and . For , , , . For , , , .

A2) If we select , then, and .

A3) If we select , , this yields and . This example corresponds to functions that are not both available in closed form. However they can still be used to define an optimization problem whose solution is numerically computable.

For the particular selection (corresponding to ) we can show that the resulting cost is equivalent to the Bregman cost (Bregman, 1967). In fact there is a one-to-one correspondence between our function and the function that defines the Bregman cost. This correspondence however is lost once we switch to a different or a different function, suggesting that the proposed class of pairs , is far richer than the class induced by the Bregman cost.

We should mention that in A1) the selection is known as the mean square error criterion and if we apply only the maximization problem then this corresponds to a likelihood ratio estimation technique proposed in the literature (Sugiyama et al., 2010, 2013). Under the adversarial approach the cost takes the following interesting form

 J(r,D)=Ef[−0.5D2(X)+r(X)D(X)]\allowdisplaybreaks=12Ef[−(D(X)−r(X))2+(r(X)−1)2]+12

where the equality is a consequence of being a likelihood ratio with respect to . As we can see, the maximization problem indeed yields while the minimization that must follow, captures the desired solution .

### 3.2 Case ω(r)=α−1logr

For we have and . As before must be strictly positive and, according to (9), for all real we must define

 ϕ′(z)=−eαzρ(z), ψ′(z)=ρ(z). (16)

We have the following examples that satisfy these equations.

B1) If with , this produces If then , , . If then , and .

B2) If , then, and .

B3) If , this yields and . The two functions can be written in terms of the Exponential integral or with the help of a power series expansion, but they do not enjoy any closed form expressions. On the other hand, their derivatives are simple and can be clearly used in a gradient type algorithm to numerically compute the solution of the corresponding optimization.

We would like to point out that the previous examples are presented for the first time and can be used either under a min-max setting for the determination of the generator/discriminator pair or under a purely maximization setting for the direct estimation of the log-likelihood ratio function .

### 3.3 Case ω(r)=rr+1

When we have and . For we must define the functions according to (9)

 ϕ′(z)=−z1−zρ(z),  ψ′(z)=ρ(z). (17)

The next set of examples can be seen to satisfy (17).

C1) If we select , this yields and .

C2) Selecting , with , yields and . For , we have and , , while for we have and , .

In C1) we recognize the functions used in the original article by (Goodfellow et al., 2016). C2) appears for the first time.

### 3.4 Case ω(r)=sign(logr)

This is a special case of with the corresponding function not being strictly increasing. It turns out that we can still come up with optimization problems, two of which are known and used in practice, by considering as a limit of a sequence of strictly increasing functions.

Monotone Loss: As a first approximation we propose where a parameter. We note that . Using this approximation we can write

 (18)

As we mentioned, we have exact equality for . Let us perform our analysis by assuming that is finite. We note that and . Consequently, if for , we must define

 ϕ′(z)=−(1+z1−z)1cρ(z), ψ′(z)=ρ(z). (19)

D1) In (19) if we let in order to converge to the desired sign function, this yields and . This suggests that is decreasing and is increasing. In fact any strictly increasing function can be adopted provided we select .

There is a popular combination that falls under Case D1). In particular, the selection known as Wasserstein GAN is proposed in (Arjovsky et al., 2017). We recall that in this case .

Hinge Loss: As a second approximation we use the expression , which is strictly increasing, continuous and converges to as . This suggests that

 sign(logr)≈sign(logr)|logr|1c=ω(r), (20)

and . Since can assume any real value we conclude that which, clearly, differs from the previous approximation where we had . If then, according to (9) we must define

 ϕ′(z)=−ezcρ(z), ψ′(z)=ρ(z). (21)

We present the following case that leads to a very well known pair from a completely different application.

D2) Following (21), if we select then . If we now let , we obtain the limiting form for the derivatives which become and . By integrating we arrive at and . The cost based on this particular pair is called the hinge loss (Tang, 2013) and it is very popular in binary classification where one is interested only in the maximization problem. The corresponding method is known to exhibit an overall performance which in practice is considered among the best (Rosasco et al., 2004; Janocha & Czarnecki, 2017). Here, as in (Zhao et al., 2017), we propose the hinge loss as a means to perform adversarial optimization for the design of the generator .

This completes our presentation of examples. However, we must emphasize, that these are only a few illustrations of possible pairs one can construct. Indeed combining, as dictated by (9), any strictly increasing function with any positive function generates a legitimate pair and a corresponding min-max problem (8) that enjoys the desired solution .

## 4 Data-Driven Setup and Neural Networks

Let us now consider the data-driven version of the problem. As mentioned, the target density is unknown. Instead we are given a collection of realizations that follow and a second collection that follows the origin density . These data constitute our training set. Regarding the second set it can either become available “on the fly” when is known by generating realizations every time they are needed, or it can be considered fixed from the start exactly as , if is also unknown.

As we pointed out in Section 1, we are interested in designing a generator so that when we apply it onto the data , that is, the resulting will follow a density that matches the target density .

Since we are now considering the data-driven version of the problem, we are going to limit to be the outputs of corresponding neural networks. Therefore the generator is replaced by while the discriminator by where summarize the parameters of the two neural networks. Of course instead of neural networks one could use any other parametric family, as SVMs, capable of efficiently approximating any nonlinear function.

Once we have selected our favorite and functions we can compute from (9) the functions that enter into the min-max problem defined in (2). This problem, after limiting the generator and discriminator to neural networks, can be rewritten as follows

 minθmaxϑJ(θ,ϑ)=\allowdisplaybreaksminθmaxϑ{Ef[ϕ(D(X,ϑ))]+Eh[ψ(D(G(Z,θ),ϑ))]}. (22)

If are the corresponding optimum parameter values, and the structure of the two networks is sufficiently rich, we expect that will approximate the optimum functions of the ideal problem in (2) respectively. In particular for , the generator , whenever applied onto any that follows , it will result in a that follows a density which is expected to be close to the target density .

A simple stochastic gradient algorithm that can solve the min-max optimization in (22) is the following

 ϑt=ϑt−1+μ{ϕ′(D(Xt,ϑt−1))∇ϑD(Xt,ϑt−1)+ψ′(D(Yt,ϑt−1))∇ϑD(Yt,ϑt−1)}, (23)

corresponding to the maximization problem and

 (24)

for the minimization. Here denotes the Jacobian of with respect to , and is the learning rate of the two updates. With we denote a training sample from the collection while with either a sample from when is unknown or a new realization following if the latter is known. If the collection of training data is exhausted after applying the iterations several times, then we simply reuse them.

With (23), (24) we confirm Remark 1, namely that in an optimization algorithm we do not necessarily need the functions explicitly, but only their derivatives.

It has also been observed (Goodfellow et al., 2016; Arjovsky et al., 2017) that in order for the optimization algorithm to converge properly, for each iteration of the minimization problem we must perform several iterations of the maximization problem (common practice suggests at least five iterations of the maximization problem for each iteration of the minimization).

###### Remark 2

When replacing with neural networks we must take special care of the corresponding outputs. Basically, we must guarantee that they are of the correct form. This is particularly important in the case of the scalar output of the discriminator. We recall that the optimum discriminator is . This implies that we need to assure that takes values in (the range of ). Consequently, we must apply the proper nonlinearity in the output of the discriminator that will guarantee this fact.

## 5 Experiments

We implemented most of the examples mentioned in Section 3 using the datasets MNIST, CelebA and CIFAR-10. Before presenting our results, we would like to give details about the following pairs that exhibited the best overall performance in our experiments:

Exponential: From Example B1), , yields , , while . No nonlinearity is needed in the discriminator output.

B1b: From Example B1), for we obtain and with . No nonlinearity is needed in the discriminator output.

B2: Example B2), with and and . No nonlinearity is needed in the discriminator output.

Cross entropy: This is the classical method proposed in (Goodfellow et al., 2016) corresponding to Example C1) with , and

. To the discriminator output we apply the sigmoid function.

Wasserstein: We are in Example D1) with and . To limit the output of the discriminator we use the function .

Hinge: From Example D2), , and . No nonlinearity is needed in the discriminator output.

For each dataset we present the best five methods in terms of convergence rate and quality of synthetic results produced by the generator.

We recall that GANs are notorious for their nonrobust behavior (Bengio, 2012; Creswell et al., 2018; Mescheder et al., 2017). For the stabilization of the training process, we used the gradient-penalty methodology described in (Gulrajani et al., 2017) which was generalized to a class of Lipschitz GANs in (Zhou et al., 2019).

For the generator, we used a four-layer neural network where the first layer is linear and the remaining deconvolutional; with ReLU activation functions between the layers except the final layer where we used a sigmoid function since the output is an image with pixel values in the range

. The generator input is a standard i.i.d. normal vector with dimension 64 for MNIST and 128 for CelebA and CIFAR-10.

For the discriminator, we used a four-layer neural network with three convolutional layers followed by a linear layer. We applied Leaky ReLUs between the layers except for the final layer where we adopted proper functions based on the range . For the training of the two neural networks we applied the Adam algorithm (Kingma et al., 2015) with , , learning rate and batch size 50 for MNIST and 64 for CelebA and CIFAR-10. For all datasets the training lasted iterations.

The first set of experiments involves training with MNIST. Table 1 summarizes the

top five attained Frechet Inception Distances (FID) (Heusel et al., 2017) and Kernel Inception Distances (KID) (Binkowski et al., 2018) by the various methods.

In Figure 1 we present examples of generated synthetic numerals by the corresponding methods. We observe that for this particular dataset the designed GANs have comparable performance with the Wasserstein and Cross Entropy exhibiting the smallest and B1b and Exponential the highest scores.

The second set of experiments involves the CelebA dataset. Table 2 reports the best observed FID, KID

scores of the competing methods, while Figures 33 depict the evolution of the corresponding scores with the number of iterations during training. From these figures we observe that both scores of the Hinge method exhibit a high variability while B2, Cross entropy, Exponential and Wasserstein have comparable behavior. In Figure 4 we have examples of synthetic faces generated by each method.

The third and last set of experiments involves the far more challenging CIFAR-10 dataset. Table 3 summarizes the best FID, KID scores, while Figures 6, 6 capture their evolution during training; finally, Figure 7 hosts examples of corresponding synthetic images. Interestingly, from Figure 6, we distinguish for this dataset, two performance groups. We can see that B1b and Wasserstein have a visibly better performance than the second group which includes the Cross Entropy, Exponential and B2.

## Acknowledgement

This work was supported by the US National Science Foundation, Grant CIF 1513373, through Rutgers University.

## References

• Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks.

Proceedings of Machine Learning Research (PMLR 2017)

, pp. 214–223, 2017.
• Bengio (2012) Bengio, Y. Practical recommendations for gradient-based training of deep architectures. arXiv:1206.5533, 2012.
• Binkowski et al. (2018) Binkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifying MMD GANs. Proceedings Interational Conference on Learning Representations, 2018.
• Bregman (1967) Bregman, L. M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200–217, 1967.
• Box & Cox (1964) Box, G. E. P., and Cox, D. R. An analysis of transformations. Journal of the Royal Statistical Society. Series B, 26(2):211–252, 1964.
• Creswell et al. (2018) Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., and Bharath, A. A. Generative adversarial networks: An overview IEEE Signal Processing Magazine, 35(1):53–65, January 2018.
• Goodfellow et al. (2016) Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. arXiv: 1406.2661, 2014.
• Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved training of Wasserstein GANs. arXiv: 1704.00028, 2017.
• Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., and Nessler, B. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Proceedings Advances Neural Information Processing Systems Conference, 2017.
• Janocha & Czarnecki (2017) Janocha, K., and Czarnecki, W. M. On loss functions for deep neural networks in classification. arXiv:1702.05659, 2017.
• Mescheder et al. (2017) Mescheder, L. M., Nowozin, S. and Geiger A. The numerics of GANs. Proceedings Advances Neural Information Processing Systems Conference, 2017.
• Moulin & Veeravalli (2019) Moulin, P., and Veeravalli, V. V. Statistical Inference for Engineers and Data Scientists. Cambridge, New York, 2019.
• Kingma et al. (2015) Kingma, D. P., and Ba J. L. Adam: A method for Stochastic Optimization. Proceedings International Conference on Learning Representations, 2015.
• Nowozin et al. (2016) Nowozin, S., Cseke, B., and Tomioka, R. f-GAN: Training generative neural samplers using variational divergence minimization. Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 271–279, 2016.
• Rosasco et al. (2004) Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., and Verri, A.

Are loss functions all the same?

Journal Neural Computation, 16(5):1063–1076, 2004.
• Sugiyama et al. (2010) Sugiyama, M., Suzuki, T., and Kanamori, T. Density ratio estimation: A comprehensive review. RIMS Kokyuroku, 1703: 10–31, 2010.
• Sugiyama et al. (2013) Sugiyama, M., Suzuki, T., and Kanamori, T. Density Ratio Estimation in Machine Learning. Cambridge, New York, 2013.
• Tang (2013) Tang, Y. Deep learning using linear support vector machines. arXiv:1306.0239, 2013.
• Zhao et al. (2017) Zhao, J., Mathieu, M., and LeCun Y. Energy-Based Generative Adversarial Networks. Proceedings International Conference on Learning Representations, 2015.
• Zhou et al. (2019) Zhou, Z., Liang, J., Song, Y., Yu, L., Wang, H., Zhang, W., Yu, Y., and Zhang, Z. Lipschitz generative adversarial nets. Proceedings of the 36th International Conference on Machine Learning, pp. 7584–7593, 2019.