# Adversarial Regression. Generative Adversarial Networks for Non-Linear Regression: Theory and Assessment

Adversarial Regression is a proposition to perform high dimensional non-linear regression with uncertainty estimation. We used Conditional Generative Adversarial Network to obtain an estimate of the full predictive distribution for a new observation. Generative Adversarial Networks (GAN) are implicit generative models which produce samples from a distribution approximating the distribution of the data. The conditional version of it (CGAN) takes the following expression: min_G max_D V(D, G) = E_x∼ p_r(x) [log(D(x, y))] + E_z∼ p_z(z) [log (1-D(G(z, y)))]. An approximate solution can be found by training simultaneously two neural networks to model D and G and feeding G with a random noise vector z. After training, we have that G(z, y)∼̇ p_data(x, y). By fixing y, we have G(z|y) ∼̇ pdata(x|y). By sampling z, we can therefore obtain samples following approximately p(x|y), which is the predictive distribution of x for a new y. We ran experiments to test various loss functions, data distributions, sample size, size of the noise vector, etc. Even if we observed differences, no experiment outperformed consistently the others. The quality of CGAN for regression relies on fine-tuning a range of hyperparameters. In a broader view, the results show that CGANs are very promising methods to perform uncertainty estimation for high dimensional non-linear regression.

## Authors

• 1 publication
• ### Dihedral angle prediction using generative adversarial networks

Several dihedral angles prediction methods were developed for protein st...
03/29/2018 ∙ by Hyeongki Kim, et al. ∙ 0

• ### f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization

Generative neural samplers are probabilistic models that implement sampl...
06/02/2016 ∙ by Sebastian Nowozin, et al. ∙ 0

• ### HIGAN: Cosmic Neutral Hydrogen with Generative Adversarial Networks

One of the most promising ways to observe the Universe is by detecting t...
04/29/2019 ∙ by Juan Zamudio-Fernandez, et al. ∙ 0

• ### Estimation with Uncertainty via Conditional Generative Adversarial Networks

Conventional predictive Artificial Neural Networks (ANNs) commonly emplo...
07/01/2020 ∙ by Minhyeok Lee, et al. ∙ 0

• ### Generative Adversarial Networks for Financial Trading Strategies Fine-Tuning and Combination

Systematic trading strategies are algorithmic procedures that allocate a...
01/07/2019 ∙ by Adriano Koshiyama, et al. ∙ 0

• ### Regression with Conditional GAN

In recent years, impressive progress has been made in the design of impl...
05/30/2019 ∙ by Karan Aggarwal, et al. ∙ 0

• ### Software Effort Estimation from Use Case Diagrams Using Nonlinear Regression Analysis

Software effort estimation in the early stages of the software life cycl...
03/22/2020 ∙ by Ali Bou Nassif, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1.1 Classical frameworks for Regressions

First, we review the most classic form of prediction in statistics: classical least square linear regression. The idea is to express as a linear combination of

and an error term, which is assumed to be normally distributed. Thus, the general linear regression model can be written as follow:

 y=Xβ+ε, (1.1)

where is a vector in ,

X is a full rank matrix in ,

is a vector in and

, being and identity matrix.

This can be read like a list of restriction and additional assumption in comparison to the general model . is restricted to a linear combination between and and the error term has no relation with any other terms and is normally distributed. Finally, has to be full rank.

Since, is given, the remaining task consists of estimating and . In this framework, the estimation can be done by minimizing the least square error. The solution has a closed form given by the formula:

 ^β=(XTX)−1XTy (1.2)

and

 ^σ2ε=eTe(n−p) (1.3)

where, is the vector of residuals. It is obvious that and are random since they depend on . With those two estimates, it is possible to compute a prediction for a set of new explanatory variables. We have:

 ^y⋆=X⋆^β. (1.4)

The expected error is equal to zero: and the error variance can be estimated as follow:

. It is thus possible to build a confidence interval for the predictions.

All the demonstrations appear in the Lecture Notes of Advanced Regression Method given by Pr. Yves Tillé (Tillé, 2017). Since it is not central for the demonstration, we do not reproduce them here.

As such, linear regression offers prediction and confidence interval for them. The model has one major benefit: its simplicity. Unfortunately, this simplicity has a cost:

1. The model requires to assume the normality of the error term, which is a strong assumption that does not hold in many cases.

2. The model has to be a linear combination of the input variables. Again, it is a very strong restriction over the possible function, even if this issue can be partially avoided by working with transformed variables (i.e. with where is an arbitrarily complex function).

3. High dimensional inputs can cause problems concerning the required condition for the design matrix to be full rank.

4. It is possible to build a confidence interval for the new predictions. However, it does not provide an estimate of the probability density for the new prediction.

Of course, the so-called Generalized Linear Model (GLM) proposes some (partial) solutions, mainly concerning the first of the above issues. While we will not discuss GLM in detail here, we will present the basic concepts of logistic regression, since they are strongly connected with some techniques we are going to use.

#### 1.1.1 Logistic Regressions

Logistic regressions are famous to deal with categorical response variables and it is also in that purpose we are interested in. Classifier neural networks can be interpreted as non-linear logistic regressions. Indeed, the last layer of such classifiers is logistic regression. Even if we are mainly interested in regression rather than in classification, some generative adversarial networks include a classifier. Hence, it is of interest to review logistic regression to understand this kind of networks.

##### Binary logistic regression

First, we have a look at the binary logistic regression. The response variable can only take two value: 0 or 1. Let define as the result of a linear combination of the matrix X so that

. Then, the sigmoid function

is a function from to defined as follow:

 σ(Z)=exp(Z)1+exp(Z)=11+exp(−Z)=σ(Z)=^y (1.5)

where is interpreted as the prediction probability for being of class 1. It follows that we have:

 Z=θTX=log(σ(Z)1−σ(Z)). (1.6)

Since, the quantity

gives the probability of being of class 1 and the model follows a Bernoulli distribution, the likelihood is given by:

 p(y|θ)=n∏i=1(σ(Z))yi(1−σ(Z))1−yi. (1.7)

The usual error function for the likelihood is the negative log-likelihood known as the cross-entropy loss function:

 L(y,θ)=−log(p(y|θ))=n∑i=1yi⋅log(σ(Z))+(1−Yi)log(1−σ(Z)). (1.8)

This result is very important for the neural network. It is used in many binary classification tasks.

#### 1.1.2 Multinomial logistic regression

The other main classification task is quite similar but with more than 2 classes. Let denotes the number of output classes. is now a vector of size (), denoting its component.

 σ(Zi)=eZi∑Jj=1eZj. (1.9)

Following the same principle as the binary logisitic regression, represent the probability for of being of class . It is easy to see that the binary case is just a particular case of a multinomial regression when . The general cross entropy loss function take this form:

 L(yi)=−J∑j=11{yi=j}log(^yi). (1.10)

### 1.2 The Bayesian approach

We introduce briefly the Bayesian framework. The Bayesian version of the linear regression offers the possibility to compute a probability density prediction for a new observation, which is an example of what we are looking for.

The Bayesian approach222

This section is based on references about Bayesian statistics

Albert (2009); Chevalier (2018) and Bayesian methods applied to machine learning Murphy (2012); Barber (2012); Bishop (2006).

aims to model knowledge about some events in terms of probability. Consequently, the model parameters are no longer fixed unknown values but random variables. This change of perspective makes possible to compute the probability density for the parameters given the data and, consequently, the probability density for a new prediction. Typically, we are interested in the probability density function of the parameter given the data:

. Thanks to Bayes Theorem, we have the following relation:

 p(θ|Y)=p(Y|θ)p(θ)p(Y). (1.11)

On the right-hand side of the equation, we have three terms. The first one is the likelihood. The second term in the equation, , is the prior and represents the probability distribution over the parameters from prior knowledge before having the data. The third one is a constant in order that the probabilities integrate to one

 p(Y)=∫p(Y|θ)p(θ)dθ. (1.12)

However, an expression proportional to the posterior is often sufficient, so that only the numerator is computed. In that case, we write:

 p(θ|Y)∝p(Y|θ)p(θ). (1.13)

The Bayesian approach offers a theoretical answer to the uncertainty issue about prediction. The full distribution for a new prediction is definitively a satisfying answer to uncertainty. Unfortunately, the posterior distribution over the parameters is practically rarely tractable. In high dimensional non-linear spaces, it is almost never the case.

#### 1.2.1 Bayesian linear regression

Since being Bayesian offers a convincing framework to deal with uncertainty, it is of interest to go through the same simple linear regression model and see how the Bayesian approach deals with it.

The likelihood remains the same. It is normally distributed with mean and variance . Since we are interested to express

 p(y|X,β,σ2)=N(y|Xβ,σIN)=((2π)σ2)−n2exp(−12σ2(y−Xβ)T(y−Xβ)), (1.14)

is an identity matrix of size . We want to compute the posterior distribution over the parameters and . This requires to define a prior on the parameters and

. Of course, all kinds of priors are allowed and possible. However, most of them are intractable and require numerical approximation. Hence, it is very common to choose a so-called conjugate prior which allows an analytical result. Actually, instead of choosing a prior for the joint distribution

, it is very common to separate the joint prior in two distributions, since we have:

 p(β,σ2)=p(β|σ2)p(σ2). (1.15)

The solution for a conjugate prior is often called NIG (standing for Normal Inverse-Gamma) since the prior for would be a Normal distribution and the one for an Inverse-Gamma. The posterior is obtained as follow:

 p(β,σ|X,y)=N(y|X,β,σ2)∗N(β|σ2;β0,Σ0)IG(σ2;a0,b0) (1.16)

where , , , are hyperparameters for the prior distributions. Eventually, the posterior can be express as a NIG:

 p(β,σ|X,y)=NIG(β,σ2|βn,Σn,bn,an) (1.17)

where ,

,

,

.

That is the very general form. Some of the prior’s parameters are particularly popular. For instance, a non-informative prior can be obtained by setting , , , , which gives , being the number of columns of . An other example arise keeping the uninformative prior for and choosing a prior with the form for

. It gives a Maximum A Posteriori (MAP) equivalent to the estimate obtained with classic ridge regression, having

.

Moreover, it can be shown that the MAP for with an uninformative prior turns out to yield equivalent result as the frequentist approach. Not only the parameters are the same. The frequentist confidence interval is also the same as the credible interval from the Bayesian perspective.

The interest of being Bayesian is not obvious in this case. It provides similar results as the classical linear regression. The restriction about the model family and the required additional assumptions are the same as the one made in the frequentist framework. Nonetheless, the method changes the perspective. Instead of having a point estimate and a confidence interval, we now have a probability distribution for the new prediction, which is what we are looking for.

### 1.3 Neural networks

Neural networks for regression are important for two reasons. First, they release considerably the restrictions and the assumptions of the previous methods. Second, the generative adversarial networks (GAN) are neural network based.

Neural networks for regression tasks can be seen as a proposition to address the restrictions and additional assumptions of the linear models. First of all, neural networks enlarge drastically the family of functions take into consideration. They are chains of nested functions including non-linear transformation able to approximate almost any kind of real-valued function. The only restriction they have about the size of the design matrix is the computational power at disposal. They make no assumption about the error term. However, basic regression neural networks have two important limitations. First, they often overfit the data, implying regularization techniques and hyperparameters to prevent it. Second, they offer only point estimate and provide no information about the prediction uncertainty.

In addition, Neural Networks are also important for this study. The adversarial regression that we are going to test is based on neural networks. So, we can introduce Neural Networks in the same time for their ability to perform regression and as a method, we are going to use.

In their simplest form, neural networks can be thought of as a non-linear multi-regression model. Indeed, a neural network consists of an input layer containing the data, one or several hidden layers and an output layer, each layer containing several neurons also sometimes called nodes. The number of units in each layer and the total numbers of layers can be seen as hyperparameters. We will come back later to this.

#### 1.3.1 Feedforward the neural network

Each unit is a place where two operations happen. First, each unit of each layer computes a linear combination of the outputs of the previous layer. The result for the unit in layer is noted . Therefore:

 z[l]j=(w[l]j,.)Ta[l−1]+b[l−1]j. (1.18)

Second, the linear combination

is then transformed by a non-linear activation function

. Consequently, the output of the unit is obtain as follow:

 a[l]j=g(z[l]j)=g[(w[l]j,.)Ta[l−1]+b[l−1]j] (1.19)

Figure 1.1 is a graphical representation of a single unit.

Once, we have understood how does work a single unit, the understanding of the whole network is straight forward since the neural network is simply a succession of layers, each one containing a fixed number of units. The computation of layer is given as follow:

 A[l]=g(Z[l])=g[(W[l])TA[l−1]+b[l−1]]. (1.20)

We will come back later on the issue of the initialization of the weights but let assume for now we have set some weights. The forward pass is very easy. The pseudo code holds in 4 lines:

So, the model is only a succession of linear combinations followed by a non-linear activation function.

#### 1.3.2 The activation functions

Why do we need an activation function? It can be proven that a linear combination of a linear combination is still a linear combination. Therefore, without a non-linear activation function, a neural network would be equivalent to a linear combination of the input. Therefore, the activation function is essential. There are plenty of activation functions. We present here the most popular ones.

The sigmoid function has already have been seen in the binary logistic regression. In the machine learning context, the logistic function for multinomial logistic regression is called the softmax

function. The rectified linear unit (Relu) function is probably the most famous and the most used in the hidden layers. Its expression is very simple:

 g(Z)=max(0,Z). (1.21)

The drawback of the Relu function is that its gradient is zero for all negative Z. To avoid it, a slightly modified version of it has been developed and called leaky Relu. Its expression is

 g(Z)=max(aZ,Z), (1.22)

and is usually small (typically or ).

Thanks to the forward pass, we have now the matrix .

#### 1.3.3 The loss functions

The neural networks can be interpreted as a non-linear regression. As we would do in linear regression, the aim is to find the parameters or ”weights” minimizing the cost function. Because of the complexity of the function to minimize, there is no closed form to find those minima. Nonetheless, the strategy will remain the same: compute an error and minimize it. Therefore, we need a mathematical expression for the cost. Loss and cost functions refer to all the function able to compute the error. I present here two among the most famous loss function.

The mean square error is the sum of the square of the differences between the predicted value and the true value divided by the number of observations.

 L(i)(^y(i),y(i))=n∑i=1(yi−^yi)2. (1.23)

For classifier, the most common function is the cross-entropy loss:

 L(i)(^y(i),y(i))=−J∑j=11{yi=j}log(^yi), (1.24)

is the number of classes.

Of course, the loss function can take different expression depending on the aim. We are going to see different loss function in the chapter about GANs.

#### 1.3.4 Backpropagation and gradient descent

Neural networks aim to minimize the cost function. There are several optimization methods but all of them are improvements of the first general idea: gradient descent. Let present first gradient descent and then some of the most popular improvement of it. The general idea of gradient descent consists of gradually updating the parameters to find a minimum, i.e. some values for the parameters where all partial derivatives with respect to the parameters are equal to zero. Once we have the values for all and and we have the cost, we can use it to optimize the weights. The idea is to use gradient descent to reach a minimum. Gradient descent works like this:

• Take the current values of the parameters as a starting point.

• Compute the gradient of the cost function for these values i.e. its first partial derivative with respect to the parameters.

• Compute the gradient of the cost function for these values i.e. its first partial derivative with respect to the parameters.

• Repeat this process until the gradient converges around zero.

What a ”small step” means remains unclear for now. Let call it the learning rate. We will come back later on this. So, the general formula for updating the parameter is as follow:

 θi=θi−1−α∂JΘi−1∂θi−1 (1.25)

where is the learning rate, is the cost computed with the former parameters, represents the parameter before the update and the same parameter updated. If the learning rate is small enough, then the cost will always decrease until it reaches a value close to a local minimum.

#### 1.3.5 The chain rule

The gradient descent implies to compute the partial derivative of the cost function with respect to the parameters. It may look very difficult since we have non-linear functions with respect to many parameters but, thanks to the chain rule, it is easier than it may seem at first sight.

The chain rule of derivation is a way to compute the derivative of variables inside nested functions. Let be , , some functions and a variable. The function is the outer function and is the inner function as follow: . Then, the derivative of with respect to is the product of the derivative of with respect to , of the derivative of with respect to and of with respect to f. We can write:

 dfdx=dfdg∗dgdh∗dhdx. (1.26)

Since a neural network can be seen as a range of functions nested within each other, the chain rule does apply. So if we want to compute the derivative of the cost function with respect to the weight of last layer , we have:

 ∂J∂W[L]=∂J∂A[L]∗∂A[L]∂W[L]. (1.27)

We can apply the chain rule even further to compute the derivative of the previous layers. Following the same method, we can compute all the weights back to the first layer. A general formula of the previous example could be written like this:

 ∂JΘ∂W[l]=∂J∂A[L]∗∂A[L]∂Z[L]∗∂Z[L]∂W[L]∗∂W[L]∂A[L−1]∗∂A[L−1]∂Z[L−1]∗∂Z[L−1]∂W[L−1]∗...∗∂Z[l+1]∂A[l]∗∂A[l]∂W[l] (1.28)

The chain rule allows a relatively easy computation of the gradient for all the parameters so that eventually they can be updated. Gradient descent do not ensure to reach the global minimum but a local minimum. In practice, however, it usually produces good results anyways.

#### 1.3.6 Improvement of Gradient Descent

Gradient descent is the principle of optimization. However, some improvements make the process faster. In this section, we will see some important improvements that we are going to use: stochastic and mini-batch gradient descent, momentum, RMSprop, Adam.

##### Stochastic and Mini-batch Gradient Descent

Stochastic gradient descent consists of drawing a random subset of the data and updating the parameters after each observation instead of the gradient on the whole dataset. The idea is that it is much faster to compute the gradient for one observation than for all the dataset. Even if a single update could yield a loose approximation of the loss, it is compensated by the fact that it is possible to compute much more iterations.

An alternative to a strict stochastic gradient descent is mini-batch gradient descent. Instead of taking only one observation the idea is to take a small batch of data. Again the data are first randomly sorted in mini-batches. The parameters are updated on each mini-batches.

We will call an epoch one full training cycle through the whole dataset so that an epoch of stochastic gradient descent consists of

iterations and an epoch of mini-batches of size consists of iterations.

##### Momentum

Gradient descent with momentum is not a recent method. It was proposed in 1964 by Polyak (1964). It has been popularized again in the machine learning community by (Sutskever et al., 2013). The idea with momentum is to update the parameters not only with respect to the last computation of the gradient but taking into account an exponentially weighted (moving) average of the last iterations.

At iteration , we compute as a linear combination of the previous and the new gradient so that

 V(i)dθ=βV(i−1)dθ+(1−β)dθ (1.29)

where is a parameter controlling the weight of the past iteration relatively to the weight of the new gradient. A close to one gives much weight to the past, a close to zero gives bigger weight to the new gradient. The parameters are not updated with but with as follow:

 θi=θi−1−αV(i)dθ. (1.30)

That way, the update of the gradient tends to continue in the direction given by the exponentially weighted average of the last iterations.

Momentum is particularly useful with stochastic or mini-batch gradient descent. Since one observation, or even an observation, yield to fluctuating gradient values because of the randomness of the process, the momentum compensates these fluctuations.

##### RMSprop

The RMSprop is another technique for speeding the optimization. It has an unconventional background since it was first proposed in an online course and spread from there. The class is not available anymore but the slide notes are (Hinton et al., Unknown). Instead of computing the moving average of the gradient, we compute the moving average of the second derivative. The average of the second derivative would tend to zero in the dimensions corresponding to the direction where the values of the parameters are moving during the last iterations and tend to infinity in the dimensions orthogonal to them. Just like with the momentum but with the second derivative, we compute an exponentially weighted average:

 S(i)dθ=βS(i−1)dθ+(1−β)dθ2 (1.31)

.

The parameters are then updated with the derivative weighted by the inverse of the square root of . Formally, the expression of the update si written:

 θi=θi−1−αdθ√S(i)dθ. (1.32)

That way the update is bigger in the dimensions parallels to the direction where the last update have been made.

Adam optimizer, proposed in (Kingma and Ba, 2014), is actually the combination of gradient descent with momentum and RMSprop. and are computed the same way and the parameters are recalled respectively , . Also Adam use a corrected version of and , reducing their weight in the beginning of the training. The corrected version is modified as follow:

 VCorr,(i)dθ=V(i)dθ1−βt (1.33)
 SCorr,(i)dθ=S(i)dθ1−βt, (1.34)

is the number of iterations. Eventually, the parameters are updated as follow:

 θi=θi−1−αVCorr,(i)dθ√SCorr,(i)dθ. (1.35)

Adam has been considered as state-of-the-art. It is used in a lot of neural networks. However, as we are going to see, the algorithm is sometimes used with or . In those cases, Adam is equivalent respectively to RMSprop or gradient descent with momentum.

#### 1.3.7 A powerful Algorithm

Putting all together, the general algorithm for a standard basic neural network can be written like this:

Neural networks are very good at building complex models. Even very simple architecture can lead to astonishing results. With just a few layers, a neural network can recognize (predict) handwritten digits from a picture with very good accuracy. However, they have some important drawbacks too. First, they are computationally expensive. Neural networks have many parameters to train, i.e. to estimate through gradient descent. The gradient descent itself is an iterative process computationally expensive in its principle. Second, some neural networks tend to overfit the data. Fortunately, they exist several regularization methods. However, all those methods involve additional hyperparameter(s) that requires to be tuned. Third, neural networks demand a lot of data. It is another consequence of the high number of parameters. Efficient training of neural networks implies many training examples. Finally, basic neural networks do only provide point estimates. Since the prediction depends on parameters that suffer non-linear transformation and combination with other parameters it is impossible to retrieve the uncertainty neither from the data nor from the parameters.

### 1.4 chapter conclusion

The ideal regression methods would combine the best of the two: the power of neural networks and the ability to quantify uncertainty for new predictions. Generative adversarial networks are potentially able to put these qualities together. The aim would be to test how well they can do it.

### 2.1 Theoretical background of GANs

We present in the section the main ideas of (Goodfellow et al., 2014). We have an unknown density function and a dataset containing realizations of it. We want to use this sample to be able to produce new realizations of an approximation of . The idea of GAN is as follow:

1. We generate some data from a simple known distribution, typically the standard normal.

2. We train a Generator to simulate as good as possible realizations of , so that we have .

3. We train a Discriminator to distinguish between coming from and the one coming from

Actually, and are trained simultaneously through optimization algorithm each of them making one step of optimization after the other. The process is as a game between two players 111

Note: In fact, it is related to the game theory and Nash equilibrium. For details, see

(Oliehoek et al., 2018). The first player the Generator try to fool the Discriminator by generating sample as close as possible of the real distribution. The Discriminator tries not to be fooled. The better one player becomes, the better the other one has to be to succeed in its task. Figure 2.1 illustrates the architecture of a GAN (depending on the loss function the Discriminator is recalled Critic).

#### 2.1.1 The Discriminator

Let start with the discriminating part. It is a classification task. As shown previously, the loss function for such a case is given by the cross entropy loss function. Plugging in the GAN’s notation, the loss function can be written:

 maxDV(D,G)=Ex∼pr(x)[log(D(x))]+Ez∼pz(z)[log(1−D(G(z)))]. (2.1)

Here, we can apply a change of variable that is not obvious if we do not want to assume that G is invertible (which is not generally the case with neural networks). As emphasized by (Rome, 2017) and thanks to the Radon–Nikodym theorem, it is possible to write:

 maxDV(D,G)=Ex∼pr(x)[log(D(x))]+Ex∼pG(x)[log(1−D(x))]. (2.2)

For a fix G(z), we can then find the that maximize the loss function:

 D⋆G=argmaxD∫pr(x)[log(D(x))]+pg(x)[log(1−D(x))]dx. (2.3)

Since it is the same as maximizing the integrand, the maximum is obtained where the first derivative is null if the second derivative is negative for all ’s.

 D′G=pr(x)D(x)+pg(x)(1−D(x))D⋆G=pr(x)pr(x)+pg(x). (2.4)

By plugging the result into the objective function we have:

 V(D⋆,G)=∫pr(x)log(pr(x)pr(x)+pg(x))+pg(x)log(pg(x)pr(x)+pg(x))dx. (2.5)

The general aim of the process is to have a Generator mimicking perfectly the true distribution . Let set then . It implies that that . The objective function becomes:

 V(D⋆,G)=∫pr(x)log(12)+pg(x)log(12)dx. (2.6)

It is easy to see that but we would like to do more. The aim is to prove that this minimum can be reached for only one G.

 V(D⋆,G)=∫pr(x)log(pr(x)pr(x)+pg(x))+pg(x)log(pg(x)pr(x)+pg(x))dx=∫(log(2)−log(2))pr(x)+pr(x)log(pr(x)pr(x)+pg(x))+(log(2)−log(2))pg(x)+pg(x)log(pg(x)pr(x)+pg(x))dx=−log(2∫pr(x)+pg(x)dx)+∫pr(x)log(2)+log(pr(x)pr(x)+pg(x))pg(x)log(2)+log(pg(x)pr(x)+pg(x))dx=−2log(2)+∫pr(x)(pr(x)(pr(x)+pg(x))/2)dx+∫pg(x)pg(x)(pr(x)+pg(x))/2dx (2.7)

By definition of the Kullback-Leibler divergence

222We will come back later on the Kullback-Leibler and Jenson-Shannon Divergences, we have:

 V(D⋆,G)=−2log(2)+DKL(pr(x)|(pr(x)+pg(x))2)+DKL(pg(x)|(pr(x)+pg(x))2). (2.8)

The Kullback-Leibler divergence is non-negative so we confirm that is a global minimum.

And by definition of the the Jenson-Shannon divergence, we obtain:

 V(D⋆,G)=−2log(2)+DJS(pr(x)|pg(x)).. (2.9)

The Jenson-Shannon divergence is if, and only if, ).

However, as mentioned in the original paper, the original loss function may not provide a gradient large enough for to learn. Indeed, ’Early in learning, when is poor, can reject samples with high confidence because they are clearly different from the training data. In this case, saturates.’ (Goodfellow et al., 2014, p. 3) Since it may not be obvious and it is not shown in the original paper, the proof is as follow:

 ∇θLG=∇θEz∼pz(z)log(1−D(Gθ(z)))=−Ez∼pz(z)1(1−D(Gθ(z)))∂D(Gθ(z))∂Gθ(z)∂Gθ(z)∂θ=Ex∼pg(x)∇xD(x)∇θGθ(z)D(x)−1. (2.10)

If is a perfect Discriminator then , so that:

 limD→D⋆D(x)=0 (2.11)
 limD→D⋆∇θD(x)=0. (2.12)

To prevent this issue, the authors propose to maximize instead of minimizing .

#### 2.1.3 The Generator

The second part of the proposition shows that , the distribution implicitly defined by converges to . As is convex in with a unique global optima, with sufficiently small updates, converges to .

However, in practice, represents only a limited family of functions , the set of the functions . Consequently, we are optimizing a limited family of functions through . Moreover, we are not optimizing but . As is not a convex function so that we have no guarantee to reach the global minimum. These limitations contribute to making GANs hard to train and unstable.

To give a clearer idea of how GANs works in practice, we present a pseudo-code version of the algorithm.

#### 2.1.4 Instability and Mode collapse

The main problem with GANs is that they are known to be hard to train. The problems appear when one of the networks is too strong for the other. If the Discriminator becomes too strong, meaning its classifying correctly with high confidence all the observations, the Generator would not be able to learn anything. Indeed, if is close to one, saturates and its gradient vanish. As seen above, the authors of the original paper propose to prevent it by maximizing log(D(G(z))) instead of minimizing . By changing the sign of the new expression the objective becomes again a minimization problem. The new loss function is then:

 maxGminDV(D,G)=Ex∼pr(x)[log(D(x))]−Ez∼pz(z)[log(D(G(z)))]. (2.15)

The strength of the Generator is also a problem. The reason is the so-called mode collapse. If is learning too fast, it will learn to produce data where the Discriminator is the most likely to be fooled, for instance, the mode of the objective function for a fixed . The Discriminator will then progressively learn to classify this mode as fake. When is good enough at predicting fake for this mode, the Generator will find another weakness in the Discriminator and collapse to this new mode, until the Discriminator adjusts and predicts it as fake. The Generator comes back to the initial state and the same process occurs again. Even if a full mode collapse is rare, partial ones, happening only on a part of the distribution, are very frequent.

### 2.2 GANs variations

The address of this issue has driven to the production of many propositions to fix it. The field is new and fast-growing. In addition to the Standard GAN (SGAN), which we just presented, there exists a long list of alternative loss functions each of them bringing a new name for the GAN. Therefore, it is hard to be exhaustive because there are so much of them. We will present here only the versions we have worked with: the Wasserstein GAN (WGAN) and its variation with Gradient Penalization (WGAN-GP) and the Relativistic Standard GAN, and its variation the Relativistic Averadged Standard GAN.

#### 2.2.1 Wasserstein GAN (WGAN)

The Wasserstein GAN (WGAN) is often seen as the state-of-the-art. The solution has been presented by (Arjovsky et al., 2017) in 2017. They propose to change the loss function. For them, the problem with the standard GAN’s loss function is that the gradient of has a high variance when is close to 1. The new proposed loss function lay on the Wasserstein distance also called Earth-Mover distance. The idea is that the distance between two distributions is the minimum distance of density (thought as earth) to move in order that the two distribution matches. Formally the Wasserstein distance is written:

 W(p,q)=infγ∈Π(p,q)∬x,y∥x−y∥γ(x,y)dxdy=infγ∈Π(p,q)E(x,y)∼γ(∥x−y∥)

where denotes the set of all joint distribution whose marginals are respectively and . The above formulation is intractable. The authors of the original paper propose then a transformation using the Kantorovich-Rubinstein Duality. The detail of the transformation is beyond the scope of this work. However, for details about it, Vincent Herrman wrote a convincing explanation (Herrmann, 2017). The result is that we can also express the former formulation as follow:

 W(pr,pg)=1Ksup∥f∥L≤KEx∼p(f(x))−Ex∼q(f(x))

The new formulation makes appear the condition , meaning that should to be K-Lipschitz continuous. A real function is said K-Lipschitz if there is a real constant such that

 |f(x1)−f(x2)|≤K|x1−x2|∀x1,x2∈R. (2.16)

.

In other words, for differentiable functions, the slope should not be larger than .

The idea is to make the Discriminator such that it belong to the family of K-Lipschitz continuous functions . It would be then possible to rewrite the loss function of the Discriminator as a Wasserstein distance. Assuming that is K-Lipschitz, the loss function becomes:

 L(pr,pg)=W(pr,pg)=maxθEx∼pr(fθ(x))−Ez∼pg(z)(fθ(G(z))). (2.17)

So, the Discriminator is not a classifier anymore. It is rather a function required to compute Wassstein distance. To emphasize this change the Discriminator is renamed Critic. The issue is now to keep being a K-Lipschitz continuous function so that the above developments hold. The trick used is quite simple. It consists of clamping the weight after each update in order that the parameter space being compact. Thus, is bounded such that the K-Lipschitz continuity is preserved. The weights are clamp in an interval , where is a small value. The authors recommend being of the order of . According to the authors themselves, ’weight clipping is a terrible way to enforce a Lipschitz constraint. (…) However, we do leave the topic of enforcing Lipschitz constraints in a neural network setting for further investigation, and we actively encourage interested researchers to improve on this method.’ (Arjovsky et al., 2017)(p.7).

The authors of (Gulrajani et al., 2017) have followed this recommendation and they proposed a penalization of the gradient to ensure that the Lipschitz constraint hold. Instead of clipping the weight, they propose to penalize the norm of the gradient take values far from one. To do that they defined as a value uniformly sampled along straight lines between pairs of points themselves sampled from the real distribution and the generated distribution , which implicitly define such that . The new loss function with gradient penalization becomes:

 L=Ex∼pr(fθ(x))−Ez∼pg(z)(fθ(G(z))) +λE^x∼p^x[(||∇^xD(^x)||2−1)2], (2.18)

where is a new hyper parameter controlling the strength of the constraint. The authors of the original paper, dealing with high dimensional input, propose to use . Our experiments showed that this value has to be tuned for smaller input 333This is typically the process that seems of little importance but which is in practice very time consuming because the algorithm does not work with the good value. In particular, the first time we implemented a GAN-GP, it is hard to say, where the problem comes from. . We used .

WGANs solve the problem of vanishing or high variance gradients. They offer also a convincing way of computing distances between distributions for the kind of issue they deal with. The theory is convincing. The problem is that it is not clear that they perform better than classic GAN. November 2017, a team from Google Brain published a paper called ’Are GANs created equal?’, comparing between 8 different GANs, including the 2 original versions (with and without the non-saturating transformation), and the two versions of the WGAN presented here. They conclude: ’We did not find evidence that any of the tested algorithms consistently outperform the non-saturating GAN introduced in [the original paper]’.

#### 2.2.2 RSGAN and RaSGAN

Jolicoeur-Martineau (2018) proposed a new class of GANs, called relativistic GANs. The idea comes from the argument ’that the key missing property of SGAN is that the probability of real data being real should decrease as the probability of fake data being real increase’ (p.3). This main argument is based on the fact that the Discriminator can use the a priori knowledge that half of the examples are real and half are fake.

Just like for the WGAN, we define the non-transformed output of the Discriminator as . The standard Discriminator is then defined as . We defined then as a pair of a real and fake data . A relativistic can be easily defined as , where is the sigmoid function. Following Jolicoeur-Martineau (2018, p.5), ’we can interpret this modification in the following way: the Discriminator estimates the probability that the given real data is more realistic than a randomly sampled fake data’. The opposite relation doesn’t need to be include in the loss function since . The loss function of a relativistic GAN can be written as follow:

 LD=−E(xr,xf)∼P,Q[log(σ(C(xr)−C(xf)))] (2.19)
 LG=−E(xr,xf)∼P,Q[log(σ(C(xf)−C(xr)))]. (2.20)

The principle of the relativistic GAN can be applied to a broad family of GANs by replacing the sigmoid function by any other function.

The Relativistic average standard RaSGAN is very similar to the RSGAN except that the non-transformed output of the Discriminator is not compared to a single data of the other type but to the batch average. We can interpret it as ’the probability that the input data is more realistic than a randomly sampled data of the opposing type (fake if the input is real or real if the input is fake)’ Jolicoeur-Martineau (2018, p.6). The loss function for the Discriminator and the Generator have the same expression. It is a bit longer than the one of the RSGAN since the two parts of the expression have to be included:

 L=−Exr∼P[log(σ(C(xr)−Exf∼QC(xf)))]−Exf∼Q[log(1−σ(C(xf)−Exr∼PC(xr)))]. (2.21)

There exists a broad class of GANs with different objective functions. We presented here only some of the most popular but we also ignored others. The choice depends on the popularity and the presumed efficiency but we have to admit the selection is also arbitrary. Moreover, as the number of papers published about GAN is high, it is hard to stay up to date. We will come back later on the specific architecture and hyperparameters used in the experiment section.

### 2.3 Conditional GAN

We remind that GAN aims to ”learn” the distribution of a probabilistic model from a sample and consequently sampling from this approximation. Therefore, it is also possible to build a conditional version of it. Instead of ”learning” , the generative model will learn .

The idea has been introduced shortly after the original GAN by Mirza and Osindero (Mirza and Osindero, 2014). The task looks even easier in the sense that both Generator and Discriminator will have more information. In practice, the change is straight forward. Instead of giving just noise, we give the joint distribution to the Generator and the joint distribution to the Discriminator. The ’s come from the dataset and are randomly shuffled to form batches of ’s. The Generator will ”learn” to produce the joint distribution and the Discriminator will discriminate real and fake knowing the couple . In other words, the Discriminator is also distinguishing real and fake using the joint distribution.

In detail, the Discriminator will learn to classify the probability distribution of . So that, the classification between generated ’s and real ’s will depend on . In return, the Generator has to produce a realistic joint distribution .

Formally, the transformation is straight forward:

 maxDV(D,G)=Ex∼pr(x)[log(D(x,y))]+Ez∼pz(z)[log(1−D(G(z,y)))]. (2.22)

The demonstration is exactly equivalent to the case without condition.

The Generator takes and as inputs and produces samples simulating For more clarity, we propose graphical representation of the Generator and the Discriminator we are going to use respectively in Figure 2.2 and Figure 2.3.

#### 2.3.1 Conditional GAN for regression

As introduced in the first chapter, this conditional distribution corresponds to the probability distribution for a new observation in a regression. Therefore, this conditional distribution can be used as the predictive distribution of a regression. However, the explicit probability distribution remains unknown.

Once the Conditional GAN has converge, it produces sample for the conditional distribution such that . In other words, the Generator acts like an implicit distribution function. It generates samples from the approximate distribution. So, the empirical distribution can be used to estimate properties of the true conditional distribution.

Practically, we will use the trained Generator as follow. It will take the conditional value (the new observation), say , for which we would like an estimate and the random noise using the normal distribution . By sampling a large number of and keeping fixed the condition, we obtain a large sample of data following the approximated conditional distribution.

From this sample, we can theoretically estimate any kind of statistics about

, including the moments of arbitrary large order and all the quantile. Indeed, GANs can be seen as a class of Monte Carlo sampler. Indeed, empirically, GANs are related to Markov process but the mathematical relation between them is beyond the scope of this work.

It is interesting to notice that GANs can produce conditional distributions in both directions, i.e. as well as . In other words, conditional GANs can produce predictions for classical or reversed regression. Moreover, the dimensions of and have no restriction in neither direction. In principle, the use of conditional GANs for regression task is very flexible.

We call this use of conditional GANs in a regression purpose adversarial regression. The next part of this work consists of testing how well the adversarial regression performs in approximating statistics about the true density for a new observation.

### 3.1 The experiments

The aim of this work consists of testing different kinds of GANs and parameters for regression tasks. We define three simple ‘true’ models to this end. We sample from these models a number of events ( is part of the factors tested) and we run the adversarial regression on the samples. We approximate a new observation using the conditional distribution for a given value thanks to the conditional GAN. We compare it to the true distribution. We choose three bivariate scenarios, which are much easier to analyze. In particular, the prediction distribution depends only on one variable. Of course, this is a strong limitation of these experiments. Further experiments in higher dimensions have to be led to confirm the result we have found.

Let have a look first at the three models. The first one corresponds to the classical linear model with normal errors :

 Y=aX+b+ϵ,ϵ∼N(0,0.0025). (3.1)

The second model is still linear but with heteroscedasticity:

 Y=aX+b+ϵ,ϵ∼N(0,x2∗0.01). (3.2)

The third model is a mix of a linear trend, a sinusoidal oscillation and an homoscedastic error term. It is defined as follow :

 Y=aX+c∗sin(dx)+ϵ,ϵ∼N(0,0.0025). (3.3)

is considered as the independent variable. We generated it uniformly on the interval . For these experiments, we fixed , , and . The figures 3.1 to 3.3 show 10’000 realizations of these models.

On each of these models, we can perform two kinds of experiments. First, the classical regression version consisting of predicting for a given . But we can also think the inverse problem and be interested to find the distribution of given . It is known as inverse regression or calibration problem. We will present results for both cases. However, most of the experiments where run for inverse regression because the inverse regression problem proposes more various situation, with a broader range of complexity. Indeed, the first experiment is exactly identical in both directions, the normal distribution being symmetric around zero. In the two other cases, however, has a different and often unknown distribution so that the distribution cannot be found analytically. To prevent this issue, the true distribution will be estimated by sampling and slicing. Practically, we have sampled events from the true distribution and estimated the true distribution with all the event contained in the slice . We also stored the number of events from the true distribution to generate the same number of ’fake’ data.

We have arbitrary selected four values to condition on. These values are 0.1, 0.4, 0.7, 1. They are the same in all experiments whether we condition on or . These values ensure to have measures for a broad range of the distribution. In particular, they include various variances in case of heteroscedasticity and produce diverse shapes for the inverse regression of the third experiment. Because the values and are on the edge of the distribution, we note that it produces particular situations. When looking for the prediction , the prediction is harder because there is no bigger ’s. In the inverse case, looking for , the true distribution fall suddenly to zero, except in the third experiment. For instance, Figures 3.8 and 3.8 shows the true conditional distributions of for the experiment 2 and 3. As explained, the true densities plotted are estimated by sampling and slicing around the selected ’s.

### 3.2 The metrics

Measuring the distance between two samples can be complicated in a high dimensional space. Fortunately, there are plenty of different measures in the uni-dimensional case. The distance between the real and the generated means is a rough and obvious measure. As we are interested in the whole distribution, it is not enough. The distance between the next moments, variance, skewness and kurtosis refine the comparison. We are going to use these distances since they are common and frequently used. They indicate if the approximation is roughly correct. However, they are quite complicated to interpret. For instance, we are annoyed to give interpretation when the higher moments are well estimated but not the lower one. We would prefer a more global measure. Fortunately, there are also good metrics giving a general distance or pseudo-distance between samples. We have considered three of them. The first one is the Kolmogorov-Smirnov distance. The second is the KL divergence. The third one is the JS divergence. Let have a look in detail on these three measures.

#### 3.2.1 The KS distance

The Kolmogorov-Smirnov statistics or distance measure the supremum of the absolute value of the difference between the cumulative distributions functions. Formally, we write:

 DKSn,m=supx|Fn,P−Fm,Q|. (3.4)

The Kolmogorov-Smirnov statistics is a non-parametric statistic, known above all for its use in the Kolmogorov-Smirnov test, which is a test of goodness of the fit.

#### 3.2.2 The KL divergence

The Kullback-Leibler Divergence is a measure of how close a distribution is from another. It is defined as follow:

 DKL(P||Q)=∫∞−∞p(x)ln(p(x)q(x))dx. (3.5)

However, in our case, we do not have a closed expression neither for nor . To prevent this issue, we are going to use the discretized version of the divergence to approximate the true divergence. In practice, we are going to divide the output space between and in 200 bins 111We know that the distribution outside this interval is almost zero., approximate the continuous distribution through the probability mass function of those 200 bins and compute the discrete Kullback-Leibler Divergence, defined as follow:

 DKL(P||Q)=∑x∈XP(x)ln(P(x)Q(x)). (3.6)

We remark that the Kullbach-Leibler Divergence take value from zero to infinity. We note also that it is not properly a distance since it is asymmetric: .

#### 3.2.3 The JS divergence

The Jenson-Shannon Divergence can be interpreted as the mean of the Kullbach-Leibler divergence between each distribution and the mean between them. The definition is:

 DJS=12DKL(P||M)+12DKL(Q||M), (3.7)

where