## I Introduction

Generative adversarial networks (GANs) are an example of generative models. Specifically, the model takes a training set, consisting of samples drawn from a probability distribution, and learns how to represent an estimate of that distribution. GANs focus primarily on sample generation, but it is also possible to design GANs that can estimate the probability distribution explicitly

[1].The subject has been recently studied, especially because it has many practical applications on variuos topics. For instance, they can used for medical purposes, i.e., to improve the diagnostic performance for the low-dose computed tomography method [2], of for polishing images taken in unfavourable weather conditions (as rain or snow) [3]. Other applications range from speech and language recognition, to playing chess and vision computing [4].

The idea behind GANs is to train the generative model via an adversarial process, in which also the opponent is simultaneously trained. Therefore, there are two neural network classes: a generative model that captures the data distribution, and a discriminative model that estimates the probability that a sample came from the training data rather than from the generator. The generative model can be thought of as a team of counterfeiters, trying to produce fake currency, while the discriminative model, i.e., the police, tries to detect the counterfeit money. The competition drives both teams to improve their methods until the counterfeit currency is indistinguishable from the original. To succeed in this game, the counterfeiter must learn to make money that are indistinguishable from original currency, and the generator network must learn to create samples that are drawn from the same distribution as the training data [5].

Since each agent payoff depends on the variables of the other agent, this problem can be described as a game. Therefore, these networks are called adversarial. However, GANs can be also thought as a game with cooperative players since they share information with each other [1]

. Since there are only the generator and the discriminator, the problem is an instance of a two-player game. Moreover, depending on the cost functions, it can also be considered as a zero-sum game. From a mathematical perspective, the class of games that suits the GAN problem is that of stochastic Nash equilibrium problems (SNEPs) where each agent aims at minimizing its expected value cost function which is approximated via a number of samples of the random variable.

Given their connection with robust optimization and game theory, GANs have received theoretical attention as well, both for modelling as Nash equilibrium problems

[6, 7] and for designing algorithms that improve the training process [8, 7].From a game theoretic point perspective, an elegant approach to compute a SNE is to cast the problem as a stochastic variational inequality (SVI) [9] and to use an iterative algorithm to find a solution. The two most used methods for SVIs studied in the literature for GANs [8] are the gradient method [10], known in monotone operator theory as forward–backward (FB) algorithm [11], and the extragradient (EG) method [12, 13]. The iterates of the FB algorithm involve an evaluation of the pseudogradient and a projection step. These iterates are known to converge if the pseudogradient mapping is cocoercive or strongly monotone [14, 15]. However, such technical assumptions are quite strong if we consider that in GANs the mapping is rarely monotone. In contrast, the EG algorithm converges for merely monotone operators but taking two projections into the local constraint set per iteration, thus making the algorithm slow and computationally expensive. Other algorithms for VIs that can be applied to GANs can be found in [8].

In this paper we propose a stochastic relaxed FB (SRFB) algorithm, inspired by [16], for GANs. A first analysis of the algorithm for stochastic (generalized) NEPs is currently under review [17]. The SRFB requires a single projection and single evaluation of the pseudogradient algorithm per iteration. The advantage of our proposed algorithm is that it is less computationally demanding that the EG algorithm even if it converges under the same assumptions. Indeed, we prove its convergence under mere monotonicity of the pseudogradient mapping when a huge number of samples is available. Alternatively, if only a finite number of samples is accessible, we prove that averaging can be used to converge to a neighbourhood of the solution.

Notation. Let indicate the set of real numbers and let . denotes the standard inner product and represents the associated euclidean norm. Given vectors , For a closed set the mapping denotes the projection onto , i.e., .

## Ii Generative Adversarial Networks

The basic idea of generative adversarial networks (GANs) is to set up a game between two players: the generator and the discriminator. The generator creates samples that are intended to come from the same distribution as the training data. The discriminator examines the samples to determine whether they are real or fake. The generator is therefore trained to fool the discriminator. Typically, a deep neural network is used to represent the generator and the discriminator. Accordingly, the two players are denoted by two functions, each of which is differentiable both with respect to its inputs and with respect to its parameters.

The generator is represented by a differentiable function , that is, a neural network class with parameter vector . The (fake) output of the generator is denoted with where the input is a random noise drawn from the model prior distribution, , that the generator uses to create the fake output [6]. The actual strategies of the generator are the parameters that allows to produce the fake output.

The discriminator is a neural network class as well, with parameter vector and a single output that indicates the accuracy of the input . We interpret the output as the probability that the discriminator assigns to an element to be real. Similarly to the generator , the strategies of the discriminator are the parameters .

The problem can be cast as a two player game, or, depending on the cost functions, as a zero sum game. Specifically, in the latter case the mappings and should satisfy the following relation

(1) |

In most cases [8, 5], the payoff of the discriminator is given by

(2) |

where is a measuring function (typically a logarithm [5]). The mapping in (2) can be interpreted as the distance between the real value and the fake one.

In the context of zero sum games, the problem can be rewritten as a minmax problem

(3) |

In words, (3) means that the generator aims at minimizing the distance from the real value while the discriminator wants to maximize it, i.e. to recognize the fake data.

## Iii Stochastic Nash equilibrium problems

In this section we formalize the two player game in a more general form that will support our analysis. Specifically, we consider the problem as a general stochastic Nash equilibrium problem since our analysis is independent on the choice of the cost functions.

We consider a set of two agents , that represents the two neural network classes. The local cost function of agent is defined as

(5) |

for some measurable function where . The cost function of agent depends on the local variable , the decisions of the other player , , and the random variable that express the uncertainty. Such uncertainty arises in practice when it is not possible to have access to the exact mapping, i.e., when only a finite number of estimates are available.
represent the mathematical expectation with respect to the distribution of the random variable ^{1}^{1}1From now on, we use instead of and instead of . in the probability space . We assume that is well defined for all the feasible [18].
For our theoretical analysis, we postulate the following assumptions on the cost function and on the feasible set which are standard in game theory [19, 18].

###### Assumption 1

For each , , the function is convex and continuously differentiable.

###### Assumption 2

For each the set is nonempty, compact and convex.

Given the decision variables of the other agent, each player aims at choosing a strategy , that solves its local optimization problem, i.e.,

(6) |

Given the coupled optimization problems in (6), the solution concept that we are seeking is that of Stochastic Nash equilibrium (SNE) [18].

###### Definition 1

A stochastic Nash equilibrium is a collective strategy such that for all

Thus, a SNE is a set of strategies where no agent can decrease its cost function by unilaterally deviating from its decision.

To guarantee that a SNE exists, we make further assumptions on the cost functions [18, Ass. 1].

###### Assumption 3

For each and for each , the function is convex, Lipschitz continuous, and continuously differentiable. The function is measurable and for each and its Lipschitz constant is integrable in .

Existence of a SNE of the game in (6) is guaranteed, under Assumptions 1-3, by [18, Section 3.1] while uniqueness does not hold in general [18, Section 3.2].

For seeking a Nash equilibrium, we rewrite the problems as a stochastic variational inequality. To this aim, let us denote the pseudogradient mapping as

(7) |

where the possibility to exchange the expected value and the pseudogradient in (7) is assured by Assumption 3. Then, the associated stochastic variational inequality (SVI) reads as

(8) |

## Iv Stochastic relaxed forward–backward with averaging

The first algorithm that we propose is inspired by [16, 17] and it is a stochastic relaxed forward backward algorithm with averaging (aSRFB). The iterations reads as in Algorithm 1.

Initialization:

Iteration : Agent receives , , then updates:

(9a) | ||||

(9b) |

Iteration :

We note that the averaging step

(10) |

where , was first proposed for VIs in [10], and it can be implemented in an online fashion as

(11) |

where . Even if they look similar, (11) is different from (9a). Indeed, in Algorithm 1, (9a) is a convex combination of the two previous iterates and , with a fixed parameter , while the averaging in (11) is a weighted cumulative sum over all the decision variables for all with time varying weights . The parameter can be tuned to obtain uniform, geometric or exponential averaging [8]. The relaxation parameter instead should satisfy the following assumption.

###### Assumption 4

In Algorithm 1, .

To continue our analysis, we postulate the following monotonicity assumption on the pseudogradient mapping which is standard for VI problems [13, 17], also when applied to GANs [8].

###### Assumption 5

as in (7) is monotone, i.e. for all .

Next, let us define the stochastic approximation of the pseudogradient [11] as

(12) |

uses one or a finite number, called mini-batch, of realizations of the random variable. Given the approximation, we postulate the following assumption which is quite strong yet reasonable in our game theoretic framework [8]. Let us first define the filtration , that is, a family of -algebras such that and for all such that for all .

###### Assumption 6

in (12) is bounded, i.e., there exists such that for ,

For the sake of our analysis, we make an explicit bound on the feasible set.

###### Assumption 7

The local constraint set is such that , for some .

For all we define the stochastic error as

(13) |

that is, the distance between the approximation and the exact expected value. Then, we postulate that the stochastic error satisfies the following assumption.

###### Assumption 8

The stochastic error in (13) is such that, for all , a.s..

Essentially, Assumption 8

states that the error has zero mean and bounded variance, as usual in SVI

[8, 13, 17].As a measure of the quality of the solution, we define the following error

(14) |

which is known as gap function and it is equal 0 if and only if is a solution of the (S)VI in (8) [9, Eq. 1.5.2]. Another measure function specific for the zero-sum game and other possible measures can be found in [8].

We are now ready to state our first result.

###### Theorem 1

See Appendix B.

## V Sample average approximation

If a huge number of samples is available or it is possible to compute the exact expected value, one can consider using a different approximation scheme or a deterministic algorithm. We discuss these two situations in this section.

In the SVI framework, using a finite, fixed number of samples is called stochastic approximation (SA). It is widely used in the literature but it often requires conditions on the step sizes to control the stochastic error. Usually, the step size sequence should be diminishing with the results that the iterations slow down considerably. The approach that is instead used to keep a fixed step size is the sample average approximation (SAA) scheme. In this case, an increasing number of samples is taken at each iteration and this helps having a diminishing error.

With the SAA scheme, it is possible to prove convergence to the exact solution without using the averaging step. We show this result in Theorem 2 but first we provide more details on the approximation scheme and state some assumptions. The algorithm that we are proposing is presented in Algorithm 2. The differences with Algorithm 1 are the absence of the averaging step and the approximation .

Initialization:

Iteration : Agent receives for , then updates:

(15a) | ||||

(15b) |

Formally, the approximation that we use is given by

(16) |

where is the batch size that should be increasing [13].

###### Assumption 9

The batch size sequence is such that for some .

With a little abuse of notation, let us denote the stochastic error also in this case as

###### Remark 3

Using the SAA scheme, it is possible to prove that, for some , i.e., the error diminishes as the size of the batch increases. Details on how to obtain this result can be found in [13].

###### Assumption 10

as in (7) is -Lipschitz continuous for , i.e., for all .

The relaxation parameter should not be too small.

###### Assumption 11

In Algorithm 2, .

Conveniently, with the SAA scheme we can take a constant step size, as long as it is small enough.

###### Assumption 12

The steps size is such that where is the Lipschitz constant of as in Assumption 10.

We can finally state our convergence result.

###### Theorem 2

See Appendix C.

If one is able to compute the exact expected value, the problem is equivalent to the deterministic case. Convergence follows under the same assumptions made for the SAA scheme with the exception of those on the stochastic error.

###### Corollary 1

## Vi Numerical simulations

In this section, we present some numerical experiments to validate the analysis. We propose two theoretical comparison between the most used algorithms for GANs [8]. In both the examples, we simulate our SRFB algorithm, the SpFB algorithm [15], the EG algorithm [13], the EG algorithm with extrapolation from the past (PastEG) [8] and Adam, a typical algorithm for GANs [20].

All the simulations are performed on Matlab R2019b with a 2,3 GHz Intel Core i5 and 8 GB LPDDR3 RAM.

### Vi-a Illustrative example

In order to make a comparison, we consider the following zero-sum game which is a problematic example, for instance, for the FB algorithm [8, Prop. 1].

We suppose that the two players aims at solving the minmax problem in (3) with cost function

where and . The matrix is the stochastic part that we approximate with the SAA scheme. is an antidiagonal matrix, i.e., the entry if and only if

, and the entries are sampled from a normal distribution with mean 1 and finite variance. The mapping is monotone and

and . The problem is constrained so that and the optimal solution is . The step sizes are taken to be the highest possible.We plot the distance from the solution, the distance of the average from the solution, and the computational cost in Figure 0(a), 0(b) and 0(c), respectively.

As one can see from Fig. 0(a), the SFB does not converge. From Fig. 0(c) instead, we note that the SRFB algorithm is the less computationally expensive. Interestingly, the average tends to smooth the convergence to a solution.

### Vi-B Classic GAN zero-sum game

A classic cost function for the zero-sum game [1] proposed for GANs reads as

This cost function is hard to optimize because it is concave-concave [8]. Here we take thus the equilibrium is In Figure 1(a), 1(b) and 1(c), we show the distance from the solution, the distance of the average from the solution, and the computational cost respectively. Interestingly, all the considered algorithms converge even if there are no theoretical guarantees.

## Vii Conclusion

The stochastic relaxed forward–backward algorithm can be applied to Generative Adversarial Networks. Given a fixed mini-batch, under monotonicity of the pseudogradient, averaging can be considered to reach a neighbourhood of the solution. On the other hand, if a huge number of samples is available, under the same assumptions, convergence to the exact solution holds.

## Appendix A Preliminary results

We here recall some facts about norms, some properties of the projection operator and a preliminary result.

We start with the norms. We use the cosine rule

(17) |

and the following two property of the norm [21, Corollary 2.15], ,

(18) |

(19) |

Concerning the projection operator, by [21, Proposition 12.26], it satisfies the following inequality: let be a nonempty closed convex set, then, for all

(20) |

The projection is also firmly non expansive [21, Prop. 4.16], and consequently, quasi firmly non expansive [21, Def. 4.1].

The Robbins-Siegmund Lemma is widely used in literature to prove a.s. convergence of sequences of random variables.

###### Lemma 1 (Robbins-Siegmund Lemma, [22])

Let be a filtration. Let , , and be non negative sequences such that , and let

Then and converges a.s. to a non negative random variable.

The next lemma collects some properties that follow from the definition of the SRFB algorithm.

###### Lemma 2

Given Algorithm 1, the following hold.

Straightforward from Algorithm 1.

## Appendix B Proof of Theorem 1

[Proof of Theorem 1] We start by using the fact that the projection is firmly quasinonexpansive.

Now we apply Lemma 2.2 and Lemma 2.3 to :

(21) | ||||

Then, we can rewrite the inequality as

(22) | ||||

Applying the Young’s inequality we obtain

Then (22) becomes

(23) | ||||

Reordering, adding and subtracting and using Lemma 2, we obtain

(24) | ||||

Then, by the definition of , reordering leads to

(25) | ||||

Summing over all the iterations, (25) becomes

(26) | ||||

Using Assumption 5 and resolving the sums, we obtain

(27) | ||||

Now we notice that We define and , thus

(28) | ||||

Therefore, Including this in (27) and doing the sum, we obtain

(29) | ||||

Byy definition , then, taking the expected value in (29) and using Assumption 8

(30) | ||||

Let us define , . Then,

(31) | ||||

Noticing that if is constant and