# Analysis of Nonautonomous Adversarial Systems

Generative adversarial networks are used to generate images but still their convergence properties are not well understood. There have been a few studies who intended to investigate the stability properties of GANs as a dynamical system. This short writing can be seen in that direction. Among the proposed methods for stabilizing training of GANs, ß-GAN was the first who proposed a complete annealing strategy to change high-level conditions of the GAN objective. In this note, we show by a simple example how annealing strategy works in GANs. The theoretical analysis is supported by simple simulations.

## Authors

• 24 publications
06/24/2018

### JR-GAN: Jacobian Regularization for Generative Adversarial Networks

Generative adversarial networks (GANs) are notoriously difficult to trai...
08/05/2020

### Annealing Genetic GAN for Minority Oversampling

The key to overcome class imbalance problems is to capture the distribut...
02/28/2018

### A Variational Inequality Perspective on Generative Adversarial Nets

Stability has been a recurrent issue in training generative adversarial ...
08/28/2020

### Adaptive WGAN with loss change rate balancing

Optimizing the discriminator in Generative Adversarial Networks (GANs) t...
02/11/2020

### Smoothness and Stability in GANs

Generative adversarial networks, or GANs, commonly display unstable beha...
05/21/2017

We introduce a novel framework for adversarial training where the target...
02/27/2018

### Robust GANs against Dishonest Adversaries

Robustness of deep learning models is a property that has recently gaine...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Generative adversarial nets (goodfellow2014generative, ) are trained by optimizing an objective function over two sets of parameters . For ease of presentation, the framework is described as a competition between two functions, generator and discriminator who want to minimize/maximize a mutual objective. The GAN objective in its general shape can be written as

 argminθmaxψL(θ,ψ)=Ep(z)[f(Dψ(Gθ(z)))]+EpD(x)[f(−Dψ(x))] (1)

Where parameterizes the discriminator and parameterizes the generator. Different choices for gives various GAN objectives, e.g. Jenson-Shannon (goodfellow2014generative, ), Wassestein-GAN (arjovsky2017wasserstein, ), f-GAN (nowozin2016f, ), etc. In accordance with these works, we assume . The ultimate goal is to converge to a saddle point where neither discriminator nor generator can achieve a better objective when the other one is kept fixed. Let’s call this point in the space the favorite equilibrium. The interesting property of this point is that

. Currently, people are using stochastic gradient descent (SGD) updates to alternately perturb

and in a hope to converge to the favorite equilibrium in the end. Even though the results look visually promising, the dynamical behavior of this system needs more investigation.

In this note, we restrict ourselves to a minimal example and try to get some insight of a method called Annealed Generative Adversarial Networks (a.k.a -GAN) which was proposed last year and proved to be effective in stabilizing the optimization of GANs in practice.

## 2 Nonautonomous GAN

Continuous dynamical system— We see GAN as a continuous dynamical system. This assumption is valid when the learning rate of Stochastic Gradient Descent (SGD) tends to zero in optimization, i.e. .

Autonomous GAN —In conventional GAN training, the dynamical system

 {˙θ=−∇θL(θ,ψ)˙ψ=∇ψL(θ,ψ) (2)

is an approximation of the training pattern for a tiny learning rate. We call these dynamical systems

autonomous because the right-hand side function is not an explicit function of time (khalil1996noninear, ). Given the Lipschitz continuity of the right-hand side of Eq. 2, there exists a solution for this system and it is unique.

Nonautonomous GAN— The overall idea is introducing a new state in the GAN objective function in Eq. 1. This state controls the data distribution. More precisely, the objective function becomes

 L(θ,ψ,α)=Ep(z)[f(Dψ(Gθ(z)))]+EpD(x;α)[f(−Dψ(x))] (3)

To study the effect of this new state, we introduce a minimalistic framework called tiny-GAN to emphasize our points:

tiny-GAN —To have a minimal tractable GAN framework, we set and , meaning that real data is concentrated on a single point at and the generator is only capable of generating one point at location . The discriminator is assumed linear, i.e. . In contrast to  (mescheder2018convergence, ), we do not tie data to the origin and release it to occupy any location on the real axis. After these simplifications, the objective function of Eq. 1 becomes:

 L(θ,ψ,α)=f(ψθ)+f(−ψα) (4)

and the dynamical system of training GAN in Eq. 2 is written as

 {˙θ=−ψf′(ψθ)˙ψ=θf′(ψθ)−αrf′(−ψαr) (5)

In this formulation, is fixed and represents real data distribution.

Many formulations of GAN can be characterized by the dynamical system of Eq. 2 which contains only two states: the parameters of the generator () and the parameters of the discriminator (). Here we augment the state-space equation with a new state which characterizes the properties of the data distribution . In harmony with the minimalistic nature of tiny-GAN, the entire data distribution is characterized by here. Notice that the real data distribution is not dynamic. Indeed, real data distribution is the target point of the dynamics of and we represent it by , i.e. as . Optimizing Eq. 3 when the dynamics of is only governed by results in trivial answers since there will be no guarantee that and arrives at favorite equilibrium where . To cure this issue, -GAN suggested a full annealing strategy over . This idea turns the dynamical system of Eq. 2 into a time-varying (Nonautonomous) system. At this point two branches can be thought of. In the first branch which is also the method devised by -GAN, has partially decoupled dynamics from the other states of the system. By partially decoupled, we mean that the dynamics of is not affected by the dynamics of the other states of the system. However, the dynamics of the other states may depend on the dynamics of . In the second branch (let’s call it -GAN), undergoes two dynamics. One is the dynamics imposed by the GAN objective which acts by SGD updates and the other one is the annealing dynamics. The first term makes the dynamics of coupled with the other states of the system. As proposed in -GAN, annealing steps must act with a slower timescale than SGD iterations of the optimization. The slow partially decoupled dynamics of is characterized by

 α(t)=(α0−αr)e−tT+αr (6)

where is a time constant that makes this dynamic term slower than the SGD dynamics. In addition, is the initial value of that characterizes the initial distribution of data when the annealing process starts and is the target value of for which becomes the real data distribution . Therefore the state-space equation is written as follows:

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩˙θ=−ψf′(ψθ)˙ψ=θf′(ψθ)−αf′(−ψα)˙α=λ[−ψf′(−ψα)]+1T(αr−α0)e−tT (7)

The hyper-parameter is a switch and has an important meaning which differentiate between -GAN and -GAN. When (-GAN) the variable is not perturbed by short timescale SGD updates. This means that has partially decoupled dynamics from the dynamics of states . On the other hand, when , the dynamics of is governed by both a short timescale term and a long timescale term. The former is the SGD updates and the latter is the same as in -GAN. Furthermore, -GAN suggests starting from uniform distribution meaning that

is constant over a specified area and zero elsewhere. The generator must be pre-trained to capture the uniform distribution for a certain data dimension

. This means that the generator at time is able to generate a simple uniform distribution which matches the initial distribution . In our minimalistic setting of tiny-GAN and the dynamical system of Eq.7, this translates to .

## 3 Simulations

To show the effect of annealing strategy in GANs, simple simulations are presented here for autonomous GAN, -GAN and -GAN. Note that the objective function of Eq. 1 becomes that of W-GAN when . We compare normal (autonomous) GAN with two Nonautonomous GANs (-GAN and -GAN). Remember that in -GAN, data distribution does not change with short timescale and it has its own partially decoupled dynamics due to annealing while in -GAN, data distribution is altered by both the fast dynamics of SGD and the slow dynamics of annealing. In all simulated experiments, the real data distribution is located at which is the static value of for autonomous GAN but target value of for Nonautonomous GANs. Fig. 1(a) shows the solution of the dynamical system of Eq. 5 when as in Wasserstein GAN with initial point . As can be seen, the states are oscillating around = which is the equilibrium point of this system. For the linear and tiny-GAN framework used in the note, this result is global. It can be shown that for nonlinear choices of , the same oscillation is observable but locally around the equilibrium point. Notice that this oscillation is so called unsustained oscillation which is different from stable limit cycles. Here, the amplitude of the oscillation depends on the initial state which is an undesirable effect. Fig. 1(b) depicts the behavior of -GAN and shows the solution to the dynamical system of Eq. 7 when with initial states . Still the target value for is . As can be seen, the dynamical system is still oscillating but the amplitude of the oscillation is reduced. The bad thing here is that the system is oscillating around a wrong point which is different from the favorite equilibrium .

Fig. 1(c) simulates the behavior of -GAN by running the dynamical system of Eq. 7 with from the initial states . Again the target data is and . As can be seen, the system is oscillating as Fig. 1(b) but this time around the correct point . The amplitude of oscillation is lower than autonomous GAN of Fig. 1(a) and decreases more by increasing . Increasing means it takes longer for to move from to which is equivalent to slower annealing dynamics or finer annealing steps in discrete setting. This is shown in Fig. 1(d) where the entire setting is as the previous case but results in slower approach to but reduced oscillation amplitude around the correct equilibrium point.

## 4 Theoretical Analysis

The simulations of section 3 shows that the amplitude of oscillation decreases as increases in -GAN framework. Here, a more formal analysis is provided to explain this observation. The dynamical system of Eq. 7 for and will be written as follows:

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩˙θ=−ψ˙ψ=θ−α˙α=1T(αr−α0)e−tT. (8)

Let’s and . We take Laplace transform from both sides of three equations above:

 ⎧⎪ ⎪⎨⎪ ⎪⎩sθ(s)−θ(0)=−ψ(s)sψ(s)−ψ(0)=θ(s)−α(s)sα(s)−α(0)=Ks+a. (9)

Taking derivative of the both sides of the second line of Eq. 8 amounts to multiplying both sides of the second line of Eq. 9 by Laplace differentiation operator and results in

 s2ψ(s)=sθ(s)−sα(s)=θ(0)−ψ(s)−α(0)−Ks+a (10)

where and cancels each other due to the assumption of -GAN that generator starts from a simple initial distribution characterized by . This assumption consequently ensures because it is assumed that the the equilibrium is initially found for both generator and discriminator for the data distribution . Solving for gives us

 ψ(s)=−K(1+s2)(s+a). (11)

We then expand the right-hand side as a sum of polynomial fractions:

 ψ(s)=−K1+a2s1+s2+Ka1+a211+s2+K1+a21s+a. (12)

Computing inverse Laplace transform of gives

 L−1{ψ(s)}=ψ1(t)Acos(t)+Bsin(t)+Ce−at. (13)

where , , and . The last term vanishes in the steady state solution when . We are mainly interested in the first two parts which are responsible for the persistent oscillation. Adding two harmonics results in a new harmonic with scaled amplitude and phase shift :

 ⎧⎪ ⎪⎨⎪ ⎪⎩ψ1(t)=Asin(t+ϕ)A=√A2+B2+2ABcos(π/2)ϕ=tan−1(A,B) (14)

where is quadrant-aware arc tangent. By substituting and in we can compute the amplitude of the persistent oscillation as

 A=√(K1+a2)2+(Ka1+a2)2=K1+a2√(1+a2). (15)

The term as . The important term is that goes to zero as and proves our point that the amplitude decreases as the annealing time increases. Now that the analytic form of is known, we can move on and obtain the analytic form of . According to Eq.9, we can write in terms of as follows:

 θ(s)=1s[θ(0)−ψ(s)] (16)

where acts as an integrator. Therefore, we can obtain the inverse Laplace transform and compute the following definite integral to compute as

 θ(t) =θ(0)L−1{1s}−∫τ=tτ=0ψ(τ)dτ =θ(0)−K1+a2∫τ=tτ=0e−atdτ+∫τ=tτ=0ψ1(τ)dτ =θ(0)+K1+a21a+∫τ=tτ=0ψ1(τ)dτ =θ(0)+αr−α01+a2+∫τ=tτ=0ψ1(τ)dτΨ1(t). (17)

Notice that is the integral of a sinusoidal which is itself a sinusoidal. As the annealing time increases, , the term and we eventually have the steady state solution of as follows:

 limT→∞θ(t)=ar+Ψ1(t) (18)

which shows the oscillation around the desired equilibrium point that is the real data distribution. ∎

## 5 Conclusion

This writing suggests annealing as a promising approach in GANs. The practical results are already provided in -GAN paper (mehrjou2017annealed, )

. In this note, a minimalistic nonautonomous adversarial system is proposed to mimic the behavior of GAN in a tractable way when its data distribution is changing. The optimization updates and the dynamics of the annealing strategy is approximated by a continuous dynamical system. Simulations and theoretical analysis are performed to give insight into the dynamics of GANs under annealing. We believe viewing adversarial strategies as dynamical systems are interesting not only in unsupervised learning, but also in control theory where compelling systems may arise when states act in an adversarial way.