Using Skill Rating as Fitness on the Evolution of GANs

by   Victor Costa, et al.
University of Coimbra

Generative Adversarial Networks (GANs) are an adversarial model that achieved impressive results on generative tasks. In spite of the relevant results, GANs present some challenges regarding stability, making the training usually a hit-and-miss process. To overcome these challenges, several improvements were proposed to better handle the internal characteristics of the model, such as alternative loss functions or architectural changes on the neural networks used by the generator and the discriminator. Recent works proposed the use of evolutionary algorithms on GAN training, aiming to solve these challenges and to provide an automatic way to find good models. In this context, COEGAN proposes the use of coevolution and neuroevolution to orchestrate the training of GANs. However, previous experiments detected that some of the fitness functions used to guide the evolution are not ideal. In this work we propose the evaluation of a game-based fitness function to be used within the COEGAN method. Skill rating is a metric to quantify the skill of players in a game and has already been used to evaluate GANs. We extend this idea using the skill rating in an evolutionary algorithm to train GANs. The results show that skill rating can be used as fitness to guide the evolution in COEGAN without the dependence of an external evaluator.


Exploring the Evolution of GANs through Quality Diversity

Generative adversarial networks (GANs) achieved relevant advances in the...

Skill Rating for Generative Models

We explore a new way to evaluate generative models using insights from e...

Demonstrating the Evolution of GANs through t-SNE

Generative Adversarial Networks (GANs) are powerful generative models th...

Evolutionary Multi-Objective Optimization Driven by Generative Adversarial Networks (GANs)

Recently, more and more works have proposed to drive evolutionary algori...

Towards co-evolution of fitness predictors and Deep Neural Networks

Deep neural networks proved to be a very useful and powerful tool with m...

Host-Pathongen Co-evolution Inspired Algorithm Enables Robust GAN Training

Generative adversarial networks (GANs) are pairs of artificial neural ne...

A survey on GANs for computer vision: Recent research, analysis and taxonomy

In the last few years, there have been several revolutions in the field ...

1 Introduction

Generative models have gained a lot of interest in the past years. The recent advances in the field contributed with impressive results, mainly in the context of images. Generative Adversarial Networks (GANs) [9] presented a relevant advance in this context, producing realistic results in several domains. In the original GAN model, two neural networks, a generator and a discriminator, are competing in a unified training process. The generator fabricates samples and the discriminator detects if these samples are fake or from an input distribution.

Despite the high-quality results, GANs are hard to train and a trial-and-error strategy is frequently followed to get the expected results. The challenges with GAN training are commonly related to the balance between the discriminator and the generator. In this context, the vanishing gradient and the mode collapse are two common problems affecting GANs. The vanishing gradient leads to stagnation of the training, caused by an imbalance between the forces of the generator and the discriminator. The mode collapse problem is characterized by the lack of representation of the target distribution used in training.

In order to solve these issues and to achieve better results, different strategies were proposed. A relevant effort was spent on the design of alternative loss functions to use in the GAN training, originating the proposal of alternative models such as WGAN [3], LSGAN [16], and RGAN [12]. Other proposals target the improvement of the architecture used in GANs, defining new modules like in SAGAN [34] or a set of recommendations as in DCGAN [21]. However, problems like the mode collapse and the vanishing gradient are still present in the training.

The use of evolutionary algorithms to train GANs was recently proposed by some researchers [1, 5, 6, 7, 29, 32]. Techniques such as neuroevolution, coevolution, and Pareto set approximations were used in their models. The application of evolutionary algorithms in GANs takes advantage of the evolutionary pressure to guide individuals toward convergence, often discarding problematic individuals.

Coevolutionary GAN (COEGAN) proposes the use of neuroevolution and coevolution to orchestrate the training of GANs. Despite the advances in the training stability, there is still room for improvement in the model. The experimental evaluation identified that the fitness function can be enhanced to better guide the evolution of the components, mainly regarding the discriminator. Currently, the discriminator uses the loss function of the respective GAN component. However, this function displayed a high variability behavior, disrupting the evolution of the population. The generator uses the Fréchet Inception Distance (FID) score, which introduces an external evaluator represented by a trained Inception Network [27, 28]. Although the good results introduced by the FID score as fitness, the drawbacks are the execution cost and the dependence of an external evaluator.

The FID score is currently the most used metric, but several other metrics were proposed to evaluate the performance of GANs [4, 33]. Metrics such as skill rating was successfully used to evaluate GANs in some contexts [20]. Skill rating uses a game rating system to assess the skill of generators and discriminators. Each generator and discriminator is considered as a player in a game and the pairing between them is designed as a match. The outcome of the matches is used as input to calculate the skill of each player.

We took inspiration from the use of skill rating to quantify the performance of generators and discriminators in GANs to design a fitness function to be used within COEGAN. Therefore, we replace the regular fitness used in COEGAN with the skill rating, i.e., the discriminator and the generator use the skill rating metric instead of the loss function and the FID score. We present an experimental study on the use of this metric, comparing the results with the previous approach used in COEGAN, a random search approach, and with a non-evolutionary model based on DCGAN. The results evidenced that skill rating provides useful information to guide the evolution of GANs when used in combination with the COEGAN model. The skill rating is more efficient with respect to execution time and does not compromise the quality of the final results.

The remainder of this paper is organized as follows: Section 2 introduces the concepts of GANs and evolutionary algorithms, presenting state-of-the-art works using these concepts; Section 3 presents COEGAN and our approach to use skill rating as fitness; Section 4 displays the experimental results using this approach; finally, Section 5 presents our conclusions and future work.

2 Background and Related Works

Generative Adversarial Networks (GANs) [9]

are an adversarial model that have became relevant for presenting high-quality results in generative tasks, mainly on the image domain. In summary, a GAN is composed of a generator and a discriminator, trained as adversaries by a unified algorithm. Each component is represented by a neural network and has a role guided by its specific loss function. The generator has to produce synthetic samples that should be classified as real by the discriminator. The discriminator should distinguish between fake samples and samples originated from an input distribution. For this, the discriminator receives a real input distribution for training, such as an image dataset. The generator is fed with a latent distribution, usually with a lower dimension than the real input distribution, and never directly looks into the real distribution.

In the original GAN model, the loss function of the discriminator is defined as follows:


For the generator, the non-saturating version of the loss function is defined by:


In Eq. 1, is the real data used as input to the discriminator. In Eq. 1 and Eq. 2, is the latent space used to feed the generator, is the latent distribution, is the generator, and represents the discriminator.

Despite the quality of the results, GANs are hard to train and the presence of stability issues on the training process is frequent. The vanishing gradient and the mode collapse are two of the most common problems that affect the training of GANs. The vanishing gradient issue is characterized by a disequilibrium between the forces of the GAN components. For example, the discriminator becomes too powerful and does not make mistakes when detecting fake samples produced by the generator. In this case, the progress on the training stagnates. The mode collapse problem occurs when the generator only partially captures the input distribution used on the discriminator training. This issue affects the variability and the quality of the created samples.

Several approaches were used to minimize these issues and leverage the quality of the results. In this context, alternative loss functions were proposed to replace the functions used in the classical GAN model, such as WGAN [3], LSGAN [16], and RGAN [12]. Another strategy is to propose architectural changes to the GAN model. DCGAN [21] proposed a reference architecture for the discriminator and the generator in GANs, describing a set of constraints and rules to achieve better results. On the other hand, a predefined strategy to progressively grow a GAN during the training procedure was proposed in [13]. SAGAN [34] proposed the use of a self-attention module in order to capture the relationship between spatial regions of the input sample. Although these approaches tried to minimize the problems and produce better results, issues still affect the training of GANs [3, 10, 23]

. Besides, the discovery of efficient models and hyperparameters for the models is not a trivial task, requiring recurrent empirical validation.

Recently, research was conducted to propose the use of evolutionary algorithms to train and evolve GANs [1, 5, 6, 7, 29, 32]. Evolutionary algorithms take inspiration on the mechanism found in nature to evolve a population of potential solutions on the production of better outcomes for a given problem [24]. E-GAN [32] uses an evolutionary algorithm to combine three different types of loss functions in the training. An approach based on the Pareto set approximations was used in [7] to model the GAN problem. Lipizzaner [1] proposes the use of spatial coevolution to match generators and discriminators in the training process. Mustangs [29] unifies the concepts of E-GAN and Lipizzaner in a single model, using different loss functions and spatial coevolution in the solution.

COEGAN uses neuroevolution and coevolution on the training and evolution of GANs. Despite the advances identified by the experiments, the results also showed that the fitness functions used in the model can be improved. COEGAN uses the loss function (Eq. 1) as the fitness for discriminators and the FID score for generators. The use of better fitness can be helpful for the creation of better models and also avoid the common stability issues when training GANs. Furthermore, as specified in the FID score, COEGAN uses an external evaluator to quantify the fitness for generators.

Several strategies were proposed to quantify the performance of GANs [4, 33]. Although the FID score is the most used metric to evaluate and compare GANs, alternative approaches can be successfully applied, such as skill rating [20]. The skill rating metric for GANs uses the Glicko-2 [8] rating system to calculate the performance. Glicko-2 was also used as comparison criteria between different evolutionary algorithms [30, 31].

3 Our Approach

We present in this section our approach to applying skill rating as fitness in an evolutionary algorithm. For this, we make use of the previously introduced method called COEGAN [5, 6], adapting the model for our proposal in this paper. Thus, we firstly introduce in this section the COEGAN algorithm. After that, we describe the skill rating method and its application in COEGAN.

3.1 Coegan

COEGAN [5, 6] proposes the use of neuroevolution and coevolution to train and evolve GANs. The motivations of COEGAN are to solve the stability issues frequently found when training GANs and also to automatically discover efficient models for different applications.

COEGAN is inspired by DeepNEAT [17] to design the model, also using coevolution techniques presented in NEAT applied to competitive coevolution [26]. The genome of COEGAN is represented by a sequential array of genes. This genome is transformed into a neural network, where each gene directly represents a layer in this network. The evolution occurs on the architecture and the internal parameters of each layer. Therefore, the mutation operators were used to add a layer, remove an existing layer, and mutate the internal parameters of a layer. For the sake of simplicity, in this work we only use convolutional layers in the addition operator. As in the original COEGAN proposal, crossover was not used in the final model because it introduced instability in the system.

Two separated populations are used in COEGAN: a population of discriminators and a population of generators. Thus, competitive coevolution was used to design the environment. In the evaluation phase, individuals are matched following an all vs. all strategy, i.e., each generator will be matched against each discriminator . Other strategies can be used, such as all vs. best. However, the all vs. all approach achieved the best results, despite the high execution cost with the application.

The selection phase uses a strategy based on NEAT [25]. Therefore, a speciation mechanism is used to promote innovation when evolving the populations. Fitness sharing adjusts the fitness of the individuals, making the selection proportional to the average fitness of each species. The species are grouped following the similarity on the genome of the individuals.

The fitness for the discriminator is the respective loss function of the classical GAN model, given by Eq. 1. The fitness of the generator is represented by the Fréchet Inception Distance (FID) [11], given by:


where , , , and

represent the mean and covariance estimated for the real dataset

and fake samples , respectively. The FID score uses the Inception Network [27, 28]

, usually trained with the ImageNet dataset 

[22], to transform images into a feature space, which is interpreted as a continuous multivariate Gaussian. The mean and covariance of the two resulting Gaussians for the transformation of real and fake images are applied in Eq. 3,

3.2 Skill Rating

In games like chess, it is common to use a rating system to quantify the skill of players. In this context, the Glicko-2 [8] rating system can be used to measure the performance of players given a history of matches. The Glicko-2 system associates to each player three variables: the rating , the deviation , and the volatility . The rating indicates the actual skill of player after a sequence of matches with other players in a game. The volatility represents the expected variability on the rating of a player. The deviation represents the confidence in the player’s rating. A system constant is also used to control the rate of change on the volatility . Different from , , and , this parameter is associated with the whole rating system.

All players are initialized with the recommended values of for the rating , for the deviation and for the volatility . These values can be tuned according to the characteristics of the application. At a fixed time period, the results of all matches between players are stored and used to update the rating , deviation , and volatility . It is recommended to use a time span large enough to contain at least to games for each player.

The Glicko-2 rating system was previously used on the comparison of evolutionary algorithms [30, 31]. In this case, different algorithms are executed on a given problem and the solutions found by them are matched to produce the outcome used as input to the Glicko-2 system. Thus, the algorithms are ranked according to the rating score.

Another application of the Glicko-2 system was to evaluate the performance of GANs [20]

. In this case, the rating was applied between discriminators and generators of different epochs to calculate the progressive skills of them. The authors found that skill rating provides a useful metric to relatively compare GANs.

We took inspiration on these use cases of Glicko-2 to apply the system in COEGAN. The fitness function for discriminators and generators in the COEGAN algorithm was changed to use the skill rating metric computed using Glicko-2. Therefore, each generator and discriminator have an associated skill rating, represented by , , and .

At the evaluation phase of the evolutionary algorithm, discriminators and generators are matched to be trained with the GAN algorithm and also to be evaluated for selection and reproduction. We modeled each evaluation step between a generator and a discriminator as a game to be quantified and applied to the skill rating calculation, composing a tournament of generators against discriminators. Therefore, as we use the all vs. all pairing strategy, each outcome of the match between is stored and used to update the skill rating at the end of each generation. Inspired by the approach in [20], we use the following equations to calculate the outcome of a match for the discriminator:


where is the win rate of the discriminator with respect to the real data, is the rate related to the fake data, is the overall win rate of the discriminator , is a threshold function that outputs when the threshold is met and otherwise,

outputs the probability of the sample to be real,

is the generator, is the input distribution, is a sample drawn from the input distribution, is the latent distribution, is the latent input for the generator, is the number of real samples, and is the number of fake samples. In summary, the win rate for the discriminator is based on the number of mistakes made by it with the real input batch (Eq. 4) and fake data produced by the generator (Eq. 5).

For the generator, the result is calculated as:


where is the discriminator win rate given by Eq. 6.

The win rates of each generator and discriminator are used as input to update the skill rate of the individuals. Each individual and has a set of outcomes , containing the win rate of each match and the skill of the adversarial. Thus, a generator has a set containing each pair for a generation. A discriminator has a set containing each pair . The sets and are used to calculate the new skill rating at the end of the generation, represented by and , respectively. It is important to note that the update of the skill rating of a player depends on the skill of the adversary, i.e., win a game from a strong player is more rewarding than to win from a weak player.

We propose in this work the use of skill rating as fitness in COEGAN, represented by the use of instead of Eq. 1 for discriminators and instead of Eq. 3 for generators. Therefore, the fitness functions for discriminators and generators are defined as:


where and are the rating for discriminators and generators, respectively. At each generation, individuals update the skill rating following these rules. In the breeding process, the offspring carry the skill rating of their parent. In this way, we keep track of the progress of individuals through generations, even when mutations occur to change their genome.

Besides the matches between each pair , individuals in the current generation can also be matched against individuals from previous generations. The algorithm can keep track of the best individuals from the last generations to match them against the current individuals in order to ensure the progression of them. This is also a strategy to avoid the intransitivity problem that occurs in competitive coevolution algorithms. The intransitivity problem means that a solution is better than other solution and is better than , but it is not guaranteed that is better than , leading to cycling between solutions during the evolutionary process and harming the progress toward optimal outcomes [2, 18]. However, this work does not use previous generations in the skill rating calculation. We leave the evaluation of this strategy for future work.

4 Experiments

To evaluate the use of skill rating with COEGAN, we conducted an experimental study using the Street View House Numbers (SVHN) dataset [19]. The SVHN dataset is composed of digits from to extracted from real house numbers. Therefore, it is a dataset with a structure similar to the MNIST dataset [15] used in previous COEGAN experiments, but with more complexity introduced by the use of real images, presenting digits with a variety of backgrounds. The experiments compare the results of the original COEGAN approach (with the FID score and the loss function as fitness for generators and discriminators), COEGAN with skill rating applied as fitness, a random search approach, and a DCGAN-based architecture. We also present a comparison between the FID score and the skill rating metric in experiments with the MNIST dataset.

4.1 Experimental Setup

Evolutionary Parameters Value
Number of generations 50
Population size (generators and discriminators) 10
Add Layer rate 20%
Remove Layer rate 10%
Change Layer rate 10%
Output channels range [32, 256]
Tournament 2
FID samples 2048
Genome Limit 4
Species 3
Skill Rating Parameters Value
, , 1500, 350, 0.06
constant 1.0
GAN Parameters Value
Batch size 64
Batches per generation 20
Optimizer Adam
Learning rate 0.001
Betas 0.5, 0.999
Table 1: Experimental parameters.

Table 1 lists the parameters used in our experiments. These parameters were chosen based on preliminary experiments and the results presented in our previous works [5, 6]. All experiments are executed for generations. The number of individuals in the populations of generators and discriminators is . This number of individuals is enough to achieve the recommended matches to feed the Glicko-2 rating system. For the variation operators, we use the rates , , and for the add layer rate, remove layer rate, and change layer rate, respectively. The number of output channels is sampled using the interval . A tournament with is applied inside each species to select the individuals for reproduction and the algorithm self-adjust to contains species for the population of generators and discriminators. The number of samples used to calculate the FID score is . To make the experiments comparable, each individual has a genome limited to genes, the same number of layers used in the DCGAN-based experiments. Besides, as the DCGAN-based model does not use an evolutionary algorithm, these evolutionary parameters described above are not applied to it.

The initial skill rating parameters used in the experiments are the same suggested by the Glicko-2 system [8], i.e., the rating , deviation , and the volatility are initialized with , , and , respectively. The system constant was set to . We conduct previous experiments to choose the best for our context. We found no relevant changes with respect to this parameter. Nevertheless, experiments focused on the tuning of should be executed to evaluate its effect on our proposal.

All experiments used the original GAN model, i.e., the neural networks are trained with the classical loss functions defined by Eq. 1 and Eq. 2. The GAN parameters were chosen based on preliminary experiments and the setup commonly used on the evaluation of GANs [10, 13, 21]. The batch size used in the training is . The Adam optimizer [14] is used with the learning rate of , beta 1 of , and beta 2 of . Each pairing between generators and discriminators is trained by batches per generation. As the all vs. all is used, each generator and discriminator will be trained for a total of batches. For the DCGAN-based experiments, we have a single generator and discriminator. Therefore, we train them for batches to keep the results comparable with the COEGAN experiments.

The results are evaluated using the FID score and the skill rating. For the SVHN dataset, the FID score is based on the Inception Network trained with the SVHN dataset instead of the ImageNet dataset, the same strategy used in the experiments of [20]

. For the MNIST results, we use the Inception Network trained with the ImageNet dataset. All results presented in this work are obtained by the average of five executions, with a confidence interval of


4.2 Results

Figure 1 presents the results of the best FID score per generation for the experiments with the SVHN dataset. We can see that the results for the original COEGAN proposal, i.e., COEGAN guided by the FID and the loss as fitness functions, are still better than the results for COEGAN with the skill rating metric. However, COEGAN guided by skill rating presented better FID scores than the random search approach. Thus, this evidences that skill rating provides useful information to the system, presenting evolutionary pressure to the individuals in the search of efficient models. Moreover, COEGAN with the FID score as fitness outperforms the DCGAN-based approach, illustrating the advantages of COEGAN.

Figure 1: Best FID score for generators with a 95% confidence interval

We found in the experiments that skill rating sometimes overestimates the score for bad individuals, affecting the final results of the training. A dataset with the complexity of SVHN may require more training epochs to achieve better outcomes, and the variability introduced by the all vs. all pairing may be too much for complex datasets. Therefore, another approach such as spatial coevolution used in [1, 29] will be considered in further experiments. Furthermore, the calculation of the match outcome, given by Eq. 4-7, can be improved to overcome this problem.

Algorithm FID Score
COEGAN + Skill
Random search
Table 2: FID score of the algorithms used in the experiments with SVHN.

Table 2

shows the average FID of the best scores at the last generation for each experiment with the SVHN dataset. We can see the difference between the FID of the solutions experimented in this work. As expected, the results for the random search approach is unstable and worse than the others, presenting a high standard deviation. However, the difference is not big due to the limitations we impose on the experimental parameters. Experiments adding the possibility of larger networks for COEGAN should be performed to assess the capacity to outperform both the random search and DCGAN approaches by a larger margin.

Despite the inferior results when compared to COEGAN with FID as fitness, the advantage with the skill rating is that we can avoid the use of an external evaluator as in the FID calculation, represented by the Inception Network. The execution cost of the skill rating metric is also lower than the FID score. The FID score requires a high number of samples to have a good representation of the data. In our experiments, we use against on the skill rating calculation ( represents the batch size used in Eq. 6). Furthermore, the Inception Network has a complex architecture and the FID score uses slow procedures in the calculation. Skill rating uses the own neural network of individuals in the experiments, and the Glicko-2 system is fast to execute.

(a) COEGAN + Skill, Pearson: -0.8, Spearman: -0.73
(b) COEGAN + FID, Pearson: -0.54, Spearman: 0.18
(c) DCGAN-based, Pearson: 0.91, Spearman: 0.89
(d) Random search, Pearson: -0.16, Spearman: 0.02
Figure 2: Comparison between the best FID score and the respective skill rating of generators trained with the SVHN dataset.

Figure 2 shows the progression of the skill rating through generations compared with the best FID scores. We can see in COEGAN guided by skill rating a clear improvement of the rating, as this is the same function used to provide evolutionary pressure in the individuals. In the experiments of COEGAN with FID, the progress also exists but is less relevant. The random approach presented an erratic behavior of the skill rating, showing that the individuals do not improve in this approach. In the DCGAN-based experiments, the skill rating behaves differently, showing a decreasing pattern. As there is only a single discriminator and generator, the number of matches per generation is only one. Therefore, we do not meet the recommendations of the Glicko-2 system of having at least ten matches per time period and the rating is not useful for this case.

Except for the DCGAN experiments, we can also see in Figure 2 some level of correlation between the best FID score and the respective skill rating among the generators in the populations. The results demonstrated that skill rating follows the tendency of the FID score, evidencing that it can be used to guide the evolution of GANs. We computed the Pearson correlation and the Spearman rank correlation between FID and skill rating to support this analysis. We found a relevant negative correlation for the experiments with COEGAN guided by skill rating, achieving a Pearson correlation coefficient of and a Spearman rank correlation of . As FID is a distance measurement (lower is better) and skill rating is a score (high is better), the negative correlation is expected.

(a) Best FID score and the respective skill rating for COEGAN + Skill. Pearson: , Spearman:
(b) Best FID score for all solutions
Figure 3: Results for the experiments with the MNIST dataset.

We experienced high variability on the FID score in the experiments with the SVHN dataset, both for the Inception Network trained with the ImageNet and SVHN datasets. Therefore, we conduct a study using the MNIST dataset to enhance the relationship between the FID score and skill rating. We followed the same parameters presented in Table 1, but limiting the number of generations to . Figure 2(a) shows a smoother progression of skill rating and FID, illustrating a more clear relation between them, which is evidenced by the Pearson’s correlation coefficient of and the Spearman’s rank correlation of . We also show in Figure 2(b) that COEGAN guided by skill rating achieves performance similar to COEGAN guided by FID, outperforming the random search approach.

(a) Number of parameters for generators
(b) Number of parameters for discriminators
Figure 4: Average number of parameters in the neural networks of generators and discriminators at each generation. Note that the number of parameters for the DCGAN-based experiments is constant, as there is not an evolutionary algorithm applied to this case.

Figure 4 presents the average number of parameters in generators and discriminators from the experiments with the SVHN dataset. As there is no evolutionary algorithm applied to DCGAN, the number of parameters is constant. It is important to note that the average number of parameters on the individuals in the COEGAN experiments is much lower than the parameters in DCGAN. Despite this, the results of COEGAN are still better than DCGAN. Therefore, the experiments evidenced that the evolutionary algorithm applied in COEGAN was able to find more efficient models. We limited in the experimental setup the complexity and the number of layers in the genome. Experiments with an expanded setup should be conducted to assess the possibility of even better results.

(a) COEGAN with skill rating as fitness
(b) COEGAN with the FID score and loss function as fitness
Figure 5: Samples produced by the best generator after the COEGAN training.

Figure 4(b) shows samples produced by the generator after the COEGAN training with FID and skill rating as fitness. In order to achieve better quality, we trained the algorithms using batches at each generation (instead of ). We can see that the quality of the samples is similar, with both strategies presenting variability on the samples.

5 Conclusions

Generative Adversarial Networks (GANs) represented a relevant advance in generative models, producing impressive results in contexts such as the image domain. In spite of this, the training of a GAN is challenging and often requires a trial-and-error approach to achieve the desired outcome. Several strategies were used in order to improve training stability and produce better results. Proposals modified the original GAN model to introduce alternative loss functions and architectural changes. On the other hand, the use of evolutionary algorithms in the context of GANs was recently proposed. COEGAN combines neuroevolution and coevolution on the training and evolution of GANs. However, experiments identified that the fitness used in COEGAN can be improved to better guide the evolution of discriminators and generators in the populations.

We propose the use of a game rating system, based on the application of Glicko-2 introduced in [20], to design a new fitness strategy for COEGAN. Thus, we changed the fitness functions used by discriminators and generators to use the skill rating metric instead of the loss function and the FID score. We conducted experiments to evaluate this proposal and compare the results with the previous COEGAN fitness proposal, a DCGAN-based approach, and a random search model.

The results evidenced that, although the FID score as fitness provides better results, the skill rating method also contribute with useful information in the evolution of GANs. The use of COEGAN with skill rating outperforms the random search approach, demonstrating the effectiveness of this fitness function. When compared to the FID score, the advantages when using skill rating is the lower execution cost and the self-contained solution, i.e., skill rating does not need to use an external component such as in the FID score. The calculation of the FID requires a trained Inception Network, making the score highly dependent on the context where it was trained and applied. Therefore, skill rating has the potential to be used in more domains. Besides, the skill rating does not require a neural network to interpret images produced by generators. Instead, the output of the discriminator is used in the calculation, resulting in a lower execution cost when compared to the FID score. We also show that there is a correlation between the FID score and the skill rating metric when using the latter as fitness with COEGAN. However, skill rating worked better with the MNIST dataset, making this correlation more evident. The SVHN dataset is more complex and sometimes lead to disagreement between FID and skill rating. The strategy to obtain the results of matches between generators and discriminators can be improved to better represent the player’s skill.

As future work, we aim to expand the strategies evaluated in this paper regarding the use of skill rating as fitness. We will evaluate changes in the skill tournament to take into account individuals from previous generations. Besides, different strategies to calculate the outcome of matches can be used to improve the results. We will investigate the use of strategies that bring information about the variability of the samples produced by generators, in order to approximate the information provided by the FID score.


This article is based upon work from COST Action CA15140: ImAppNIO, supported by COST (European Cooperation in Science and Technology).


  • [1] A. Al-Dujaili, T. Schmiedlechner, E. Hemberg, and U. O’Reilly (2018) Towards distributed coevolutionary GANs. In AAAI 2018 Fall Symposium, Cited by: §1, §2, §4.2.
  • [2] L. M. Antonio and C. A. C. Coello (2018) Coevolutionary multiobjective evolutionary algorithms: survey of the state-of-the-art.

    IEEE Transactions on Evolutionary Computation

    22 (6), pp. 851–865.
    Cited by: §3.2.
  • [3] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In

    International Conference on Machine Learning

    pp. 214–223. Cited by: §1, §2.
  • [4] A. Borji (2019) Pros and cons of GAN evaluation measures. Computer Vision and Image Understanding 179, pp. 41–65. Cited by: §1, §2.
  • [5] V. Costa, N. Lourenço, J. Correia, and P. Machado (2019) COEGAN: evaluating the coevolution effect in generative adversarial networks. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 374–382. Cited by: §1, §2, §3.1, §3, §4.1.
  • [6] V. Costa, N. Lourenço, and P. Machado (2019) Coevolution of generative adversarial networks. In International Conference on the Applications of Evolutionary Computation (Part of EvoStar), pp. 473–487. Cited by: §1, §2, §3.1, §3, §4.1.
  • [7] U. Garciarena, R. Santana, and A. Mendiburu (2018) Evolved GANs for generating pareto set approximations. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’18, New York, NY, USA, pp. 434–441. Cited by: §1, §2.
  • [8] M. E. Glickman (2013) Example of the glicko-2 system. Boston University, pp. 1–6. External Links: Link Cited by: §2, §3.2, §4.1.
  • [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §1, §2.
  • [10] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein GANs. In Advances in Neural Information Processing Systems, pp. 5769–5779. Cited by: §2, §4.1.
  • [11] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6629–6640. Cited by: §3.1.
  • [12] A. Jolicoeur-Martineau (2019) The relativistic discriminator: a key element missing from standard GAN. In International Conference on Learning Representations, Cited by: §1, §2.
  • [13] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, Cited by: §2, §4.1.
  • [14] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • [15] Y. LeCun (1998)

    The mnist database of handwritten digits

    http://yann. lecun. com/exdb/mnist/. Cited by: §4.
  • [16] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley (2017) Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2813–2821. Cited by: §1, §2.
  • [17] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, A. Navruzyan, N. Duffy, and B. Hodjat (2017) Evolving deep neural networks. arXiv preprint arXiv:1703.00548. Cited by: §3.1.
  • [18] M. Mitchell (2006) Coevolutionary learning with spatially distributed populations. Computational intelligence: principles and practice. Cited by: §3.2.
  • [19] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §4.
  • [20] C. Olsson, S. Bhupatiraju, T. Brown, A. Odena, and I. Goodfellow (2018) Skill rating for generative models. arXiv preprint arXiv:1808.04888. Cited by: §1, §2, §3.2, §3.2, §4.1, §5.
  • [21] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1, §2, §4.1.
  • [22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §3.1.
  • [23] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pp. 2234–2242. Cited by: §2.
  • [24] K. Sims (1994) Evolving 3d morphology and behavior by competition. Artificial life 1 (4), pp. 353–372. Cited by: §2.
  • [25] K. O. Stanley and R. Miikkulainen (2002) Evolving neural networks through augmenting topologies. Evolutionary computation 10 (2), pp. 99–127. Cited by: §3.1.
  • [26] K. O. Stanley and R. Miikkulainen (2004) Competitive coevolution through evolutionary complexification.

    Journal of Artificial Intelligence Research

    21, pp. 63–100.
    Cited by: §3.1.
  • [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1–9. Cited by: §1, §3.1.
  • [28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. Cited by: §1, §3.1.
  • [29] J. Toutouh, E. Hemberg, and U. O’Reilly (2019) Spatial evolutionary generative adversarial networks. arXiv preprint arXiv:1905.12702. Cited by: §1, §2, §4.2.
  • [30] N. Veček, M. Črepinšek, M. Mernik, and D. Hrnčič (2014) A comparison between different chess rating systems for ranking evolutionary algorithms. In 2014 Federated Conference on Computer Science and Information Systems, pp. 511–518. Cited by: §2, §3.2.
  • [31] N. Veček, M. Mernik, and M. Črepinšek (2014) A chess rating system for evolutionary algorithms: a new method for the comparison and ranking of evolutionary algorithms. Information Sciences 277, pp. 656–679. Cited by: §2, §3.2.
  • [32] C. Wang, C. Xu, X. Yao, and D. Tao (2018) Evolutionary generative adversarial networks. arXiv preprint arXiv:1803.00657. Cited by: §1, §2.
  • [33] Q. Xu, G. Huang, Y. Yuan, C. Guo, Y. Sun, F. Wu, and K. Weinberger (2018)

    An empirical study on evaluation metrics of generative adversarial networks

    arXiv preprint arXiv:1806.07755. Cited by: §1, §2.
  • [34] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §1, §2.