Investigating the reasons behind the accelerated expansion of the universe is one of the main challenges in astronomy and modern cosmology. Future space missions, such as Euclid, will provide images of billions of galaxies in order to investigate the so-called dark matter and probe the geometry of the universe through the gravitational lensing effect. Due to the very large-scale of data provided by such missions, automated algorithms are needed for measurement and detection purposes. The training and calibration of such algorithms require simulated, or synthetic, images of galaxies that mimic the real observations and exhibit real morphologies.
In the case of weak lensing for instance, the accuracy of the shape measurement algorithms is very sensitive to any statistical bias induced by the Point Spread Function (PSF). Therefore, simulated images of galaxies with known ground-truth lensing are required to calibrate and detect any potential bias in the ensemble averages. Moreover, the training of automated strong lensing detector, such as deep learning architectures(lensFinding2019), requires simulated images in order to mitigate class imbalance and avoid false-positive type of error in the current datasets.
2 Model-Driven v.s. Data-Driven Galaxy Image Simulation
The current approaches to simulate images of galaxies in the cosmology literature are mostly model-driven, or rule-based, approaches. These might involve the fitting of parametric analytic profiles (size, ellipticity, brightness, etc.) to the observed galaxies. This approach is usually unable to reproduce all the complex morphologies. An alternative, more expensive and often infeasible, model-driven approach is to start with high-quality galaxy images as the input of the simulation pipeline followed by a model that reproduces all the data acquisition effects (galsim2015).
Recently, several data-driven approaches have been investigated in order to generate synthetic images of galaxies via generative models used in machine learning (celeste2015; EnablingDE2016)
, mainly variational autoencoder (VAE)(Kingma2013AutoEncodingVB) and generative adversarial network (GAN) (GAN2014). Such approaches have shown some promising preliminary results in generating galaxy images. Following this data-driven approach, and motivated by the success and recent impressive improvements in GANs, we have further investigated the use of such architecture in generating galaxy images.
3 Generative Adversarial Network
Unlike most of the generative models used in machine learning, GAN represents a novel approach that learns how to sample from the data distribution without explicitly tracking the parameters of the probability distribution function via traditional maximum likelihood estimation. The GAN architecture consists of two neural networks that compete against each other in a two-player minimax game. The first network is the “generator” that is responsible of generating the data, while the second network is the “discriminator” that represents the adversarial loss function. Despite its elegant mathematical formulation and the theoretical guarantees provided by a non-parametric analysis, the initial GAN architecture suffered from some practical implementation problems.
After the invention of GAN in 2014, a plethora of work have been done to improve the training (in terms of convergence and stability) and to obtain more realistic generated data (in terms of quality and diversity). Most of this effort was made towards improving the cost function and stabilizing the training methodology, which has recently lead to unprecedented results in generating synthetic images. Based on these recent advances, we have investigated variants of GAN that use the Wasserstein distance (Wasserstein2017) and the progressive training (karras2018progressive) on galaxy images provided by the Galaxy-Zoo dataset (galaxyZoo).
4 Proposed Architecture
Following (karras2018progressive), we employ blocks of convolutional layers to progressively build the generator and the discriminator as mirror images of each other (see Table 1). Intuitively speaking, training a small network to generate low-resolution images that capture the large-scale structure of the galaxies is an easier task than directly training a full network to generate high-resolution images with fine details. Hence, we start by training the network to generate low-resolution images (), we then progressively increase the resolution, in steps until resolution, by smoothly and synchronously adding blocks of convolutional layers to both the generator and discriminator. For the generator, each progress block is preceded by an up-sampling operation while a down-sampling operation follows each progress block in the discriminator.111 One can also use fractionally-strided and strided convolution respectively.
One can also use fractionally-strided and strided convolution respectively.Such methodology leads to a more stable and faster training.
Moreover, the Wasserstein distance with gradient penalty (Wasserstein2017) is used as a cost function to mitigate the gradient problems. Furthermore, various normalization techniques are used to avoid the unhealthy competition between the generator and discriminator. In particular, we use “weight scaling” and “pixelwise feature normalization” as done in (He2015; AlexNet2012)
. In addition to that, the “mini-batch standard deviation”(karras2018progressive) is computed and incorporated in the cost function in order to favor diversity in the synthetic data.
Our architecture is implemented in Python using PyTorch library and trained on a GPU system. The dataset is made of 6157 images of galaxies in RGB format. The images were centered atresolution, normalized, and augmented using standard data augmentation techniques. A batch size was used with data loading workers.
The training was performed over a total of epochs and lasted less than hours. During the first epochs of training, the generator and discriminator were competing to reach the minimax equilibrium and the performance was fluctuating (in terms of their loss functions). The performance stabilized after that while the image quality continued to improve. After training, the discriminator, which plays the role of an adaptive loss function, is detached from the architecture and dismissed. The generator is then able to generate galaxy images starting from a latent space made of
standard Gaussian i.i.d. random variables.
By changing the latent vector, we were able to obtain very diverse and high quality images of galaxies showing complex structures and morphologies (e.g. arm and disk features). Furthermore, the simulated images exhibited realistic effects (e.g. companion stars) as shown in Figure 1.
5 Future Work
We are planning to investigate the latent space of our GAN model in order to gain insight on the effect of each latent variable on the galaxies morphology. This will provide us with more control on the generation task and will permit to interpolate between the variables and perform latent space arithmetics. Furthermore, we are planning to incorporate the labels of the galaxies, when available, in a supervised or semi-supervised approach using variants of "Conditional GAN" architectures(Odena2016ConditionalIS) in order to improve the quality of the generated images and guide the generator.
M. Dia and E. Savary would like to acknowledge funding from the SNSF (grant number 173716).