Forging new worlds: high-resolution synthetic galaxies with chained generative adversarial networks

11/07/2018 ∙ by Levi Fussell, et al. ∙ 6

Astronomy of the 21st century finds itself with extreme quantities of data, with most of it filtered out during capture to save on memory storage. This growth is ripe for modern technologies such as deep learning, as deep image processing techniques have the potential to allow astronomers to automatically identify, classify, segment and deblend various astronomical objects, and to aid in the calibration of shape measurements for weak lensing in cosmology through large datasets augmented with synthetic images. Since galaxies are a prime contender for such applications, we explore the use of generative adversarial networks (GANs), a class of generative models, to produce physically realistic galaxy images. By measuring the distributions of multiple physical properties, we show that images generated with our approach closely follow the distributions of real galaxies, further establishing state-of-the-art GAN architectures as a valuable tool for modern-day astronomy.



page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interest in using machine learning for tasks such as galaxy processing, classification, segmentation, and deblending has become popular due to the growth of larger galaxy datasets

(Banerji et al., 2010; Huertas-Company et al., 2018; Khalifa et al., 2018; Reiman & Göhre, 2018). As these approaches become more complex and attempt to automate galaxy pre-processing, the data demands grow accordingly. Galaxy datasets such as Galaxy Zoo data releases, as described by Lintott et al. (2011) and Willett et al. (2013), and the EFIGI catalogue by Baillard et al. (2011) are examples of large, pre-processed datasets used to train these types of deep models (Barchi et al., 2017; Domínguez Sánchez et al., 2018).

Currently, much research is invested in how to use GANs for purposes other than subjectively beautiful and realistic visuals. Strong generative models that have generalised well to a training dataset will have learned useful latent properties about the true data distribution. In the context of galaxies, this can involve latent features such as ellipticity, brightness, size, shape, and colour. Galaxy datasets can then be indefinitely enlarged by the generative model. Deep object segmentation and image classification, which require datasets on the order of 4-5 magnitudes, are examples of applications that benefit from the training with such synthetic datasets. These improved models can then be used to more accurately segment and identify galaxies from telescope data. Datasets augmented in this way also offer a solution to the need for large quantities of high-quality datasets of galaxy images in 21st-century cosmology. In the quest to differentiate between different dark energy models, reducing biases in galaxy shape measurements is of importance in weak gravitational lensing. Simulations required to calibrate shape measurement algorithms are necessary to correctly extract the lensing signal, and real galaxy images of high-enough quality are costly to obtain, which is relevant to upcoming cosmological probes such as LSST and Euclid (LSST Science Collaboration et al., 2009; Laureijs et al., 2011).

While research on the generative modeling of galaxies with machine learning is sparse in the related literature, recent efforts are starting to pave the way. Ravanbakhsh et al. (2017)

train both a conditional variational autoencoder (C-VAE) and a conditional generative adversarial network (C-GAN) to create galaxy images with a resolution of

pixels based on conditioning features from the training data, for example brightness and size of galaxies. The aim of their analysis is to provide datasets for shape measurement algorithms in the cosmological probes mentioned above. The authors report that by conditioning the models on the features of a galaxy from the real dataset, they are able to successfully reproduce rich galaxy images that share similar structures with real data. To quantify the similarity of generated images to real data, the ellipticity and size distributions of the real and generated galaxies are measured. The attempt to train a C-GAN on continuous conditional variables, however, is reported to fail, and Ravanbakhsh et al. replace the discriminator network with a novel predictor network producing desirable convergence properties for the generator. In related research, but targeting another type of image relevant to cosmology, Rodriguez et al. (2018) use GAN models to generate synthetic cosmic web examples, with similar statistical evaluations of the results.

Schawinski et al. (2017) make use of a different capability of GANs to recover high-quality images with recovered features from artificially degraded galaxies. The applications for such an approach to astronomy are clear, as galaxy images from telescopes suffer from various noise sources such as background noise, atmospheric noise, and instrumental noise, which convolve and degrade the image. A generative model that is able to automatically filter this noise and reproduce a rich galaxy image can help to streamline the telescope galaxy-imaging process. The authors use a relatively small dataset of 4,105 galaxy images for training, and hence encounter difficulties with anomalous galaxies, for example a tidally warped edge-on disc galaxy, due to a lack of generalisation in their model.

In a third take on using GANs, Reiman & Göhre (2018) propose a branched GAN to deblend overlapping galaxies in compound images in the , and

bands, with each branch generating one of the two separated galaxies. Using a model based on the super-resolution GAN (SRGAN) previously introduced by

Ledig et al. (2017), the authors exploit the ability of GANs to fill in missing pixels occluded by the superposition of blended galaxies. One of the advantages of GANs and similarly pre-trained architectures is the speed of the fully automated task, which offers a way to avoid the discarding of blended galaxy images due to severe blends.

The rest of this paper is structured as follows: Section 2 provides a brief outline of the general GAN model, as well as an introduction to the variations used in our experiment; the deep convolutional GAN (DCGAN) and the stacked GAN (StackGAN). Descriptions of the model architectures, training schedule, dataset, and hardware are included for reproducibility. Next, Section 3 outlines the experiments performed with the DCGAN and StackGAN architectures to qualitatively evaluate design decisions of the model architectures. In Section 4, the generated galaxies of the best-performing models, for a DCGAN producing images and a chained DCGAN/StackGAN producing higher-resolution images, are quantitatively evaluated. This evaluation involves measuring the ellipticity, angle from the horizon, total flux, and size as measured by the semi-major axis, for both real and generated galaxy images, thus showing that the generated image distributions closely follow the real image distributions. Finally, we provide a discussion of the results and remarks on future work in Section 5, and final conclusions in Section 6.

2 Generative adversarial networks

We provide a brief outline of the basic GAN architecture, colloquially referred to as vanilla GANs from here on, in Section 2.1. For a more extensive introduction to GANs, we refer the interested reader to Goodfellow et al. (2014) and Arjovsky & Bottou (2017). We extend this introduction by covering the specifics of deep convolutional GANs in Section 2.2, and introduce the StackGan architecture for later experiments in Section 2.3.

2.1 Basics of GAN design

The vanilla GAN architecture consists of two neural networks, the

generator and the discriminator, which have adversarial objectives. The generator’s objective is to ’trick’ the discriminator by generating fake data that is close to the real data distribution, while the discriminator’s objective is to determine if the data it is presented with is drawn from the real or fake data distribution. This can be represented by the following two-player minimax game:


Here, is the discriminator function, which takes as its input a data sample or

and outputs the probability

of the data sample belonging to the real data distribution. is the generator function, which takes as its input randomly sampled multivariate Gaussian noise , with and

as the identity matrix with diagonal values

, and outputs a data sample that is as close to the true data distribution as possible. During training, it is common to maximise the alternative objective function for the generator, as this approach leads to a more stable convergence, while the original objective function is still maximised for the discriminator.

Training GANs requires alternating between training the generator and training the discriminator. This allows for each network to incrementally improve such that both networks seek an optimal equilibrium. Although convergence can occur in theory, in practise, GANs struggle from imbalanced player strengths, mode collapse, and oscillations, to name a few common issues.

2.2 Deep convolutional GAN

Figure 1: DCGAN architecture employed in this paper. The top (blue) CNN is the generator that is tasked with creating realistic synthetic galaxy images from Gaussian noise, while the bottom (red) CNN is the discriminator, the purpose of which is to learn to differentiate between generated and real images, forcing the generator to learn how to produce more realistic galaxy images.

Because vanilla GANs work well for image generation, a natural extension is to use convolution layers, as first introduced by LeCun et al. (1990)

, which decrease the number of parameters per layer via weight sharing of convolving templates. In other words, for a hidden neuron at location

, an activation function

, a shared bias , an input activation , and as the number of shared weights for the local receptive field of size , the neuron’s output is is given by:


Conveniently, these reduced-parameter templates are also location-invariant when convolved across the input, which is an essential requirement for effective object recognition. Each layer is represented by a tensor, where is the number of convolution templates, or channels, and is the size of the convolution template, which is also referred to as a kernel. Radford et al. (2015)

present the first working DCGAN. The fundamental insights in their work include the removal of all pooling layers, which are usually used in traditional convolutional neural networks (CNNs), and their replacement with more convolution layers, as well as the addition of batch normalisation layers, the use of the ReLU activation function introduced by

Nair & Hinton (2010) for the generator, and the use of the LeakyReLU activation function for the discriminator (Maase et al., 2013).

6464 DCGAN generator 100-dimensional multivariate Gaussian input deconvolution layers: Channel size: with Padding: Stride: Batch size: 32 Kernel size: Batch normalisation after layers ReLU activation function in layers Tanh activation function in layer   6464 DCGAN discriminator convolution layers: Channel size: with Padding: Stride: Batch size: 32 Kernel size: Batch normalisation after layers LeakyReLU activation function in layers Sigmoid activation function in layer

The DCGAN architecture used in this paper is similar to the original architecture outlined in Radford et al. (2015). We provide a description of the model in the framed part of this section, and a schematic representation in Figure 1.

2.3 StackGAN

The StackGAN architecture is introduced by Zhang et al. (2016), but has since been developed in further versions, for example an attention-based model by Xu et al. (2017) and StackGAN++ by Zhang et al. (2018). Due to basic GAN architectures not scaling well to image sizes larger that , the StackGAN architecture employs two GANS; one to generate low-resolution synthetic images as the DCGAN does, and another one to transform the synthetic images into high-resolution versions. Research on generating high-quality realistic images using GANs has recently become popular with a variety of proposed architectures (Ledig et al., 2017; Karras et al., 2017; Wang et al., 2018). This subdomain is called super resolution and differs from the StackGAN model in that realism to the human eye is targeted, without necessarily requiring data that is similar to the true data distribution.

In this paper, we use an architecture similar to StackGAN, which is defined by two GANs; the Stage-I GAN and the Stage-II GAN. The Stage-I GAN generates low-resolution images, while the Stage-II GAN converts them into higher-resolution images. Both models are trained independently. The original StackGAN architecture uses nearest-neighbours upsampling layers coupled with kernel convolutions in the Stage-I generator, and conditions the GAN on an embedded text describing the input image. In this work, the Stage-I generator is replaced by the DCGAN generator architecture as described in Section 2.2, which is used to generate lower-resolution images. For the Stage-II GAN, we use an architecture similar to the StackGAN by Zhang et al. (2016), but also incorporate elements from the architecture in Ledig et al. (2017)

inspired by StackGAN. The novelties of the Stage-II generator are downsampling layers for feature extraction, residual connections to preserve low-level information of pixels, and nearest-neighbours upsampling with 3x3 convolutions to encourage the resolution growth of the image as it passes through the generator.

To encourage the generator to produce an image similar to the upscaled target real image, we also introduce a pixel-loss term into the generator objective function. We refer to this new generator objective function as the dual-objective function, which is used in a super-resolution GAN model by Ledig et al. (2017). The latter authors define one of the terms as the ’content loss’, which computes an error metric between the resolution-enhanced image and the real high-resolution image, and the second term as the ’generative loss’, which is the loss based on the discriminator’s output. Without the use of the dual-objective function, we find that, during experiments, the generator focusses on producing galaxy images relating to the galaxy types most common in the distribution. By enforcing pixel-to-pixel similarity on the upscaled image, the generator produces high-resolution images retaining rarer characteristics of the galaxies, for example spiral arms. The dual-objective function we use for the Stage-II generator is, therefore, given by:


is the usual generative loss term based on the discriminator output given the generated fake image (see equation 1), is the pixel-loss term given the real image and its generated resolution-enhanced counterpart , and . The pixel-loss term is defined as:


Here, is the sum of the red, green, and blue channel values at pixel position for an image . Following Ledig et al. (2017), we set and . Additionally, to ensure that the generator learns latent features of the input images, the first layer of the generator reduces the image size from to to enforce an information bottleneck. The architecture for the Stage-II generator is outlined in the framed part of this section.

The architecture for the discriminator is identical to the DCGAN discriminator described in Section 2.2, with the exception of a training batch size of for both the generator and the discriminator, and with an additional layer in the discriminator to transform images instead of images into a single output.

Stage-II generator image input convolution layers: Kernel size: for layers , otherwise Channel size: for layer                             for layers                             for layer                             for layer Padding: for all layers Stride: for layers , otherwise Batch normalisation after layers ReLU activation function in layers Tanh activation function in layer Upsampling () before layers Residual connections: Add layer output to layer output Add layer input         to layer output

3 Experiments

In Section 3.1, we introduce the setup of our models and the construction of the dataset, followed by descriptions of a variety of DCGAN experiments in Section 3.2. These experiments cover alterations to architecture parameters, which include the kernel size and the number of convolution channels, as well as the use of batch normalisation, label smoothing, and dropout. We also presents a closest-match analysis to ensure that the model does not memorise and reproduce the training data. Qualitatively, the architecture produces suitable outputs even for slight variations in most of its parameters, the exception being that batch normalisation layers are essential for the desired level of performance. The generation of higher-resolution images is presented in Section 3.3, followed by the description of a chained approach using StackGANs in Section 3.4.

3.1 Setup and dataset

The models are trained on a dataset , where is the size of the dataset and is the resolution of an image. The images are full-colour RGB galaxy images from the Galaxy Zoo 2 data release (Willett et al., 2013). This dataset is a set of images that have been centred and cropped to resolution such that a single galaxy is found at its centre. The presence of foreground stars, background noise, and extraneous galaxies in the images provides a further test for the ability of our approach. Unlike Ravanbakhsh et al. (2017), we do not crop the images by to reduce the effects of this noise in order to allow our work to be used as a more challenging measure of the capabilities of GANs in this context.

The models are trained on a single NVIDIA 1060 GTX 6GB GPU. During training, the galaxy images are flipped horizontally and vertically with probability . Mini-batches of size

are randomly sampled from the dataset, and an entire epoch is complete when all batches of the dataset have been sampled. Both networks are trained in an alternating manner with the Adam optimiser by

Kingma & Ba (2015), where each network is trained for one step of gradient descent before switching to training the other network. The learning parameters and weight initialisation method are outlined in Table 1.

learning rate
weight initialisation
Table 1: Learning parameters and weight initialisation method for DCGAN and StackGAN training. For the StackGAN, the learning rate is scheduled to halve every epochs, where represents the current epoch.

Since the DCGAN performance is known to drop for images larger than , as shown by Salimans et al. (2016), initial experiments focus on generating galaxy images, whereas results for generating images are described later. The images for real data are created by downscaling the original images using nearest-neighbour downsampling. In a subsequent step, we enhance the generated images to a resolution of using a StackGAN.

3.2 DCGAN experiments

We first perform architectural experiments to explore how adjustments to the DCGAN affect the generative performance. These experiments include changing the kernel size of the convolutions, changing the number of convolutional channels, removing batch normalisation layers, adding label smoothing to the objective function, and including dropout layers. The separate experiments are described in further detail below, and the results are presented. Each experiment starts with the previously described DCGAN architecture, and adjusts the parameter that is the focus of the experiment. Evaluation of the results during these experiments is purely qualitative, and the best results during this stage are chosen for quantitative evaluation in Section 4.

3.2.1 Kernel size

Each channel of a convolutional layer represents a kernel with , which performs a weighted sum of the pixels within the kernel region as it convolves the image. More formally, if represents a kernel function applied to a pixel at location in the image, the convolution can be summarised as:


The kernel moves along the image at a per-pixel rate called the stride. A large kernel with a small stride will have more overlap with its previous position; likewise, a small kernel with a large stride will have no overlap and will skip pixels. A constant stride value of is used while exploring the use of a larger kernel of size , which means that there is more overlap in the convolution. The results are compared to a kernel model with the same stride. Both models are run for epochs, and the results are shown in Figure 2. The kernel model produces asymmetrical galaxies with evident pixelation, which is likely due to the overlapping of the kernel as it convolves the image.

Figure 2: Comparison of the effect of different kernel sizes on the resulting images. The first row (red) shows generated images for a kernel size of , the second row (blue) shows generated images for a kernel size of , and the third row (green) shows images from the Galaxy Zoo dataset. Images are randomly sampled.

3.2.2 More convolution channels

Models provided with more convolutional channels are able to learn a richer set of templates for producing images. In the following experiments, we explore channel scales of . The DCGAN architecture progressively halves the number of channels in each generator layer, and doubles the number in each discriminator layer. Therefore, the channel sizes we explore are , where with is the index of the layer, and is the number of layers in the network. Each model is trained for epochs with a batch size of . In addition, a model with a batch size of is explored. The results are shown in Figure 3.

The model with and is capable of generating spiral arms in galaxies, whereas previous models struggle with generating such structures. In summary, increasing the number of channels improves the quality of the generated galaxies, as the representational power of models with more channels is larger.

Figure 3: Comparison of the effect of different numbers of convolution channels and batch sizes on the resulting images. The first row (red) shows generated images for a channel scale of , the second row (blue) shows generated images for a channel scale of , and the third row (green) shows generated images for a channel scale of . In addition, the fourth row (purple) shows generated images for a channel scale of with a batch size of , while the fifth row (orange) shows images from the Galaxy Zoo dataset. Images are randomly sampled.

3.2.3 Batch normalisation

Batch normalisation, introduced by Ioffe & Szegedy (2015), has shown to improve the generalisation of the generator and prevent mode collapse (Radford et al., 2015). In addition, adding batch normalisation to the discriminator helps with gradient flow. Batch normalisation is placed after the layer weights, but before the layer activation in order to normalise the input batch according to:


Here, represents the mean of the component of , denotes the mini-batch of size , and . The layer then learns parameters and via gradient descent to control the normalisation factor for the entire mini-batch via .

We exclude the discriminator output and generator input from the addition of batch normalisation layers to avoid oscillations (Radford et al., 2015). Our results support the claim that batch normalisation improves GAN performance, as shown in Figure 4. In contrast, the model without batch normalisation between the layers lacks both colour and diversity in the range of generated images.

Figure 4: Comparison of the effect of batch normalisation on the resulting images. The first row (red) shows generated images with batch normalisation layers, the second row (blue) shows generated images without batch normalisation layers, and the third row (green) shows images from the Galaxy Zoo dataset. Images are randomly sampled.

3.2.4 Label smoothing

Label smoothing perturbs the labels of the real images fed into the discriminator by , with a common choice of . Instead of setting the label of real images to one, a label is sampled uniformly from the range . Previous research shows that the addition of label smoothing leads to a noticeable gain in GAN performance (Salimans et al., 2016). The results are shown in Figure 5.

Label smoothing appears to remove clutter in the image and reduce the effect of noise on the generator. It also seems to decrease the diversity of colour and shape, tending more of the galaxies towards less elliptical shapes. The advantage of label smoothing is, therefore, not clear.

Figure 5: Comparison of the effect of label smoothing on the resulting images. The first row (red) shows generated images with added label smoothing, the second row (blue) shows generated images without added label smoothing, and the third row (green) shows images from the Galaxy Zoo dataset. Images are randomly sampled.

3.2.5 Dropout

A dropout layer is represented by a Bernoulli distribution

for a neuron in the network. With probability , a neuron in the previous layer will output during training; otherwise, its output remains unchanged (Srivastava et al., 2014). During testing, the output of a neuron is then an expectation under the Bernoulli distribution such that:


A value of is used for all experiments, and a variation of dropout called spatial dropout is employed, as our networks are fully convolutional (Tompson et al., 2015). Dropout helps the generator or discriminator to generalise better and not overfit the data. The placement of dropout layers is explored by adding dropout layers between all hidden layers of the generator, all hidden layers of the discriminator, and all hidden layers of both the generator and discriminator. The results are shown in Figure 6.

Figure 6: Comparison of the effect of dropout on the resulting images. The first row (red) shows generated images with dropout layers in the generator, the second row (blue) shows generated images with dropout layers in the discriminator, and the third row (green) shows generated images with dropout layers in both the generator and discriminator. Images are randomly sampled.

Adding dropout layers in the generator causes poor performance, leading to visually unrealistic images. With dropout in the discriminator, the generator is still able to produce realistic galaxies, but the discriminator’s strength is decreased, meaning that it is easier for the generator to overcome the discriminator, thus making generated images less diverse. Dropout in both the generator and discriminator still results in poor performance for the generator. In general, dropout does not appear to benefit the model. Despite this result, we find that a single, carefully-managed dropout layer on the discriminator can help for larger image generation, which is further discussed in Section 3.3.

3.2.6 Closest-match analysis

To show that the model does not overfit to the data, random images are sampled from the generator, after which the closest image in the real dataset is found using the distance. For two images , the latter distance, denoted as , can be expressed as follows:


The images are cropped by from the centre before computing the distance to remove the effects of background noise. Through the difference of generated images from the closest-matching real images, we demonstrate that the model learns to create new images from a latent representation, instead of memorizing and reproducing the dataset. Figure 7 presents the results of this analysis.

By comparing the brightness at the centre of the difference visualisation for pairs of images in the third column of Figure 7, it is apparent that the galaxies have not been memorised, with discrepancies in colour, shape, and brightness being present between generated images and their closest-matching real counterpart.

Figure 7: Results for a closest-match analysis to check that the generator does not simply memorise the dataset. The first row (red) shows randomly sampled generated galaxy images, the second row (blue) shows the closest matching real galaxy images from the Galaxy Zoo dataset as measured by the distance, and the third row (green) shows the absolute difference between the two images.

3.3 Generating 128x128 images

The DCGAN architecture was not originally designed to handle images and, as mentioned previously, images of and above pose a challenge for simple GAN models. Therefore, scaling up the image architecture to work with a resolution of is not a straightforward task, and we observe that adding a sixth convolutional layer results in mode collapse due to the discriminator being too powerful. During these tests, we use two techniques to try to weaken the discriminator; dropout before the final discriminator layer and decreasing the number of channels in the discriminator.

128128 DCGAN generator 100-dimensional multivariate Gaussian input deconvolution layers: Channel size: with Padding: Stride: Batch size: 32 Kernel size: Batch normalisation after layers ReLU activation function in layers Tanh activation function in layer   128128 DCGAN discriminator deconvolution layers: Channel size: with Padding: Stride: Batch size: 32 Kernel size: Batch normalisation after layers LeakyReLU activation function in layers Sigmoid activation function in layer Dropout in layer for epochs

The best model requires some unconventional training methods. In the framed part of this section, we describe the final DCGAN model used for generating images, which differs slightly from the DCGAN architecture described in Section 2.2.

Training with a dropout layer diminishes the discriminator’s performance early on, eventually causing mode collapse. Alternatively, training without dropout creates a discriminator that proves too powerful after epochs of training. By introducing a dropout layer at the epoch, however, the discriminator is provided with enough time to learn a good function at first, but is then weakened to allow the generator to develop more generative power. All other architectures we experiment with either mode-collapse, produce images that are too bright, or show visually obvious kernel templates throughout the model image. For the best model, the generator produces visually desirable galaxy images after epochs of training, but the background still shows clear kernel template artifacts.

As a recommendation for future research, trying an ’annealed’ dropout in which the probability of dropout starts low and gradually increments to one is an interesting alternative pathway, and a generalisation of the training method described above. The results of the final model are shown in Figure 8.

Figure 8: Results for the final architecture to create images with a resolution of . The first column (red) shows generated images with dropout and a 32-channel disciminator, the second column (blue) shows generated images with dropout removed after 200 epochs, and the third column (green) shows images from the Galaxy Zoo dataset. Images are randomly sampled.

Despite various trials, obtaining the same generation quality with images as with images proves to be an obstacle. Despite producing realistic galaxies, the largest failure of the model is its inability to generate realistic background representations, as shown by the lighter colouration of the backgrounds in the generated images.

3.4 StackGAN experiments

Instead of trying to use a single DCGAN to produce high-quality images, we use a second GAN, the StackGAN by Zhang et al. (2016), to enhance the resolution of realistic images obtained through the DCGAN. Two datasets are created from the original Galaxy Zoo dataset by scaling down the images to a resolution of and , respectively, using a nearest-neighbours method. We flip identical pairs of and images vertically and horizontally with a probability of for each transformation. The model is then trained for epochs and with a batch size of . While the original StackGAN paper recommends training for epochs, preliminary experiments show that training for more than epochs eventually results in mode collapse and a lack of diversity in the generated images. Input images are scaled to be between , whereas output images are scaled to be between as per Ledig et al. (2017).

The training parameters are specified in Table 1, and Figure 9 shows resolution-enhanced results for a sample of generated images from a random sample of real Galaxy Zoo images, compared to upsampled and downsampled original Galaxy Zoo images.

Figure 9: Comparison of resolution-enhanced images with a StackGAN and real images from the Galaxy Zoo dataset. The first column (red) and the second column (blue) show images from the original Galaxy Zoo dataset that are downsampled to a resolution of and , respectively. The third column (green) shows images from the Galaxy Zoo dataset with a resolution of that are upsampled from versions. The fourth column (orange) shows generated images with a resolution of conditioned on images of the same resolution from the Galaxy Zoo dataset. Images are randomly sampled.

The results demonstrate the ability of the StackGAN model to generate diverse high-resolution images from lower-resolution synthetic images, and solve the problem of visible kernel templates described in Section 3.3. The architecture presented above, which is comprised of a chained combination of DCGAN and StackGAN, is, therefore, the final model for the generation of images. Figure 10 shows a selection of images generated with the latter model.

Figure 10: Examples of galaxy images with a resolution of created with the chained DCGAN/StackGAN model. The images are selected to highlight the model’s ability to create features such as spiral arms, as well as a variety of ways galaxies present themselves, like edge-on disc galaxies, featureless elliptical galaxies, and multiple galaxies per image.

4 Evaluation

While the generation of visually appealing synthetic galaxy images provides a reasonable proof of concept, the use of generated images in applications within astronomy requires such images to also be physically realistic. Therefore, we assess the quality of generated images by performing statistical tests on both the real and generated data. If the generated statistics closely follow the statistics derived from real data, the generated data is viable for supplementing real datasets in the domain of the measured statistical features.

In doing so, we explore four properties of the galaxy images; ellipticity, angle of elevation from the horizontal, total flux, and the size measurement of the semi-major axis. Related to this approach, Ravanbakhsh et al. (2017) tested for two of these properties, ellipticity and size, but with a C-VAE conditioned on the size parameter and in combination with a report that their implementation of a C-GAN produces less consistent results. This introduces the question of whether our trained model can produce consistent evaluation results, which presents an interesting opportunity to compare state-of-the-art results. The ellipticity is defined as:


Here, and are the semi-major and semi-minor axis of the ellipse, respectively. We make use of the photutils package, which is part of the Astropy collection of astronomy-related Python packages, to fit an ellipse via isophotes of equal intensity from a predefined elliptical centre (Astropy Collaboration et al., 2013, 2018). Specifically, photutils implements methodology initially introduced by Jedrzejewski (1987) to fit measurements around trial ellipses via weighted least-squares:


Here, , , and are coefficients, and denotes the eccentric anomaly. Correction factors for the coefficients described above are computed for small errors in ellipse parameters, with major axis position , minor axis position , ellipticity , and position angle :


denotes the intensity derivative along the direction of the major axis evaluated at a semi-major axis length of . The ellipse parameter for the largest-amplitude harmonic is changed by the amount calculated in equations 11, followed by iterative sampling of the galaxy until a sufficient and constant fit to the intensity is reached. The total flux of the galaxy is then computed as the sum of the pixel values within the outermost ellipse.

While the ellipticity represents a relationship between both axes of an ellipse, the size represents a measure of the semi-major axis. The angle measurement is defined as the angle of elevation relative to the horizontal of the galaxy’s semi-major axis in degrees. Due to the limitations of the fitting algorithm in Astropy, all images are upscaled to a resolution of using bicubic sampling, which allows for more accurate ellipse fits, but does not alter the underlying distribution of the data. The angle of elevation is given in pixels, as the Galaxy Zoo 2 dataset is a colour composite of resolution scaled to arcseconds per pixel, which means that each image corresponds to a different angle dependent on the galaxy size. Here, denotes the radius containing of the -band Petrosian aperture flux (Willett et al., 2013). For the same reason, the total fluxes are treated as relative fluxes due to the lack of a consistent conversion from flux per pixel to flux per angle.

Figure 11:

Histograms of the evaluation metrics for synthetic galaxy images with a resolution of

created with a DCGAN. The distributions for generated images and real Galaxy Zoo dataset samples, both upsampled from to , are coloured in red and blue, respectively. The upper left plot shows the distributions for the ellipticities () of the images, the upper right plot shows the distributions for angles of elevation from the horizontal in degrees (), the lower left plot shows the distributions for relative fluxes (), and the lower right plot shows the distributions for the size measured by the semi-major axis in pixels (). All plots are created with random samples of 9,000 images from both the generated images and the Galaxy Zoo dataset.


The four statistics for each galaxy image are measured for a sample of galaxy images from both the true data set and the generated data. How well the generated data incorporates key galaxy features can be measured by comparing the distribution of the statistics over the real and generated data. Evaluations are performed on the best image generative model and the best image generative model. Figure 11 shows a comparison of generated and true distributions for each of the four statistics, with generated images for a resolution of from our DCGAN being evaluated. Similarly, Figure 12 depicts the same comparison plots for a resolution of , with the upscaled images obtained from the two-stage generation process using our chained combination of DCGAN and StackGAN.

Figure 12: Histograms of the evaluation metrics for synthetic galaxy images with a resolution of created with a two-stage pipeline using a DCGAN for images and then upscaling the images to a higher resolution of with a StackGAN. The distributions for generated images and real Galaxy Zoo dataset samples, both upsampled from to , are coloured in red and blue, respectively. The upper left plot shows the distributions for the ellipticities () of the images, the upper right plot shows the distributions for angles of elevation from the horizontal in degrees (), the lower left plot shows the distributions for relative fluxes (), and the lower right plot shows the distributions for the size measured by the semi-major axis in pixels (). All plots are created with random samples of 9,000 images from both the generated images and the Galaxy Zoo dataset.


Due to the limitations of the ellipse fitting via photutils, the software fails to fit suitable ellipses for some of the generated images, and the Astropy library states that: “A well defined negative radial intensity gradient across the region being fitted is paramount for the achievement of stable solutions”111
. Increasing the scale of the images helps reduce the percentage of failures to fit, and we observe that the similarity of the distributions is inversely proportional to the percentage of failed fits. For the final results of the resolution distributions, approximately of ellipse fits failed, which is likely to be the cause of the ’dip’ of the generated distribution of Figure 11 in the angle evaluation in the distribution comparisons, shifting mass to the extremes of the distribution. While this is primarily a failure of the distribution generation for plotting, non-uniform orientations in generated datasets can be easily fixed by randomly rotating images. In contrast, failures to fit did not occur for the resolution images generated by the StackGAN, thus correcting the dip in the angle distribution.

As can be seen in Figure 11 and Figure  12, apart from slight incongruities in the angle plot for the DCGAN and the flux plot for the final chained DCGAN/StackGAN model, the distributions of the investigated properties for the generated data closely follow the distributions for the real data, demonstrating that the model has learned an effective latent representation of galaxy features. Given that our model is not conditioned on any galaxy features, this confirms the viability of our approach for the types of applications in astronomy discussed in Section 1.

5 Discussion

One interesting findings of the experiments and their accompanying results is that a comparably simple model produces the most realistic images as well as the best evaluation results. Despite experimenting with the use of dropout and label smoothing, the original DCGAN architecture outperformed all other models on the dataset with resolution , hence supporting the case for model simplicity. This outcome is in line with recent debates on whether neural networks memorise the data they are presented with, and there are results supporting both sides (Arpit et al., 2017; Zhang et al., 2017). Springenberg et al. (2015) show that techniques such as simplifying a model’s architecture can improve test results in such a way that smaller models are competitive with state-of-the-art models of higher complexity, with similar findings being reported about adjustments in the initialisation (Glorot & Bengio, 2010; Mishkin & Matas, 2016). As described before, the dataset with a resolution of requires a customised architecture to reach the model’s best performance, but those changes do not represent a significant deviation from the DCGAN model, which generates the initial galaxy images that are then resolution-enhanced.

Our statistical evaluation of physical properties in the generated images also shows that our models are able to learn realistic latent representations of data. This requires both the discriminator and the generator to extrapolate these underlying features defining the property distributions of galaxies in our universe, only through backpropagation. We find that models tasked with directly generating larger images with a resolution of

struggle primarily with filling the background around the galaxies. One solution for this case is to crop the galaxy images by as done by Ravanbakhsh et al. (2017), which removes the need for the model to learn complex background-filling techniques. Building from this pre-processing, a potential enhancement is to encourage the discriminator to focus on different sub-regions of the generated and real images by gradually cropping random sub-sections of the input data. We do, however, find that an effective approach is to train a second generator to increase the resolution of the generated images, avoiding the need for pre-processing techniques.

As discussed in Section 4, the distributions for angles of elevation from the horizontal deviate slightly from the real dataset for the DCGAN model, which can be resolved by randomly flipping generated images along the axes. In contrast, the evaluation for the chained DCGAN/StackGAN model is shown to closely follow the distributions for the real dataset, the exception being the flux evaluation due to generated galaxies being slightly brighter on average. An interesting characteristic of the StackGAN model is that it does not just enhance the resolution quality, but is also able to sometimes correct defects in the original galaxy images as can be seen in the last row of Figure 9. This model behaviour can be ascribed to the dual-objective function (see equation 3). Specifically, the generative loss term corrects real galaxy images that have artifacts which do not match the galaxies’ distribution.

Generative models that are able to create physically realistic galaxy images have many practical uses. In this work, we use Galaxy Zoo 2 images, which are

compound images with the bands fed into RGB channels. Our described architectures can, however, easily be used for different bands and numbers of channels. The Galaxy Zoo datasets are hand-classified via community crowd sourcing, and hence features such as shape, merging disturbance, and irregularities are determined through a hand-built decision tree. Using this manual classification approach on the generated data would provide a more rigorous evaluation metric for the model, for example through measurements of the distributions for different classes, as well as an assessment of the ease of classification, both in comparison to the real datasets.

Another obvious use of models such as the ones presented in this paper is the creation of large datasets of high-quality galaxy images that are representative of a specific survey. Shape measurement algorithms used to detect weak lensing signals are an important part of research targeting dark matter, for example in the context of upcoming surveys like LSST and Euclid (LSST Science Collaboration et al., 2009; Laureijs et al., 2011). The process of calibrating for measurement biases relies on image simulations with a known ground truth, which requires high-quality images as an input to the simulation. For this reason, the distributions of generated data used in place of expensive observations have to closely follow the distributions of real data for properties such as ellipticity, which we demonstrated in the statistical evaluation in Section 5.

Finally, a natural extension of this work is to test the effect that augmenting datasets from efforts such as Galaxy Zoo or EFIGI has on, for example, galaxy classification models. Showing that generated galaxy images improve the generalisation and test accuracy of these models motivates further research into deep learning models for astronomy.

6 Conclusion

In this paper, we show how generative modelling with GAN architectures can be used for the augmentation of smaller datasets of galaxy images. Specifically, the original DCGAN architecture proves sufficient for the generative model to create physically realistic images that closely follow the property distributions of real galaxy images when faced with statistical evaluations.

In addition, we explore the applicability and limits of common ways to optimise such models, and show that the StackGAN architecture can be used as a second-stage model in a chained DCGAN/StackGAN approach to generate synthetic galaxies with higher resolutions, circumventing the difficulties that DCGAN models experience with such resolutions. While GANs quickly spread to a variety of application areas since their introduction in 2014, our work also adds to the evidence that chaining different GAN models is a workable approach.

By demonstrating that distributions of generated galaxies closely follow the real data distributions for a variety of physical properties, we posit that generated galaxy images can be used to indefinitely augment real galaxy datasets to enlarge the number of samples from surveys. The range of evaluation metrics used in this paper show the viability of synthetic galaxies generated in this way for learning tasks such as galaxy classification and segmentation, deblending, and the calibration of shape measurement algorithms used to investigate dark energy through weak gravitational lensing. With the presented capability to provide a data source for deep learning models that require a large number of training samples, our work demonstrates the potential of GAN architectures as a valuable tool for modern-day astronomy.


We would like to express our gratitude to the Galaxy Zoo team for creating a high-quality dataset of galaxy images well-suited for machine learning applications, the anisotropic irregularities in the early universe for making them possible in the first place, and the University of Edinburgh’s School of Informatics for supplying the necessary GPU power. We also wish to thank Romeel Davé and Joe Zuntz for helpful conversations about evaluation metrics, and Nathan Bourne for coordinating the summer research projects at the University of Edinburgh’s Institute for Astronomy during the 2017/2018 period, which laid the foundation for this paper.


  • Arjovsky & Bottou (2017) Arjovsky M., Bottou L., 2017, preprint (arXiv:1701.04862)
  • Arpit et al. (2017) Arpit D., et al., 2017, in Proceedings of the 34th International Conference on Machine Learning (ICML 2017). pp 233–242
  • Astropy Collaboration et al. (2013) Astropy Collaboration et al., 2013, A&A, 558, A33
  • Astropy Collaboration et al. (2018) Astropy Collaboration et al., 2018, AJ, 156, 123
  • Baillard et al. (2011) Baillard A., et al., 2011, A&A, 532, A74
  • Banerji et al. (2010) Banerji M., et al., 2010, MNRAS, 406, 342
  • Barchi et al. (2017) Barchi P. H., da Costa F. G., Sautter R., Moura T. C., Stalder D. H., Rosa R. R., de Carvalho R. R., 2017, J. Comput. Interdiscip. Sci., 7, 114
  • Domínguez Sánchez et al. (2018) Domínguez Sánchez H., Huertas-Company M., Bernardi M., Tuccillo D., Fischer J. L., 2018, MNRAS, 476, 3661
  • Glorot & Bengio (2010)

    Glorot X., Bengio Y., 2010, in Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010). pp 249–256

  • Goodfellow et al. (2014) Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y., 2014, in Advances in Neural Information Processing Systems 27. pp 2672–2680
  • Huertas-Company et al. (2018) Huertas-Company M., et al., 2018, ApJ, 858, 114
  • Ioffe & Szegedy (2015) Ioffe S., Szegedy C., 2015, in Proceedings of the 32nd International Conference on Machine Learning (ICML 2015). pp 448–456
  • Jedrzejewski (1987) Jedrzejewski R. I., 1987, MNRAS, 226, 747
  • Karras et al. (2017) Karras T., Aila T., Laine S., Lehtinen J., 2017, preprint (arXiv:1710.10196)
  • Khalifa et al. (2018) Khalifa N. E., Hamed Tah M., Hassanien A. E., Selim I., 2018, in 2018 International Conference on Computing Sciences and Engineering (ICCSE). pp 1–6
  • Kingma & Ba (2015) Kingma D. P., Ba J., 2015, in Proceedings of the 3rd International Conference for Learning Representations (ICLR 2015).
  • LSST Science Collaboration et al. (2009) LSST Science Collaboration et al., 2009, preprint (arXiv:0912.0201)
  • Laureijs et al. (2011) Laureijs R., et al., 2011, preprint (arXiv:1110.3193)
  • LeCun et al. (1990) LeCun Y., Boser B. E., Denker J. S., Henderson D., Howard R. E., Hubbard W. E., Jackel L. D., 1990, in Advances in Neural Information Processing Systems 2. pp 396–404
  • Ledig et al. (2017)

    Ledig C., et al., 2017, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp 105–114

  • Lintott et al. (2011) Lintott C., et al., 2011, MNRAS, 410, 166
  • Maase et al. (2013) Maase A. L., Hannun A. L., Ng A. Y., 2013, in Proceedings of the 30th International Conference on Machine Learning (ICML 2013).
  • Mishkin & Matas (2016) Mishkin D., Matas J., 2016, in Proceedings of the 4th International Conference on Learning Representations (ICLR 2016).
  • Nair & Hinton (2010) Nair V., Hinton G., 2010, in Proceedings of the 27th International Conference on Machine Learning (ICML 2010). pp 807–814
  • Radford et al. (2015) Radford A., Metz L., Chintala S., 2015, preprint (arXiv:1511.06434)
  • Ravanbakhsh et al. (2017) Ravanbakhsh S., Lanusse F., Mandelbaum R., Schneider J., Poczos B., 2017, in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). pp 1488–1494
  • Reiman & Göhre (2018) Reiman D. M., Göhre B. E., 2018, preprint (arXiv:1810.10098)
  • Rodriguez et al. (2018) Rodriguez A. C., Kacprzak T., Lucchi A., Amara A., Sgier R., Fluri J., Hofmann T., Réfrégier A., 2018, preprint (arXiv:1801.09070)
  • Salimans et al. (2016) Salimans T., Goodfellow I., Zaremba W., Cheung V., Radford A., Chen X., 2016, in Advances in Neural Information Processing Systems 29. pp 2234–2242
  • Schawinski et al. (2017) Schawinski K., Zhang C., Zhang H., Fowler L., Santhanam G. K., 2017, MNRAS, 467, L110
  • Springenberg et al. (2015) Springenberg J. T., Dosovitskiy A., Brox T., Riedmiller M., 2015, in Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015).
  • Srivastava et al. (2014) Srivastava N., Hinton G., Krizhevsky A., Sutskever A., Salakhutdinov R., 2014, J. Mach. Learn. Res., 15, 1929
  • Tompson et al. (2015) Tompson J., Goroshin R., Jain A., LeCun Y., Bregler C., 2015, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp 648–656
  • Wang et al. (2018) Wang X., et al., 2018, preprint (arXiv:1809.00219)
  • Willett et al. (2013) Willett K. W., et al., 2013, MNRAS, 435, 2835
  • Xu et al. (2017) Xu T., Zhang P., Huang Q., Zhang H., Gan Z., Huang X., He X., 2017, preprint (arXiv:1711.10485)
  • Zhang et al. (2016) Zhang H., Xu T., Li H., Zhang S., Wang X., Huang X., Metaxas D., 2016, in 2017 IEEE International Conference on Computer Vision (ICCV). pp 5908–5916
  • Zhang et al. (2017) Zhang C., Bengio S., Hardt M., Recht B., Vinyals O., 2017, in Proceedings of the 5th International Conference on Learning Representations (ICLR 2017).
  • Zhang et al. (2018) Zhang H., Xu T., Li H., Zhang S., Wang X., Huang X., Metaxas D. N., 2018, IEEE Trans. Pattern Anal. Mach. Intell.