Unsupervised learning and data clustering for the construction of Galaxy Catalogs in the Dark Energy Survey

12/05/2018 ∙ by Asad Khan, et al. ∙ 0

Large scale astronomical surveys continue to increase their depth and scale, providing new opportunities to observe large numbers of celestial objects with ever increasing precision. At the same time, the sheer scale of ongoing and future surveys pose formidable challenges to classify astronomical objects. Pioneering efforts on this front include the citizen science approach adopted by the Sloan Digital Sky Survey (SDSS). These SDSS datasets have been used recently to train neural network models to classify galaxies in the Dark Energy Survey (DES) that overlap the footprint of both surveys. While this represents a significant step to classify unlabeled images of astrophysical objects in DES, the key issue at heart still remains, i.e., the classification of unlabelled DES galaxies that have not been observed in previous surveys. To start addressing this timely and pressing matter, we demonstrate that knowledge from deep learning algorithms trained with real-object images can be transferred to classify elliptical and spiral galaxies that overlap both SDSS and DES surveys, achieving state-of-the-art accuracy 99.6 to initiate the characterization of unlabelled DES galaxies that have not been observed in previous surveys, we demonstrate that our neural network model can also be used for unsupervised clustering, grouping together unlabeled DES galaxies into spiral and elliptical types. We showcase the application of this novel approach by classifying over ten thousand unlabelled DES galaxies into spiral and elliptical classes. We conclude by showing that unsupervised clustering can be combined with recursive training to start creating large-scale DES galaxy catalogs in preparation for the Large Synoptic Survey Telescope era.



There are no comments yet.


page 3

page 4

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Large scale astronomical surveys provide key insights into the large scale structure of the Universe, its geometry and evolution in cosmic time. As the depth and scale of these surveys continue to increase in years to come, they will push back the frontiers of our understanding of dark matter and dark energy Riess et al. (1998); Perlmutter et al. (1999); Tonry et al. (2003); Knop et al. (2003).

One of the best observational probes to study the expansion of the universe, and thereby of dark energy, is through the observation of distant Type Ia supernovae. By observing thousands of these celestial objects, it may be possible to map the distance-redshift curve over a wide range of redshifts with unprecedented accuracy Drell et al. (2000).

In addition to this purely electromagnetic probe, the combination of gravitational wave observations with large scale galaxy catalogs has enabled the first gravitational wave standard-siren measurement of the Hubble constant Abbott et al. (2017). This approach, envisioned by Schutz Schutz (1986), has also been explored without assuming the existence of an electromagnetic counterpart Fishbach et al. (2018). This latter approach now opens up the way to using a large number of ground-based gravitational wave observations of binary black hole mergers, which may not have an electromagnetic counterpart, to enable precision gravitational wave cosmological measurements of the Hubble constant up to redshift . Similar observations at higher redshifts may be possible in the context of supermassive black hole mergers with space-borne gravitational wave missions Holz and Hughes (2005). Since the gravitational wave observation of binary black hole mergers has now become a common occurrence The LIGO Scientific Collaboration and the Virgo Collaboration (2018), the next frontier to enable this science is the construction of galaxy catalogs at higher redshifts.

To realize this science, we need to address outstanding data science challenge regarding the sheer scale of ongoing and future surveys. For instance, the Sloan Digital Sky Survey (SDSS) 

Eisenstein et al. (2011) observed hundreds of thousands of galaxies, which were classified through a remarkably successful citizens science program. Follow up analyses using deep learning to classify these celestial objects based on their morphology have also reported excellent results Domínguez Sánchez et al. (2018a).

In the context of the Dark Energy Survey (DES) Dark Energy Survey Collaboration et al. (2016), we now have a unique opportunity to innovate by creating new types of signal processing tools that are adequate to classify hundreds of millions of unlabeled galaxies. For instance, in Domínguez Sánchez et al. (2018b) the authors designed a neural network model, and trained it with SDSS galaxy images that overlap the footprint of both SDSS and DES, reporting classification accuracies .

In this article, we present novel methods to further advance the goal of creating large-scale galaxy catalogs, using DES as a driver for these studies.

  • We demonstrate that knowledge from deep learning algorithms, trained for real-world object recognition, can be transferred to classify SDSS galaxies into spiral and elliptical classes with state-of-the-art accuracies .

  • We used the aforementioned Deep Transfer Learning SDSS-seed neural network model to demonstrate that unlabelled DES galaxies, that overlap the footprint of SDSS, are correctly classified into spiral and elliptical classes with state-of-the-art accuracies


  • We introduce the first application of unsupervised learning and data clustering to classify over ten thousand unlabeled DES galaxies, that have not been observed in previous surveys, into spiral and elliptical classes.

The method we introduce herein can be combined with recursive training to produce large-scale galaxy catalogs in DES, i.e., once unlabelled DES galaxies are clustered into spiral and elliptical classes, these new labelled datasets can be used to retrain the original deep learning algorithm, boosting its accuracy and robustness to classify unlabeled DES galaxies in bulk in new regions of parameter space, as we demonstrate in Section III.

To address the case of high-redshift galaxy classification, one may use data augmentation to prepare datasets that resemble galaxies at higher redshifts by contaminating them with noise anomalies. Thereafter, one can design and train a neural network that first denoises images, and upon identifying potential candidate galaxies, it clusters them according to their features George et al. (2018). This approach lays the foundations to exploit transfer learning, unsupervised clustering and recursive training to produce large-scale galaxy catalogs in the Large Synoptic Survey Telescope (LSST) LSST Dark Energy Science Collaboration (2012).

This paper is organized as follows. Section II presents the approach followed to curate the datasets and deep learning algorithms designed and trained for our analyses. In section III, we demonstrate the applicability of our methods to classify galaxies in SDSS, galaxies that overlap SDSS and DES, and finally, the applicability of our approach to correctly classify thousands of unlabelled DES galaxies through unsupervised clustering. Finally, section IV summarizes our findings and future directions of work.

Ii Methods

In this section we describe the SDSS and DES datasets used to train and test our deep learning algorithms. We also describe the design and construction of the neural network models used for unsupervised learning and data clustering used to classify unlabelled DES galaxy images.

ii.1 Data Curation for SDSS and DES

We use a subset of SDSS Data Release (DR) 7 images for which we have high confidence classifications through the Galaxy Zoo project. We then divide these images into three orthogonal datasets for training, validation and testing. The validation set is used to monitor the accuracy and loss when training and fine-tuning our deep neural network. The test set is carefully constructed so that each image lies in both the SDSS and the DES footprint. This latter dataset was obtained from the DES DR1 data release. We have labelled these image datasets High Probability (HP) Test Sets, and there are two versions, one for each survey, i.e., HP SDSS and HP DES.

Furthermore, we created a second test set consisting of almost all galaxies that lie in both SDSS and DES footprints, this time without imposing any threshold on the Galaxy Zoo classification confidence. These datasets are labelled Full Overlap (FO) Test Sets, and again there are two versions, i.e., FO SDSS and FO DES.

The properties of these datasets are summarized in Table 1

, while their probability distributions are presented in Fig. 

1. A sample of the training SDSS dataset, and the HP Test set images are presented in the top and bottom panels of Fig. 2, respectively. Notice that the probability cutoffs are different to get similar numbers of spiral and elliptical SDSS galaxies for training.

Dataset Spirals Ellipticals
Training set 18,352 18,268
HP SDSS Test Set 516 550
HP DES    Test Set 516 550
FO SDSS Test Set 6,677 5,904
FO DES    Test Set 6,677 5,904
Table 1: Summary of each Dataset set.

SDSS Dataset We used the de-biased probabilities for elliptical and combined spiral classes described in Table 2 of Lintott et al. (2011) to create labels for the two classes of our training and test sets. After selecting the OBJIDs from Table 2 based on the probability thresholds of 0.985 and 0.926 for spirals and ellipticals respectively, we submit SQL queries to the SDSS Skyserver SDSS (2018) to obtain g, r and z-band images and metadata from the PhotoObj table. Thereafter, each galaxy is ‘cut-out’ from the downloaded telescope fits files for each band.

Bearing in mind that the neural network model we are using for transfer learning (ImageNet dataset Russakovsky et al. (2014) from 2014 with the Xception model Chollet (2016)) was originally trained with images of size , we have resized all the galaxy sub-images to be pixels using the scikit-image library van der Walt et al. (2014), and then stacked the three filters together to create a color image of size . Finally, these sub-images are mean subtracted and normalized to convert the pixel values to the -1 to 1 range centered around 0, following best practices of neural network training Andrej Karpathy (2018)

. These curated datasets serve as the input tensor into our deep neural network model.

We developed all these scripts, to download and preprocess data, as open source, Python software stack. To facilitate and streamline these tasks at scale, we incorporated Message Passing Interface (MPI) Gropp et al. (1999) to exploit multiple nodes on supercomputers for a fast parallel computation. In our case, the data extraction and curation was done using the Blue Waters Supercomputer Kramer et al. (2015).

DES Dataset The same steps are repeated to first select the DES DR1 metadata and images from the NCSA DESaccess web DES (2018), and then to cut-out, preprocess and stack the filters together to create a lupton RGB of size using the Astropy branch John Parejko (2018). Additionally, the Astropy package match_to_catalog_sky is used to crossmatch DES and SDSS catalogues to within 1 arcsec. Finally we pick a random sample of bright DES galaxies to quantify the clustering performance of our neural network model.

Figure 1: Violin Plots of Galaxy Zoo Probability Distributions for galaxies in each dataset.

Figure 2: Top panels: labelled images of the SDSS training set. Bottom panels: sample of galaxies from SDSS-DR7 and the corresponding crossmatched galaxies from DES DR1.

ii.2 Deep Learning: Model and Methodology

For the classification problem we do transfer learning starting with the Xception model Chollet (2016), which has been pre-trained with the ImageNet Russakovsky et al. (2014) dataset. We choose this neural network model because it outperforms other state-of-the-art neural network models, including Inception-v3 Szegedy et al. (2015), ResNet-152 He et al. (2015) and VGG16 Simonyan and Zisserman (2014). More importantly, we carried out several experiments using all these architecture and found that Xception exhibits better performance on our validation and testing galaxy datasets. The deep learning APIs used are Keras Keras (2018) and Tensorflow Abadi et al. (2016).

For training, we first extract the bottleneck features of our training set for one or two epochs and feed them into a few custom defined fully connected layers added at the end of the pre-trained model (see Figure 

6 in Appendix A). Then we progressively unfreeze the earlier layers of the whole network and fine tune their weights for a few epochs of training. The rationale behind this approach is that the earlier layers of a trained network are very versatile filters able to pick up simple abstract features like lines and edges relevant to any image detection or classification problem. However, deeper into the network, the weights of the layers become less interpretable and more specific to the given problem at hand. Hence by training the last layers first and then progressively fine tuning the earlier layers we make sure that the useful weights learned on millions of ImageNet Deng et al. (2009) images are not destroyed while the neural network learns and adapts to the galaxy classification problem.

We train the network using Tesla P100 GPUs on XSEDE (Bridges) XSEDE (2018)

. The training process for the dataset of 36500 images is completed within 3 hours. We use categorical cross entropy as the loss function together with ADAM optimizer 

Kingma and Ba (2014). To avoid over-fitting, we monitor both training and validation losses, add a dropout rate of 70% between our fully connected layers, and also use early-stopping, i.e. we stop training once validation loss stops decreasing. Additionally we use the learning rate scheduler, i.e., we reduce the learning rate when training loss stops decreasing to do a more fine-grained search of the loss function’s minima, and data augmentation. For data augmentation we use random flips, rotations, zooms and shifts as shown (see Figure 7 in Appendix B). After training, all the weights are frozen and saved, and inference on about 10,000 test images is completed within 10 minutes using a single Tesla P100 GPU.

The last layer of the network has two softmax nodes, which provide the output probability that the input image belongs to each class. While these probabilities can be directly tested for crossmatched DES sets by comparing to the SDSS-Galaxy Zoo probabilities, for the rest of the unlabelled DES images this is not possible. Given that for large-scale galaxy catalogs it would be unfeasible to inspect individual images to determine what class they belong to, we propose to use neural networks as feature extractors. In practice, we can use the nodes of the second last layer of the neural network to determine what combination of nodes is activated for each galaxy type. In this approach, the activation vectors of this layer would form two distinct clusters, for each galaxy type in a 1024-D space.

In order to visualize these 1024-D clusters, we embed them into a 3-D parameter space using the sklearn library implementation of t-Distributed Stochastic Neighbor Embedding (t-SNE) van der Maaten and Hinton (2008). For HP SDSS and HP DES test sets, we label the points using the ground-truth label of each galaxy, and find that the points neatly cluster into two groups with accuracies .

For unlabelled DES sets, we find again that two distinct clusters are formed. Based on the accuracy of the test set, we heuristically know that these clusters have accuracies

for the top-half most confident predictions. One can then pick the high confidence predictions from each cluster, and assign them the corresponding galaxy label, thereby creating newly labelled DES galaxy datasets.

Iii Results

As shown in Table 2, our Xception deep transfer learning neural network model attains accuracies for the HP SDSS and DES test sets. Sorting the FO test sets by the highest output probabilities, our neural network model reaches accuracies using of the dataset. The confusion matrices are shown in Figure 3.

Dataset Accuracy F1 score
Training set 99.81% 0.9998
HP SDSS Test Set 99.81% 0.9980
HP DES    Test Set 99.62% 0.9961
FO SDSS Test Set 96.76% 0.9675
FO DES    Test Set 96.32% 0.9685
Table 2: Classification accuracy for each test dataset.

Figure 3: Confusion matrices that indicate the accuracy of our neural network model for classifications tasks on various test sets. Note that in all cases our deep transfer learning model reports accuracies .

Having quantified the accuracy of our neural network model on a DES test set that overlaps the SDSS footprint, we now use our model as a feature extractor by feeding bright, unlabelled DES galaxies that do not overlap the SDSS footprint. We use our neural network model to quantify the probability that these images represent either spiral or elliptical galaxies. A random sample of high confidence predictions is shown in Figure 8 in Appendix C. We test the robustness of these predictions by clustering all these unlabelled galaxies in a 1024-D parameter space based on the information extracted by the neural network model based on their morphology in 3 different bands, but visualized in 3-D using t-SNE, as shown in Figure 4.


Figure 4: t-SNE visualization of the clustering of HP SDSS and DES test sets, and unlabelled DES test.

The results presented in Figure 4 indicate that the neural network model has extracted the necessary information from the training dataset to enable t-SNE to clearly identify two distinct classes of galaxies. A scientific visualization of this clustering algorithm for the FO DES test set is presented in NCSA (2018).

Recursive training Having labeled about 10,000 DES galaxies through unsupervised clustering, we pick the top 1000 spiral and top 1000 elliptical galaxies. We then add them to our original SDSS training dataset, and use deep transfer learning again to re-train the neural network model. The top- and bottom-left panels in Figure 5 show the initial (0th recursion) accuracy of our classifier, and the accuracy attained once the newly labelled DES images are added to the SDSS training dataset (1st recursion). We notice that the accuracy for classification for FO SDSS and DES test sets improves up to . In particular, we notice that the classification accuracy for the FO DES test set is now boosted up to when 50% of the dataset is considered. This is rather significant given that this newly labelled DES dataset represent of the the original SDSS training dataset.

Figure 5: Top panels: SDSS datasets. Bottom panels: DES datasets. Accuracy (left panels) and f1 score (right panels) vs N high confidence predictions as a fraction of total full overlap test datasets (0th recursion). We also show the improvement in classification accuracy and f1 score after 2000 newly labelled DES images are added to the SDSS training dataset (1st recursion).

The f1 score shown in the top- and bottom-right panels of Figure 5

is a single number statistical evaluation metric that measures the accuracy of binary classification by taking a weighted average of precision and recall. It varies between its worst performance value of 0 and best performance value of 1, and is given by


For binary classification, precision is the number of true positives divided by the total number of predicted positives, i.e., true positives plus false positives. Similarly, recall is the number of true positives divided by the total number of actual positives, i.e., true positives plus false negatives. We noticed that the f1 score also improves when we include new DES images to the SDSS training dataset.

This novel approach provides us with the means to enhance SDSS galaxy classification, as shown in the top left panel of Figure 5. More importantly, it provides a way forward to gradually replace SDSS galaxy images in the training dataset that we need to construct DES galaxy catalogs at scale. A DES-only image training dataset will better capture the nature of images observed by DES, and would also enable us to better use data augmentations to model the effects of noise, making our neural network model more resilient to accurately classify galaxies at higher redshift, or that are contaminated by various sources of noise.

Iv Conclusion

We have presented the first application of deep transfer learning for the classification of DES galaxies that overlap the footprint of the SDSS survey, achieving state-of-the-art accuracies . We have also introduced the use of unsupervised clustering for the classification of DES galaxies that had not been observed in previous surveys, and had thereby remained unlabelled.

We have demonstrated that unsupervised clustering provides a meaningful classification of DES galaxies using as raw information morphological features abstracted from DES images in three different filters. To get insights into the inner workings of our clustering algorithm, we have presented a scientific visualization of the clustering of the FO DES test set, which is available at NCSA (2018). Through this visualization we have found that seemingly incorrect labels provided by our neural network model are often actually correct. It seems to be the case that the source of this issue stems from inaccurate human classifications in our SDSS training dataset.

Finally, we have shown that newly labelled DES datasets can be used to do recursive training, providing the means to gradually replace SDSS images we have used in our training dataset. This method will enable the creation of DES-only images to train, validate and test neural network models for the creation of large-scale DES galaxy catalogs, which are needed for immediate gravitational wave standard-siren measurements of the Hubble constant, and will provide input data to create galaxy catalogs in the Large Synoptic Survey Telescope era. The scalability of this algorithm, and the minimum computational power required for these analyses, promote it as an ideal tool for future analyses of this nature.

This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the State of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. We acknowledge support from the NCSA. We thank the NCSA Gravity Group for useful feedback, and Vlad Kindratenko for granting us access to state-of-the-art GPUs and HPC resources at the Innovative Systems Lab at NCSA. We are grateful to NVIDIA for donating several Tesla P100 and V100 GPUs that we used for our analysis. This project used public archival data from the Dark Energy Survey (DES). Funding for the DES Projects has been provided by the U.S. Department of Energy, the U.S. National Science Foundation, the Ministry of Science and Education of Spain, the Science and Technology Facilities Council of the United Kingdom, the Higher Education Funding Council for England, the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, the Kavli Institute of Cosmological Physics at the University of Chicago, the Center for Cosmology and Astro-Particle Physics at the Ohio State University, the Mitchell Institute for Fundamental Physics and Astronomy at Texas A&M University, Financiadora de Estudos e Projetos, Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro, Conselho Nacional de Desenvolvimento Científico e Tecnológico and the Ministério da Ciência, Tecnologia e Inovação, the Deutsche Forschungsgemeinschaft and the Collaborating Institutions in the Dark Energy Survey. The Collaborating Institutions are Argonne National Laboratory, the University of California at Santa Cruz, the University of Cambridge, Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas–Madrid, the University of Chicago, University College London, the DES-Brazil Consortium, the University of Edinburgh, the Eidgenössische Technische Hochschule (ETH) Zürich, Fermi National Accelerator Laboratory, the University of Illinois at Urbana-Champaign, the Institut de Ciències de l’Espai (IEEC/CSIC), the Institut de Física d’Altes Energies, Lawrence Berkeley National Laboratory, the Ludwig-Maximilians Universität München and the associated Excellence Cluster Universe, the University of Michigan, the National Optical Astronomy Observatory, the University of Nottingham, The Ohio State University, the OzDES Membership Consortium, the University of Pennsylvania, the University of Portsmouth, SLAC National Accelerator Laboratory, Stanford University, the University of Sussex, and Texas A&M University. Based in part on observations at Cerro Tololo Inter-American Observatory, National Optical Astronomy Observatory, which is operated by the Association of Universities for Research in Astronomy (AURA) under a cooperative agreement with the National Science Foundation. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC). This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.


Appendix A Neural network architecture

The left panel in Figure 6 presents the architecture of the pre-trained Xception model Chollet (2016) used in these studies. The right panel of Figure 6 presents the fully connected layers, and classifier added at the bottleneck of the pre-trained Xception model. These features enable us to use t-SNE to classify new DES images through unsupervised clustering.


Figure 6: Left panel: Xception model Chollet (2016). Right panel: fully connected layers, and classifier added at the bottleneck of the pre-trained Xception model.

Appendix B Data Augmentation

Figure 7: Data augmentations include random rotations of up to 45 degrees, random flips, height and width shifts and zooms of up to a factor of 1.3

To expose the neural network to a variety of potential scenarios for classification, we augment original galaxy images with random rotations, random flips, height and width shifts and zooms, as shown in Figure 7. This approach not only synthetically increases the training dataset, but also makes the neural network invariant to rotations, shifts, flips and combinations of these, and also introduces scale invariance.

Appendix C Classification predictions for unlabelled DES galaxies

Figure 8

presents high-confidence neural network predictions for unlabelled DES galaxies. The robustness of these predictions were tested with our unsupervised clustering algorithm, finding that these classifications, based on the morphological features extracted from the DES images in three filters, are meaningful, as shown in the t-SNE projections in Figure 



Figure 8: Sample of high confidence predictions for spiral (left panel) and elliptical galaxies (right panel) on an unlabelled DES set.