Estimating the distribution of Galaxy Morphologies on a continuous space

06/29/2014 ∙ by Giuseppe Vinci, et al. ∙ Carnegie Mellon University 0

The incredible variety of galaxy shapes cannot be summarized by human defined discrete classes of shapes without causing a possibly large loss of information. Dictionary learning and sparse coding allow us to reduce the high dimensional space of shapes into a manageable low dimensional continuous vector space. Statistical inference can be done in the reduced space via probability distribution estimation and manifold estimation.



There are no comments yet.


page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The evolution of the Universe has led to the formation of complex objects apparently without any regular shape, which our mind would just classify as

irregular. Thus, the incredible variety of galaxy shapes cannot be summarized by human defined discrete classes of shapes (e.g. “Hubble sequence”) without causing a possibly large loss of information. Our human concept of shape could limit the complete understanding of the complex structure of the galaxies. Estimating the distribution of galaxy morphologies is one means to test theories of the formation and the evolution of the Universe. We estimate the distribution of morphologies on a continuous Euclidean space, such that a particular shape will be viewed as a point in a continuous space. This task must be performed in an unsupervised

way, i.e. free from any human judgement. Galaxy images are intrinsically high-dimensional data, and we use

dictionary learning and sparse coding [[Mairal  (2010)]] to reduce the high dimensional space of shapes into a manageable low dimensional one. Essentially, galaxy images will be approximated by sparse linear combinations of basis pictures, which are learned from the data. Statistical inference on the reduced space can be performed via probability distribution estimation. We propose a testing procedure and analyse a dataset of galaxy images111GOODS-South Early Release Science Field dataset observed in the near-infrared regime by the Wide Field Camera 3 on-board the Hubble Space Telescope [see [Windhorst  (2011), Freeman  (2013)]]. to show some examples.

2 Dictionary Learning and Sparse Coding - Radon Transform

The general idea of dictionary learning and sparse coding is to approximate images by sparse linear combinations of a fixed number of basis images, which are not predefined, but are learned from the data. Let be an image, which has dimensions. For , we want to approximate as:


where is a sparse vector of coefficients, and is a collection of basis images . Notice that the basis images will not be imposed to be orthogonal such that the dictionary can easily adapt to the structure of the data [[Mairal  (2010)]]. Moreover, learning the bases from the data was shown to perform better in signal reconstruction with respect to using predefined bases [[Elad  (2006)]].

2.1 Optimization problem

From a dataset of galaxy images , we can estimate the dictionary and the vectors of coefficients by solving the following optimization problem:


where is a sparsity parameter and is the Frobenius norm [[Mairal  (2010)]; R package “spams”]. We suggest to choose and via cross validation. See [Mairal  (2010)] for other configurations of problem (2).

2.2 Standardization of the images. Radon transform

Before solving problem (2), images must be standardized to eliminate any spurious dimensionality and improve the quality of the approximations (1). We are talking about: centring, resizing and rotation orientation. While the first one can be easy to perform, the two others are not. Images can be rotated and resized by using Radon Transform (RT) and Inverse RT (IRT). The RT of a function is , where . An image can be viewed as the discrete evaluation of a function. The orientation of the texture of an image can be estimated by , where

is the variance of

at angle [[Jafari-Khouzani  (2005), Arodź (2012)]; R package “PET”]. Rotating images by angle essentially makes all the pictures horizontally oriented. To rotate an image we need to: 1) evaluate its RT on a discrete grid, say with , , and ; 2) find and move the first columns of as described in Figure 1 to get (“rotation” in the Radon domain); 3) computing the IRT of on a grid of desired resolution (“resizing”). In Figure 2 we show some effects of images standardization.

Figure 1: Left: vectors are moved after vectors with values moved up and down. Right: starting from an original image, we compute its Radon transform on a discrete grid, then by shifting the vectors of this matrix according to the orientation , we can obtain a standardized rotated version of the image as the IRT of the shifted RT.
Figure 2: Rotation standardization improves the fit. Left: an image approximated using a dictionary learned with rotation standardization (top) and not (bottom). Spurious dimensionality negatively affects the dictionary at the bottom, while rotation standardization may lead to more refined approximations. Right: for different numbers of atoms (), the minimum loss (2) is smaller when using standardized images. Images are from the GOODS-S dataset, H-band.

3 Statistical inference on the reduced space

In this section we propose a method to estimate the distribution of galaxy morphologies on a low-dimensional space, and we use the GOODS-S dataset to perform a simulation.

3.1 Probability distribution of galaxy morphologies.

For a dataset of images , where is a matrix of nonnegative light intensity:

  1. Standardize all the images as described in paragraph 2.2;

  2. Obtain the dictionary and the vectors according to paragraph 2.1;

  3. Estimate the joint distribution of vector

    . Call it .

Given the fitted dictionary , estimate can be viewed as an approximation of the distribution of galaxy morphologies.

3.2 Comparing populations of shapes

In this section we propose a method to compare the distributions of two collections of images. Let be two collections of images. Suppose we want to test hypothesis , i.e. a distribution test. We propose the following method:

  1. Pool and into a unique dataset

  2. From , fit dictionary and vectors of coefficients .

  3. Implement a distribution test .

For step , we suggest to use the nonparametric test based on the Maximum Mean Discrepancy (MMD) statistic ([Gretton  (2012)]; R package “kernlab”). We can call this testing procedure “DSM test” (Dictionary Learning - Sparse Coding - MMD).

3.2.1 Simulation

We selected two subsets of images of the GOODS-S dataset in the H-band (see Figure 3): with 25 images of non-mergers, and with 25 images of mergers. To generate images of non-mergers and images of non-mergers we: 1) randomly sample with replacement images from and images from , respectively; 2) randomly rotate them by angles Unif

, i.i.d.; 3) add heteroscedastic noise:

, where is the light intensity at position

in a matrix. We repeat comparisons (via DSM test) of samples of the same kind (Mer Vs Mer, NMer Vs NMer) and different one (Mer Vs NMer) to estimate the probability of Type I error and the power of the test as functions of the sample size (see Figure

3). We chose and via 10-CV.

Figure 3: Left: selected non-mergers () and mergers () from the GOODS-S dataset, H-band. Top right: procedure to simulate an image from . An image is randomly selected from the subset, randomly rotated and heteroscedastic Gaussian noise is added to each pixel. Bottom right: the DSM test helps to distinguish different shapes. The probability of Type I error of the DSM test is always smaller than the level of the test; the power of the test is increasing in the sample size. The shape of the power function depends on the original sets .

4 Conclusions and future work

An unsupervised analysis based on dictionary learning and sparse coding allows us to approximate the distribution of galaxy morphologies by a multivariate distribution defined on a subset of , where dimension is much smaller than the dimension of a galaxy image. Hypothesis testing on the reduced space can help to distinguish the distributions of two sets of images. Current and future work is: using dictionary learning and sparse coding to put constraints on the parameters of cosmological models; comparing the distribution of galaxy shapes at different redshift ranges; manifold estimation: some clusters may correspond to some human defined shapes (e.g. spiral, elliptical) and filaments [see [Chen  (2013)]] may describe the transition from a shape to another one; analysing images of other astronomical objects and 3D images.


  • [Arodź (2012)] Arodź, T. 2012, Computing and Informatics, 24 no. 2 (2012): 183-199.
  • [Chen  (2013)] Chen, Y.-C., Genovese, C. R. & Wasserman, L. 2013, arXiv preprint arXiv:1312.2098 (2013).
  • [Elad  (2006)] Elad, M. & Aharon, M. 2006, Image Processing, IEEE Transactions on 15, no. 12 (2006): 3736-3745.
  • [Freeman  (2013)] Freeman, P. E., R. Izbicki, A. B. Lee, J. A. Newman et al. 2013, MNRAS (2013): stt1016.
  • [Gretton  (2012)] Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., & Sriperumbudur, B. K. 2012, Advances in neural information processing systems, pp. 1205-1213. 2012.
  • [Jafari-Khouzani  (2005)] Jafari-Khouzani, K. & Soltanian-Zadeh , H. 2005, Pattern Analysis and Machine Intelligence, IEEE Transactions on 27, no. 6 (2005): 1004-1008.
  • [Mairal  (2010)] Mairal, J., Bach, F., Ponce, J. & Sapiro, G. 2010,

    The Journal of Machine Learning Research

    11 (2010): 19-60.
  • [Windhorst  (2011)] Windhorst, R. A., Cohen, S.H., Hathi, N. P., McCarthy, P. J. et al. 2011, ApJS 193, no. 2 (2011): 27.