The evolution of the Universe has led to the formation of complex objects apparently without any regular shape, which our mind would just classify asirregular. Thus, the incredible variety of galaxy shapes cannot be summarized by human defined discrete classes of shapes (e.g. “Hubble sequence”) without causing a possibly large loss of information. Our human concept of shape could limit the complete understanding of the complex structure of the galaxies. Estimating the distribution of galaxy morphologies is one means to test theories of the formation and the evolution of the Universe. We estimate the distribution of morphologies on a continuous Euclidean space, such that a particular shape will be viewed as a point in a continuous space. This task must be performed in an unsupervised
way, i.e. free from any human judgement. Galaxy images are intrinsically high-dimensional data, and we usedictionary learning and sparse coding [[Mairal (2010)]] to reduce the high dimensional space of shapes into a manageable low dimensional one. Essentially, galaxy images will be approximated by sparse linear combinations of basis pictures, which are learned from the data. Statistical inference on the reduced space can be performed via probability distribution estimation. We propose a testing procedure and analyse a dataset of galaxy images111GOODS-South Early Release Science Field dataset observed in the near-infrared regime by the Wide Field Camera 3 on-board the Hubble Space Telescope [see [Windhorst (2011), Freeman (2013)]]. to show some examples.
2 Dictionary Learning and Sparse Coding - Radon Transform
The general idea of dictionary learning and sparse coding is to approximate images by sparse linear combinations of a fixed number of basis images, which are not predefined, but are learned from the data. Let be an image, which has dimensions. For , we want to approximate as:
where is a sparse vector of coefficients, and is a collection of basis images . Notice that the basis images will not be imposed to be orthogonal such that the dictionary can easily adapt to the structure of the data [[Mairal (2010)]]. Moreover, learning the bases from the data was shown to perform better in signal reconstruction with respect to using predefined bases [[Elad (2006)]].
2.1 Optimization problem
From a dataset of galaxy images , we can estimate the dictionary and the vectors of coefficients by solving the following optimization problem:
where is a sparsity parameter and is the Frobenius norm [[Mairal (2010)]; R package “spams”]. We suggest to choose and via cross validation. See [Mairal (2010)] for other configurations of problem (2).
2.2 Standardization of the images. Radon transform
Before solving problem (2), images must be standardized to eliminate any spurious dimensionality and improve the quality of the approximations (1). We are talking about: centring, resizing and rotation orientation. While the first one can be easy to perform, the two others are not. Images can be rotated and resized by using Radon Transform (RT) and Inverse RT (IRT). The RT of a function is , where . An image can be viewed as the discrete evaluation of a function. The orientation of the texture of an image can be estimated by , where
is the variance ofat angle [[Jafari-Khouzani (2005), Arodź (2012)]; R package “PET”]. Rotating images by angle essentially makes all the pictures horizontally oriented. To rotate an image we need to: 1) evaluate its RT on a discrete grid, say with , , and ; 2) find and move the first columns of as described in Figure 1 to get (“rotation” in the Radon domain); 3) computing the IRT of on a grid of desired resolution (“resizing”). In Figure 2 we show some effects of images standardization.
3 Statistical inference on the reduced space
In this section we propose a method to estimate the distribution of galaxy morphologies on a low-dimensional space, and we use the GOODS-S dataset to perform a simulation.
3.1 Probability distribution of galaxy morphologies.
For a dataset of images , where is a matrix of nonnegative light intensity:
Standardize all the images as described in paragraph 2.2;
Obtain the dictionary and the vectors according to paragraph 2.1;
Estimate the joint distribution of vector. Call it .
Given the fitted dictionary , estimate can be viewed as an approximation of the distribution of galaxy morphologies.
3.2 Comparing populations of shapes
In this section we propose a method to compare the distributions of two collections of images. Let be two collections of images. Suppose we want to test hypothesis , i.e. a distribution test. We propose the following method:
Pool and into a unique dataset
From , fit dictionary and vectors of coefficients .
Implement a distribution test .
For step , we suggest to use the nonparametric test based on the Maximum Mean Discrepancy (MMD) statistic ([Gretton (2012)]; R package “kernlab”). We can call this testing procedure “DSM test” (Dictionary Learning - Sparse Coding - MMD).
We selected two subsets of images of the GOODS-S dataset in the H-band (see Figure 3): with 25 images of non-mergers, and with 25 images of mergers. To generate images of non-mergers and images of non-mergers we: 1) randomly sample with replacement images from and images from , respectively; 2) randomly rotate them by angles Unif
, i.i.d.; 3) add heteroscedastic noise:, where is the light intensity at position
in a matrix. We repeat comparisons (via DSM test) of samples of the same kind (Mer Vs Mer, NMer Vs NMer) and different one (Mer Vs NMer) to estimate the probability of Type I error and the power of the test as functions of the sample size (see Figure3). We chose and via 10-CV.
4 Conclusions and future work
An unsupervised analysis based on dictionary learning and sparse coding allows us to approximate the distribution of galaxy morphologies by a multivariate distribution defined on a subset of , where dimension is much smaller than the dimension of a galaxy image. Hypothesis testing on the reduced space can help to distinguish the distributions of two sets of images. Current and future work is: using dictionary learning and sparse coding to put constraints on the parameters of cosmological models; comparing the distribution of galaxy shapes at different redshift ranges; manifold estimation: some clusters may correspond to some human defined shapes (e.g. spiral, elliptical) and filaments [see [Chen (2013)]] may describe the transition from a shape to another one; analysing images of other astronomical objects and 3D images.
- [Arodź (2012)] Arodź, T. 2012, Computing and Informatics, 24 no. 2 (2012): 183-199.
- [Chen (2013)] Chen, Y.-C., Genovese, C. R. & Wasserman, L. 2013, arXiv preprint arXiv:1312.2098 (2013).
- [Elad (2006)] Elad, M. & Aharon, M. 2006, Image Processing, IEEE Transactions on 15, no. 12 (2006): 3736-3745.
- [Freeman (2013)] Freeman, P. E., R. Izbicki, A. B. Lee, J. A. Newman et al. 2013, MNRAS (2013): stt1016.
- [Gretton (2012)] Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., & Sriperumbudur, B. K. 2012, Advances in neural information processing systems, pp. 1205-1213. 2012.
- [Jafari-Khouzani (2005)] Jafari-Khouzani, K. & Soltanian-Zadeh , H. 2005, Pattern Analysis and Machine Intelligence, IEEE Transactions on 27, no. 6 (2005): 1004-1008.
Mairal, J., Bach, F., Ponce, J. & Sapiro, G. 2010,
The Journal of Machine Learning Research11 (2010): 19-60.
- [Windhorst (2011)] Windhorst, R. A., Cohen, S.H., Hathi, N. P., McCarthy, P. J. et al. 2011, ApJS 193, no. 2 (2011): 27.