1 Introduction
The standard CDM cosmological model – based on a cosmological constant as the source of the observed accelerated cosmic expansion (Riess et al., 1998; Perlmutter et al., 1999; Schmidt et al., 1998) and on cold dark matter particles as the bulk of the clustering mass in the universe (White & Rees, 1978; White, 1993, 1996; Springel et al., 2005) – has survived the past two decades of cosmological observations targeted to a wide range of independent probes. This includes the statistical properties of Cosmic Microwave Background (CMB) anisotropies (Bennett et al., 2013; Planck Collaboration et al., 2018), the large-scale distribution and dynamics of visible galaxies (Parkinson et al., 2012; Alam et al., 2017; Pezzotta et al., 2017), weak gravitational lensing signals (Fu et al., 2008; Joudaki et al., 2017; Hildebrandt et al., 2017; Troxel et al., 2018; Hikage et al., 2018), the abundance of galaxy clusters, as well as its time evolution (Vikhlinin et al., 2009; Planck Collaboration et al., 2016b).
Despite this astonishing success, the fundamental nature of the two main ingredients of the CDM model – summing up to about of the total energy density of the universe – remains unknown. On one side, the energy scale associated with the cosmological constant does not find any reasonable explanation in the context of fundamental physics, with predictions based on the standard model of particle physics failing by tens of orders of magnitude. On the other hand, no clear detection – direct or indirect – of any new fundamental particle that may be associated with cold dark matter has been made despite a longstanding chase through astrophysical observations (Aartsen et al., 2013; Ackermann et al., 2017; Albert et al., 2017) and laboratory experiments (see e.g. Bernabei et al., 2018; ATLAS Collaboration, 2013; CMS Collaboration, 2016).
This leaves the next generation of cosmological observations with the arduous challenge of clarifying the fundamental nature of the dark sector by systematically scrutinising the huge wealth of high-quality data that will be made available in the near future by several wide-field surveys (such as Laureijs et al., 2011; Ivezic et al., 2008; Spergel et al., 2015; Benitez et al., 2014). As a matter of fact, any possible insights from future datasets must come in the form of very small deviations from the expectations of the CDM model, otherwise past observations would have already detected them. This suggests that either the fundamental physics behind dark energy and dark matter is indeed extremely close to that of General Relativity (GR) with a cosmological constant and heavy fundamental particles with negligible thermal velocities, respectively, or that a more radical shift from this standard paradigm is hidden and masked by other effects such as an observational degeneracy with some not yet fully constrained cosmological parameter. The latter scenario may result in a severe limitation of the discriminating power of observations, thereby providing a particularly challenging testbed for the next generation of cosmological surveys.
A typical example of such a possible intriguing situation is given by the well-known degeneracy between some Modified Gravity (MG, see e.g. Amendola et al., 2018, for a recent review on a wide range of MG scenarios) theories and the yet unknown value of the neutrino mass. It is now generally accepted (Baldi et al., 2014; He, 2013; Motohashi et al., 2013; Wright et al., 2017) that MG theories such as gravity (Buchdahl, 1970) are strongly observationally degenerate with the effects of massive neutrinos on structure formation (see Baldi et al., 2014). Some commonly adopted statistics such as the matter auto-power spectrum (Giocoli et al., 2018a), the lensing convergence power spectrum (Peel et al., 2018), and the halo mass function (Hagstotz et al., 2018) may hardly distinguish standard expectations from some specific combinations of the gravity parameters and the total neutrino mass.
As such degeneracies extend down to the non-linear regime of structure formation, the use of full numerical simulations currently represents the only viable method to explore these scenarios, even though alternative approaches based on approximate methods (see e.g. Wright et al., 2017) have been developed in the last years and are being tested and calibrated against simulations. In the present work, we will explore the prospects of using machine learning techniques applied to numerical simulations of both MG and cosmologies that are highly observationally degenerate through standard observational statistics.
Several variants of higher-order statistics have been applied in the past to characterise cosmological data sensitive to the late-time evolution of structure in the Universe. Recent analyses of the weak lensing (WL) (Bartelmann & Schneider, 2001) data from CFHTLens (Heymans et al., 2012)
used either higher-order (>2) moments of the convergence field
(Van Waerbeke et al., 2013), or Minkowski functionals (Petri et al., 2015) to draw cosmological inference from a data description that goes beyond two-point statistics. Martinet et al. (2018) and Shan et al. (2018) applied peak count statistics (Dietrich & Hartlap, 2010; Kratochvil et al., 2010) to shear and convergence fields from KiDS (Hildebrandt et al., 2017) and Gruen et al. (2018) used counts-in-cells (Friedrich et al., 2018) to extract information from the DES (Abbott et al., 2018)catalogues. A new set of techniques based on deep learning
(LeCun et al., 2015) currently has gained momentum in many scientific fields, including astrophysics. The extremely complex models which can be constructed through a modular building-block concept (e.g. Chollet, 2017) have been very successful for tasks like language translation (e.g. Wu et al., 2016; Johnson et al., 2016), text and handwriting recognition (e.g. Graves, 2013), as well as for the classification of images (starting with the seminal work of Krizhevsky et al., 2012). In cosmology, deep learning is used for the extraction of information from N-body simulations (Ravanbakhsh et al., 2017), to learn the connection between initial conditions and the final shape of structure (Lucie-Smith et al., 2018), for the characterisation of point spread functions (Herbel et al., 2018) or the measurement of shear for WL (Springer et al., 2018), the characterisation of non-Gaussian structure in mass maps (Gupta et al., 2018), the determination of galaxy cluster X-ray masses (Ntampaka et al., 2018) and the fast creation of simulated data using generative adversarial networks (Rodriguez et al., 2018). In this work, we will use such techniques to break the degeneracies between models of modified gravity in the presence of massive neutrinos.The text is organised as follows: Section 2 gives an overview of the numerical simulations and the creation of the mass maps that constitute our main data set. In section 3 we introduce the different characterisation and classification techniques that we apply to the mass map data and show the results that they produce in section 4. We present our conclusions in section 5
. Two appendices provide more details on certain technical aspects of the computer vision (appendix
A) and deep neural network (appendix B) methods we are using.2 Numerical simulations
We perform our analysis on a set of WL maps extracted from a suite of cosmological dark matter-only simulations called the DUSTGRAIN-pathfinder runs. These simulations represent a preliminary calibration sample for the DUSTGRAIN (Dark Universe Simulations to Test GRAvity In the presence of Neutrinos) project, an ongoing numerical effort aimed at investigating cosmological models characterised by a modification of the laws of gravity from their standard GR form and by a non-negligible fraction of the cosmic matter density being made of standard massive neutrinos.
2.1 Dustgrain-pathfinder
The modification of gravity considered in the DUSTGRAIN project consists in an model defined by the Action (Buchdahl, 1970)
(1) |
We assume a specific analytical form for the function (Hu & Sawicki, 2007)
(2) |
where is the Ricci scalar curvature, is a mass scale, while are free parameters of the model. Such a form is particularly popular and widely investigated as it allows one to recover with arbitrary precision a CDM background expansion history by choosing . Here and are the vacuum and matter energy density, respectively, under the condition , so that the scalar field takes the approximate form
(3) |
By restricting to the case the only remaining free parameter of the model can be written as
(4) |
and its absolute value will quantify how much the model departs from GR.
The DUSTGRAIN-pathfinder simulations have been devised to sample the parameter space and to identify highly degenerate combinations of parameters. Some analyses of the corresponding WL signal have been presented by Giocoli et al. (2018a) and Peel et al. (2018), while Hagstotz et al. (2018) have used the simulations to calibrate a theoretical modelling of the halo mass function in gravity with and without the contribution of massive neutrinos. In this further paper, we will use machine learning techniques to tackle the issue of observational degeneracy in these combined models based on the WL reconstruction described in Giocoli et al. (2018a). A similar approach, focused on a subset of particularly degenerate models is presented in Peel et al. (2018b, PRL submitted).
From a technical point of view, the DUSTGRAIN-pathfinder runs are cosmological collisionless simulations including Dark Matter particles of mass M (for the case of ) and as many neutrino particles (for the case of ) in a cosmological volume with periodic boundary conditions evolving under the effect of a gravitational interaction defined by equation 1. The simulations have been performed with the MG-GADGET code (see Puchwein et al., 2013), a modified version of the GADGET code (Springel, 2005) that implements all the modifications that characterise gravity (see Puchwein et al., 2013, for more details on the algorithm). MG-GADGET has been extensively tested (see e.g. the Modified Gravity code comparison project described in Winther et al., 2015) and employed recently for a wide variety of applications (Baldi & Villaescusa-Navarro, 2018; Arnold et al., 2018; Arnold et al., 2014, 2015; Roncarelli et al., 2018; Arnold et al., 2016; Naik et al., 2018). For the DUSTGRAIN-pathfinder simulations, as was already done in Baldi et al. (2014), we have combined the MG-GADGET solver with the particle-based implementation of massive neutrinos developed by Viel et al. (2010). This allowed us to include massive neutrinos in the simulations as an independent family of particles with its own initial transfer function and velocity distribution. Initial conditions have been generated following the approach of e.g. Zennaro et al. (2017) and Villaescusa-Navarro et al. (2017) at the starting redshift of the simulation with thermal neutrino velocities added on top of the gravitational velocities by random sampling the neutrino momentum distribution at the initial redshift.
Standard cosmological parameters are set to be consistent with the Planck constraints (Planck Collaboration et al., 2016a). Concerning non-standard parameters, the DUSTGRAIN-pathfinder simulations spanned the range for the scalar amplitude and for the neutrino mass, for a total of simulations. In the present work, we will consider a subset of the full DUSTGRAIN-pathfinder suite consisting of nine simulations whose specifications are summarised in table 1.
Simulation Name | Gravity type | [eV] | [M] | [M] | |||
---|---|---|---|---|---|---|---|
CDM | GR | – | 0 | 0.31345 | 0 | 0 | |
0 | 0.31345 | 0 | 0 | ||||
0 | 0.31345 | 0 | 0 | ||||
0 | 0.31345 | 0 | 0 | ||||
0.3 | 0.30630 | 0.00715 | |||||
0.15 | 0.30987 | 0.00358 | |||||
0.1 | 0.31107 | 0.00238 | |||||
0.1 | 0.31107 | 0.00238 | |||||
0.06 | 0.31202 | 0.00143 |
2.2 Lensing light-cones
For all simulations we stored snapshots at different redshifts that allow us to construct lensing light-cones up to a source redshift without gaps. Different methods have been developed to produce lensing light-cones from large cosmological N-body simulations. Recent works have employed post-processing reconstructions based on the slicing of a set of comoving particle snapshots (as e.g. in Hilbert et al., 2008, 2009; Giocoli et al., 2016; Shirasaki et al., 2017), as well as on-the-fly algorithms capable of storing only the projected matter density on a given field-of-view without resorting on the flat-sky approximation (see e.g. Barreira et al., 2016; Arnold et al., 2018). In this work we use the MapSim routine (Giocoli et al., 2014; Tessore et al., 2015; Castro et al., 2018) which is based on the former strategy. We use the particles stored in different snapshots to construct our continuous past-light-cones up to , building lens planes of the projected matter density distribution, considering a square sky coverage of five degrees on a side. For each cosmological model we construct different light-cone realisations by randomising the various comoving cosmological boxes (Giocoli et al., 2018a; Peel et al., 2018).
2.3 Convergence maps
The MapSim pipeline is composed of two algorithms. The first one – called i-MapSim – constructs lensing planes from the different simulation snapshots, saving for each plane and on each pixel with coordinate indices the particle surface mass density
(5) |
represents the comoving pixel area of the lens plane and the sum over all particle masses associated with the given pixel. The second algorithm named ray-MapSim projects the matter density distribution along the line-of-sight by weighing the lens planes with the lensing kernel in the Born approximation regime (Bartelmann & Schneider, 2001; Schäfer et al., 2012; Petri, 2016; Giocoli et al., 2016; Petri et al., 2017; Giocoli et al., 2017; Giocoli et al., 2018b; Castro et al., 2018). From we can derive the convergence at a given source redshift as
(6) |
where varies over the different lens planes with the lens redshift smaller than and represents the critical surface density at the lens plane for sources at redshift
(7) |
Here is the speed of light, is Newton’s constant and , and are the angular diameter distances between observer-lens, observer-source and source-lens, respectively. The final maps cover the 25 square degree field-of-view with pixels, resulting in a map resolution of arc-seconds.
3 Methodology
A variety of machine learning techniques is applied to the DUSTGRAIN-pathfinder convergence maps. It was shown by Peel et al. (2018) that summary statistics up to second-order do not reliably separate such mass maps. Higher-order statistics, especially peak counts (e.g. Peel et al., 2017; Shan et al., 2018; Martinet et al., 2018; Lin & Kilbinger, 2018), do a better job but still leave room for improvement when distinguishing between a large number of models and in the presence of noise. Most commonly used methods to characterise observational data are naturally based on physical models. In the following we present an agnostic approach, which also applies techniques and algorithms found in the fields of computer science and specifically digital image processing.
We distinguish two subsequent steps in the process of mass map classification. The first is to find a feature extraction function
, which takes a high-dimensional data vector
as input and finds a general, dimensional reduced representation of it in the form of a feature vector(8) |
The feature extraction function can have several parameters which are stored in the feature weight vector . In order to arrange the data vector in a meaningful way, we introduce an index notation . The first two indices reflect a spatial ordering of the 2D data along the coordinate axes. This means that all elements with are located in the first row of the pixelised image and all elements with e.g. are located in the tenth column of the image. This notation also includes 1D data, ordered or not, by setting . The third index –commonly dubbed as a channel– allows us to collect multiple aspects of the same entity represented by . For the example of an RGB-image, would be the red channel of the image, the green and the blue channel. Finally, we define the shape of a data vector with a bracket notation. The shape of our input convergence maps is since we have pixel maps with convergence values at six different source redshift channels and where in the above we have introduced the shape operator which returns the shape of a data vector.
The second step classifies into a set of target classes. The classification function , which can again depend on a set of parameters , should not only output a single class prediction, but rather a prediction vector of shape
with probabilities to belong to each of
target classes(9) |
It must hold that , and in our case .
In the following we explore different choices for the feature extraction and classification functions and find ways to optimise their parameters to achieve an optimal classification. We do so with the help of training sets, which are data vector–label pairs , meaning mass maps for which we a-priori know the underlying cosmological model. Specifically, the label is an indicator function for the class with elements for which
(10) |
3.1 Definition of data sets
Our full data set consists of 256 convergence maps of shape (2048,2048,6) for each of the nine cosmological models. We split each map further into 64 smaller patches to define our main data vectors with . 75% of those maps (12289) are used as a training set in order to optimise the parameters of our models. We use 15% of the maps (2457) as a validation set where the correct labels are known to us, but not to the optimisation algorithm. Performing a classification on the validation data serves as quality control and helps us to decide if an optimisation is successful and when to stop it. Another 10% of the maps (1638) are used as a test set, where the labels are not known to us a-priori and to which our trained and validated algorithms are applied to blindly. The success rates on those test sets will be the main result of this work.
3.2 Mass map feature extraction
Two important subclasses for are possible. In the first, the parameters are free and can be optimised during a training phase. In the second, they are fixed. We want to highlight that we do not perform any initial transformations of the data, which have proven to be useful for the analysis of lensing mass maps. It was shown in e.g. Peel et al. (2018) that an aperture mass transformation (Schneider, 1996; Schneider et al., 1998) can largely improve the discrimination power of certain statistics, but we want to stay as general and agnostic as possible at this stage and use the raw pixel data of the convergence maps as the initial data vector.
3.2.1 Standard mass map descriptors
Examples of fixed feature extraction are the mass map descriptors which are commonly used in the cosmological community to describe convergence or shear catalogues. For the purposes of this work, such descriptors serve as the reference for other techniques that we apply. We combine a number of mass map features, which we extract with the LensTools package111https://github.com/apetri/LensTools by Petri (2016)
into a feature vector of shape (1,1,99). The first four entries in this vector are the mean, variance, skewness and kurtosis of the convergence maps. This is followed by eleven percentiles between the 0th-percentile (the minimum) and the 100th-percentile (the maximum) in steps of ten percent. The normalised histogram (PDF) of the convergence values in each map is sorted with 14 bins and the value for each bin is appended to the feature vector. Next, we calculate the power spectrum in 14 logarithmically spaced bins between
and, which cover the angular size and resolution of our mass maps. Finally, we use the standard deviation of each map to define 14 signal-to-noise bins between -2 and 5. For each such bin we calculate the peak counts, as well as the first three Minkowski functionals
(e.g. Kratochvil et al., 2012; Petri et al., 2015, and references therein), which concludes our collection of 99 features.3.2.2 Classical computer vision
We know from Peel et al. (2018) that at least some of the standard descriptors above are not optimally suited for the task at hand and it is, at this point, not entirely obvious how to define better ones. This is why we now aim to derive as many fixed features as possible. The publicly available wnd-charm algorithm (Shamir et al., 2008; Orlov et al., 2008; Shamir et al., 2010) was designed for the classification of microscopy images and derives a particularly large feature vector of shape (1,1,2919). This includes most of the common statistics and descriptors known to digital image processing. Many features are thereby not only calculated from the raw image, but from some of its alternative representations like the Fourier, Wavelet, Chebyshev or Edge transformation. Moreover, some features are also extracted from transformations of transformations. While we did state earlier that we do not want to vet our data with transformations, we want to point out that the listed transformations are by no means inspired by the mechanisms of lensing or structure formation. We refer the interested reader to Orlov et al. (2008) for the full description of the algorithm and the description of the full feature vector, but we do provide a short summary in appendix A and a compact overview in table 10.
3.2.3 Convolutional neural networks
As the class of feature extraction functions which are able to change their shape during the training process we chose multi-layered neural networks (LeCun et al., 2015; Goodfellow et al., 2016). The input data vector is manipulated and eventually reduced in dimension by a long –deep– chain of simple layers , which implement a specific mathematical operation. The output of one layer, becomes the input of the following layer and contains its own set of parameters . The set of all layer parameters becomes the feature parameter vector .
(11) | ||||
(12) | ||||
(13) |
Deep neural networks source their performance from the sheer number of layers they are comprised of and have gained much popularity in recent years. This is mainly due to the advancements in numerical performance by e.g. exploiting many-core architectures222General Purpose Graphics Processing Units (GPGPU) are a popular example of a many-core architecture.. This allows for the construction of particularly deep and complex networks with hundreds of millions of parameters. The functional forms of the layers that are used in a deep neural network depend on the problem at hand. For image classification, convolutional neural networks (CNN) have proven to be particularly useful (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; Szegedy et al., 2014; He et al., 2015; Lin et al., 2017) and hence we chose this class of models for our purposes. The main functionality of a CNN is provided by a convolutional layer which applies a number of convolutions with kernel size to a 2D input vector with
. The stride parameters
and allow one to implement dimensional reduction and the parametercontrols if the input data is padded (
) or unaltered (). We provide a thorough mathematical definition of all deep neural network layers used in this work, including the convolutional one, in appendix B.1.Convolution layers are often followed by pooling layers for dimensional reduction. We implement average pooling layers which average entries of the 2D data vector within a window of size , apply a stride defined by and and follow the same padding scheme that was introduced earlier. Maximum pooling layers work in a similar manner but instead of the average they return the maximum within a given window. Both pooling layers exist also as global versions, indicated by GlobalMaxPool and GlobalAvgPool, where all entries per channel are considered for either the maximum or averaging operation.
Up to this point, we only allowed for layers to be placed strictly sequential. In order to implement a horizontal layout, we connect several layers to the same input and combine their results with the help of a concatenation layer . This concept of performing not only one operation at a given depth of the network but several has proven very successful for image classification as e.g. shown in Szegedy et al. (2014)
, who dubbed such horizontal layers as Inception modules.
The output of a layer can be followed by a non-linear activation function. For convolution and pooling layers we mainly deploy rectangular linear units (ReLU) and we give the full detail about the activation functions used in this work in appendix
B.2. To avoid the network from overfitting, so-called dropout layers are introduced as a regularisation. In there, a given percentage of the elements of an input vector are chosen at random and are subsequently discarded from the output (Srivastava et al., 2014). Finally, to compensate for fluctuations in the amplitudes of input vectors at different network depth, Ioffe & Szegedy (2015) introduced the concept of batch normalisation which we also use after each convolutional layer. The output of the last layer in the CNN, the feature vector , is used for classification in a final section of the network which is commonly referred to as top. The concrete architecture of the CNN that we use in this work is provided in section 4.3 and appendix B.3.3.3 Feature-based classification
We now turn our attention to the classification function . We investigate two different approaches to classification. The first one is a nearest-neighbour-search scheme based on distances in feature space. The other approach, based again on a class of neural networks, uses regression through a training set to find the optimal mapping between features and labels.
3.3.1 Feature space distances
In the following we denote with all those feature vectors that belong to a sample from the training set and with the subset which belongs only to class of the training set. We calculate a Fisher discriminant (e.g. Bishop, 2006) to find suited classification weights for each individual feature .
(14) |
Here is the total number of classes and is the variance of the feature within class .
Once we found the weights we can define a weighted nearest-neighbour distance (WNN) of any feature vector to all the classes in the training set.
(15) |
where is the length of the feature vector. The problem with this WNN distance is the fact that it is based only on a single element in the training set, the one that minimises the sum in equation 15. To remedy this, Orlov et al. (2006) introduced a weighted neighbour distance (WND), which takes into account the distance to all elements in the training set, but largely penalises large distances through the free parameter
(16) |
Orlov et al. (2006) found that the results do not strongly depend on once and that is a generally good, numerically stable, choice. The final step in order to make predictions is to define a similarity using a distance of choice, e.g. WNN or WND, and by normalising appropriately
(17) |
3.3.2 Fully connected neural networks
A different approach to the classification task is another form of neural network (equation 11
). The main layer in such a neural network is a fully connected –sometimes called affine–layer FC(n), which implements a linear mapping between the input vector of length
and the output vector of length using a matrix of free parameters and an additional bias parameter (see appendix B.1).Such layers are again chained together and the last layer produces an output vector of the same length as the number of classes . As before, in between those layers one may use dropout, activation and normalisation layers. The top of the network is followed by a specific activation function called a softmax (see appendix B.2) which provides the desired predictions .
Since the optimal weights
are found by a regression, we define a loss function
, which in the case of this classification problem is a categorical cross entropy(18) |
are the labels for the elements in the training data and their class predictions given a current set of parameters
. In order to minimise the loss, while continuously looping over the training data, we use a specific implementation of stochastic gradient descent called
ADAM (Kingma & Ba, 2014). The gradients of our model are thereby calculated via a back-propagation algorithm (Rumelhart et al., 1986). We end this description of our methodology by noting that for a full feature extraction and classification chain , with a CNN as and a neural network as classifier , the classification and feature extraction weights can be optimised at the same time.3.4 Numerical setup
As mentioned earlier we use the Python package LensTools333https://github.com/apetri/LensTools (Petri, 2016) for the extraction of the standard map descriptors from section 3.2.1. For the computer vision fixed features from section 3.2.2 we slightly adapted the publicly available version of wnd-charm444https://github.com/wnd-charm/wnd-charm. We altered the C++ version of the feature extraction algorithm to accept FITS files (Hanisch et al., 2001) as an input image container with pixel values as double precision floating-point numbers. We then use the feature output files of wnd-charm as an input for our own distance-based classification pipeline written in Python. We make these routines publicly available in this repository555https://bitbucket.org/jmerten82/mydnn. All deep learning elements of our analysis stack use the widely used tensorflow666https://www.tensorflow.org/ framework, which uses NVIDIA’s cuDNN (Chetlur et al., 2014)
library to carry out tensor operations on GPUs. We pair a
tensorflow backend with the high-level deep learning Python interface keras777https://keras.io/ as a frontend. The network training was carried out on two NVIDIA Titan Xp GPUs. All convergence maps and Jupyter888http://jupyter.org/ notebooks used to produce the results in this work are either linked to or publicly available in the aforementioned repository. In there, we refer the reader to the ’reproducible_science’ folder.4 Results
Section 3 introduced a number of methods to perform mass map characterisation and classification. We now present the results obtained by applying those methods and provide details on their training process with the help of the validation sets. For the most successful method we investigate the dependence of the results on the convergence map source redshift and we end this section with a closer look at the most relevant features which are extracted by the different methods. If not stated otherwise, the results in this section are based on training, validation and test set maps at a source redshift .
4.1 Classification based on feature distance
For the case of the distance-based classifier from section 3.3.1, the training process is just the derivation of the Fisher weights shown in equation 14. We calculate them using the training set and present the 20 top-ranked features for the classical descriptors in table 2 and for the wnd-charm features in table 3. For the first case, we see quite a mix of features in the top, with the power spectrum and peak counts being the most important ones. For the wnd-charm features however, the ranking is completely dominated by Zernike coefficients on transformations of the image, with a few contributions of Haralick textures. One should keep in mind though that we extract a total of 2919 features, out of which 51 have a weight , 868 have a weight and only 193 features have a vanishing weight. It is the combination of all the non-zero weights which will lead to the distance-based classification later on.
Rank | Name | Index | Weight |
---|---|---|---|
1 | Power spectrum | 11 | 0.106 |
2 | " | 10 | 0.104 |
3 | " | 9 | 0.092 |
4 | " | 12 | 0.084 |
5 | Peak counts | 13 | 0.083 |
6 | Power spectrum | 8 | 0.078 |
7 | Peak counts | 12 | 0.078 |
8 | Power spectrum | 7 | 0.065 |
9 | Peak counts | 5 | 0.064 |
10 | Skewness | – | 0.059 |
11 | Peak counts | 14 | 0.055 |
12 | Power spectrum | 6 | 0.051 |
13 | Percentile | 100 | 0.051 |
14 | Minkowski functional 1 | 14 | 0.049 |
15 | Percentile | 0 | 0.049 |
16 | Power spectrum | 13 | 0.046 |
17 | Minkowski functional 2 | 14 | 0.045 |
18 | Peak counts | 11 | 0.044 |
19 | Power spectrum | 5 | 0.042 |
20 | Kurtosis | – | 0.042 |
Rank | Transform | Name | Index | Weight |
---|---|---|---|---|
1 | F | Zernike coefficients | 20 | 0.285 |
2 | F | " | 42 | 0.270 |
3 | F(W) | " | 50 | 0.255 |
4 | F(E) | " | 52 | 0.242 |
5 | F(W) | " | 21 | 0.236 |
6 | F(E) | " | 39 | 0.214 |
7 | F | " | 12 | 0.205 |
8 | F(W) | " | 22 | 0.204 |
9 | F(E) | " | 37 | 0.196 |
10 | F(W) | " | 56 | 0.183 |
11 | F(E) | " | 5 | 0.174 |
12 | F(E) | " | 28 | 0.170 |
13 | W | Haralick textures | 5 | 0.169 |
14 | F | Zernike coefficients | 17 | 0.166 |
15 | F(E) | " | 34 | 0.166 |
16 | F(E) | Haralick textures | 0 | 0.164 |
17 | F(E) | " | 14 | 0.161 |
18 | F(E) | Zernike coefficients | 24 | 0.159 |
19 | F | " | 60 | 0.154 |
20 | – | Edge features | 0 | 0.152 |
For the 99 standard features we find a total classification success rate of 22%, meaning that out of 14742 samples in the test set, only 3243 were classified correctly. For some specific classes the classification success rate is barely above the success rate for a random guess (11%). The important class for example shows a success rate of 13%. The picture improves marginally when using the 2919 wnd-charm features instead. The total classification success over all classes rises mildly to 25%. While especially the three models still show success rates around or even below 11%, at least some classes, including , are now significantly above the 20% level. We do not show more details999A full success rate analysis is provided in the repsository (https://bitbucket.org/jmerten82/mydnn) associated with this article. on the distance-based classification since it is clear from those results already that this classification method does not qualify for a successful discrimination of our models.
4.2 Classification based on neural network
We now use the same set of fixed features but feed them into a fully-connected neural network for classification. For the case of the 99 standard features we show the very simple topology of the classification network in table 4. The same network is used to classify the 2919 wnd-charm
features but due the larger input vector, the number of free parameters is larger, which we indicate by a square bracket notation in the same table. The regression to find the optimal parameters of the main fully-connected layers is based on the training set. In total we train with 110601 feature vectors of shape (1,1,99) or (1,1,2919) and where one iteration over all those elements during the regression is commonly called an epoch. Gradient evaluations and corresponding changes to the network parameters are made after a subset of an epoch, usually called a batch. The batch size in this case was set to 128. After each epoch, we evaluate the current performance of the network with the 22113 (2457 per class) feature vectors in the validation set. Figure
1 shows for both feature sets the evolution of the loss function for the training and validation data as a function of training epoch. For the larger feature vector, the validation loss starts to saturate around epoch 70 while the training loss keeps declining. This indicates that the networks starts to overfit, meaning that it learns training-set specific features which are of no use to characterise the validation set or any data unknown to the model. This is where we stop the training and save the model parameters which produced the smallest validation loss.
Index | Layer | free parameters | Output shape |
---|---|---|---|
1 | Input | 0 | (1,1,99[2919]) |
2 | FC(32) | 3200 [93440] | (1,1,32) |
3 | leakyReLU(0.03) | 0 | (1,1,32) |
4 | FC(9) | +297 | (1,1,9) |
5 | Softmax | 0 | (1,1,9) |
Ouput | =3497 [93737] | 9 |
The neural network classification yields significantly better results compared to the classification based on feature-space distances. In the case of the 99 standard features the total classification rate rises to 39% and to 35% in the case of the wnd-charm features. Most interestingly, the smaller vector of 99 classical features produces better results than the much larger feature vector provided by wnd-charm. Some of the most discriminative features from the computer vision method shown in table 3 are certainly describing the data well and should be used in future analyses; however, once the information from all standard descriptors such as the binned power spectrum, peak counts and Minkowski functionals are combined in an optimal way by a neural network, there is no advantage in using features that are inspired by computer vision only. Table 5
shows the classification success matrix for the standard features, where each row refers to a subset of the test data comprising only maps from that true class labelled by the first column. The first number in each block of four shows how many times the 1638 members of this subset have been sorted into the respective predicted class which is indicated by the label in the very first row. The second number is the percentage of predictions with respect to the total number of maps in the class. The third and fourth numbers are the mean and its standard error on the prediction probability for all maps in the subset given by the row and for the class predictions indicated by the label of the column. For an optimal classification, only the diagonal of this matrix (those fields typeset in boldface) would show non-zero values.
CDM | |||||||||
958 | 80 | 157 | 103 | 97 | 137 | 26 | 24 | 56 | |
58% | 5% | 10% | 6% | 6% | 8% | 2% | 1% | 3% | |
70 | 1135 | 1 | 72 | 10 | 18 | 31 | 116 | 185 | |
4% | 69% | 0% | 4% | 1% | 1% | 2% | 7% | 11% | |
232 | 6 | 857 | 158 | 275 | 84 | 10 | 5 | 11 | |
14% | 0% | 52% | 10% | 17% | 5% | 1% | 0% | 1% | |
149 | 90 | 173 | 651 | 254 | 120 | 38 | 52 | 111 | |
9% | 5% | 11% | 40% | 16% | 7% | 2% | 3% | 7% | |
223 | 22 | 485 | 333 | 413 | 98 | 18 | 14 | 32 | |
14% | 1% | 30% | 20% | 25% | 6% | 1% | 1% | 2% | |
193 | 103 | 44 | 181 | 65 | 512 | 126 | 171 | 243 | |
12% | 6% | 3% | 11% | 4% | 31% | 8% | 10% | 15% | |
128 | 194 | 30 | 163 | 39 | 377 | 147 | 257 | 303 | |
8% | 12% | 2% | 10% | 2% | 23% | 9% | 16% | 18% | |
71 | 262 | 18 | 164 | 39 | 279 | 109 | 345 | 351 | |
4% | 16% | 1% | 10% | 2% | 17% | 7% | 21% | 21% | |
CDM | 69 | 265 | 5 | 135 | 9 | 158 | 60 | 166 | 771 |
4% | 16% | 0% | 8% | 1% | 10% | 4% | 10% | 47% | |
While table 5 gives a good indication of what to expect from a classification of single maps, only the mean and its standard error on the class predictions give an idea on how well the full ensemble of test set maps is classified. We therefore further evaluate the statistics of the prediction vectors for each true test set class. Figure 2 shows nine panels of box plots, each of which represents the statistics for one such subset. The black box in each panel represents the correct predictions, equivalent to the bold diagonal of table 5. The horizontal line spanning each panel is the median for all the true class predictions and the error band shows the scatter of medians derived from 1000101010This number is of course arbitrary but is close to the sample size and we also checked that the bootstrap-derived error does not depend significantly on the number of bootstraps.
bootstrap samples. The upper and lower end of each box show the 75th and 25th percentiles, respectively, and the whiskers show the outlier cleaned minimum and maximum value of the class predictions. Whenever a box which is not the true label is shown in green, it means that the median and its errors, indicated by the notches of each box, is lower than the one of the correct prediction box and does not overlap with its horizontal error band. If those criteria are not met, the respective box is shown in red.

When looking at the results in table 5 and figure 2, the following observations catch the eye. Although the overall classification success rate is only 39%, none of the classes is classified incorrectly as an ensemble. In the case of the two models we see a clear separation between the correct predictions from the other classes. This is confirmed by the classification matrix, which shows no substantial overlap () with any other model. This however changes for the three and the three models. Although the median for the correct predictions is the highest for all of the models111111The mean is not in the case of , the degeneracies within the same model of gravity are strong in those cases as one can see from the basically equal heights of the centres of the boxes in figure 2 and from the classification matrix, which lists a large number of misclassifications up to 30% in the case of misclassified as . A lot more severe is the case of the three models. For them we find substantial overlap of up to 21% with . Even the predictions for itself are not completely separate from the three models and with an overlap of up to 16%.
4.3 Convolutional neural network
The convolutional neural network (CNN) extracts the characterising features directly from the pixel data of the training mass maps. We have experimented with a number of architectures, including classic topologies which implement a large number of convolutions inspired by VGG-net (Simonyan & Zisserman, 2014), as well as architectures presented in Ravanbakhsh et al. (2017) and Gupta et al. (2018). The model that worked best for our purposes is almost exclusively based on the Inception layers first presented in Szegedy et al. (2014). Here we adopt one of its latest iterations, version 4 introduced in Szegedy et al. (2016). The global linear structure of our CNN is shown in table 6 and we describe in detail the different elements of this network and their purpose in appendix B.3.
Layer | free parameters | Output shape | |
---|---|---|---|
1 | Input | 0 | (256,256,1) |
2 | Conv(3,3,2,2,v,32) | 288 | (127,127,32) |
3 | lReLU(0.03) | 0 | (127,127,32) |
4 | Conv(3,3,1,1,v,32) | +9216 | (125,125,32) |
5 | lReLU(0.03) | 0 | (125,125,32) |
6 | Conv(3,3,1,1,s,64) | +9216 | (125,125,32) |
7 | StemInception (Fig. 9) | +555008 | (29,29,384) |
8 | InceptionA (Fig. 10) | +316416 | (29,29,384) |
9 | ReductionA (Fig. 11) | +2304000 | (14,14,1024) |
10 | InceptionB (Fig. 12) | +2931712 | (14,14,1024) |
11 | ReductionB (Fig. 13) | +2744320 | (6,6,1536) |
12 | InceptionC (Fig. 14) | +4546560 | (6,6,1536) |
13 | GlobalAvgPool | 0 | (1,1,1536) |
14 | Dropout(0.33) | 0 | (1,1,1536) |
15 | FC(9) | +13833 | (1,1,9) |
16 | Softmax | 0 | (1,1,9) |
17 | Ouput | =13469865 | 9 |
We visualise the evolution of the network’s loss during training in figure 3.

The total classification success rate of the CNN is 52% and its classification success matrix is shown in table 7.
CDM | |||||||||
1307 | 116 | 36 | 21 | 31 | 88 | 12 | 5 | 22 | |
80% | 7% | 2% | 1% | 2% | 5% | 1% | 0% | 1% | |
51 | 1298 | 0 | 11 | 0 | 7 | 8 | 27 | 236 | |
3% | 79% | 0% | 1% | 0% | 0% | 0% | 2% | 14% | |
105 | 1 | 1065 | 90 | 320 | 48 | 2 | 1 | 6 | |
6% | 0% | 65% | 5% | 20% | 3% | 0% | 0% | 0% | |
103 | 37 | 161 | 721 | 347 | 78 | 14 | 31 | 146 | |
6% | 2% | 10% | 44% | 21% | 5% | 1% | 2% | 9% | |
122 | 5 | 624 | 271 | 514 | 70 | 5 | 7 | 20 | |
7% | 0% | 38% | 17% | 31% | 4% | 0% | 0% | 1% | |
51 | 27 | 11 | 42 | 44 | 968 | 74 | 307 | 114 | |
3% | 2% | 1% | 3% | 3% | 59% | 5% | 19% | 7% | |
36 | 46 | 11 | 40 | 35 | 713 | 95 | 458 | 204 | |
2% | 3% | 1% | 2% | 2% | 44% | 6% | 28% | 12% | |
20 | 79 | 5 | 35 | 17 | 558 | 64 | 565 | 295 | |
1% | 5% | 0% | 2% | 1% | 34% | 4% | 34% | 18% | |
CDM | 41 | 179 | 0 | 43 | 3 | 99 | 43 | 144 | 1086 |
3% | 11% | 0% | 3% | 0% | 6% | 3% | 9% | 66% | |
Compared to the fixed feature results in table 5 we find much larger true prediction values for many models. Exceptions are and . Figure 4 shows the statistics of the predictions for all classes in the test set and reveals that the and models, even as an ensemble, cannot be classified correctly by the CNN since the error bars on the medians of the predictions in their samples overlap with other models.

However, the degeneracy with is now broken for all models and the CNN robustly discriminates most of the nine models from each other.
4.4 Dependence on redshift
A source redshift of is realistic for future space -and ground-based surveys but it is certainly optimistic for current ground-based surveys. On the other hand, it also does not test the full potential of our classification methods since one would expect a better classification accuracy for larger source redshifts. We therefore repeat training and classification for one lower () and one higher () source redshift. For simplicity we restrict this analysis to the CNN which delivered the best results.
For a source redshift the overall accuracy drops significantly from 52% to 44%. When comparing the prediction statistics of the full set at this redshift in figure 5 with the reference at in figure 4, one can see that the decrease in the overall accuracy mainly stems from a weaker separation of the two models, and . The known issue of degeneracies between the three neutrino masses for and are already present and more prominent. The issue of model misclassification for gravity gets worse with now two misclassifications.

The improvements when going from to are highlighted by figure 6. For the network’s ability to distinguish between the base models increases and the overall classification accuracy is now 59%. The discrimination accuracy for massive neutrinos within each gravity model increases for the models and only the two models with massive neutrinos show significant overlap. Those models are also the only ones that show residual, but insignificant overlap with CDM. Given the fact that the ensemble of maps also gets misidentified as , it is clear that the discrimination within the models remains an issue even at a larger source redshift.

As a last analysis using the CNN we perform a tomographic classification. For each line-of-sight realisation we are not using a single mass map at a specific source redshift but we feed data vectors of shape into the CNN where the four channels refer to , , and , respectively. The classification success matrix for this analysis is shown in table 8 and figure 7 shows the familiar box-plot representation of the prediction-vector statistics. The overall classification success rate rises to 76% and all models besides and now show correct classification rates of 74% or clearly above. The probability of correctly classifying a single map in those two models are only 38% or 50%, respectively, however, a look at figure 7 reveals that they are correctly classified as an ensemble and at high significance. Finally, it is worth noting that none of the models shows any degeneracy with which is larger than 4% according to table 8.
CDM | |||||||||
1618 | 0 | 7 | 0 | 10 | 3 | 0 | 0 | 0 | |
99% | 0% | 0% | 0% | 1% | 0% | 0% | 0% | 0% | |
0 | 1501 | 0 | 1 | 0 | 0 | 1 | 20 | 115 | |
0% | 92% | 0% | 0% | 0% | 0% | 0% | 1% | 7% | |
3 | 0 | 1257 | 2 | 375 | 1 | 0 | 0 | 0 | |
0% | 0% | 77% | 0% | 23% | 0% | 0% | 0% | 0% | |
0 | 0 | 1 | 1470 | 96 | 22 | 7 | 2 | 40 | |
0% | 0% | 0% | 90% | 6% | 1% | 0% | 0% | 2% | |
0 | 0 | 130 | 207 | 1289 | 12 | 0 | 0 | 0 | |
0% | 0% | 8% | 13% | 79% | 1% | 0% | 0% | 0% | |
0 | 0 | 0 | 15 | 1 | 1206 | 275 | 30 | 111 | |
0% | 0% | 0% | 1% | 0% | 74% | 17% | 2% | 7% | |
0 | 0 | 0 | 7 | 0 | 363 | 627 | 319 | 322 | |
0% | 0% | 0% | 0% | 0% | 22% | 38% | 19% | 20% | |
0 | 3 | 0 | 3 | 0 | 77 | 385 | 827 | 343 | |
0% | 0% | 0% | 0% | 0% | 5% | 24% | 50% | 21% | |
CDM | 0 | 1 | 0 | 6 | 0 | 57 | 69 | 37 | 1468 |
0% | 0% | 0% | 0% | 0% | 3% | 4% | 2% | 90% | |

4.5 Remarks on extracted features
After presenting the raw classification results for different methods, we now briefly investigate what insight can be gathered into the actual meaning and importance of specific features that drive the classification success of different methods. To do so, we take a closer look at the training process. The first important observation is strikingly highlighted in table 3, which shows that almost all of the most discriminating wnd-charm
features are Zernike coefficients derived from the Fourier transform of the raw image or from the Fourier transform of the edge -or wavelet-processed image. This is interesting since Zernike polynomials were originally introduced to describe the effects of certain optical elements such as lenses or reflecting surfaces in optical imaging
(Zernike, 1934). This suggests that a decomposition of mass maps into a function set that has a well-defined physical meaning does indeed lead to a good general representation of our data. In addition, all those features are derived from transformations of the raw mass map which shows the power of filtering the input data as e.g. shown by Peel et al. (2018). The ranking of the standard features shown in table 2 is less dominated by a single class, although the power spectrum and peak counts seem most relevant. The good results with a neural network as classifier shows that the optimal combination of such classical features leads to a good classification even without the need for additional descriptors.CNNs often deliver superior results compared to other methods for certain tasks, but it is often believed that they are harder to understand and interpret. We are attempting to dissolve this believe by applying visualisation techniques for the different filters linked together in a deep neural network (Girshick et al., 2013; Zeiler & Fergus, 2013; Szegedy et al., 2013; Springenberg et al., 2014) and in order to reveal the inner workings of the complex model. We follow the approach of Simonyan et al. (2013) to extract our filter responses121212 Also see https://github.com/keras-team/keras/blob/master/examples/conv_filter_visualization.py. Starting from an image of random numbers with the same shape as our mass maps, we retrieve the output of every convolutional layer in the network and perform a gradient ascent in order maximise the response of those layers. While this is of course not a unique solution, the result of the final iteration of the ascent represents an example which triggered a strong response at a particular depth in the network. In figure 8 we show a few examples. The top row shows the four channels which had the strongest loss compared to the initial random image in CNN layer one. That is the convolution marked with index 2 in table 6. The second row shows the top four channel responses of the convolution with stride two just above the input layer in figure 9. The row marked with InceptionA shows the most responsive channels among all four convolutions just below the concatenation layer in figure 9 and equivalently for the figure rows marked InceptionB and C.

As is typical for CNNs (Zeiler & Fergus, 2013), the very first level extracts very regular horizontal and vertical stripe patterns from the image. The stripes turn into a grid pattern deeper into the network and once arriving at the end of the InceptionA layer we can identify patterns of peaks and troughs which are either grouped regularly or along larger structures. It is not surprising that the earlier layers of the network, up to InceptionA, perform a global filtering of the map that highlights structure as long as the image still consists of a relatively large number of pixels. It is just from the finer InceptionB layers on that more specific structures, like objects that look like individual clusters or voids are picked up. It is such detailed analyses of the inner structure of trained CNNs that will lead to a deeper understanding why those networks work so well. This can potentially lead to the development of more specific algorithms at lower numerical cost but with similar or better classification performance.
5 Conclusions
We studied the ability of different kinds of machine learning techniques to discriminate between highly degenerate cosmological models, which combine the effects of modified gravity and massive neutrinos on structure formation. For this purpose we used a subset of the DUSTGRAIN-pathfinder simulation suite which consists of and eight models of gravity in the range of . The neutrino masses in the simulations span . Lensing convergence maps produced from these simulations provided the input for the different classification methods.
In order to characterise the mass maps we used three different approaches to feature extraction. Commonly used statistics in astrophysics such as, and among others, the power spectrum, peak counts and Minkowski functionals were combined into a single feature vector. In order to probe features which are more common to the field of computer vision and digital image processing, we used the publicly available wnd-charm algorithm which produces a large feature vector that combines a variety of common and more exotic descriptors and statistics. As the most flexible method of feature extraction we used a convolutional neural network (CNN). For classification we tested a nearest-neighbour method in feature space and a fully-connected neural network.
We provide an overview of the classification results from section 4 in table 9 and our results can be summarised as follows:
total accuracy | degenerate classes | performance | Reference | |||
---|---|---|---|---|---|---|
classic | nearest neighbour | 22% | 8 | 14%/15% () | repository | |
wnd-charm | nearest neighbour | 25% | 7 | 24%/24%() | repository | |
classic | neural network | 39% | 3 | 47%/16%() | Tab. 5, Fig. 2 | |
wnd-charm | neural network | 36% | 4 | 42%/24%() | repository | |
CNN | neural network | 44% | 3 | 52%/15%() | Fig. 5 | |
CNN | neural network | 52% | 2 | 66%/11%() | Tab. 7, Fig. 4 | |
CNN | neural network | 59% | 1 | 53%/12%() | Fig. 6 | |
CNN | neural network | 76% | 0 | 90%/4%() | Tab. 8, Fig. 7 |
-
Nearest-neighbour classifiers based on distances in feature space are not delivering robust results. No matter if a small classical feature vector is used or a longer version based on computer vision, the total classification accuracy stays below 25%. Eight, out of the nine tested models, remain observationally degenerate131313We declare a model as degenerate if the median and its error for the predictions of a true test set class overlaps with the median and its error of the predictions for any other class..
-
With the same classical or computer vision feature vectors, a neural network delivers a much more robust classification than the nearest-neighbour method. The total success rate for the classical feature vector is 39% and the number of degenerate models reduces to three.
-
The longer feature vector containing 2919 features inspired by computer vision delivers a slightly worse classification of our models than the shorter vector with 99 classical descriptors. The total classification success rate is 3% lower and the method produces one additional degenerate model. Some of the computer vision feature may very well be useful, but currently we see no advantage of using features inspired by digital image processing compared to features well-established in cosmology.
-
A CNN delivers the best classification results with 52% correct classifications at source redshift . The number of degenerate models reduces to two, both of which are part of the same model of gravity.
-
Classification success is clearly a function of mass map source redshift. While going from to the success rate of the CNN decreases by 8% and the number of degenerate models increases by one. When going from to the accuracy increases by 7% and the number of degenerate models reduces by one. This increase of success rate with increasing redshift is not surprising since more information relevant to structure formation can be picked up along a deeper line-of-sight.
-
When using a CNN in a tomographic analysis of four different mass map source redshifts along the same line-of-sight all observational degeneracies are fully broken. The total classification success rate increases to 76%.
A number of improvements to our methodology come to mind and we reserve them for future work. Firstly, the flexible features derived by a CNN can be combined with fixed features that are known to contribute to a successful classification of degenerate models. Secondly, instead of working on the raw image data, a clever transformation can be applied to the input data to enhance features that allow for the desired discrimination. We attempt such an approach in the context of machine learning in Peel et al. (2018b, PRL submitted). In fact, the CNN used in this work applies such transformations as we discussed in section 4.5. A careful analysis of the filtering process of a CNN at the early levels of its filter chain can provide useful insight into the most powerful image transformation for a given classification task. Furthermore, the careful analysis of the filters at a much deeper level of the network might actually lead to more insights on structure formation in different models, since it is at this deeper level where individual structure is characterised and isolated by the algorithm.
Much work is left to be done before this machine learning approach to the classification of mass maps in different cosmological models can be applied to real data. In this work we limited ourselves to optimal noise-free maps in order to see how different methodologies compare under optimal conditions. The influence of pixel shot-noise, observational systematics and practical issues like masking and image artefacts needs to be studied in detail. Furthermore, since the currently most successful methods use a supervised training process with labelled data based on numerical simulations, it needs to be carefully investigated how closely those simulated maps resemble a real observation. Without this important sanity check, even the best machine learning technique is useless since it learns the wrong data.
Acknowledgements
We would like to thank Ofer Springer for useful discussions about deep learning. JM has received funding from the European Union’s Horizon 2020 research and Innovation programme under the Marie Skłodowska-Curie grant agreement No 664931. AP acknowledges support from an Enhanced Eurotalents Fellowship, a Marie Skłodowska-Curie Actions Programme co-funded by the European Commission and Commissariat à l’énergie atomique et aux énergies alternatives (CEA). CG and MB acknowledge support from the Italian Ministry for Education, University and Research (MIUR) through the SIR individual grant SIMCODE (project number RBSI14P4IH). CG and MM acknowledge support from the Italian Ministry of Foreign Affairs and International Cooperation, Directorate General for Country Promotion (Project "Crack the lens"). We also acknowledge the support from the grant MIUR PRIN 2015 "Cosmology and Fundamental Physics: illuminating the Dark Universe with Euclid"; and the financial contribution from the agreement ASI n.I/023/12/0 "Attività relative alla fase B2/C per la missione Euclid". The DUSTGRAIN-pathfinder simulations analyzed in this work have been performed on the Marconi supercomputing machine at Cineca thanks to the PRACE project SIMCODE1 (grant nr. 2016153604) and on the computing facilities of the Computational Center for Particle and Astrophysics (C2PAP) and of the Leibniz Supercomputer Center (LRZ) under the project ID pr94ji.We gratefully acknowledge the support of NVIDIA Corporation with the donation of the two Titan Xp GPUs used for this research.
References
- ATLAS Collaboration (2013) ATLAS Collaboration 2013, preprint, (arXiv:1309.4017)
- Aartsen et al. (2013) Aartsen M. G., et al., 2013, Physical Review Letters, 110, 131302
- Abbott et al. (2018) Abbott T. M. C., et al., 2018, Phys. Rev. D, 98, 043526
- Abraham et al. (2003) Abraham R. G., van den Bergh S., Nair P., 2003, ApJ, 588, 218
- Ackermann et al. (2017) Ackermann M., et al., 2017, ApJ, 840, 43
- Alam et al. (2017) Alam S., et al., 2017, MNRAS, 470, 2617
- Albert et al. (2017) Albert A., et al., 2017, ApJ, 834, 110
- Amendola et al. (2018) Amendola L., et al., 2018, Living Reviews in Relativity, 21, 2
- Arnold et al. (2014) Arnold C., Puchwein E., Springel V., 2014, MNRAS, 440, 833
- Arnold et al. (2015) Arnold C., Puchwein E., Springel V., 2015, Mon. Not. Roy. Astron. Soc., 448, 2275
- Arnold et al. (2016) Arnold C., Springel V., Puchwein E., 2016, MNRAS, 462, 1530
- Arnold et al. (2018) Arnold C., Fosalba P., Springel V., Puchwein E., Blot L., 2018, ArXiv e-prints 1805.09824,
- Baldi & Villaescusa-Navarro (2018) Baldi M., Villaescusa-Navarro F., 2018, Mon. Not. Roy. Astron. Soc., 473, 3226
- Baldi et al. (2014) Baldi M., Villaescusa-Navarro F., Viel M., Puchwein E., Springel V., Moscardini L., 2014, Mon. Not. Roy. Astron. Soc., 440, 75
- Barreira et al. (2016) Barreira A., Llinares C., Bose S., Li B., 2016, J. Cosmology Astropart. Phys., 5, 001
- Bartelmann & Schneider (2001) Bartelmann M., Schneider P., 2001, Phys. Rep., 340, 291
- Benitez et al. (2014) Benitez N., et al., 2014, ArXiv e-prints 1403.5237,
- Bennett et al. (2013) Bennett C., et al., 2013, Astrophys.J.Suppl., 208, 20
- Bernabei et al. (2018) Bernabei R., et al., 2018, ArXiv e-prints 1805.10486,
-
Bishop (2006)
Bishop C. M., 2006, Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA
- Buchdahl (1970) Buchdahl H. A., 1970, MNRAS, 150, 1
- CMS Collaboration (2016) CMS Collaboration 2016, preprint, (arXiv:1603.08914)
- Castro et al. (2018) Castro T., Quartin M., Giocoli C., Borgani S., Dolag K., 2018, MNRAS, 478, 1305
- Chetlur et al. (2014) Chetlur S., Woolley C., Vandermersch P., Cohen J., Tran J., Catanzaro B., Shelhamer E., 2014, CoRR, abs/1410.0759
- Chollet (2017) Chollet F., 2017, Deep Learning with Python, 1st edn. Manning Publications Co., Greenwich, CT, USA
- Dietrich & Hartlap (2010) Dietrich J. P., Hartlap J., 2010, MNRAS, 402, 1049
- Fogel & Sagi (1989) Fogel I., Sagi D., 1989, Biological Cybernetics, 61, 103
- Friedrich et al. (2018) Friedrich O., et al., 2018, Phys. Rev. D, 98, 023508
- Fu et al. (2008) Fu L., et al., 2008, Astron. Astrophys., 479, 9
- Giocoli et al. (2014) Giocoli C., Meneghetti M., Metcalf R. B., Ettori S., Moscardini L., 2014, MNRAS, 440, 1899
- Giocoli et al. (2016) Giocoli C., et al., 2016, MNRAS, 461, 209
- Giocoli et al. (2017) Giocoli C., et al., 2017, MNRAS, 470, 3574
- Giocoli et al. (2018a) Giocoli C., Baldi M., Moscardini L., 2018a, ArXiv e-prints: 1806.04681,
- Giocoli et al. (2018b) Giocoli C., Moscardini L., Baldi M., Meneghetti M., Metcalf R. B., 2018b, MNRAS,
- Girshick et al. (2013) Girshick R., Donahue J., Darrell T., Malik J., 2013, preprint, (arXiv:1311.2524)
- Goodfellow et al. (2016) Goodfellow I., Bengio Y., Courville A., 2016, Deep Learning. The MIT Press
- Graves (2013) Graves A., 2013, CoRR, abs/1308.0850
- Gruen et al. (2018) Gruen D., et al., 2018, Phys. Rev. D, 98, 023507
- Gupta et al. (2018) Gupta A., Matilla J. M. Z., Hsu D., Haiman Z., 2018, Phys. Rev. D, 97, 103515
- Hagstotz et al. (2018) Hagstotz S., Costanzi M., Baldi M., Weller J., 2018, ArXiv e-prints 1806.07400,
- Hanisch et al. (2001) Hanisch R. J., Farris A., Greisen E. W., Pence W. D., Schlesinger B. M., Teuben P. J., Thompson R. W., Warnock III A., 2001, A&A, 376, 359
- Haralick et al. (1973) Haralick R. M., Shanmugam K., Dinstein I., 1973, IEEE Transactions on Systems, Man, and Cybernetics, SMC-3, 610
- He (2013) He J.-h., 2013, arXiv:1307.4876,
- He et al. (2015) He K., Zhang X., Ren S., Sun J., 2015, preprint, (arXiv:1512.03385)
- Herbel et al. (2018) Herbel J., Kacprzak T., Amara A., Refregier A., Lucchi A., 2018, J. Cosmology Astropart. Phys., 7, 054
- Heymans et al. (2012) Heymans C., et al., 2012, MNRAS, 427, 146
- Hikage et al. (2018) Hikage C., et al., 2018, preprint, (arXiv:1809.09148)
- Hilbert et al. (2008) Hilbert S., White S. D. M., Hartlap J., Schneider P., 2008, MNRAS, 386, 1845
- Hilbert et al. (2009) Hilbert S., Hartlap J., White S. D. M., Schneider P., 2009, A&A, 499, 31
- Hildebrandt et al. (2017) Hildebrandt H., et al., 2017, MNRAS, 465, 1454
- Hu & Sawicki (2007) Hu W., Sawicki I., 2007, Phys. Rev., D76, 064004
- Ioffe & Szegedy (2015) Ioffe S., Szegedy C., 2015, CoRR, abs/1502.03167
- Ivezic et al. (2008) Ivezic Z., et al., 2008, arXiv:0805.2366
- Johnson et al. (2016) Johnson M., et al., 2016, CoRR, abs/1611.04558
- Joudaki et al. (2017) Joudaki S., et al., 2017, MNRAS, 465, 2033
- Kingma & Ba (2014) Kingma D. P., Ba J., 2014, CoRR, abs/1412.6980
- Kratochvil et al. (2010) Kratochvil J. M., Haiman Z., May M., 2010, Phys. Rev. D, 81, 043519
- Kratochvil et al. (2012) Kratochvil J. M., Lim E. A., Wang S., Haiman Z., May M., Huffenberger K., 2012, Phys. Rev. D, 85, 103513
- Krizhevsky et al. (2012) Krizhevsky A., Sutskever I., Hinton G. E., 2012, in Pereira F., Burges C. J. C., Bottou L., Weinberger K. Q., eds, , Advances in Neural Information Processing Systems 25. Curran Associates, Inc., pp 1097–1105, link
- Laureijs et al. (2011) Laureijs R., et al., 2011, eprint arXiv: 1110.3193,
- LeCun et al. (2015) LeCun Y., Bengio Y., Hinton G., 2015, Nature, 521, 436
- Lin & Kilbinger (2018) Lin C.-A., Kilbinger M., 2018, A&A, 614, A36
- Lin et al. (2017) Lin T., Goyal P., Girshick R. B., He K., Dollár P., 2017, CoRR, abs/1708.02002
- Lucie-Smith et al. (2018) Lucie-Smith L., Peiris H. V., Pontzen A., Lochner M., 2018, MNRAS, 479, 3405
- Martinet et al. (2018) Martinet N., et al., 2018, MNRAS, 474, 712
- Motohashi et al. (2013) Motohashi H., Starobinsky A. A., Yokoyama J., 2013, Phys.Rev.Lett., 110, 121302
- Naik et al. (2018) Naik A. P., Puchwein E., Davis A.-C., Arnold C., 2018, ArXiv e-prints 1805.12221,
- Ntampaka et al. (2018) Ntampaka M., et al., 2018, preprint, (arXiv:1810.07703)
- Orlov et al. (2006) Orlov N., Johnston J., Macura T., Wolkow C., Goldberg I., 2006, in 3rd IEEE International Symposium on Biomedical Imaging: Nano to Macro, 2006.. pp 1152–1155, doi:10.1109/ISBI.2006.1625127
- Orlov et al. (2008) Orlov N., Shamir L., Macura T., Johnston J., Eckley D. M., Goldberg I. G., 2008, Pattern recognition letters, 29, 1684
- Otsu (1979) Otsu N., 1979, IEEE Transactions on Systems, Man, and Cybernetics, 9, 62
- Parkinson et al. (2012) Parkinson D., et al., 2012, Phys. Rev. D, 86, 103518
- Peel et al. (2017) Peel A., Lin C.-A., Lanusse F., Leonard A., Starck J.-L., Kilbinger M., 2017, A&A, 599, A79
- Peel et al. (2018) Peel A., Pettorino V., Giocoli C., Starck J.-L., Baldi M., 2018, preprint, (arXiv:1805.05146)
- Perlmutter et al. (1999) Perlmutter S., et al., 1999, Astrophys. J., 517, 565
- Petri (2016) Petri A., 2016, Astronomy and Computing, 17, 73
- Petri et al. (2015) Petri A., Liu J., Haiman Z., May M., Hui L., Kratochvil J. M., 2015, Phys. Rev. D, 91, 103511
- Petri et al. (2017) Petri A., Haiman Z., May M., 2017, Phys. Rev. D, 95, 123503
- Pezzotta et al. (2017) Pezzotta A., et al., 2017, A&A, 604, A33
- Planck Collaboration et al. (2016a) Planck Collaboration et al., 2016a, A&A, 594, A13
- Planck Collaboration et al. (2016b) Planck Collaboration et al., 2016b, A&A, 594, A24
- Planck Collaboration et al. (2018) Planck Collaboration et al., 2018, ArXiv e-prints 1807.06209,
- Prewitt (1970) Prewitt J., 1970, Picture Processing and Psychopictorics. New York: Academic Press
- Puchwein et al. (2013) Puchwein E., Baldi M., Springel V., 2013, MNRAS, 436, 348
- Radon (1917) Radon J., 1917, in Berichte über die Verhandlungen der Königlich-Sächsischen Akademie der Wissenschaften zu Leipzig. Teubner, pp 262–277, doi:doi:10.1109/TMI.1986.4307775, https://dx.doi.org/10.1109%2FTMI.1986.4307775
- Ravanbakhsh et al. (2017) Ravanbakhsh S., Oliva J., Fromenteau S., Price L. C., Ho S., Schneider J., Poczos B., 2017, preprint, (arXiv:1711.02033)
- Riess et al. (1998) Riess A. G., et al., 1998, Astron. J., 116, 1009
- Rodriguez et al. (2018) Rodriguez A. C., Kacprzak T., Lucchi A., Amara A., Sgier R., Fluri J., Hofmann T., Réfrégier A., 2018, preprint, (arXiv:1801.09070)
- Roncarelli et al. (2018) Roncarelli M., Baldi M., Villaescusa-Navarro F., 2018, MNRAS, 481, 2497
- Rumelhart et al. (1986) Rumelhart D. E., Hinton G. E., Williams R. J., 1986, Nature, 323, 533
- Schäfer et al. (2012) Schäfer B. M., Heisenberg L., Kalovidouris A. F., Bacon D. J., 2012, MNRAS, 420, 455
- Schmidt et al. (1998) Schmidt B. P., et al., 1998, Astrophys.J., 507, 46
- Schneider (1996) Schneider P., 1996, MNRAS, 283, 837
- Schneider et al. (1998) Schneider P., van Waerbeke L., Jain B., Kruse G., 1998, MNRAS, 296, 873
- Shamir et al. (2008) Shamir L., Orlov N., Eckley D. M., Macura T., Johnston J., Goldberg I., 2008, Source Code for Biology and Medicine, 3, 13
- Shamir et al. (2010) Shamir L., Delaney J., Orlov N., Eckley D., Goldberg I., 2010, PLoS Comput Biol 6(11): e1000974
- Shan et al. (2018) Shan H., et al., 2018, MNRAS, 474, 1116
- Shirasaki et al. (2017) Shirasaki M., Nishimichi T., Li B., Higuchi Y., 2017, MNRAS, 466, 2402
- Simonyan & Zisserman (2014) Simonyan K., Zisserman A., 2014, preprint, (arXiv:1409.1556)
- Simonyan et al. (2013) Simonyan K., Vedaldi A., Zisserman A., 2013, preprint, (arXiv:1312.6034)
- Spergel et al. (2015) Spergel D., et al., 2015, ArXiv e-prints 1503.03757,
- Springel (2005) Springel V., 2005, Mon. Not. Roy. Astron. Soc., 364, 1105
- Springel et al. (2005) Springel V., et al., 2005, Nature, 435, 629
- Springenberg et al. (2014) Springenberg J. T., Dosovitskiy A., Brox T., Riedmiller M., 2014, preprint, (arXiv:1412.6806)
- Springer et al. (2018) Springer O. M., Ofek E. O., Weiss Y., Merten J., 2018, preprint, (arXiv:1808.07491)
- Srivastava et al. (2014) Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R., 2014, Journal of Machine Learning Research, 15, 1929
- Szegedy et al. (2013) Szegedy C., Zaremba W., Sutskever I., Bruna J., Erhan D., Goodfellow I., Fergus R., 2013, preprint, (arXiv:1312.6199)
- Szegedy et al. (2014) Szegedy C., et al., 2014, CoRR, abs/1409.4842
- Szegedy et al. (2016) Szegedy C., Ioffe S., Vanhoucke V., 2016, CoRR, abs/1602.07261
- Tamura et al. (1978) Tamura H., Mori S., Yamawaki T., 1978, IEEE Transactions on Systems, Man, and Cybernetics, 8, 460
- Teague (1980) Teague M. R., 1980, J. Opt. Soc. Am., 70, 920
- Tessore et al. (2015) Tessore N., Winther H. A., Metcalf R. B., Ferreira P. G., Giocoli C., 2015, J. Cosmology Astropart. Phys., 10, 036
- Troxel et al. (2018) Troxel M. A., et al., 2018, Phys. Rev. D, 98, 043528
- Van Waerbeke et al. (2013) Van Waerbeke L., et al., 2013, MNRAS, 433, 3373
- Viel et al. (2010) Viel M., Haehnelt M. G., Springel V., 2010, JCAP, 1006, 015
- Vikhlinin et al. (2009) Vikhlinin A., et al., 2009, ApJ, 692, 1060
- Villaescusa-Navarro et al. (2017) Villaescusa-Navarro F., Banerjee A., Dalal N., Castorina E., Scoccimarro R., Angulo R., Spergel D. N., 2017, preprint, (arXiv:1708.01154)
- White (1993) White S. D. M., 1993, in Gleiser R. J., Kozameh C. N., Moreschi O. M., eds, General Relativity and Gravitation 1992. pp 331–+
- White (1996) White S. D. M., 1996, in Schaeffer R., Silk J., Spiro M., Zinn-Justin J., eds, Cosmology and Large Scale Structure. p. 349
- White & Rees (1978) White S. D. M., Rees M. J., 1978, MNRAS, 183, 341
- Winther et al. (2015) Winther H. A., et al., 2015, Mon. Not. Roy. Astron. Soc., 454, 4208
- Wright et al. (2017) Wright B. S., Winther H. A., Koyama K., 2017, J. Cosmology Astropart. Phys., 10, 054
- Wu et al. (1992) Wu C.-M., Chen Y.-C., Hsieh K.-S., 1992, IEEE Transactions on Medical Imaging, 11, 141
- Wu et al. (2016) Wu Y., et al., 2016, CoRR, abs/1609.08144
- Zeiler & Fergus (2013) Zeiler M. D., Fergus R., 2013, preprint, (arXiv:1311.2901)
- Zennaro et al. (2017) Zennaro M., Bel J., Villaescusa-Navarro F., Carbone C., Sefusatti E., Guzzo L., 2017, MNRAS, 466, 3244
- Zernike (1934) Zernike v. F., 1934, Physica, 1, 689
Appendix A Wnd-charm features
The total length of the wnd-charm feature vector entails 2919 descriptors, which can be divided into five families. We provide an overview of the features and their respective families in table 10. The algorithm does not only work on the image itself (raw), but also on its Fourier (F), Wavelet (W), Chebyshev (C) or Edge transformation (E) as indicated by the ’Input’ column of table 10. Transformations of transformations are considered by the bracket notation. While Fourier and Chebyshev transforms are implemented using common algorithms and methodologies, the Wavelet transformation is performed with a one level filter pass with a 5th order symlet (Orlov et al., 2008) and the Edge transformation is carried out using a Prewitt operator (Prewitt, 1970) to approximate the image gradient.
The pixel statistics family is made out of four different subclasses, with the simplest being the intensity statistics consisting of mean, median, standard deviation, minimum and maximum. The multi-scale histograms are calculated by using three, five, seven or nine bins to order the pixel amplitudes. The counts in each of those bins makes up the 24 features in this subclass. The combined moments are mean, standard deviation, skewness and kurtosis, which are calculated in a horizontal stripe through the image centre and with a width which is half the total image width. The stripe is then rotated by 45, 90 and 135 degrees and the measurement is repeated. Those 16 numbers are sampled into three bins each, providing a total of 48 features. The Gini coefficient (Abraham et al., 2003) is a measure of how equal the spectrum of pixel intensities is distributed within the image.
The second feature family is comprised of polynomial decompositions. The coefficients of an order 20 Chebyshev and an order 23 Chebyshev-Fourier (Orlov et al., 2006) transformation are sorted into 32 bin histograms. Radon transformations are carried out along lines with an inclination angle of 0, 45, 90 and 135 degrees with respect to the image horizontal (Radon, 1917) and ordered in 3 bin histograms. The class of Zernike coefficients is derived from a 2D Zernike decomposition of the image (Teague, 1980) and the first 72 of those coefficients contribute to the feature vector.
The use of textures is common in image processing and is a way of describing spatial correlations of intensity values. We extract seven Gabor filters (e.g. Fogel & Sagi, 1989) using Gaussian harmonic functions and define their image occupation area as a feature. Tamura textures are described in detail in Tamura et al. (1978) and wnd-charm uses contrast, directionality, coarseness sum and coarseness binned into a 3 sample histogram. The 28 Haralick textures are specific properties of the grey-level dependence matrix of the image and are described in Haralick et al. (1973). The fractal analysis is based on a Brownian motion model of the image following Wu et al. (1992) and wnd-charm uses the first 20 parameters of this analysis as features.
Object statistics are only derived from the raw image data. The starting point is an edge transform using a Prewitt filter and mean, median, variance and 8-bin histogram of both image gradient and its directionality add up to 22 features, which are supplemented by the total number of edge pixels, their genus and the differences between the directionality bins. Otsu features and their inverse are calculated after the application of an Otsu threshold (Otsu, 1979). Finally, for all objects the algorithm calculates minimum, maximum, mean, median, variance and 10-bin histogram for area and image-centre distance of all Otsu objects in the image.
Family | Class | Features | Input | Reference |
---|---|---|---|---|
Pixel | Combined moments | 48 | raw, F, W, C, C(F), W(F) | – |
statistics | F(W), F(C), C(W), E, F(E), W(E) | |||
Gini coefficient | 1 | raw, F, W, C, C(F), W(F) | Abraham et al. (2003) | |
F(W), F(C), C(W), E, F(E), W(E) | ||||
Multiscale histograms | 24 | raw, F, W, C, C(F), W(F) | – | |
F(W), F(C), C(W), E, F(E), W(E) | ||||
Pixel intensity statistics | 5 | raw, F, W, C, C(F), W(F) | – | |
F(W), F(C), C(W), E, F(E), W(E) | ||||
Polynomial | Chebyshev coefficients | 32 | raw, F, W, C, F(W), E, F(E), W(E) | – |
decomposition | Chebyshev-Fourier coefficients | 32 | raw, F, W, C, F(W), E, F(E), W(E) | Orlov et al. (2006) |
Radon coefficients | 12 | raw, F, W, C, C(F), W(F) | Radon (1917) | |
F(W), F(C), C(W), E, F(E), W(E) | ||||
Zernike coefficients | 72 | raw, F, W, C, F(W), E, F(E), W(E) | Teague (1980)) | |
Textures | Fractal analysis | 20 | raw, F, W, C, C(F), W(F) | Wu et al. (1992) |
F(W), F(C), C(W), E, F(E), W(E) | ||||
Gabor | 7 | raw | Fogel & Sagi (1989) | |
Haralick | 28 | raw, F, W, C, C(F), W(F) | Haralick et al. (1973) | |
F(W), F(C), C(W), E, F(E), W(E) | ||||
Tamura | 6 | raw, F, W, C, C(F), W(F) | Tamura et al. (1978) | |
F(W), F(C), C(W), E, F(E), W(E) | ||||
Objects | Edge features | 28 | raw | Prewitt (1970) |
Otsu object features | 34 | raw | Otsu (1979) | |
Inverse Otsu object features | 34 | raw | Otsu (1979) |
Appendix B Deep neural networks
In this appendix we collect some more detailed information about deep neural networks. The first section formally defines all network layers used in this work and the second section deals with activation functions. The third section provides a thorough description about the architecture of the convolutional neural network (CNN) that we use in our analyses.
b.1 Layers
Given a 2D input vector with a convolution layer applies the following operation to produce an output
(19) | |||
(20) | |||
(21) | |||
(22) | |||
Comments
There are no comments yet.