1 Introduction
The standard CDM cosmological model – based on a cosmological constant as the source of the observed accelerated cosmic expansion (Riess et al., 1998; Perlmutter et al., 1999; Schmidt et al., 1998) and on cold dark matter particles as the bulk of the clustering mass in the universe (White & Rees, 1978; White, 1993, 1996; Springel et al., 2005) – has survived the past two decades of cosmological observations targeted to a wide range of independent probes. This includes the statistical properties of Cosmic Microwave Background (CMB) anisotropies (Bennett et al., 2013; Planck Collaboration et al., 2018), the largescale distribution and dynamics of visible galaxies (Parkinson et al., 2012; Alam et al., 2017; Pezzotta et al., 2017), weak gravitational lensing signals (Fu et al., 2008; Joudaki et al., 2017; Hildebrandt et al., 2017; Troxel et al., 2018; Hikage et al., 2018), the abundance of galaxy clusters, as well as its time evolution (Vikhlinin et al., 2009; Planck Collaboration et al., 2016b).
Despite this astonishing success, the fundamental nature of the two main ingredients of the CDM model – summing up to about of the total energy density of the universe – remains unknown. On one side, the energy scale associated with the cosmological constant does not find any reasonable explanation in the context of fundamental physics, with predictions based on the standard model of particle physics failing by tens of orders of magnitude. On the other hand, no clear detection – direct or indirect – of any new fundamental particle that may be associated with cold dark matter has been made despite a longstanding chase through astrophysical observations (Aartsen et al., 2013; Ackermann et al., 2017; Albert et al., 2017) and laboratory experiments (see e.g. Bernabei et al., 2018; ATLAS Collaboration, 2013; CMS Collaboration, 2016).
This leaves the next generation of cosmological observations with the arduous challenge of clarifying the fundamental nature of the dark sector by systematically scrutinising the huge wealth of highquality data that will be made available in the near future by several widefield surveys (such as Laureijs et al., 2011; Ivezic et al., 2008; Spergel et al., 2015; Benitez et al., 2014). As a matter of fact, any possible insights from future datasets must come in the form of very small deviations from the expectations of the CDM model, otherwise past observations would have already detected them. This suggests that either the fundamental physics behind dark energy and dark matter is indeed extremely close to that of General Relativity (GR) with a cosmological constant and heavy fundamental particles with negligible thermal velocities, respectively, or that a more radical shift from this standard paradigm is hidden and masked by other effects such as an observational degeneracy with some not yet fully constrained cosmological parameter. The latter scenario may result in a severe limitation of the discriminating power of observations, thereby providing a particularly challenging testbed for the next generation of cosmological surveys.
A typical example of such a possible intriguing situation is given by the wellknown degeneracy between some Modified Gravity (MG, see e.g. Amendola et al., 2018, for a recent review on a wide range of MG scenarios) theories and the yet unknown value of the neutrino mass. It is now generally accepted (Baldi et al., 2014; He, 2013; Motohashi et al., 2013; Wright et al., 2017) that MG theories such as gravity (Buchdahl, 1970) are strongly observationally degenerate with the effects of massive neutrinos on structure formation (see Baldi et al., 2014). Some commonly adopted statistics such as the matter autopower spectrum (Giocoli et al., 2018a), the lensing convergence power spectrum (Peel et al., 2018), and the halo mass function (Hagstotz et al., 2018) may hardly distinguish standard expectations from some specific combinations of the gravity parameters and the total neutrino mass.
As such degeneracies extend down to the nonlinear regime of structure formation, the use of full numerical simulations currently represents the only viable method to explore these scenarios, even though alternative approaches based on approximate methods (see e.g. Wright et al., 2017) have been developed in the last years and are being tested and calibrated against simulations. In the present work, we will explore the prospects of using machine learning techniques applied to numerical simulations of both MG and cosmologies that are highly observationally degenerate through standard observational statistics.
Several variants of higherorder statistics have been applied in the past to characterise cosmological data sensitive to the latetime evolution of structure in the Universe. Recent analyses of the weak lensing (WL) (Bartelmann & Schneider, 2001) data from CFHTLens (Heymans et al., 2012)
used either higherorder (>2) moments of the convergence field
(Van Waerbeke et al., 2013), or Minkowski functionals (Petri et al., 2015) to draw cosmological inference from a data description that goes beyond twopoint statistics. Martinet et al. (2018) and Shan et al. (2018) applied peak count statistics (Dietrich & Hartlap, 2010; Kratochvil et al., 2010) to shear and convergence fields from KiDS (Hildebrandt et al., 2017) and Gruen et al. (2018) used countsincells (Friedrich et al., 2018) to extract information from the DES (Abbott et al., 2018)catalogues. A new set of techniques based on deep learning
(LeCun et al., 2015) currently has gained momentum in many scientific fields, including astrophysics. The extremely complex models which can be constructed through a modular buildingblock concept (e.g. Chollet, 2017) have been very successful for tasks like language translation (e.g. Wu et al., 2016; Johnson et al., 2016), text and handwriting recognition (e.g. Graves, 2013), as well as for the classification of images (starting with the seminal work of Krizhevsky et al., 2012). In cosmology, deep learning is used for the extraction of information from Nbody simulations (Ravanbakhsh et al., 2017), to learn the connection between initial conditions and the final shape of structure (LucieSmith et al., 2018), for the characterisation of point spread functions (Herbel et al., 2018) or the measurement of shear for WL (Springer et al., 2018), the characterisation of nonGaussian structure in mass maps (Gupta et al., 2018), the determination of galaxy cluster Xray masses (Ntampaka et al., 2018) and the fast creation of simulated data using generative adversarial networks (Rodriguez et al., 2018). In this work, we will use such techniques to break the degeneracies between models of modified gravity in the presence of massive neutrinos.The text is organised as follows: Section 2 gives an overview of the numerical simulations and the creation of the mass maps that constitute our main data set. In section 3 we introduce the different characterisation and classification techniques that we apply to the mass map data and show the results that they produce in section 4. We present our conclusions in section 5
. Two appendices provide more details on certain technical aspects of the computer vision (appendix
A) and deep neural network (appendix B) methods we are using.2 Numerical simulations
We perform our analysis on a set of WL maps extracted from a suite of cosmological dark matteronly simulations called the DUSTGRAINpathfinder runs. These simulations represent a preliminary calibration sample for the DUSTGRAIN (Dark Universe Simulations to Test GRAvity In the presence of Neutrinos) project, an ongoing numerical effort aimed at investigating cosmological models characterised by a modification of the laws of gravity from their standard GR form and by a nonnegligible fraction of the cosmic matter density being made of standard massive neutrinos.
2.1 Dustgrainpathfinder
The modification of gravity considered in the DUSTGRAIN project consists in an model defined by the Action (Buchdahl, 1970)
(1) 
We assume a specific analytical form for the function (Hu & Sawicki, 2007)
(2) 
where is the Ricci scalar curvature, is a mass scale, while are free parameters of the model. Such a form is particularly popular and widely investigated as it allows one to recover with arbitrary precision a CDM background expansion history by choosing . Here and are the vacuum and matter energy density, respectively, under the condition , so that the scalar field takes the approximate form
(3) 
By restricting to the case the only remaining free parameter of the model can be written as
(4) 
and its absolute value will quantify how much the model departs from GR.
The DUSTGRAINpathfinder simulations have been devised to sample the parameter space and to identify highly degenerate combinations of parameters. Some analyses of the corresponding WL signal have been presented by Giocoli et al. (2018a) and Peel et al. (2018), while Hagstotz et al. (2018) have used the simulations to calibrate a theoretical modelling of the halo mass function in gravity with and without the contribution of massive neutrinos. In this further paper, we will use machine learning techniques to tackle the issue of observational degeneracy in these combined models based on the WL reconstruction described in Giocoli et al. (2018a). A similar approach, focused on a subset of particularly degenerate models is presented in Peel et al. (2018b, PRL submitted).
From a technical point of view, the DUSTGRAINpathfinder runs are cosmological collisionless simulations including Dark Matter particles of mass M (for the case of ) and as many neutrino particles (for the case of ) in a cosmological volume with periodic boundary conditions evolving under the effect of a gravitational interaction defined by equation 1. The simulations have been performed with the MGGADGET code (see Puchwein et al., 2013), a modified version of the GADGET code (Springel, 2005) that implements all the modifications that characterise gravity (see Puchwein et al., 2013, for more details on the algorithm). MGGADGET has been extensively tested (see e.g. the Modified Gravity code comparison project described in Winther et al., 2015) and employed recently for a wide variety of applications (Baldi & VillaescusaNavarro, 2018; Arnold et al., 2018; Arnold et al., 2014, 2015; Roncarelli et al., 2018; Arnold et al., 2016; Naik et al., 2018). For the DUSTGRAINpathfinder simulations, as was already done in Baldi et al. (2014), we have combined the MGGADGET solver with the particlebased implementation of massive neutrinos developed by Viel et al. (2010). This allowed us to include massive neutrinos in the simulations as an independent family of particles with its own initial transfer function and velocity distribution. Initial conditions have been generated following the approach of e.g. Zennaro et al. (2017) and VillaescusaNavarro et al. (2017) at the starting redshift of the simulation with thermal neutrino velocities added on top of the gravitational velocities by random sampling the neutrino momentum distribution at the initial redshift.
Standard cosmological parameters are set to be consistent with the Planck constraints (Planck Collaboration et al., 2016a). Concerning nonstandard parameters, the DUSTGRAINpathfinder simulations spanned the range for the scalar amplitude and for the neutrino mass, for a total of simulations. In the present work, we will consider a subset of the full DUSTGRAINpathfinder suite consisting of nine simulations whose specifications are summarised in table 1.
Simulation Name  Gravity type  [eV]  [M]  [M]  

CDM  GR  –  0  0.31345  0  0  
0  0.31345  0  0  
0  0.31345  0  0  
0  0.31345  0  0  
0.3  0.30630  0.00715  
0.15  0.30987  0.00358  
0.1  0.31107  0.00238  
0.1  0.31107  0.00238  
0.06  0.31202  0.00143 
2.2 Lensing lightcones
For all simulations we stored snapshots at different redshifts that allow us to construct lensing lightcones up to a source redshift without gaps. Different methods have been developed to produce lensing lightcones from large cosmological Nbody simulations. Recent works have employed postprocessing reconstructions based on the slicing of a set of comoving particle snapshots (as e.g. in Hilbert et al., 2008, 2009; Giocoli et al., 2016; Shirasaki et al., 2017), as well as onthefly algorithms capable of storing only the projected matter density on a given fieldofview without resorting on the flatsky approximation (see e.g. Barreira et al., 2016; Arnold et al., 2018). In this work we use the MapSim routine (Giocoli et al., 2014; Tessore et al., 2015; Castro et al., 2018) which is based on the former strategy. We use the particles stored in different snapshots to construct our continuous pastlightcones up to , building lens planes of the projected matter density distribution, considering a square sky coverage of five degrees on a side. For each cosmological model we construct different lightcone realisations by randomising the various comoving cosmological boxes (Giocoli et al., 2018a; Peel et al., 2018).
2.3 Convergence maps
The MapSim pipeline is composed of two algorithms. The first one – called iMapSim – constructs lensing planes from the different simulation snapshots, saving for each plane and on each pixel with coordinate indices the particle surface mass density
(5) 
represents the comoving pixel area of the lens plane and the sum over all particle masses associated with the given pixel. The second algorithm named rayMapSim projects the matter density distribution along the lineofsight by weighing the lens planes with the lensing kernel in the Born approximation regime (Bartelmann & Schneider, 2001; Schäfer et al., 2012; Petri, 2016; Giocoli et al., 2016; Petri et al., 2017; Giocoli et al., 2017; Giocoli et al., 2018b; Castro et al., 2018). From we can derive the convergence at a given source redshift as
(6) 
where varies over the different lens planes with the lens redshift smaller than and represents the critical surface density at the lens plane for sources at redshift
(7) 
Here is the speed of light, is Newton’s constant and , and are the angular diameter distances between observerlens, observersource and sourcelens, respectively. The final maps cover the 25 square degree fieldofview with pixels, resulting in a map resolution of arcseconds.
3 Methodology
A variety of machine learning techniques is applied to the DUSTGRAINpathfinder convergence maps. It was shown by Peel et al. (2018) that summary statistics up to secondorder do not reliably separate such mass maps. Higherorder statistics, especially peak counts (e.g. Peel et al., 2017; Shan et al., 2018; Martinet et al., 2018; Lin & Kilbinger, 2018), do a better job but still leave room for improvement when distinguishing between a large number of models and in the presence of noise. Most commonly used methods to characterise observational data are naturally based on physical models. In the following we present an agnostic approach, which also applies techniques and algorithms found in the fields of computer science and specifically digital image processing.
We distinguish two subsequent steps in the process of mass map classification. The first is to find a feature extraction function
, which takes a highdimensional data vector
as input and finds a general, dimensional reduced representation of it in the form of a feature vector(8) 
The feature extraction function can have several parameters which are stored in the feature weight vector . In order to arrange the data vector in a meaningful way, we introduce an index notation . The first two indices reflect a spatial ordering of the 2D data along the coordinate axes. This means that all elements with are located in the first row of the pixelised image and all elements with e.g. are located in the tenth column of the image. This notation also includes 1D data, ordered or not, by setting . The third index –commonly dubbed as a channel– allows us to collect multiple aspects of the same entity represented by . For the example of an RGBimage, would be the red channel of the image, the green and the blue channel. Finally, we define the shape of a data vector with a bracket notation. The shape of our input convergence maps is since we have pixel maps with convergence values at six different source redshift channels and where in the above we have introduced the shape operator which returns the shape of a data vector.
The second step classifies into a set of target classes. The classification function , which can again depend on a set of parameters , should not only output a single class prediction, but rather a prediction vector of shape
with probabilities to belong to each of
target classes(9) 
It must hold that , and in our case .
In the following we explore different choices for the feature extraction and classification functions and find ways to optimise their parameters to achieve an optimal classification. We do so with the help of training sets, which are data vector–label pairs , meaning mass maps for which we apriori know the underlying cosmological model. Specifically, the label is an indicator function for the class with elements for which
(10) 
3.1 Definition of data sets
Our full data set consists of 256 convergence maps of shape (2048,2048,6) for each of the nine cosmological models. We split each map further into 64 smaller patches to define our main data vectors with . 75% of those maps (12289) are used as a training set in order to optimise the parameters of our models. We use 15% of the maps (2457) as a validation set where the correct labels are known to us, but not to the optimisation algorithm. Performing a classification on the validation data serves as quality control and helps us to decide if an optimisation is successful and when to stop it. Another 10% of the maps (1638) are used as a test set, where the labels are not known to us apriori and to which our trained and validated algorithms are applied to blindly. The success rates on those test sets will be the main result of this work.
3.2 Mass map feature extraction
Two important subclasses for are possible. In the first, the parameters are free and can be optimised during a training phase. In the second, they are fixed. We want to highlight that we do not perform any initial transformations of the data, which have proven to be useful for the analysis of lensing mass maps. It was shown in e.g. Peel et al. (2018) that an aperture mass transformation (Schneider, 1996; Schneider et al., 1998) can largely improve the discrimination power of certain statistics, but we want to stay as general and agnostic as possible at this stage and use the raw pixel data of the convergence maps as the initial data vector.
3.2.1 Standard mass map descriptors
Examples of fixed feature extraction are the mass map descriptors which are commonly used in the cosmological community to describe convergence or shear catalogues. For the purposes of this work, such descriptors serve as the reference for other techniques that we apply. We combine a number of mass map features, which we extract with the LensTools package^{1}^{1}1https://github.com/apetri/LensTools by Petri (2016)
into a feature vector of shape (1,1,99). The first four entries in this vector are the mean, variance, skewness and kurtosis of the convergence maps. This is followed by eleven percentiles between the 0thpercentile (the minimum) and the 100thpercentile (the maximum) in steps of ten percent. The normalised histogram (PDF) of the convergence values in each map is sorted with 14 bins and the value for each bin is appended to the feature vector. Next, we calculate the power spectrum in 14 logarithmically spaced bins between
and, which cover the angular size and resolution of our mass maps. Finally, we use the standard deviation of each map to define 14 signaltonoise bins between 2 and 5. For each such bin we calculate the peak counts, as well as the first three Minkowski functionals
(e.g. Kratochvil et al., 2012; Petri et al., 2015, and references therein), which concludes our collection of 99 features.3.2.2 Classical computer vision
We know from Peel et al. (2018) that at least some of the standard descriptors above are not optimally suited for the task at hand and it is, at this point, not entirely obvious how to define better ones. This is why we now aim to derive as many fixed features as possible. The publicly available wndcharm algorithm (Shamir et al., 2008; Orlov et al., 2008; Shamir et al., 2010) was designed for the classification of microscopy images and derives a particularly large feature vector of shape (1,1,2919). This includes most of the common statistics and descriptors known to digital image processing. Many features are thereby not only calculated from the raw image, but from some of its alternative representations like the Fourier, Wavelet, Chebyshev or Edge transformation. Moreover, some features are also extracted from transformations of transformations. While we did state earlier that we do not want to vet our data with transformations, we want to point out that the listed transformations are by no means inspired by the mechanisms of lensing or structure formation. We refer the interested reader to Orlov et al. (2008) for the full description of the algorithm and the description of the full feature vector, but we do provide a short summary in appendix A and a compact overview in table 10.
3.2.3 Convolutional neural networks
As the class of feature extraction functions which are able to change their shape during the training process we chose multilayered neural networks (LeCun et al., 2015; Goodfellow et al., 2016). The input data vector is manipulated and eventually reduced in dimension by a long –deep– chain of simple layers , which implement a specific mathematical operation. The output of one layer, becomes the input of the following layer and contains its own set of parameters . The set of all layer parameters becomes the feature parameter vector .
(11)  
(12)  
(13) 
Deep neural networks source their performance from the sheer number of layers they are comprised of and have gained much popularity in recent years. This is mainly due to the advancements in numerical performance by e.g. exploiting manycore architectures^{2}^{2}2General Purpose Graphics Processing Units (GPGPU) are a popular example of a manycore architecture.. This allows for the construction of particularly deep and complex networks with hundreds of millions of parameters. The functional forms of the layers that are used in a deep neural network depend on the problem at hand. For image classification, convolutional neural networks (CNN) have proven to be particularly useful (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; Szegedy et al., 2014; He et al., 2015; Lin et al., 2017) and hence we chose this class of models for our purposes. The main functionality of a CNN is provided by a convolutional layer which applies a number of convolutions with kernel size to a 2D input vector with
. The stride parameters
and allow one to implement dimensional reduction and the parametercontrols if the input data is padded (
) or unaltered (). We provide a thorough mathematical definition of all deep neural network layers used in this work, including the convolutional one, in appendix B.1.Convolution layers are often followed by pooling layers for dimensional reduction. We implement average pooling layers which average entries of the 2D data vector within a window of size , apply a stride defined by and and follow the same padding scheme that was introduced earlier. Maximum pooling layers work in a similar manner but instead of the average they return the maximum within a given window. Both pooling layers exist also as global versions, indicated by GlobalMaxPool and GlobalAvgPool, where all entries per channel are considered for either the maximum or averaging operation.
Up to this point, we only allowed for layers to be placed strictly sequential. In order to implement a horizontal layout, we connect several layers to the same input and combine their results with the help of a concatenation layer . This concept of performing not only one operation at a given depth of the network but several has proven very successful for image classification as e.g. shown in Szegedy et al. (2014)
, who dubbed such horizontal layers as Inception modules.
The output of a layer can be followed by a nonlinear activation function. For convolution and pooling layers we mainly deploy rectangular linear units (ReLU) and we give the full detail about the activation functions used in this work in appendix
B.2. To avoid the network from overfitting, socalled dropout layers are introduced as a regularisation. In there, a given percentage of the elements of an input vector are chosen at random and are subsequently discarded from the output (Srivastava et al., 2014). Finally, to compensate for fluctuations in the amplitudes of input vectors at different network depth, Ioffe & Szegedy (2015) introduced the concept of batch normalisation which we also use after each convolutional layer. The output of the last layer in the CNN, the feature vector , is used for classification in a final section of the network which is commonly referred to as top. The concrete architecture of the CNN that we use in this work is provided in section 4.3 and appendix B.3.3.3 Featurebased classification
We now turn our attention to the classification function . We investigate two different approaches to classification. The first one is a nearestneighboursearch scheme based on distances in feature space. The other approach, based again on a class of neural networks, uses regression through a training set to find the optimal mapping between features and labels.
3.3.1 Feature space distances
In the following we denote with all those feature vectors that belong to a sample from the training set and with the subset which belongs only to class of the training set. We calculate a Fisher discriminant (e.g. Bishop, 2006) to find suited classification weights for each individual feature .
(14) 
Here is the total number of classes and is the variance of the feature within class .
Once we found the weights we can define a weighted nearestneighbour distance (WNN) of any feature vector to all the classes in the training set.
(15) 
where is the length of the feature vector. The problem with this WNN distance is the fact that it is based only on a single element in the training set, the one that minimises the sum in equation 15. To remedy this, Orlov et al. (2006) introduced a weighted neighbour distance (WND), which takes into account the distance to all elements in the training set, but largely penalises large distances through the free parameter
(16) 
Orlov et al. (2006) found that the results do not strongly depend on once and that is a generally good, numerically stable, choice. The final step in order to make predictions is to define a similarity using a distance of choice, e.g. WNN or WND, and by normalising appropriately
(17) 
3.3.2 Fully connected neural networks
A different approach to the classification task is another form of neural network (equation 11
). The main layer in such a neural network is a fully connected –sometimes called affine–layer FC(n), which implements a linear mapping between the input vector of length
and the output vector of length using a matrix of free parameters and an additional bias parameter (see appendix B.1).Such layers are again chained together and the last layer produces an output vector of the same length as the number of classes . As before, in between those layers one may use dropout, activation and normalisation layers. The top of the network is followed by a specific activation function called a softmax (see appendix B.2) which provides the desired predictions .
Since the optimal weights
are found by a regression, we define a loss function
, which in the case of this classification problem is a categorical cross entropy(18) 
are the labels for the elements in the training data and their class predictions given a current set of parameters
. In order to minimise the loss, while continuously looping over the training data, we use a specific implementation of stochastic gradient descent called
ADAM (Kingma & Ba, 2014). The gradients of our model are thereby calculated via a backpropagation algorithm (Rumelhart et al., 1986). We end this description of our methodology by noting that for a full feature extraction and classification chain , with a CNN as and a neural network as classifier , the classification and feature extraction weights can be optimised at the same time.3.4 Numerical setup
As mentioned earlier we use the Python package LensTools^{3}^{3}3https://github.com/apetri/LensTools (Petri, 2016) for the extraction of the standard map descriptors from section 3.2.1. For the computer vision fixed features from section 3.2.2 we slightly adapted the publicly available version of wndcharm^{4}^{4}4https://github.com/wndcharm/wndcharm. We altered the C++ version of the feature extraction algorithm to accept FITS files (Hanisch et al., 2001) as an input image container with pixel values as double precision floatingpoint numbers. We then use the feature output files of wndcharm as an input for our own distancebased classification pipeline written in Python. We make these routines publicly available in this repository^{5}^{5}5https://bitbucket.org/jmerten82/mydnn. All deep learning elements of our analysis stack use the widely used tensorflow^{6}^{6}6https://www.tensorflow.org/ framework, which uses NVIDIA’s cuDNN (Chetlur et al., 2014)
library to carry out tensor operations on GPUs. We pair a
tensorflow backend with the highlevel deep learning Python interface keras^{7}^{7}7https://keras.io/ as a frontend. The network training was carried out on two NVIDIA Titan Xp GPUs. All convergence maps and Jupyter^{8}^{8}8http://jupyter.org/ notebooks used to produce the results in this work are either linked to or publicly available in the aforementioned repository. In there, we refer the reader to the ’reproducible_science’ folder.4 Results
Section 3 introduced a number of methods to perform mass map characterisation and classification. We now present the results obtained by applying those methods and provide details on their training process with the help of the validation sets. For the most successful method we investigate the dependence of the results on the convergence map source redshift and we end this section with a closer look at the most relevant features which are extracted by the different methods. If not stated otherwise, the results in this section are based on training, validation and test set maps at a source redshift .
4.1 Classification based on feature distance
For the case of the distancebased classifier from section 3.3.1, the training process is just the derivation of the Fisher weights shown in equation 14. We calculate them using the training set and present the 20 topranked features for the classical descriptors in table 2 and for the wndcharm features in table 3. For the first case, we see quite a mix of features in the top, with the power spectrum and peak counts being the most important ones. For the wndcharm features however, the ranking is completely dominated by Zernike coefficients on transformations of the image, with a few contributions of Haralick textures. One should keep in mind though that we extract a total of 2919 features, out of which 51 have a weight , 868 have a weight and only 193 features have a vanishing weight. It is the combination of all the nonzero weights which will lead to the distancebased classification later on.
Rank  Name  Index  Weight 

1  Power spectrum  11  0.106 
2  "  10  0.104 
3  "  9  0.092 
4  "  12  0.084 
5  Peak counts  13  0.083 
6  Power spectrum  8  0.078 
7  Peak counts  12  0.078 
8  Power spectrum  7  0.065 
9  Peak counts  5  0.064 
10  Skewness  –  0.059 
11  Peak counts  14  0.055 
12  Power spectrum  6  0.051 
13  Percentile  100  0.051 
14  Minkowski functional 1  14  0.049 
15  Percentile  0  0.049 
16  Power spectrum  13  0.046 
17  Minkowski functional 2  14  0.045 
18  Peak counts  11  0.044 
19  Power spectrum  5  0.042 
20  Kurtosis  –  0.042 
Rank  Transform  Name  Index  Weight 

1  F  Zernike coefficients  20  0.285 
2  F  "  42  0.270 
3  F(W)  "  50  0.255 
4  F(E)  "  52  0.242 
5  F(W)  "  21  0.236 
6  F(E)  "  39  0.214 
7  F  "  12  0.205 
8  F(W)  "  22  0.204 
9  F(E)  "  37  0.196 
10  F(W)  "  56  0.183 
11  F(E)  "  5  0.174 
12  F(E)  "  28  0.170 
13  W  Haralick textures  5  0.169 
14  F  Zernike coefficients  17  0.166 
15  F(E)  "  34  0.166 
16  F(E)  Haralick textures  0  0.164 
17  F(E)  "  14  0.161 
18  F(E)  Zernike coefficients  24  0.159 
19  F  "  60  0.154 
20  –  Edge features  0  0.152 
For the 99 standard features we find a total classification success rate of 22%, meaning that out of 14742 samples in the test set, only 3243 were classified correctly. For some specific classes the classification success rate is barely above the success rate for a random guess (11%). The important class for example shows a success rate of 13%. The picture improves marginally when using the 2919 wndcharm features instead. The total classification success over all classes rises mildly to 25%. While especially the three models still show success rates around or even below 11%, at least some classes, including , are now significantly above the 20% level. We do not show more details^{9}^{9}9A full success rate analysis is provided in the repsository (https://bitbucket.org/jmerten82/mydnn) associated with this article. on the distancebased classification since it is clear from those results already that this classification method does not qualify for a successful discrimination of our models.
4.2 Classification based on neural network
We now use the same set of fixed features but feed them into a fullyconnected neural network for classification. For the case of the 99 standard features we show the very simple topology of the classification network in table 4. The same network is used to classify the 2919 wndcharm
features but due the larger input vector, the number of free parameters is larger, which we indicate by a square bracket notation in the same table. The regression to find the optimal parameters of the main fullyconnected layers is based on the training set. In total we train with 110601 feature vectors of shape (1,1,99) or (1,1,2919) and where one iteration over all those elements during the regression is commonly called an epoch. Gradient evaluations and corresponding changes to the network parameters are made after a subset of an epoch, usually called a batch. The batch size in this case was set to 128. After each epoch, we evaluate the current performance of the network with the 22113 (2457 per class) feature vectors in the validation set. Figure
1 shows for both feature sets the evolution of the loss function for the training and validation data as a function of training epoch. For the larger feature vector, the validation loss starts to saturate around epoch 70 while the training loss keeps declining. This indicates that the networks starts to overfit, meaning that it learns trainingset specific features which are of no use to characterise the validation set or any data unknown to the model. This is where we stop the training and save the model parameters which produced the smallest validation loss.Index  Layer  free parameters  Output shape 

1  Input  0  (1,1,99[2919]) 
2  FC(32)  3200 [93440]  (1,1,32) 
3  leakyReLU(0.03)  0  (1,1,32) 
4  FC(9)  +297  (1,1,9) 
5  Softmax  0  (1,1,9) 
Ouput  =3497 [93737]  9 
The neural network classification yields significantly better results compared to the classification based on featurespace distances. In the case of the 99 standard features the total classification rate rises to 39% and to 35% in the case of the wndcharm features. Most interestingly, the smaller vector of 99 classical features produces better results than the much larger feature vector provided by wndcharm. Some of the most discriminative features from the computer vision method shown in table 3 are certainly describing the data well and should be used in future analyses; however, once the information from all standard descriptors such as the binned power spectrum, peak counts and Minkowski functionals are combined in an optimal way by a neural network, there is no advantage in using features that are inspired by computer vision only. Table 5
shows the classification success matrix for the standard features, where each row refers to a subset of the test data comprising only maps from that true class labelled by the first column. The first number in each block of four shows how many times the 1638 members of this subset have been sorted into the respective predicted class which is indicated by the label in the very first row. The second number is the percentage of predictions with respect to the total number of maps in the class. The third and fourth numbers are the mean and its standard error on the prediction probability for all maps in the subset given by the row and for the class predictions indicated by the label of the column. For an optimal classification, only the diagonal of this matrix (those fields typeset in boldface) would show nonzero values.
CDM  
958  80  157  103  97  137  26  24  56  
58%  5%  10%  6%  6%  8%  2%  1%  3%  
70  1135  1  72  10  18  31  116  185  
4%  69%  0%  4%  1%  1%  2%  7%  11%  
232  6  857  158  275  84  10  5  11  
14%  0%  52%  10%  17%  5%  1%  0%  1%  
149  90  173  651  254  120  38  52  111  
9%  5%  11%  40%  16%  7%  2%  3%  7%  
223  22  485  333  413  98  18  14  32  
14%  1%  30%  20%  25%  6%  1%  1%  2%  
193  103  44  181  65  512  126  171  243  
12%  6%  3%  11%  4%  31%  8%  10%  15%  
128  194  30  163  39  377  147  257  303  
8%  12%  2%  10%  2%  23%  9%  16%  18%  
71  262  18  164  39  279  109  345  351  
4%  16%  1%  10%  2%  17%  7%  21%  21%  
CDM  69  265  5  135  9  158  60  166  771 
4%  16%  0%  8%  1%  10%  4%  10%  47%  
While table 5 gives a good indication of what to expect from a classification of single maps, only the mean and its standard error on the class predictions give an idea on how well the full ensemble of test set maps is classified. We therefore further evaluate the statistics of the prediction vectors for each true test set class. Figure 2 shows nine panels of box plots, each of which represents the statistics for one such subset. The black box in each panel represents the correct predictions, equivalent to the bold diagonal of table 5. The horizontal line spanning each panel is the median for all the true class predictions and the error band shows the scatter of medians derived from 1000^{10}^{10}10This number is of course arbitrary but is close to the sample size and we also checked that the bootstrapderived error does not depend significantly on the number of bootstraps.
bootstrap samples. The upper and lower end of each box show the 75th and 25th percentiles, respectively, and the whiskers show the outlier cleaned minimum and maximum value of the class predictions. Whenever a box which is not the true label is shown in green, it means that the median and its errors, indicated by the notches of each box, is lower than the one of the correct prediction box and does not overlap with its horizontal error band. If those criteria are not met, the respective box is shown in red.
When looking at the results in table 5 and figure 2, the following observations catch the eye. Although the overall classification success rate is only 39%, none of the classes is classified incorrectly as an ensemble. In the case of the two models we see a clear separation between the correct predictions from the other classes. This is confirmed by the classification matrix, which shows no substantial overlap () with any other model. This however changes for the three and the three models. Although the median for the correct predictions is the highest for all of the models^{11}^{11}11The mean is not in the case of , the degeneracies within the same model of gravity are strong in those cases as one can see from the basically equal heights of the centres of the boxes in figure 2 and from the classification matrix, which lists a large number of misclassifications up to 30% in the case of misclassified as . A lot more severe is the case of the three models. For them we find substantial overlap of up to 21% with . Even the predictions for itself are not completely separate from the three models and with an overlap of up to 16%.
4.3 Convolutional neural network
The convolutional neural network (CNN) extracts the characterising features directly from the pixel data of the training mass maps. We have experimented with a number of architectures, including classic topologies which implement a large number of convolutions inspired by VGGnet (Simonyan & Zisserman, 2014), as well as architectures presented in Ravanbakhsh et al. (2017) and Gupta et al. (2018). The model that worked best for our purposes is almost exclusively based on the Inception layers first presented in Szegedy et al. (2014). Here we adopt one of its latest iterations, version 4 introduced in Szegedy et al. (2016). The global linear structure of our CNN is shown in table 6 and we describe in detail the different elements of this network and their purpose in appendix B.3.
Layer  free parameters  Output shape  

1  Input  0  (256,256,1) 
2  Conv(3,3,2,2,v,32)  288  (127,127,32) 
3  lReLU(0.03)  0  (127,127,32) 
4  Conv(3,3,1,1,v,32)  +9216  (125,125,32) 
5  lReLU(0.03)  0  (125,125,32) 
6  Conv(3,3,1,1,s,64)  +9216  (125,125,32) 
7  StemInception (Fig. 9)  +555008  (29,29,384) 
8  InceptionA (Fig. 10)  +316416  (29,29,384) 
9  ReductionA (Fig. 11)  +2304000  (14,14,1024) 
10  InceptionB (Fig. 12)  +2931712  (14,14,1024) 
11  ReductionB (Fig. 13)  +2744320  (6,6,1536) 
12  InceptionC (Fig. 14)  +4546560  (6,6,1536) 
13  GlobalAvgPool  0  (1,1,1536) 
14  Dropout(0.33)  0  (1,1,1536) 
15  FC(9)  +13833  (1,1,9) 
16  Softmax  0  (1,1,9) 
17  Ouput  =13469865  9 
We visualise the evolution of the network’s loss during training in figure 3.
The total classification success rate of the CNN is 52% and its classification success matrix is shown in table 7.
CDM  
1307  116  36  21  31  88  12  5  22  
80%  7%  2%  1%  2%  5%  1%  0%  1%  
51  1298  0  11  0  7  8  27  236  
3%  79%  0%  1%  0%  0%  0%  2%  14%  
105  1  1065  90  320  48  2  1  6  
6%  0%  65%  5%  20%  3%  0%  0%  0%  
103  37  161  721  347  78  14  31  146  
6%  2%  10%  44%  21%  5%  1%  2%  9%  
122  5  624  271  514  70  5  7  20  
7%  0%  38%  17%  31%  4%  0%  0%  1%  
51  27  11  42  44  968  74  307  114  
3%  2%  1%  3%  3%  59%  5%  19%  7%  
36  46  11  40  35  713  95  458  204  
2%  3%  1%  2%  2%  44%  6%  28%  12%  
20  79  5  35  17  558  64  565  295  
1%  5%  0%  2%  1%  34%  4%  34%  18%  
CDM  41  179  0  43  3  99  43  144  1086 
3%  11%  0%  3%  0%  6%  3%  9%  66%  
Compared to the fixed feature results in table 5 we find much larger true prediction values for many models. Exceptions are and . Figure 4 shows the statistics of the predictions for all classes in the test set and reveals that the and models, even as an ensemble, cannot be classified correctly by the CNN since the error bars on the medians of the predictions in their samples overlap with other models.
However, the degeneracy with is now broken for all models and the CNN robustly discriminates most of the nine models from each other.
4.4 Dependence on redshift
A source redshift of is realistic for future space and groundbased surveys but it is certainly optimistic for current groundbased surveys. On the other hand, it also does not test the full potential of our classification methods since one would expect a better classification accuracy for larger source redshifts. We therefore repeat training and classification for one lower () and one higher () source redshift. For simplicity we restrict this analysis to the CNN which delivered the best results.
For a source redshift the overall accuracy drops significantly from 52% to 44%. When comparing the prediction statistics of the full set at this redshift in figure 5 with the reference at in figure 4, one can see that the decrease in the overall accuracy mainly stems from a weaker separation of the two models, and . The known issue of degeneracies between the three neutrino masses for and are already present and more prominent. The issue of model misclassification for gravity gets worse with now two misclassifications.
The improvements when going from to are highlighted by figure 6. For the network’s ability to distinguish between the base models increases and the overall classification accuracy is now 59%. The discrimination accuracy for massive neutrinos within each gravity model increases for the models and only the two models with massive neutrinos show significant overlap. Those models are also the only ones that show residual, but insignificant overlap with CDM. Given the fact that the ensemble of maps also gets misidentified as , it is clear that the discrimination within the models remains an issue even at a larger source redshift.
As a last analysis using the CNN we perform a tomographic classification. For each lineofsight realisation we are not using a single mass map at a specific source redshift but we feed data vectors of shape into the CNN where the four channels refer to , , and , respectively. The classification success matrix for this analysis is shown in table 8 and figure 7 shows the familiar boxplot representation of the predictionvector statistics. The overall classification success rate rises to 76% and all models besides and now show correct classification rates of 74% or clearly above. The probability of correctly classifying a single map in those two models are only 38% or 50%, respectively, however, a look at figure 7 reveals that they are correctly classified as an ensemble and at high significance. Finally, it is worth noting that none of the models shows any degeneracy with which is larger than 4% according to table 8.
CDM  
1618  0  7  0  10  3  0  0  0  
99%  0%  0%  0%  1%  0%  0%  0%  0%  
0  1501  0  1  0  0  1  20  115  
0%  92%  0%  0%  0%  0%  0%  1%  7%  
3  0  1257  2  375  1  0  0  0  
0%  0%  77%  0%  23%  0%  0%  0%  0%  
0  0  1  1470  96  22  7  2  40  
0%  0%  0%  90%  6%  1%  0%  0%  2%  
0  0  130  207  1289  12  0  0  0  
0%  0%  8%  13%  79%  1%  0%  0%  0%  
0  0  0  15  1  1206  275  30  111  
0%  0%  0%  1%  0%  74%  17%  2%  7%  
0  0  0  7  0  363  627  319  322  
0%  0%  0%  0%  0%  22%  38%  19%  20%  
0  3  0  3  0  77  385  827  343  
0%  0%  0%  0%  0%  5%  24%  50%  21%  
CDM  0  1  0  6  0  57  69  37  1468 
0%  0%  0%  0%  0%  3%  4%  2%  90%  
4.5 Remarks on extracted features
After presenting the raw classification results for different methods, we now briefly investigate what insight can be gathered into the actual meaning and importance of specific features that drive the classification success of different methods. To do so, we take a closer look at the training process. The first important observation is strikingly highlighted in table 3, which shows that almost all of the most discriminating wndcharm
features are Zernike coefficients derived from the Fourier transform of the raw image or from the Fourier transform of the edge or waveletprocessed image. This is interesting since Zernike polynomials were originally introduced to describe the effects of certain optical elements such as lenses or reflecting surfaces in optical imaging
(Zernike, 1934). This suggests that a decomposition of mass maps into a function set that has a welldefined physical meaning does indeed lead to a good general representation of our data. In addition, all those features are derived from transformations of the raw mass map which shows the power of filtering the input data as e.g. shown by Peel et al. (2018). The ranking of the standard features shown in table 2 is less dominated by a single class, although the power spectrum and peak counts seem most relevant. The good results with a neural network as classifier shows that the optimal combination of such classical features leads to a good classification even without the need for additional descriptors.CNNs often deliver superior results compared to other methods for certain tasks, but it is often believed that they are harder to understand and interpret. We are attempting to dissolve this believe by applying visualisation techniques for the different filters linked together in a deep neural network (Girshick et al., 2013; Zeiler & Fergus, 2013; Szegedy et al., 2013; Springenberg et al., 2014) and in order to reveal the inner workings of the complex model. We follow the approach of Simonyan et al. (2013) to extract our filter responses^{12}^{12}12 Also see https://github.com/kerasteam/keras/blob/master/examples/conv_filter_visualization.py. Starting from an image of random numbers with the same shape as our mass maps, we retrieve the output of every convolutional layer in the network and perform a gradient ascent in order maximise the response of those layers. While this is of course not a unique solution, the result of the final iteration of the ascent represents an example which triggered a strong response at a particular depth in the network. In figure 8 we show a few examples. The top row shows the four channels which had the strongest loss compared to the initial random image in CNN layer one. That is the convolution marked with index 2 in table 6. The second row shows the top four channel responses of the convolution with stride two just above the input layer in figure 9. The row marked with InceptionA shows the most responsive channels among all four convolutions just below the concatenation layer in figure 9 and equivalently for the figure rows marked InceptionB and C.
As is typical for CNNs (Zeiler & Fergus, 2013), the very first level extracts very regular horizontal and vertical stripe patterns from the image. The stripes turn into a grid pattern deeper into the network and once arriving at the end of the InceptionA layer we can identify patterns of peaks and troughs which are either grouped regularly or along larger structures. It is not surprising that the earlier layers of the network, up to InceptionA, perform a global filtering of the map that highlights structure as long as the image still consists of a relatively large number of pixels. It is just from the finer InceptionB layers on that more specific structures, like objects that look like individual clusters or voids are picked up. It is such detailed analyses of the inner structure of trained CNNs that will lead to a deeper understanding why those networks work so well. This can potentially lead to the development of more specific algorithms at lower numerical cost but with similar or better classification performance.
5 Conclusions
We studied the ability of different kinds of machine learning techniques to discriminate between highly degenerate cosmological models, which combine the effects of modified gravity and massive neutrinos on structure formation. For this purpose we used a subset of the DUSTGRAINpathfinder simulation suite which consists of and eight models of gravity in the range of . The neutrino masses in the simulations span . Lensing convergence maps produced from these simulations provided the input for the different classification methods.
In order to characterise the mass maps we used three different approaches to feature extraction. Commonly used statistics in astrophysics such as, and among others, the power spectrum, peak counts and Minkowski functionals were combined into a single feature vector. In order to probe features which are more common to the field of computer vision and digital image processing, we used the publicly available wndcharm algorithm which produces a large feature vector that combines a variety of common and more exotic descriptors and statistics. As the most flexible method of feature extraction we used a convolutional neural network (CNN). For classification we tested a nearestneighbour method in feature space and a fullyconnected neural network.
We provide an overview of the classification results from section 4 in table 9 and our results can be summarised as follows:
total accuracy  degenerate classes  performance  Reference  

classic  nearest neighbour  22%  8  14%/15% ()  repository  
wndcharm  nearest neighbour  25%  7  24%/24%()  repository  
classic  neural network  39%  3  47%/16%()  Tab. 5, Fig. 2  
wndcharm  neural network  36%  4  42%/24%()  repository  
CNN  neural network  44%  3  52%/15%()  Fig. 5  
CNN  neural network  52%  2  66%/11%()  Tab. 7, Fig. 4  
CNN  neural network  59%  1  53%/12%()  Fig. 6  
CNN  neural network  76%  0  90%/4%()  Tab. 8, Fig. 7 

Nearestneighbour classifiers based on distances in feature space are not delivering robust results. No matter if a small classical feature vector is used or a longer version based on computer vision, the total classification accuracy stays below 25%. Eight, out of the nine tested models, remain observationally degenerate^{13}^{13}13We declare a model as degenerate if the median and its error for the predictions of a true test set class overlaps with the median and its error of the predictions for any other class..

With the same classical or computer vision feature vectors, a neural network delivers a much more robust classification than the nearestneighbour method. The total success rate for the classical feature vector is 39% and the number of degenerate models reduces to three.

The longer feature vector containing 2919 features inspired by computer vision delivers a slightly worse classification of our models than the shorter vector with 99 classical descriptors. The total classification success rate is 3% lower and the method produces one additional degenerate model. Some of the computer vision feature may very well be useful, but currently we see no advantage of using features inspired by digital image processing compared to features wellestablished in cosmology.

A CNN delivers the best classification results with 52% correct classifications at source redshift . The number of degenerate models reduces to two, both of which are part of the same model of gravity.

Classification success is clearly a function of mass map source redshift. While going from to the success rate of the CNN decreases by 8% and the number of degenerate models increases by one. When going from to the accuracy increases by 7% and the number of degenerate models reduces by one. This increase of success rate with increasing redshift is not surprising since more information relevant to structure formation can be picked up along a deeper lineofsight.

When using a CNN in a tomographic analysis of four different mass map source redshifts along the same lineofsight all observational degeneracies are fully broken. The total classification success rate increases to 76%.
A number of improvements to our methodology come to mind and we reserve them for future work. Firstly, the flexible features derived by a CNN can be combined with fixed features that are known to contribute to a successful classification of degenerate models. Secondly, instead of working on the raw image data, a clever transformation can be applied to the input data to enhance features that allow for the desired discrimination. We attempt such an approach in the context of machine learning in Peel et al. (2018b, PRL submitted). In fact, the CNN used in this work applies such transformations as we discussed in section 4.5. A careful analysis of the filtering process of a CNN at the early levels of its filter chain can provide useful insight into the most powerful image transformation for a given classification task. Furthermore, the careful analysis of the filters at a much deeper level of the network might actually lead to more insights on structure formation in different models, since it is at this deeper level where individual structure is characterised and isolated by the algorithm.
Much work is left to be done before this machine learning approach to the classification of mass maps in different cosmological models can be applied to real data. In this work we limited ourselves to optimal noisefree maps in order to see how different methodologies compare under optimal conditions. The influence of pixel shotnoise, observational systematics and practical issues like masking and image artefacts needs to be studied in detail. Furthermore, since the currently most successful methods use a supervised training process with labelled data based on numerical simulations, it needs to be carefully investigated how closely those simulated maps resemble a real observation. Without this important sanity check, even the best machine learning technique is useless since it learns the wrong data.
Acknowledgements
We would like to thank Ofer Springer for useful discussions about deep learning. JM has received funding from the European Union’s Horizon 2020 research and Innovation programme under the Marie SkłodowskaCurie grant agreement No 664931. AP acknowledges support from an Enhanced Eurotalents Fellowship, a Marie SkłodowskaCurie Actions Programme cofunded by the European Commission and Commissariat à l’énergie atomique et aux énergies alternatives (CEA). CG and MB acknowledge support from the Italian Ministry for Education, University and Research (MIUR) through the SIR individual grant SIMCODE (project number RBSI14P4IH). CG and MM acknowledge support from the Italian Ministry of Foreign Affairs and International Cooperation, Directorate General for Country Promotion (Project "Crack the lens"). We also acknowledge the support from the grant MIUR PRIN 2015 "Cosmology and Fundamental Physics: illuminating the Dark Universe with Euclid"; and the financial contribution from the agreement ASI n.I/023/12/0 "Attività relative alla fase B2/C per la missione Euclid". The DUSTGRAINpathfinder simulations analyzed in this work have been performed on the Marconi supercomputing machine at Cineca thanks to the PRACE project SIMCODE1 (grant nr. 2016153604) and on the computing facilities of the Computational Center for Particle and Astrophysics (C2PAP) and of the Leibniz Supercomputer Center (LRZ) under the project ID pr94ji.We gratefully acknowledge the support of NVIDIA Corporation with the donation of the two Titan Xp GPUs used for this research.
References
 ATLAS Collaboration (2013) ATLAS Collaboration 2013, preprint, (arXiv:1309.4017)
 Aartsen et al. (2013) Aartsen M. G., et al., 2013, Physical Review Letters, 110, 131302
 Abbott et al. (2018) Abbott T. M. C., et al., 2018, Phys. Rev. D, 98, 043526
 Abraham et al. (2003) Abraham R. G., van den Bergh S., Nair P., 2003, ApJ, 588, 218
 Ackermann et al. (2017) Ackermann M., et al., 2017, ApJ, 840, 43
 Alam et al. (2017) Alam S., et al., 2017, MNRAS, 470, 2617
 Albert et al. (2017) Albert A., et al., 2017, ApJ, 834, 110
 Amendola et al. (2018) Amendola L., et al., 2018, Living Reviews in Relativity, 21, 2
 Arnold et al. (2014) Arnold C., Puchwein E., Springel V., 2014, MNRAS, 440, 833
 Arnold et al. (2015) Arnold C., Puchwein E., Springel V., 2015, Mon. Not. Roy. Astron. Soc., 448, 2275
 Arnold et al. (2016) Arnold C., Springel V., Puchwein E., 2016, MNRAS, 462, 1530
 Arnold et al. (2018) Arnold C., Fosalba P., Springel V., Puchwein E., Blot L., 2018, ArXiv eprints 1805.09824,
 Baldi & VillaescusaNavarro (2018) Baldi M., VillaescusaNavarro F., 2018, Mon. Not. Roy. Astron. Soc., 473, 3226
 Baldi et al. (2014) Baldi M., VillaescusaNavarro F., Viel M., Puchwein E., Springel V., Moscardini L., 2014, Mon. Not. Roy. Astron. Soc., 440, 75
 Barreira et al. (2016) Barreira A., Llinares C., Bose S., Li B., 2016, J. Cosmology Astropart. Phys., 5, 001
 Bartelmann & Schneider (2001) Bartelmann M., Schneider P., 2001, Phys. Rep., 340, 291
 Benitez et al. (2014) Benitez N., et al., 2014, ArXiv eprints 1403.5237,
 Bennett et al. (2013) Bennett C., et al., 2013, Astrophys.J.Suppl., 208, 20
 Bernabei et al. (2018) Bernabei R., et al., 2018, ArXiv eprints 1805.10486,

Bishop (2006)
Bishop C. M., 2006, Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag New York, Inc., Secaucus, NJ, USA
 Buchdahl (1970) Buchdahl H. A., 1970, MNRAS, 150, 1
 CMS Collaboration (2016) CMS Collaboration 2016, preprint, (arXiv:1603.08914)
 Castro et al. (2018) Castro T., Quartin M., Giocoli C., Borgani S., Dolag K., 2018, MNRAS, 478, 1305
 Chetlur et al. (2014) Chetlur S., Woolley C., Vandermersch P., Cohen J., Tran J., Catanzaro B., Shelhamer E., 2014, CoRR, abs/1410.0759
 Chollet (2017) Chollet F., 2017, Deep Learning with Python, 1st edn. Manning Publications Co., Greenwich, CT, USA
 Dietrich & Hartlap (2010) Dietrich J. P., Hartlap J., 2010, MNRAS, 402, 1049
 Fogel & Sagi (1989) Fogel I., Sagi D., 1989, Biological Cybernetics, 61, 103
 Friedrich et al. (2018) Friedrich O., et al., 2018, Phys. Rev. D, 98, 023508
 Fu et al. (2008) Fu L., et al., 2008, Astron. Astrophys., 479, 9
 Giocoli et al. (2014) Giocoli C., Meneghetti M., Metcalf R. B., Ettori S., Moscardini L., 2014, MNRAS, 440, 1899
 Giocoli et al. (2016) Giocoli C., et al., 2016, MNRAS, 461, 209
 Giocoli et al. (2017) Giocoli C., et al., 2017, MNRAS, 470, 3574
 Giocoli et al. (2018a) Giocoli C., Baldi M., Moscardini L., 2018a, ArXiv eprints: 1806.04681,
 Giocoli et al. (2018b) Giocoli C., Moscardini L., Baldi M., Meneghetti M., Metcalf R. B., 2018b, MNRAS,
 Girshick et al. (2013) Girshick R., Donahue J., Darrell T., Malik J., 2013, preprint, (arXiv:1311.2524)
 Goodfellow et al. (2016) Goodfellow I., Bengio Y., Courville A., 2016, Deep Learning. The MIT Press
 Graves (2013) Graves A., 2013, CoRR, abs/1308.0850
 Gruen et al. (2018) Gruen D., et al., 2018, Phys. Rev. D, 98, 023507
 Gupta et al. (2018) Gupta A., Matilla J. M. Z., Hsu D., Haiman Z., 2018, Phys. Rev. D, 97, 103515
 Hagstotz et al. (2018) Hagstotz S., Costanzi M., Baldi M., Weller J., 2018, ArXiv eprints 1806.07400,
 Hanisch et al. (2001) Hanisch R. J., Farris A., Greisen E. W., Pence W. D., Schlesinger B. M., Teuben P. J., Thompson R. W., Warnock III A., 2001, A&A, 376, 359
 Haralick et al. (1973) Haralick R. M., Shanmugam K., Dinstein I., 1973, IEEE Transactions on Systems, Man, and Cybernetics, SMC3, 610
 He (2013) He J.h., 2013, arXiv:1307.4876,
 He et al. (2015) He K., Zhang X., Ren S., Sun J., 2015, preprint, (arXiv:1512.03385)
 Herbel et al. (2018) Herbel J., Kacprzak T., Amara A., Refregier A., Lucchi A., 2018, J. Cosmology Astropart. Phys., 7, 054
 Heymans et al. (2012) Heymans C., et al., 2012, MNRAS, 427, 146
 Hikage et al. (2018) Hikage C., et al., 2018, preprint, (arXiv:1809.09148)
 Hilbert et al. (2008) Hilbert S., White S. D. M., Hartlap J., Schneider P., 2008, MNRAS, 386, 1845
 Hilbert et al. (2009) Hilbert S., Hartlap J., White S. D. M., Schneider P., 2009, A&A, 499, 31
 Hildebrandt et al. (2017) Hildebrandt H., et al., 2017, MNRAS, 465, 1454
 Hu & Sawicki (2007) Hu W., Sawicki I., 2007, Phys. Rev., D76, 064004
 Ioffe & Szegedy (2015) Ioffe S., Szegedy C., 2015, CoRR, abs/1502.03167
 Ivezic et al. (2008) Ivezic Z., et al., 2008, arXiv:0805.2366
 Johnson et al. (2016) Johnson M., et al., 2016, CoRR, abs/1611.04558
 Joudaki et al. (2017) Joudaki S., et al., 2017, MNRAS, 465, 2033
 Kingma & Ba (2014) Kingma D. P., Ba J., 2014, CoRR, abs/1412.6980
 Kratochvil et al. (2010) Kratochvil J. M., Haiman Z., May M., 2010, Phys. Rev. D, 81, 043519
 Kratochvil et al. (2012) Kratochvil J. M., Lim E. A., Wang S., Haiman Z., May M., Huffenberger K., 2012, Phys. Rev. D, 85, 103513
 Krizhevsky et al. (2012) Krizhevsky A., Sutskever I., Hinton G. E., 2012, in Pereira F., Burges C. J. C., Bottou L., Weinberger K. Q., eds, , Advances in Neural Information Processing Systems 25. Curran Associates, Inc., pp 1097–1105, link
 Laureijs et al. (2011) Laureijs R., et al., 2011, eprint arXiv: 1110.3193,
 LeCun et al. (2015) LeCun Y., Bengio Y., Hinton G., 2015, Nature, 521, 436
 Lin & Kilbinger (2018) Lin C.A., Kilbinger M., 2018, A&A, 614, A36
 Lin et al. (2017) Lin T., Goyal P., Girshick R. B., He K., Dollár P., 2017, CoRR, abs/1708.02002
 LucieSmith et al. (2018) LucieSmith L., Peiris H. V., Pontzen A., Lochner M., 2018, MNRAS, 479, 3405
 Martinet et al. (2018) Martinet N., et al., 2018, MNRAS, 474, 712
 Motohashi et al. (2013) Motohashi H., Starobinsky A. A., Yokoyama J., 2013, Phys.Rev.Lett., 110, 121302
 Naik et al. (2018) Naik A. P., Puchwein E., Davis A.C., Arnold C., 2018, ArXiv eprints 1805.12221,
 Ntampaka et al. (2018) Ntampaka M., et al., 2018, preprint, (arXiv:1810.07703)
 Orlov et al. (2006) Orlov N., Johnston J., Macura T., Wolkow C., Goldberg I., 2006, in 3rd IEEE International Symposium on Biomedical Imaging: Nano to Macro, 2006.. pp 1152–1155, doi:10.1109/ISBI.2006.1625127
 Orlov et al. (2008) Orlov N., Shamir L., Macura T., Johnston J., Eckley D. M., Goldberg I. G., 2008, Pattern recognition letters, 29, 1684
 Otsu (1979) Otsu N., 1979, IEEE Transactions on Systems, Man, and Cybernetics, 9, 62
 Parkinson et al. (2012) Parkinson D., et al., 2012, Phys. Rev. D, 86, 103518
 Peel et al. (2017) Peel A., Lin C.A., Lanusse F., Leonard A., Starck J.L., Kilbinger M., 2017, A&A, 599, A79
 Peel et al. (2018) Peel A., Pettorino V., Giocoli C., Starck J.L., Baldi M., 2018, preprint, (arXiv:1805.05146)
 Perlmutter et al. (1999) Perlmutter S., et al., 1999, Astrophys. J., 517, 565
 Petri (2016) Petri A., 2016, Astronomy and Computing, 17, 73
 Petri et al. (2015) Petri A., Liu J., Haiman Z., May M., Hui L., Kratochvil J. M., 2015, Phys. Rev. D, 91, 103511
 Petri et al. (2017) Petri A., Haiman Z., May M., 2017, Phys. Rev. D, 95, 123503
 Pezzotta et al. (2017) Pezzotta A., et al., 2017, A&A, 604, A33
 Planck Collaboration et al. (2016a) Planck Collaboration et al., 2016a, A&A, 594, A13
 Planck Collaboration et al. (2016b) Planck Collaboration et al., 2016b, A&A, 594, A24
 Planck Collaboration et al. (2018) Planck Collaboration et al., 2018, ArXiv eprints 1807.06209,
 Prewitt (1970) Prewitt J., 1970, Picture Processing and Psychopictorics. New York: Academic Press
 Puchwein et al. (2013) Puchwein E., Baldi M., Springel V., 2013, MNRAS, 436, 348
 Radon (1917) Radon J., 1917, in Berichte über die Verhandlungen der KöniglichSächsischen Akademie der Wissenschaften zu Leipzig. Teubner, pp 262–277, doi:doi:10.1109/TMI.1986.4307775, https://dx.doi.org/10.1109%2FTMI.1986.4307775
 Ravanbakhsh et al. (2017) Ravanbakhsh S., Oliva J., Fromenteau S., Price L. C., Ho S., Schneider J., Poczos B., 2017, preprint, (arXiv:1711.02033)
 Riess et al. (1998) Riess A. G., et al., 1998, Astron. J., 116, 1009
 Rodriguez et al. (2018) Rodriguez A. C., Kacprzak T., Lucchi A., Amara A., Sgier R., Fluri J., Hofmann T., Réfrégier A., 2018, preprint, (arXiv:1801.09070)
 Roncarelli et al. (2018) Roncarelli M., Baldi M., VillaescusaNavarro F., 2018, MNRAS, 481, 2497
 Rumelhart et al. (1986) Rumelhart D. E., Hinton G. E., Williams R. J., 1986, Nature, 323, 533
 Schäfer et al. (2012) Schäfer B. M., Heisenberg L., Kalovidouris A. F., Bacon D. J., 2012, MNRAS, 420, 455
 Schmidt et al. (1998) Schmidt B. P., et al., 1998, Astrophys.J., 507, 46
 Schneider (1996) Schneider P., 1996, MNRAS, 283, 837
 Schneider et al. (1998) Schneider P., van Waerbeke L., Jain B., Kruse G., 1998, MNRAS, 296, 873
 Shamir et al. (2008) Shamir L., Orlov N., Eckley D. M., Macura T., Johnston J., Goldberg I., 2008, Source Code for Biology and Medicine, 3, 13
 Shamir et al. (2010) Shamir L., Delaney J., Orlov N., Eckley D., Goldberg I., 2010, PLoS Comput Biol 6(11): e1000974
 Shan et al. (2018) Shan H., et al., 2018, MNRAS, 474, 1116
 Shirasaki et al. (2017) Shirasaki M., Nishimichi T., Li B., Higuchi Y., 2017, MNRAS, 466, 2402
 Simonyan & Zisserman (2014) Simonyan K., Zisserman A., 2014, preprint, (arXiv:1409.1556)
 Simonyan et al. (2013) Simonyan K., Vedaldi A., Zisserman A., 2013, preprint, (arXiv:1312.6034)
 Spergel et al. (2015) Spergel D., et al., 2015, ArXiv eprints 1503.03757,
 Springel (2005) Springel V., 2005, Mon. Not. Roy. Astron. Soc., 364, 1105
 Springel et al. (2005) Springel V., et al., 2005, Nature, 435, 629
 Springenberg et al. (2014) Springenberg J. T., Dosovitskiy A., Brox T., Riedmiller M., 2014, preprint, (arXiv:1412.6806)
 Springer et al. (2018) Springer O. M., Ofek E. O., Weiss Y., Merten J., 2018, preprint, (arXiv:1808.07491)
 Srivastava et al. (2014) Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R., 2014, Journal of Machine Learning Research, 15, 1929
 Szegedy et al. (2013) Szegedy C., Zaremba W., Sutskever I., Bruna J., Erhan D., Goodfellow I., Fergus R., 2013, preprint, (arXiv:1312.6199)
 Szegedy et al. (2014) Szegedy C., et al., 2014, CoRR, abs/1409.4842
 Szegedy et al. (2016) Szegedy C., Ioffe S., Vanhoucke V., 2016, CoRR, abs/1602.07261
 Tamura et al. (1978) Tamura H., Mori S., Yamawaki T., 1978, IEEE Transactions on Systems, Man, and Cybernetics, 8, 460
 Teague (1980) Teague M. R., 1980, J. Opt. Soc. Am., 70, 920
 Tessore et al. (2015) Tessore N., Winther H. A., Metcalf R. B., Ferreira P. G., Giocoli C., 2015, J. Cosmology Astropart. Phys., 10, 036
 Troxel et al. (2018) Troxel M. A., et al., 2018, Phys. Rev. D, 98, 043528
 Van Waerbeke et al. (2013) Van Waerbeke L., et al., 2013, MNRAS, 433, 3373
 Viel et al. (2010) Viel M., Haehnelt M. G., Springel V., 2010, JCAP, 1006, 015
 Vikhlinin et al. (2009) Vikhlinin A., et al., 2009, ApJ, 692, 1060
 VillaescusaNavarro et al. (2017) VillaescusaNavarro F., Banerjee A., Dalal N., Castorina E., Scoccimarro R., Angulo R., Spergel D. N., 2017, preprint, (arXiv:1708.01154)
 White (1993) White S. D. M., 1993, in Gleiser R. J., Kozameh C. N., Moreschi O. M., eds, General Relativity and Gravitation 1992. pp 331–+
 White (1996) White S. D. M., 1996, in Schaeffer R., Silk J., Spiro M., ZinnJustin J., eds, Cosmology and Large Scale Structure. p. 349
 White & Rees (1978) White S. D. M., Rees M. J., 1978, MNRAS, 183, 341
 Winther et al. (2015) Winther H. A., et al., 2015, Mon. Not. Roy. Astron. Soc., 454, 4208
 Wright et al. (2017) Wright B. S., Winther H. A., Koyama K., 2017, J. Cosmology Astropart. Phys., 10, 054
 Wu et al. (1992) Wu C.M., Chen Y.C., Hsieh K.S., 1992, IEEE Transactions on Medical Imaging, 11, 141
 Wu et al. (2016) Wu Y., et al., 2016, CoRR, abs/1609.08144
 Zeiler & Fergus (2013) Zeiler M. D., Fergus R., 2013, preprint, (arXiv:1311.2901)
 Zennaro et al. (2017) Zennaro M., Bel J., VillaescusaNavarro F., Carbone C., Sefusatti E., Guzzo L., 2017, MNRAS, 466, 3244
 Zernike (1934) Zernike v. F., 1934, Physica, 1, 689
Appendix A Wndcharm features
The total length of the wndcharm feature vector entails 2919 descriptors, which can be divided into five families. We provide an overview of the features and their respective families in table 10. The algorithm does not only work on the image itself (raw), but also on its Fourier (F), Wavelet (W), Chebyshev (C) or Edge transformation (E) as indicated by the ’Input’ column of table 10. Transformations of transformations are considered by the bracket notation. While Fourier and Chebyshev transforms are implemented using common algorithms and methodologies, the Wavelet transformation is performed with a one level filter pass with a 5th order symlet (Orlov et al., 2008) and the Edge transformation is carried out using a Prewitt operator (Prewitt, 1970) to approximate the image gradient.
The pixel statistics family is made out of four different subclasses, with the simplest being the intensity statistics consisting of mean, median, standard deviation, minimum and maximum. The multiscale histograms are calculated by using three, five, seven or nine bins to order the pixel amplitudes. The counts in each of those bins makes up the 24 features in this subclass. The combined moments are mean, standard deviation, skewness and kurtosis, which are calculated in a horizontal stripe through the image centre and with a width which is half the total image width. The stripe is then rotated by 45, 90 and 135 degrees and the measurement is repeated. Those 16 numbers are sampled into three bins each, providing a total of 48 features. The Gini coefficient (Abraham et al., 2003) is a measure of how equal the spectrum of pixel intensities is distributed within the image.
The second feature family is comprised of polynomial decompositions. The coefficients of an order 20 Chebyshev and an order 23 ChebyshevFourier (Orlov et al., 2006) transformation are sorted into 32 bin histograms. Radon transformations are carried out along lines with an inclination angle of 0, 45, 90 and 135 degrees with respect to the image horizontal (Radon, 1917) and ordered in 3 bin histograms. The class of Zernike coefficients is derived from a 2D Zernike decomposition of the image (Teague, 1980) and the first 72 of those coefficients contribute to the feature vector.
The use of textures is common in image processing and is a way of describing spatial correlations of intensity values. We extract seven Gabor filters (e.g. Fogel & Sagi, 1989) using Gaussian harmonic functions and define their image occupation area as a feature. Tamura textures are described in detail in Tamura et al. (1978) and wndcharm uses contrast, directionality, coarseness sum and coarseness binned into a 3 sample histogram. The 28 Haralick textures are specific properties of the greylevel dependence matrix of the image and are described in Haralick et al. (1973). The fractal analysis is based on a Brownian motion model of the image following Wu et al. (1992) and wndcharm uses the first 20 parameters of this analysis as features.
Object statistics are only derived from the raw image data. The starting point is an edge transform using a Prewitt filter and mean, median, variance and 8bin histogram of both image gradient and its directionality add up to 22 features, which are supplemented by the total number of edge pixels, their genus and the differences between the directionality bins. Otsu features and their inverse are calculated after the application of an Otsu threshold (Otsu, 1979). Finally, for all objects the algorithm calculates minimum, maximum, mean, median, variance and 10bin histogram for area and imagecentre distance of all Otsu objects in the image.
Family  Class  Features  Input  Reference 

Pixel  Combined moments  48  raw, F, W, C, C(F), W(F)  – 
statistics  F(W), F(C), C(W), E, F(E), W(E)  
Gini coefficient  1  raw, F, W, C, C(F), W(F)  Abraham et al. (2003)  
F(W), F(C), C(W), E, F(E), W(E)  
Multiscale histograms  24  raw, F, W, C, C(F), W(F)  –  
F(W), F(C), C(W), E, F(E), W(E)  
Pixel intensity statistics  5  raw, F, W, C, C(F), W(F)  –  
F(W), F(C), C(W), E, F(E), W(E)  
Polynomial  Chebyshev coefficients  32  raw, F, W, C, F(W), E, F(E), W(E)  – 
decomposition  ChebyshevFourier coefficients  32  raw, F, W, C, F(W), E, F(E), W(E)  Orlov et al. (2006) 
Radon coefficients  12  raw, F, W, C, C(F), W(F)  Radon (1917)  
F(W), F(C), C(W), E, F(E), W(E)  
Zernike coefficients  72  raw, F, W, C, F(W), E, F(E), W(E)  Teague (1980))  
Textures  Fractal analysis  20  raw, F, W, C, C(F), W(F)  Wu et al. (1992) 
F(W), F(C), C(W), E, F(E), W(E)  
Gabor  7  raw  Fogel & Sagi (1989)  
Haralick  28  raw, F, W, C, C(F), W(F)  Haralick et al. (1973)  
F(W), F(C), C(W), E, F(E), W(E)  
Tamura  6  raw, F, W, C, C(F), W(F)  Tamura et al. (1978)  
F(W), F(C), C(W), E, F(E), W(E)  
Objects  Edge features  28  raw  Prewitt (1970) 
Otsu object features  34  raw  Otsu (1979)  
Inverse Otsu object features  34  raw  Otsu (1979) 
Appendix B Deep neural networks
In this appendix we collect some more detailed information about deep neural networks. The first section formally defines all network layers used in this work and the second section deals with activation functions. The third section provides a thorough description about the architecture of the convolutional neural network (CNN) that we use in our analyses.
b.1 Layers
Given a 2D input vector with a convolution layer applies the following operation to produce an output
(19)  
(20)  
(21)  
(22)  
(23) 
The stride parameters
Comments
There are no comments yet.