On the dissection of degenerate cosmologies with machine learning

10/25/2018 ∙ by Julian Merten, et al. ∙ Istituto Nazionale di Astrofisica 0

Based on the DUSTGRAIN-pathfinder suite of simulations, we investigate observational degeneracies between nine models of modified gravity and massive neutrinos. Three types of machine learning techniques are tested for their ability to discriminate lensing convergence maps by extracting dimensional reduced representations of the data. Classical map descriptors such as the power spectrum, peak counts and Minkowski functionals are combined into a joint feature vector and compared to the descriptors and statistics that are common to the field of digital image processing. To learn new features directly from the data we use a Convolutional Neural Network (CNN). For the mapping between feature vectors and the predictions of their underlying model, we implement two different classifiers; one based on a nearest-neighbour search and one that is based on a fully connected neural network. We find that the neural network provides a much more robust classification than the nearest-neighbour approach and that the CNN provides the most discriminating representation of the data. It achieves the cleanest separation between the different models and the highest classification success rate of 59 we perform a tomographic CNN analysis, the total classification accuracy increases significantly to 76 Visualising the filter responses of the CNN at different network depths provides us with the unique opportunity to learn from very complex models and to understand better why they perform so well.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The standard CDM cosmological model – based on a cosmological constant as the source of the observed accelerated cosmic expansion (Riess et al., 1998; Perlmutter et al., 1999; Schmidt et al., 1998) and on cold dark matter particles as the bulk of the clustering mass in the universe (White & Rees, 1978; White, 1993, 1996; Springel et al., 2005) – has survived the past two decades of cosmological observations targeted to a wide range of independent probes. This includes the statistical properties of Cosmic Microwave Background (CMB) anisotropies (Bennett et al., 2013; Planck Collaboration et al., 2018), the large-scale distribution and dynamics of visible galaxies (Parkinson et al., 2012; Alam et al., 2017; Pezzotta et al., 2017), weak gravitational lensing signals (Fu et al., 2008; Joudaki et al., 2017; Hildebrandt et al., 2017; Troxel et al., 2018; Hikage et al., 2018), the abundance of galaxy clusters, as well as its time evolution (Vikhlinin et al., 2009; Planck Collaboration et al., 2016b).

Despite this astonishing success, the fundamental nature of the two main ingredients of the CDM model – summing up to about of the total energy density of the universe – remains unknown. On one side, the energy scale associated with the cosmological constant does not find any reasonable explanation in the context of fundamental physics, with predictions based on the standard model of particle physics failing by tens of orders of magnitude. On the other hand, no clear detection – direct or indirect – of any new fundamental particle that may be associated with cold dark matter has been made despite a longstanding chase through astrophysical observations (Aartsen et al., 2013; Ackermann et al., 2017; Albert et al., 2017) and laboratory experiments (see e.g. Bernabei et al., 2018; ATLAS Collaboration, 2013; CMS Collaboration, 2016).

This leaves the next generation of cosmological observations with the arduous challenge of clarifying the fundamental nature of the dark sector by systematically scrutinising the huge wealth of high-quality data that will be made available in the near future by several wide-field surveys (such as Laureijs et al., 2011; Ivezic et al., 2008; Spergel et al., 2015; Benitez et al., 2014). As a matter of fact, any possible insights from future datasets must come in the form of very small deviations from the expectations of the CDM model, otherwise past observations would have already detected them. This suggests that either the fundamental physics behind dark energy and dark matter is indeed extremely close to that of General Relativity (GR) with a cosmological constant and heavy fundamental particles with negligible thermal velocities, respectively, or that a more radical shift from this standard paradigm is hidden and masked by other effects such as an observational degeneracy with some not yet fully constrained cosmological parameter. The latter scenario may result in a severe limitation of the discriminating power of observations, thereby providing a particularly challenging testbed for the next generation of cosmological surveys.

A typical example of such a possible intriguing situation is given by the well-known degeneracy between some Modified Gravity (MG, see e.g. Amendola et al., 2018, for a recent review on a wide range of MG scenarios) theories and the yet unknown value of the neutrino mass. It is now generally accepted (Baldi et al., 2014; He, 2013; Motohashi et al., 2013; Wright et al., 2017) that MG theories such as gravity (Buchdahl, 1970) are strongly observationally degenerate with the effects of massive neutrinos on structure formation (see Baldi et al., 2014). Some commonly adopted statistics such as the matter auto-power spectrum (Giocoli et al., 2018a), the lensing convergence power spectrum (Peel et al., 2018), and the halo mass function (Hagstotz et al., 2018) may hardly distinguish standard expectations from some specific combinations of the gravity parameters and the total neutrino mass.

As such degeneracies extend down to the non-linear regime of structure formation, the use of full numerical simulations currently represents the only viable method to explore these scenarios, even though alternative approaches based on approximate methods (see e.g. Wright et al., 2017) have been developed in the last years and are being tested and calibrated against simulations. In the present work, we will explore the prospects of using machine learning techniques applied to numerical simulations of both MG and cosmologies that are highly observationally degenerate through standard observational statistics.

Several variants of higher-order statistics have been applied in the past to characterise cosmological data sensitive to the late-time evolution of structure in the Universe. Recent analyses of the weak lensing (WL) (Bartelmann & Schneider, 2001) data from CFHTLens (Heymans et al., 2012)

used either higher-order (>2) moments of the convergence field

(Van Waerbeke et al., 2013), or Minkowski functionals (Petri et al., 2015) to draw cosmological inference from a data description that goes beyond two-point statistics. Martinet et al. (2018) and Shan et al. (2018) applied peak count statistics (Dietrich & Hartlap, 2010; Kratochvil et al., 2010) to shear and convergence fields from KiDS (Hildebrandt et al., 2017) and Gruen et al. (2018) used counts-in-cells (Friedrich et al., 2018) to extract information from the DES (Abbott et al., 2018)

catalogues. A new set of techniques based on deep learning

(LeCun et al., 2015) currently has gained momentum in many scientific fields, including astrophysics. The extremely complex models which can be constructed through a modular building-block concept (e.g. Chollet, 2017) have been very successful for tasks like language translation (e.g. Wu et al., 2016; Johnson et al., 2016), text and handwriting recognition (e.g. Graves, 2013), as well as for the classification of images (starting with the seminal work of Krizhevsky et al., 2012). In cosmology, deep learning is used for the extraction of information from N-body simulations (Ravanbakhsh et al., 2017), to learn the connection between initial conditions and the final shape of structure (Lucie-Smith et al., 2018), for the characterisation of point spread functions (Herbel et al., 2018) or the measurement of shear for WL (Springer et al., 2018), the characterisation of non-Gaussian structure in mass maps (Gupta et al., 2018), the determination of galaxy cluster X-ray masses (Ntampaka et al., 2018) and the fast creation of simulated data using generative adversarial networks (Rodriguez et al., 2018). In this work, we will use such techniques to break the degeneracies between models of modified gravity in the presence of massive neutrinos.

The text is organised as follows: Section 2 gives an overview of the numerical simulations and the creation of the mass maps that constitute our main data set. In section 3 we introduce the different characterisation and classification techniques that we apply to the mass map data and show the results that they produce in section 4. We present our conclusions in section 5

. Two appendices provide more details on certain technical aspects of the computer vision (appendix

A) and deep neural network (appendix B) methods we are using.

2 Numerical simulations

We perform our analysis on a set of WL maps extracted from a suite of cosmological dark matter-only simulations called the DUSTGRAIN-pathfinder runs. These simulations represent a preliminary calibration sample for the DUSTGRAIN (Dark Universe Simulations to Test GRAvity In the presence of Neutrinos) project, an ongoing numerical effort aimed at investigating cosmological models characterised by a modification of the laws of gravity from their standard GR form and by a non-negligible fraction of the cosmic matter density being made of standard massive neutrinos.

2.1 Dustgrain-pathfinder

The modification of gravity considered in the DUSTGRAIN project consists in an model defined by the Action (Buchdahl, 1970)

(1)

We assume a specific analytical form for the function (Hu & Sawicki, 2007)

(2)

where is the Ricci scalar curvature, is a mass scale, while are free parameters of the model. Such a form is particularly popular and widely investigated as it allows one to recover with arbitrary precision a CDM background expansion history by choosing . Here and are the vacuum and matter energy density, respectively, under the condition , so that the scalar field takes the approximate form

(3)

By restricting to the case the only remaining free parameter of the model can be written as

(4)

and its absolute value will quantify how much the model departs from GR.

The DUSTGRAIN-pathfinder simulations have been devised to sample the parameter space and to identify highly degenerate combinations of parameters. Some analyses of the corresponding WL signal have been presented by Giocoli et al. (2018a) and Peel et al. (2018), while Hagstotz et al. (2018) have used the simulations to calibrate a theoretical modelling of the halo mass function in gravity with and without the contribution of massive neutrinos. In this further paper, we will use machine learning techniques to tackle the issue of observational degeneracy in these combined models based on the WL reconstruction described in Giocoli et al. (2018a). A similar approach, focused on a subset of particularly degenerate models is presented in Peel et al. (2018b, PRL submitted).

From a technical point of view, the DUSTGRAIN-pathfinder runs are cosmological collisionless simulations including Dark Matter particles of mass M (for the case of ) and as many neutrino particles (for the case of ) in a cosmological volume with periodic boundary conditions evolving under the effect of a gravitational interaction defined by equation 1. The simulations have been performed with the MG-GADGET code (see Puchwein et al., 2013), a modified version of the GADGET code (Springel, 2005) that implements all the modifications that characterise gravity (see Puchwein et al., 2013, for more details on the algorithm). MG-GADGET has been extensively tested (see e.g. the Modified Gravity code comparison project described in Winther et al., 2015) and employed recently for a wide variety of applications (Baldi & Villaescusa-Navarro, 2018; Arnold et al., 2018; Arnold et al., 2014, 2015; Roncarelli et al., 2018; Arnold et al., 2016; Naik et al., 2018). For the DUSTGRAIN-pathfinder simulations, as was already done in Baldi et al. (2014), we have combined the MG-GADGET solver with the particle-based implementation of massive neutrinos developed by Viel et al. (2010). This allowed us to include massive neutrinos in the simulations as an independent family of particles with its own initial transfer function and velocity distribution. Initial conditions have been generated following the approach of e.g. Zennaro et al. (2017) and Villaescusa-Navarro et al. (2017) at the starting redshift of the simulation with thermal neutrino velocities added on top of the gravitational velocities by random sampling the neutrino momentum distribution at the initial redshift.

Standard cosmological parameters are set to be consistent with the Planck constraints (Planck Collaboration et al., 2016a). Concerning non-standard parameters, the DUSTGRAIN-pathfinder simulations spanned the range for the scalar amplitude and for the neutrino mass, for a total of simulations. In the present work, we will consider a subset of the full DUSTGRAIN-pathfinder suite consisting of nine simulations whose specifications are summarised in table 1.

Simulation Name Gravity type [eV] [M] [M]
CDM GR 0 0.31345 0 0
0 0.31345 0 0
0 0.31345 0 0
0 0.31345 0 0
0.3 0.30630 0.00715
0.15 0.30987 0.00358
0.1 0.31107 0.00238
0.1 0.31107 0.00238
0.06 0.31202 0.00143
Table 1: The subset of the DUSTGRAIN-pathfinder simulations considered in this work with their specific parameters. represents the MG parameter, and the neutrino mass in electron volts and in as implemented in the simulation, cold dark matter particle mass, and and the and neutrino density parameters, respectively.

2.2 Lensing light-cones

For all simulations we stored snapshots at different redshifts that allow us to construct lensing light-cones up to a source redshift without gaps. Different methods have been developed to produce lensing light-cones from large cosmological N-body simulations. Recent works have employed post-processing reconstructions based on the slicing of a set of comoving particle snapshots (as e.g. in Hilbert et al., 2008, 2009; Giocoli et al., 2016; Shirasaki et al., 2017), as well as on-the-fly algorithms capable of storing only the projected matter density on a given field-of-view without resorting on the flat-sky approximation (see e.g. Barreira et al., 2016; Arnold et al., 2018). In this work we use the MapSim routine (Giocoli et al., 2014; Tessore et al., 2015; Castro et al., 2018) which is based on the former strategy. We use the particles stored in different snapshots to construct our continuous past-light-cones up to , building lens planes of the projected matter density distribution, considering a square sky coverage of five degrees on a side. For each cosmological model we construct different light-cone realisations by randomising the various comoving cosmological boxes (Giocoli et al., 2018a; Peel et al., 2018).

2.3 Convergence maps

The MapSim pipeline is composed of two algorithms. The first one – called i-MapSim – constructs lensing planes from the different simulation snapshots, saving for each plane and on each pixel with coordinate indices the particle surface mass density

(5)

represents the comoving pixel area of the lens plane and the sum over all particle masses associated with the given pixel. The second algorithm named ray-MapSim projects the matter density distribution along the line-of-sight by weighing the lens planes with the lensing kernel in the Born approximation regime (Bartelmann & Schneider, 2001; Schäfer et al., 2012; Petri, 2016; Giocoli et al., 2016; Petri et al., 2017; Giocoli et al., 2017; Giocoli et al., 2018b; Castro et al., 2018). From we can derive the convergence at a given source redshift as

(6)

where varies over the different lens planes with the lens redshift smaller than and represents the critical surface density at the lens plane for sources at redshift

(7)

Here is the speed of light, is Newton’s constant and , and are the angular diameter distances between observer-lens, observer-source and source-lens, respectively. The final maps cover the 25 square degree field-of-view with pixels, resulting in a map resolution of arc-seconds.

3 Methodology

A variety of machine learning techniques is applied to the DUSTGRAIN-pathfinder convergence maps. It was shown by Peel et al. (2018) that summary statistics up to second-order do not reliably separate such mass maps. Higher-order statistics, especially peak counts (e.g. Peel et al., 2017; Shan et al., 2018; Martinet et al., 2018; Lin & Kilbinger, 2018), do a better job but still leave room for improvement when distinguishing between a large number of models and in the presence of noise. Most commonly used methods to characterise observational data are naturally based on physical models. In the following we present an agnostic approach, which also applies techniques and algorithms found in the fields of computer science and specifically digital image processing.

We distinguish two subsequent steps in the process of mass map classification. The first is to find a feature extraction function

, which takes a high-dimensional data vector

as input and finds a general, dimensional reduced representation of it in the form of a feature vector

(8)

The feature extraction function can have several parameters which are stored in the feature weight vector . In order to arrange the data vector in a meaningful way, we introduce an index notation . The first two indices reflect a spatial ordering of the 2D data along the coordinate axes. This means that all elements with are located in the first row of the pixelised image and all elements with e.g. are located in the tenth column of the image. This notation also includes 1D data, ordered or not, by setting . The third index –commonly dubbed as a channel– allows us to collect multiple aspects of the same entity represented by . For the example of an RGB-image, would be the red channel of the image, the green and the blue channel. Finally, we define the shape of a data vector with a bracket notation. The shape of our input convergence maps is since we have pixel maps with convergence values at six different source redshift channels and where in the above we have introduced the shape operator which returns the shape of a data vector.

The second step classifies into a set of target classes. The classification function , which can again depend on a set of parameters , should not only output a single class prediction, but rather a prediction vector of shape

with probabilities to belong to each of

target classes

(9)

It must hold that , and in our case .

In the following we explore different choices for the feature extraction and classification functions and find ways to optimise their parameters to achieve an optimal classification. We do so with the help of training sets, which are data vector–label pairs , meaning mass maps for which we a-priori know the underlying cosmological model. Specifically, the label is an indicator function for the class with elements for which

(10)

3.1 Definition of data sets

Our full data set consists of 256 convergence maps of shape (2048,2048,6) for each of the nine cosmological models. We split each map further into 64 smaller patches to define our main data vectors with . 75% of those maps (12289) are used as a training set in order to optimise the parameters of our models. We use 15% of the maps (2457) as a validation set where the correct labels are known to us, but not to the optimisation algorithm. Performing a classification on the validation data serves as quality control and helps us to decide if an optimisation is successful and when to stop it. Another 10% of the maps (1638) are used as a test set, where the labels are not known to us a-priori and to which our trained and validated algorithms are applied to blindly. The success rates on those test sets will be the main result of this work.

3.2 Mass map feature extraction

Two important subclasses for are possible. In the first, the parameters are free and can be optimised during a training phase. In the second, they are fixed. We want to highlight that we do not perform any initial transformations of the data, which have proven to be useful for the analysis of lensing mass maps. It was shown in e.g. Peel et al. (2018) that an aperture mass transformation (Schneider, 1996; Schneider et al., 1998) can largely improve the discrimination power of certain statistics, but we want to stay as general and agnostic as possible at this stage and use the raw pixel data of the convergence maps as the initial data vector.

3.2.1 Standard mass map descriptors

Examples of fixed feature extraction are the mass map descriptors which are commonly used in the cosmological community to describe convergence or shear catalogues. For the purposes of this work, such descriptors serve as the reference for other techniques that we apply. We combine a number of mass map features, which we extract with the LensTools package111https://github.com/apetri/LensTools by Petri (2016)

into a feature vector of shape (1,1,99). The first four entries in this vector are the mean, variance, skewness and kurtosis of the convergence maps. This is followed by eleven percentiles between the 0th-percentile (the minimum) and the 100th-percentile (the maximum) in steps of ten percent. The normalised histogram (PDF) of the convergence values in each map is sorted with 14 bins and the value for each bin is appended to the feature vector. Next, we calculate the power spectrum in 14 logarithmically spaced bins between

and

, which cover the angular size and resolution of our mass maps. Finally, we use the standard deviation of each map to define 14 signal-to-noise bins between -2 and 5. For each such bin we calculate the peak counts, as well as the first three Minkowski functionals

(e.g. Kratochvil et al., 2012; Petri et al., 2015, and references therein), which concludes our collection of 99 features.

3.2.2 Classical computer vision

We know from Peel et al. (2018) that at least some of the standard descriptors above are not optimally suited for the task at hand and it is, at this point, not entirely obvious how to define better ones. This is why we now aim to derive as many fixed features as possible. The publicly available wnd-charm algorithm (Shamir et al., 2008; Orlov et al., 2008; Shamir et al., 2010) was designed for the classification of microscopy images and derives a particularly large feature vector of shape (1,1,2919). This includes most of the common statistics and descriptors known to digital image processing. Many features are thereby not only calculated from the raw image, but from some of its alternative representations like the Fourier, Wavelet, Chebyshev or Edge transformation. Moreover, some features are also extracted from transformations of transformations. While we did state earlier that we do not want to vet our data with transformations, we want to point out that the listed transformations are by no means inspired by the mechanisms of lensing or structure formation. We refer the interested reader to Orlov et al. (2008) for the full description of the algorithm and the description of the full feature vector, but we do provide a short summary in appendix A and a compact overview in table 10.

3.2.3 Convolutional neural networks

As the class of feature extraction functions which are able to change their shape during the training process we chose multi-layered neural networks (LeCun et al., 2015; Goodfellow et al., 2016). The input data vector is manipulated and eventually reduced in dimension by a long –deep– chain of simple layers , which implement a specific mathematical operation. The output of one layer, becomes the input of the following layer and contains its own set of parameters . The set of all layer parameters becomes the feature parameter vector .

(11)
(12)
(13)

Deep neural networks source their performance from the sheer number of layers they are comprised of and have gained much popularity in recent years. This is mainly due to the advancements in numerical performance by e.g. exploiting many-core architectures222General Purpose Graphics Processing Units (GPGPU) are a popular example of a many-core architecture.. This allows for the construction of particularly deep and complex networks with hundreds of millions of parameters. The functional forms of the layers that are used in a deep neural network depend on the problem at hand. For image classification, convolutional neural networks (CNN) have proven to be particularly useful (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; Szegedy et al., 2014; He et al., 2015; Lin et al., 2017) and hence we chose this class of models for our purposes. The main functionality of a CNN is provided by a convolutional layer which applies a number of convolutions with kernel size to a 2D input vector with

. The stride parameters

and allow one to implement dimensional reduction and the parameter

controls if the input data is padded (

) or unaltered (). We provide a thorough mathematical definition of all deep neural network layers used in this work, including the convolutional one, in appendix B.1.

Convolution layers are often followed by pooling layers for dimensional reduction. We implement average pooling layers which average entries of the 2D data vector within a window of size , apply a stride defined by and and follow the same padding scheme that was introduced earlier. Maximum pooling layers work in a similar manner but instead of the average they return the maximum within a given window. Both pooling layers exist also as global versions, indicated by GlobalMaxPool and GlobalAvgPool, where all entries per channel are considered for either the maximum or averaging operation.

Up to this point, we only allowed for layers to be placed strictly sequential. In order to implement a horizontal layout, we connect several layers to the same input and combine their results with the help of a concatenation layer . This concept of performing not only one operation at a given depth of the network but several has proven very successful for image classification as e.g. shown in Szegedy et al. (2014)

, who dubbed such horizontal layers as Inception modules.

The output of a layer can be followed by a non-linear activation function. For convolution and pooling layers we mainly deploy rectangular linear units (ReLU) and we give the full detail about the activation functions used in this work in appendix

B.2. To avoid the network from overfitting, so-called dropout layers are introduced as a regularisation. In there, a given percentage of the elements of an input vector are chosen at random and are subsequently discarded from the output (Srivastava et al., 2014). Finally, to compensate for fluctuations in the amplitudes of input vectors at different network depth, Ioffe & Szegedy (2015) introduced the concept of batch normalisation which we also use after each convolutional layer. The output of the last layer in the CNN, the feature vector , is used for classification in a final section of the network which is commonly referred to as top. The concrete architecture of the CNN that we use in this work is provided in section 4.3 and appendix B.3.

3.3 Feature-based classification

We now turn our attention to the classification function . We investigate two different approaches to classification. The first one is a nearest-neighbour-search scheme based on distances in feature space. The other approach, based again on a class of neural networks, uses regression through a training set to find the optimal mapping between features and labels.

3.3.1 Feature space distances

In the following we denote with all those feature vectors that belong to a sample from the training set and with the subset which belongs only to class of the training set. We calculate a Fisher discriminant (e.g. Bishop, 2006) to find suited classification weights for each individual feature .

(14)

Here is the total number of classes and is the variance of the feature within class .

Once we found the weights we can define a weighted nearest-neighbour distance (WNN) of any feature vector to all the classes in the training set.

(15)

where is the length of the feature vector. The problem with this WNN distance is the fact that it is based only on a single element in the training set, the one that minimises the sum in equation 15. To remedy this, Orlov et al. (2006) introduced a weighted neighbour distance (WND), which takes into account the distance to all elements in the training set, but largely penalises large distances through the free parameter

(16)

Orlov et al. (2006) found that the results do not strongly depend on once and that is a generally good, numerically stable, choice. The final step in order to make predictions is to define a similarity using a distance of choice, e.g. WNN or WND, and by normalising appropriately

(17)

3.3.2 Fully connected neural networks

A different approach to the classification task is another form of neural network (equation 11

). The main layer in such a neural network is a fully connected –sometimes called affine–layer FC(n), which implements a linear mapping between the input vector of length

and the output vector of length using a matrix of free parameters and an additional bias parameter (see appendix B.1).

Such layers are again chained together and the last layer produces an output vector of the same length as the number of classes . As before, in between those layers one may use dropout, activation and normalisation layers. The top of the network is followed by a specific activation function called a softmax (see appendix B.2) which provides the desired predictions .

Since the optimal weights

are found by a regression, we define a loss function

, which in the case of this classification problem is a categorical cross entropy

(18)

are the labels for the elements in the training data and their class predictions given a current set of parameters

. In order to minimise the loss, while continuously looping over the training data, we use a specific implementation of stochastic gradient descent called

ADAM (Kingma & Ba, 2014). The gradients of our model are thereby calculated via a back-propagation algorithm (Rumelhart et al., 1986). We end this description of our methodology by noting that for a full feature extraction and classification chain , with a CNN as and a neural network as classifier , the classification and feature extraction weights can be optimised at the same time.

3.4 Numerical setup

As mentioned earlier we use the Python package LensTools333https://github.com/apetri/LensTools (Petri, 2016) for the extraction of the standard map descriptors from section 3.2.1. For the computer vision fixed features from section 3.2.2 we slightly adapted the publicly available version of wnd-charm444https://github.com/wnd-charm/wnd-charm. We altered the C++ version of the feature extraction algorithm to accept FITS files (Hanisch et al., 2001) as an input image container with pixel values as double precision floating-point numbers. We then use the feature output files of wnd-charm as an input for our own distance-based classification pipeline written in Python. We make these routines publicly available in this repository555https://bitbucket.org/jmerten82/mydnn. All deep learning elements of our analysis stack use the widely used tensorflow666https://www.tensorflow.org/ framework, which uses NVIDIA’s cuDNN (Chetlur et al., 2014)

library to carry out tensor operations on GPUs. We pair a

tensorflow backend with the high-level deep learning Python interface keras777https://keras.io/ as a frontend. The network training was carried out on two NVIDIA Titan Xp GPUs. All convergence maps and Jupyter888http://jupyter.org/ notebooks used to produce the results in this work are either linked to or publicly available in the aforementioned repository. In there, we refer the reader to the ’reproducible_science’ folder.

4 Results

Section 3 introduced a number of methods to perform mass map characterisation and classification. We now present the results obtained by applying those methods and provide details on their training process with the help of the validation sets. For the most successful method we investigate the dependence of the results on the convergence map source redshift and we end this section with a closer look at the most relevant features which are extracted by the different methods. If not stated otherwise, the results in this section are based on training, validation and test set maps at a source redshift .

4.1 Classification based on feature distance

For the case of the distance-based classifier from section 3.3.1, the training process is just the derivation of the Fisher weights shown in equation 14. We calculate them using the training set and present the 20 top-ranked features for the classical descriptors in table 2 and for the wnd-charm features in table 3. For the first case, we see quite a mix of features in the top, with the power spectrum and peak counts being the most important ones. For the wnd-charm features however, the ranking is completely dominated by Zernike coefficients on transformations of the image, with a few contributions of Haralick textures. One should keep in mind though that we extract a total of 2919 features, out of which 51 have a weight , 868 have a weight and only 193 features have a vanishing weight. It is the combination of all the non-zero weights which will lead to the distance-based classification later on.

Rank Name Index Weight
1 Power spectrum 11 0.106
2 " 10 0.104
3 " 9 0.092
4 " 12 0.084
5 Peak counts 13 0.083
6 Power spectrum 8 0.078
7 Peak counts 12 0.078
8 Power spectrum 7 0.065
9 Peak counts 5 0.064
10 Skewness 0.059
11 Peak counts 14 0.055
12 Power spectrum 6 0.051
13 Percentile 100 0.051
14 Minkowski functional 1 14 0.049
15 Percentile 0 0.049
16 Power spectrum 13 0.046
17 Minkowski functional 2 14 0.045
18 Peak counts 11 0.044
19 Power spectrum 5 0.042
20 Kurtosis 0.042
Table 2: The top-ranked classical mass map features according to their Fisher score (equation 14). The meaning of each feature and the explanation of its index can be found in section 3.2.1.
Rank Transform Name Index Weight
1 F Zernike coefficients 20 0.285
2 F " 42 0.270
3 F(W) " 50 0.255
4 F(E) " 52 0.242
5 F(W) " 21 0.236
6 F(E) " 39 0.214
7 F " 12 0.205
8 F(W) " 22 0.204
9 F(E) " 37 0.196
10 F(W) " 56 0.183
11 F(E) " 5 0.174
12 F(E) " 28 0.170
13 W Haralick textures 5 0.169
14 F Zernike coefficients 17 0.166
15 F(E) " 34 0.166
16 F(E) Haralick textures 0 0.164
17 F(E) " 14 0.161
18 F(E) Zernike coefficients 24 0.159
19 F " 60 0.154
20 Edge features 0 0.152
Table 3: The same as table 2 but for the wnd-charm features. We refer the reader to appendix A for the definition of each feature and the exact meaning of the feature index and the transform column.

For the 99 standard features we find a total classification success rate of 22%, meaning that out of 14742 samples in the test set, only 3243 were classified correctly. For some specific classes the classification success rate is barely above the success rate for a random guess (11%). The important class for example shows a success rate of 13%. The picture improves marginally when using the 2919 wnd-charm features instead. The total classification success over all classes rises mildly to 25%. While especially the three models still show success rates around or even below 11%, at least some classes, including , are now significantly above the 20% level. We do not show more details999A full success rate analysis is provided in the repsository (https://bitbucket.org/jmerten82/mydnn) associated with this article. on the distance-based classification since it is clear from those results already that this classification method does not qualify for a successful discrimination of our models.

4.2 Classification based on neural network

We now use the same set of fixed features but feed them into a fully-connected neural network for classification. For the case of the 99 standard features we show the very simple topology of the classification network in table 4. The same network is used to classify the 2919 wnd-charm

 features but due the larger input vector, the number of free parameters is larger, which we indicate by a square bracket notation in the same table. The regression to find the optimal parameters of the main fully-connected layers is based on the training set. In total we train with 110601 feature vectors of shape (1,1,99) or (1,1,2919) and where one iteration over all those elements during the regression is commonly called an epoch. Gradient evaluations and corresponding changes to the network parameters are made after a subset of an epoch, usually called a batch. The batch size in this case was set to 128. After each epoch, we evaluate the current performance of the network with the 22113 (2457 per class) feature vectors in the validation set. Figure

1 shows for both feature sets the evolution of the loss function for the training and validation data as a function of training epoch. For the larger feature vector, the validation loss starts to saturate around epoch 70 while the training loss keeps declining. This indicates that the networks starts to overfit, meaning that it learns training-set specific features which are of no use to characterise the validation set or any data unknown to the model. This is where we stop the training and save the model parameters which produced the smallest validation loss.

Figure 1: The evolution of the loss as a function of epoch for the training of the neural network. Shown are both cases where either a smaller vector of standard is the input for the network, or a larger set of wnd-charm features.
Index Layer free parameters Output shape
1 Input 0 (1,1,99[2919])
2 FC(32) 3200 [93440] (1,1,32)
3 leakyReLU(0.03) 0 (1,1,32)
4 FC(9) +297 (1,1,9)
5 Softmax 0 (1,1,9)
Ouput =3497 [93737] 9
Table 4: The sequence of layers used in the neural network to classify fixed mass map features. The output shape notation follows the convention introduced in section 3. The description of all layers can be found in sections 3.2.3 and 3.3.2, their formal definition in appendix B.1 and B.2. The numbers in square brackets refer to the case where the wnd-charm feature vector is used instead of the smaller vector of classical features.

The neural network classification yields significantly better results compared to the classification based on feature-space distances. In the case of the 99 standard features the total classification rate rises to 39% and to 35% in the case of the wnd-charm features. Most interestingly, the smaller vector of 99 classical features produces better results than the much larger feature vector provided by wnd-charm. Some of the most discriminative features from the computer vision method shown in table 3 are certainly describing the data well and should be used in future analyses; however, once the information from all standard descriptors such as the binned power spectrum, peak counts and Minkowski functionals are combined in an optimal way by a neural network, there is no advantage in using features that are inspired by computer vision only. Table 5

shows the classification success matrix for the standard features, where each row refers to a subset of the test data comprising only maps from that true class labelled by the first column. The first number in each block of four shows how many times the 1638 members of this subset have been sorted into the respective predicted class which is indicated by the label in the very first row. The second number is the percentage of predictions with respect to the total number of maps in the class. The third and fourth numbers are the mean and its standard error on the prediction probability for all maps in the subset given by the row and for the class predictions indicated by the label of the column. For an optimal classification, only the diagonal of this matrix (those fields typeset in boldface) would show non-zero values.

CDM
958 80 157 103 97 137 26 24 56
58% 5% 10% 6% 6% 8% 2% 1% 3%
70 1135 1 72 10 18 31 116 185
4% 69% 0% 4% 1% 1% 2% 7% 11%
232 6 857 158 275 84 10 5 11
14% 0% 52% 10% 17% 5% 1% 0% 1%
149 90 173 651 254 120 38 52 111
9% 5% 11% 40% 16% 7% 2% 3% 7%
223 22 485 333 413 98 18 14 32
14% 1% 30% 20% 25% 6% 1% 1% 2%
193 103 44 181 65 512 126 171 243
12% 6% 3% 11% 4% 31% 8% 10% 15%
128 194 30 163 39 377 147 257 303
8% 12% 2% 10% 2% 23% 9% 16% 18%
71 262 18 164 39 279 109 345 351
4% 16% 1% 10% 2% 17% 7% 21% 21%
CDM 69 265 5 135 9 158 60 166 771
4% 16% 0% 8% 1% 10% 4% 10% 47%
Table 5: The classification success matrix for the neural-network-based classification of the classical features. Each row represents a different subset of the test data indicated by the first column. The first number in each block of four in a column is the number of samples in the subset that was assigned to the predicted class indicated by the column label on the top. The second number is the relative classification success rate for the subset. The third number is the mean of all predictions in the subset and the fourth number is its standard error. The success rates indicate that only the two models and to a lesser degree , and are well separated from the other models with correct classification rates of 40% or above and false classification rates for other models of 17% or less. and the two models with non-vanishing neutrino mass are basically undistinguished from other models and (success rate 31%) shows still a large degeneracy with (15% misclassification rate).

While table 5 gives a good indication of what to expect from a classification of single maps, only the mean and its standard error on the class predictions give an idea on how well the full ensemble of test set maps is classified. We therefore further evaluate the statistics of the prediction vectors for each true test set class. Figure 2 shows nine panels of box plots, each of which represents the statistics for one such subset. The black box in each panel represents the correct predictions, equivalent to the bold diagonal of table 5. The horizontal line spanning each panel is the median for all the true class predictions and the error band shows the scatter of medians derived from 1000101010This number is of course arbitrary but is close to the sample size and we also checked that the bootstrap-derived error does not depend significantly on the number of bootstraps.

bootstrap samples. The upper and lower end of each box show the 75th and 25th percentiles, respectively, and the whiskers show the outlier cleaned minimum and maximum value of the class predictions. Whenever a box which is not the true label is shown in green, it means that the median and its errors, indicated by the notches of each box, is lower than the one of the correct prediction box and does not overlap with its horizontal error band. If those criteria are not met, the respective box is shown in red.

Figure 2: The prediction statistics of the classical feature vector classified by a neural network. Each labelled panel represents all predictions for one true class of the test set. In every panel, each box summarises the statistics of the model predictions indicated by the bottom labels. The median and its bootstrap error for the correct prediction is shown by the red line with error band. For this method only the two models are clearly distinguished from the other models. Especially the and models remain largely degenerate within themselves, but also with CDM.

When looking at the results in table 5 and figure 2, the following observations catch the eye. Although the overall classification success rate is only 39%, none of the classes is classified incorrectly as an ensemble. In the case of the two models we see a clear separation between the correct predictions from the other classes. This is confirmed by the classification matrix, which shows no substantial overlap () with any other model. This however changes for the three and the three models. Although the median for the correct predictions is the highest for all of the models111111The mean is not in the case of , the degeneracies within the same model of gravity are strong in those cases as one can see from the basically equal heights of the centres of the boxes in figure 2 and from the classification matrix, which lists a large number of misclassifications up to 30% in the case of misclassified as . A lot more severe is the case of the three models. For them we find substantial overlap of up to 21% with . Even the predictions for itself are not completely separate from the three models and with an overlap of up to 16%.

4.3 Convolutional neural network

The convolutional neural network (CNN) extracts the characterising features directly from the pixel data of the training mass maps. We have experimented with a number of architectures, including classic topologies which implement a large number of convolutions inspired by VGG-net (Simonyan & Zisserman, 2014), as well as architectures presented in Ravanbakhsh et al. (2017) and Gupta et al. (2018). The model that worked best for our purposes is almost exclusively based on the Inception layers first presented in Szegedy et al. (2014). Here we adopt one of its latest iterations, version 4 introduced in Szegedy et al. (2016). The global linear structure of our CNN is shown in table 6 and we describe in detail the different elements of this network and their purpose in appendix B.3.

Layer free parameters Output shape
1 Input 0 (256,256,1)
2 Conv(3,3,2,2,v,32) 288 (127,127,32)
3 lReLU(0.03) 0 (127,127,32)
4 Conv(3,3,1,1,v,32) +9216 (125,125,32)
5 lReLU(0.03) 0 (125,125,32)
6 Conv(3,3,1,1,s,64) +9216 (125,125,32)
7 StemInception (Fig. 9) +555008 (29,29,384)
8 InceptionA (Fig. 10) +316416 (29,29,384)
9 ReductionA (Fig. 11) +2304000 (14,14,1024)
10 InceptionB (Fig. 12) +2931712 (14,14,1024)
11 ReductionB (Fig. 13) +2744320 (6,6,1536)
12 InceptionC (Fig. 14) +4546560 (6,6,1536)
13 GlobalAvgPool 0 (1,1,1536)
14 Dropout(0.33) 0 (1,1,1536)
15 FC(9) +13833 (1,1,9)
16 Softmax 0 (1,1,9)
17 Ouput =13469865 9
Table 6: The sequential structure of the CNN used in this work. All layers marked by are batch normalised. More complicated Inception layers are shown in the respective figure.

We visualise the evolution of the network’s loss during training in figure 3.

Figure 3: The evolution of the loss as a function of epoch for the training of the CNN.

The total classification success rate of the CNN is 52% and its classification success matrix is shown in table 7.

CDM
1307 116 36 21 31 88 12 5 22
80% 7% 2% 1% 2% 5% 1% 0% 1%
51 1298 0 11 0 7 8 27 236
3% 79% 0% 1% 0% 0% 0% 2% 14%
105 1 1065 90 320 48 2 1 6
6% 0% 65% 5% 20% 3% 0% 0% 0%
103 37 161 721 347 78 14 31 146
6% 2% 10% 44% 21% 5% 1% 2% 9%
122 5 624 271 514 70 5 7 20
7% 0% 38% 17% 31% 4% 0% 0% 1%
51 27 11 42 44 968 74 307 114
3% 2% 1% 3% 3% 59% 5% 19% 7%
36 46 11 40 35 713 95 458 204
2% 3% 1% 2% 2% 44% 6% 28% 12%
20 79 5 35 17 558 64 565 295
1% 5% 0% 2% 1% 34% 4% 34% 18%
CDM 41 179 0 43 3 99 43 144 1086
3% 11% 0% 3% 0% 6% 3% 9% 66%
Table 7: The classification success matrix for the CNN. The general structure of the table is the same as in table 5. We see successful classifications of the two models, , and . However, large degeneracies remain within the neutrino mass variants of and gravity, respectively. In some cases, the wrong predictions can outnumber the correct ones as is the case for and .

Compared to the fixed feature results in table 5 we find much larger true prediction values for many models. Exceptions are and . Figure 4 shows the statistics of the predictions for all classes in the test set and reveals that the and models, even as an ensemble, cannot be classified correctly by the CNN since the error bars on the medians of the predictions in their samples overlap with other models.

Figure 4: Prediction statistics for the CNN at source redshift . The structure of the figure is the same as in figure 2. The CNN discriminates more clearly between the models and both models, and are now clearly distinguished. Problems remain for the different neutrino mass realisations within and gravity. is incorrectly classified as .

However, the degeneracy with is now broken for all models and the CNN robustly discriminates most of the nine models from each other.

4.4 Dependence on redshift

A source redshift of is realistic for future space -and ground-based surveys but it is certainly optimistic for current ground-based surveys. On the other hand, it also does not test the full potential of our classification methods since one would expect a better classification accuracy for larger source redshifts. We therefore repeat training and classification for one lower () and one higher () source redshift. For simplicity we restrict this analysis to the CNN which delivered the best results.

For a source redshift the overall accuracy drops significantly from 52% to 44%. When comparing the prediction statistics of the full set at this redshift in figure 5 with the reference at in figure 4, one can see that the decrease in the overall accuracy mainly stems from a weaker separation of the two models, and . The known issue of degeneracies between the three neutrino masses for and are already present and more prominent. The issue of model misclassification for gravity gets worse with now two misclassifications.

Figure 5: Prediction statistics for the CNN and a source redshift of . Compared to figure 4, the separation between the models becomes washed out. Two misclassifications occur: is incorrectly classified as and is misclassified as . Also, the samples cannot be distinguished as an ensemble from the ones since their prediction medians overlap within the error bars.

The improvements when going from to are highlighted by figure 6. For the network’s ability to distinguish between the base models increases and the overall classification accuracy is now 59%. The discrimination accuracy for massive neutrinos within each gravity model increases for the models and only the two models with massive neutrinos show significant overlap. Those models are also the only ones that show residual, but insignificant overlap with CDM. Given the fact that the ensemble of maps also gets misidentified as , it is clear that the discrimination within the models remains an issue even at a larger source redshift.

Figure 6: Prediction statistics for the CNN and at a source redshift of . Only the model remains degenerate given the error bar on the median of its sample predictions. In fact it is misclassified as by the CNN.

As a last analysis using the CNN we perform a tomographic classification. For each line-of-sight realisation we are not using a single mass map at a specific source redshift but we feed data vectors of shape into the CNN where the four channels refer to , , and , respectively. The classification success matrix for this analysis is shown in table 8 and figure 7 shows the familiar box-plot representation of the prediction-vector statistics. The overall classification success rate rises to 76% and all models besides and now show correct classification rates of 74% or clearly above. The probability of correctly classifying a single map in those two models are only 38% or 50%, respectively, however, a look at figure 7 reveals that they are correctly classified as an ensemble and at high significance. Finally, it is worth noting that none of the models shows any degeneracy with which is larger than 4% according to table 8.

CDM
1618 0 7 0 10 3 0 0 0
99% 0% 0% 0% 1% 0% 0% 0% 0%
0 1501 0 1 0 0 1 20 115
0% 92% 0% 0% 0% 0% 0% 1% 7%
3 0 1257 2 375 1 0 0 0
0% 0% 77% 0% 23% 0% 0% 0% 0%
0 0 1 1470 96 22 7 2 40
0% 0% 0% 90% 6% 1% 0% 0% 2%
0 0 130 207 1289 12 0 0 0
0% 0% 8% 13% 79% 1% 0% 0% 0%
0 0 0 15 1 1206 275 30 111
0% 0% 0% 1% 0% 74% 17% 2% 7%
0 0 0 7 0 363 627 319 322
0% 0% 0% 0% 0% 22% 38% 19% 20%
0 3 0 3 0 77 385 827 343
0% 0% 0% 0% 0% 5% 24% 50% 21%
CDM 0 1 0 6 0 57 69 37 1468
0% 0% 0% 0% 0% 3% 4% 2% 90%
Table 8: The classification success matrix for the tomographic analysis using the CNN. The general structure of the table is the same as in table 5. We see good classification rates above 79% and typically above 90% for all models but the -family. Also with vanishing neutrino mass is correctly classified 74% of the time. The remaining degeneracies are limited to and with 38% and 50% classification accuracy, respectively, but given the error bars on the prediction mean the degeneracy is not significant for the ensemble of mass maps in the test set.
Figure 7: Predictions statistics on the tomographic analysis with the CNN. For many classes the classification is so good, that the prediction samples cluster around the optimal value of 1. All models are correctly classified and within the error bars of the prediction medians, no model remains observationally degenerate. Only small similarities remain between the three models of varying neutrino mass.

4.5 Remarks on extracted features

After presenting the raw classification results for different methods, we now briefly investigate what insight can be gathered into the actual meaning and importance of specific features that drive the classification success of different methods. To do so, we take a closer look at the training process. The first important observation is strikingly highlighted in table 3, which shows that almost all of the most discriminating wnd-charm

 features are Zernike coefficients derived from the Fourier transform of the raw image or from the Fourier transform of the edge -or wavelet-processed image. This is interesting since Zernike polynomials were originally introduced to describe the effects of certain optical elements such as lenses or reflecting surfaces in optical imaging

(Zernike, 1934). This suggests that a decomposition of mass maps into a function set that has a well-defined physical meaning does indeed lead to a good general representation of our data. In addition, all those features are derived from transformations of the raw mass map which shows the power of filtering the input data as e.g. shown by Peel et al. (2018). The ranking of the standard features shown in table 2 is less dominated by a single class, although the power spectrum and peak counts seem most relevant. The good results with a neural network as classifier shows that the optimal combination of such classical features leads to a good classification even without the need for additional descriptors.

CNNs often deliver superior results compared to other methods for certain tasks, but it is often believed that they are harder to understand and interpret. We are attempting to dissolve this believe by applying visualisation techniques for the different filters linked together in a deep neural network (Girshick et al., 2013; Zeiler & Fergus, 2013; Szegedy et al., 2013; Springenberg et al., 2014) and in order to reveal the inner workings of the complex model. We follow the approach of Simonyan et al. (2013) to extract our filter responses121212 Also see https://github.com/keras-team/keras/blob/master/examples/conv_filter_visualization.py. Starting from an image of random numbers with the same shape as our mass maps, we retrieve the output of every convolutional layer in the network and perform a gradient ascent in order maximise the response of those layers. While this is of course not a unique solution, the result of the final iteration of the ascent represents an example which triggered a strong response at a particular depth in the network. In figure 8 we show a few examples. The top row shows the four channels which had the strongest loss compared to the initial random image in CNN layer one. That is the convolution marked with index 2 in table 6. The second row shows the top four channel responses of the convolution with stride two just above the input layer in figure 9. The row marked with InceptionA shows the most responsive channels among all four convolutions just below the concatenation layer in figure 9 and equivalently for the figure rows marked InceptionB and C.

Figure 8: Visualisations of the convolutional filters applied by the CNN at different depths of the network.

As is typical for CNNs (Zeiler & Fergus, 2013), the very first level extracts very regular horizontal and vertical stripe patterns from the image. The stripes turn into a grid pattern deeper into the network and once arriving at the end of the InceptionA layer we can identify patterns of peaks and troughs which are either grouped regularly or along larger structures. It is not surprising that the earlier layers of the network, up to InceptionA, perform a global filtering of the map that highlights structure as long as the image still consists of a relatively large number of pixels. It is just from the finer InceptionB layers on that more specific structures, like objects that look like individual clusters or voids are picked up. It is such detailed analyses of the inner structure of trained CNNs that will lead to a deeper understanding why those networks work so well. This can potentially lead to the development of more specific algorithms at lower numerical cost but with similar or better classification performance.

5 Conclusions

We studied the ability of different kinds of machine learning techniques to discriminate between highly degenerate cosmological models, which combine the effects of modified gravity and massive neutrinos on structure formation. For this purpose we used a subset of the DUSTGRAIN-pathfinder simulation suite which consists of and eight models of gravity in the range of . The neutrino masses in the simulations span . Lensing convergence maps produced from these simulations provided the input for the different classification methods.

In order to characterise the mass maps we used three different approaches to feature extraction. Commonly used statistics in astrophysics such as, and among others, the power spectrum, peak counts and Minkowski functionals were combined into a single feature vector. In order to probe features which are more common to the field of computer vision and digital image processing, we used the publicly available wnd-charm algorithm which produces a large feature vector that combines a variety of common and more exotic descriptors and statistics. As the most flexible method of feature extraction we used a convolutional neural network (CNN). For classification we tested a nearest-neighbour method in feature space and a fully-connected neural network.

We provide an overview of the classification results from section 4 in table 9 and our results can be summarised as follows:

total accuracy degenerate classes performance Reference
classic nearest neighbour 22% 8 14%/15% () repository
wnd-charm nearest neighbour 25% 7 24%/24%() repository
classic neural network 39% 3 47%/16%() Tab. 5, Fig. 2
wnd-charm neural network 36% 4 42%/24%() repository
CNN neural network 44% 3 52%/15%() Fig. 5
CNN neural network 52% 2 66%/11%() Tab. 7, Fig. 4
CNN neural network 59% 1 53%/12%() Fig. 6
CNN neural network 76% 0 90%/4%() Tab. 8, Fig. 7
Table 9: A summary of the performance of the different methods used in this analysis. indicates the feature extraction function as described in section 3.2. is the classification function introduced in section 3.3 and is the convergence map source redshift. Degenerate classes is the number of all models for which the median and its error for the predictions of the true test set class overlaps with the median and its error of the predictions for any other class. The table also lists the performance for the particularly important class and shows the classification accuracy of each method for this model as well as the largest misclassification rate and the associated model. A reference to the detailed results of each models is given in the last column, where the reference ’repository’ points to the online repository mentioned in section 3.4.
  1. Nearest-neighbour classifiers based on distances in feature space are not delivering robust results. No matter if a small classical feature vector is used or a longer version based on computer vision, the total classification accuracy stays below 25%. Eight, out of the nine tested models, remain observationally degenerate131313We declare a model as degenerate if the median and its error for the predictions of a true test set class overlaps with the median and its error of the predictions for any other class..

  2. With the same classical or computer vision feature vectors, a neural network delivers a much more robust classification than the nearest-neighbour method. The total success rate for the classical feature vector is 39% and the number of degenerate models reduces to three.

  3. The longer feature vector containing 2919 features inspired by computer vision delivers a slightly worse classification of our models than the shorter vector with 99 classical descriptors. The total classification success rate is 3% lower and the method produces one additional degenerate model. Some of the computer vision feature may very well be useful, but currently we see no advantage of using features inspired by digital image processing compared to features well-established in cosmology.

  4. A CNN delivers the best classification results with 52% correct classifications at source redshift . The number of degenerate models reduces to two, both of which are part of the same model of gravity.

  5. Classification success is clearly a function of mass map source redshift. While going from to the success rate of the CNN decreases by 8% and the number of degenerate models increases by one. When going from to the accuracy increases by 7% and the number of degenerate models reduces by one. This increase of success rate with increasing redshift is not surprising since more information relevant to structure formation can be picked up along a deeper line-of-sight.

  6. When using a CNN in a tomographic analysis of four different mass map source redshifts along the same line-of-sight all observational degeneracies are fully broken. The total classification success rate increases to 76%.

A number of improvements to our methodology come to mind and we reserve them for future work. Firstly, the flexible features derived by a CNN can be combined with fixed features that are known to contribute to a successful classification of degenerate models. Secondly, instead of working on the raw image data, a clever transformation can be applied to the input data to enhance features that allow for the desired discrimination. We attempt such an approach in the context of machine learning in Peel et al. (2018b, PRL submitted). In fact, the CNN used in this work applies such transformations as we discussed in section 4.5. A careful analysis of the filtering process of a CNN at the early levels of its filter chain can provide useful insight into the most powerful image transformation for a given classification task. Furthermore, the careful analysis of the filters at a much deeper level of the network might actually lead to more insights on structure formation in different models, since it is at this deeper level where individual structure is characterised and isolated by the algorithm.

Much work is left to be done before this machine learning approach to the classification of mass maps in different cosmological models can be applied to real data. In this work we limited ourselves to optimal noise-free maps in order to see how different methodologies compare under optimal conditions. The influence of pixel shot-noise, observational systematics and practical issues like masking and image artefacts needs to be studied in detail. Furthermore, since the currently most successful methods use a supervised training process with labelled data based on numerical simulations, it needs to be carefully investigated how closely those simulated maps resemble a real observation. Without this important sanity check, even the best machine learning technique is useless since it learns the wrong data.

Acknowledgements

We would like to thank Ofer Springer for useful discussions about deep learning. JM has received funding from the European Union’s Horizon 2020 research and Innovation programme under the Marie Skłodowska-Curie grant agreement No 664931. AP acknowledges support from an Enhanced Eurotalents Fellowship, a Marie Skłodowska-Curie Actions Programme co-funded by the European Commission and Commissariat à l’énergie atomique et aux énergies alternatives (CEA). CG and MB acknowledge support from the Italian Ministry for Education, University and Research (MIUR) through the SIR individual grant SIMCODE (project number RBSI14P4IH). CG and MM acknowledge support from the Italian Ministry of Foreign Affairs and International Cooperation, Directorate General for Country Promotion (Project "Crack the lens"). We also acknowledge the support from the grant MIUR PRIN 2015 "Cosmology and Fundamental Physics: illuminating the Dark Universe with Euclid"; and the financial contribution from the agreement ASI n.I/023/12/0 "Attività relative alla fase B2/C per la missione Euclid". The DUSTGRAIN-pathfinder simulations analyzed in this work have been performed on the Marconi supercomputing machine at Cineca thanks to the PRACE project SIMCODE1 (grant nr. 2016153604) and on the computing facilities of the Computational Center for Particle and Astrophysics (C2PAP) and of the Leibniz Supercomputer Center (LRZ) under the project ID pr94ji.We gratefully acknowledge the support of NVIDIA Corporation with the donation of the two Titan Xp GPUs used for this research.

References

  • ATLAS Collaboration (2013) ATLAS Collaboration 2013, preprint, (arXiv:1309.4017)
  • Aartsen et al. (2013) Aartsen M. G., et al., 2013, Physical Review Letters, 110, 131302
  • Abbott et al. (2018) Abbott T. M. C., et al., 2018, Phys. Rev. D, 98, 043526
  • Abraham et al. (2003) Abraham R. G., van den Bergh S., Nair P., 2003, ApJ, 588, 218
  • Ackermann et al. (2017) Ackermann M., et al., 2017, ApJ, 840, 43
  • Alam et al. (2017) Alam S., et al., 2017, MNRAS, 470, 2617
  • Albert et al. (2017) Albert A., et al., 2017, ApJ, 834, 110
  • Amendola et al. (2018) Amendola L., et al., 2018, Living Reviews in Relativity, 21, 2
  • Arnold et al. (2014) Arnold C., Puchwein E., Springel V., 2014, MNRAS, 440, 833
  • Arnold et al. (2015) Arnold C., Puchwein E., Springel V., 2015, Mon. Not. Roy. Astron. Soc., 448, 2275
  • Arnold et al. (2016) Arnold C., Springel V., Puchwein E., 2016, MNRAS, 462, 1530
  • Arnold et al. (2018) Arnold C., Fosalba P., Springel V., Puchwein E., Blot L., 2018, ArXiv e-prints 1805.09824,
  • Baldi & Villaescusa-Navarro (2018) Baldi M., Villaescusa-Navarro F., 2018, Mon. Not. Roy. Astron. Soc., 473, 3226
  • Baldi et al. (2014) Baldi M., Villaescusa-Navarro F., Viel M., Puchwein E., Springel V., Moscardini L., 2014, Mon. Not. Roy. Astron. Soc., 440, 75
  • Barreira et al. (2016) Barreira A., Llinares C., Bose S., Li B., 2016, J. Cosmology Astropart. Phys., 5, 001
  • Bartelmann & Schneider (2001) Bartelmann M., Schneider P., 2001, Phys. Rep., 340, 291
  • Benitez et al. (2014) Benitez N., et al., 2014, ArXiv e-prints 1403.5237,
  • Bennett et al. (2013) Bennett C., et al., 2013, Astrophys.J.Suppl., 208, 20
  • Bernabei et al. (2018) Bernabei R., et al., 2018, ArXiv e-prints 1805.10486,
  • Bishop (2006)

    Bishop C. M., 2006, Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA

  • Buchdahl (1970) Buchdahl H. A., 1970, MNRAS, 150, 1
  • CMS Collaboration (2016) CMS Collaboration 2016, preprint, (arXiv:1603.08914)
  • Castro et al. (2018) Castro T., Quartin M., Giocoli C., Borgani S., Dolag K., 2018, MNRAS, 478, 1305
  • Chetlur et al. (2014) Chetlur S., Woolley C., Vandermersch P., Cohen J., Tran J., Catanzaro B., Shelhamer E., 2014, CoRR, abs/1410.0759
  • Chollet (2017) Chollet F., 2017, Deep Learning with Python, 1st edn. Manning Publications Co., Greenwich, CT, USA
  • Dietrich & Hartlap (2010) Dietrich J. P., Hartlap J., 2010, MNRAS, 402, 1049
  • Fogel & Sagi (1989) Fogel I., Sagi D., 1989, Biological Cybernetics, 61, 103
  • Friedrich et al. (2018) Friedrich O., et al., 2018, Phys. Rev. D, 98, 023508
  • Fu et al. (2008) Fu L., et al., 2008, Astron. Astrophys., 479, 9
  • Giocoli et al. (2014) Giocoli C., Meneghetti M., Metcalf R. B., Ettori S., Moscardini L., 2014, MNRAS, 440, 1899
  • Giocoli et al. (2016) Giocoli C., et al., 2016, MNRAS, 461, 209
  • Giocoli et al. (2017) Giocoli C., et al., 2017, MNRAS, 470, 3574
  • Giocoli et al. (2018a) Giocoli C., Baldi M., Moscardini L., 2018a, ArXiv e-prints: 1806.04681,
  • Giocoli et al. (2018b) Giocoli C., Moscardini L., Baldi M., Meneghetti M., Metcalf R. B., 2018b, MNRAS,
  • Girshick et al. (2013) Girshick R., Donahue J., Darrell T., Malik J., 2013, preprint, (arXiv:1311.2524)
  • Goodfellow et al. (2016) Goodfellow I., Bengio Y., Courville A., 2016, Deep Learning. The MIT Press
  • Graves (2013) Graves A., 2013, CoRR, abs/1308.0850
  • Gruen et al. (2018) Gruen D., et al., 2018, Phys. Rev. D, 98, 023507
  • Gupta et al. (2018) Gupta A., Matilla J. M. Z., Hsu D., Haiman Z., 2018, Phys. Rev. D, 97, 103515
  • Hagstotz et al. (2018) Hagstotz S., Costanzi M., Baldi M., Weller J., 2018, ArXiv e-prints 1806.07400,
  • Hanisch et al. (2001) Hanisch R. J., Farris A., Greisen E. W., Pence W. D., Schlesinger B. M., Teuben P. J., Thompson R. W., Warnock III A., 2001, A&A, 376, 359
  • Haralick et al. (1973) Haralick R. M., Shanmugam K., Dinstein I., 1973, IEEE Transactions on Systems, Man, and Cybernetics, SMC-3, 610
  • He (2013) He J.-h., 2013, arXiv:1307.4876,
  • He et al. (2015) He K., Zhang X., Ren S., Sun J., 2015, preprint, (arXiv:1512.03385)
  • Herbel et al. (2018) Herbel J., Kacprzak T., Amara A., Refregier A., Lucchi A., 2018, J. Cosmology Astropart. Phys., 7, 054
  • Heymans et al. (2012) Heymans C., et al., 2012, MNRAS, 427, 146
  • Hikage et al. (2018) Hikage C., et al., 2018, preprint, (arXiv:1809.09148)
  • Hilbert et al. (2008) Hilbert S., White S. D. M., Hartlap J., Schneider P., 2008, MNRAS, 386, 1845
  • Hilbert et al. (2009) Hilbert S., Hartlap J., White S. D. M., Schneider P., 2009, A&A, 499, 31
  • Hildebrandt et al. (2017) Hildebrandt H., et al., 2017, MNRAS, 465, 1454
  • Hu & Sawicki (2007) Hu W., Sawicki I., 2007, Phys. Rev., D76, 064004
  • Ioffe & Szegedy (2015) Ioffe S., Szegedy C., 2015, CoRR, abs/1502.03167
  • Ivezic et al. (2008) Ivezic Z., et al., 2008, arXiv:0805.2366
  • Johnson et al. (2016) Johnson M., et al., 2016, CoRR, abs/1611.04558
  • Joudaki et al. (2017) Joudaki S., et al., 2017, MNRAS, 465, 2033
  • Kingma & Ba (2014) Kingma D. P., Ba J., 2014, CoRR, abs/1412.6980
  • Kratochvil et al. (2010) Kratochvil J. M., Haiman Z., May M., 2010, Phys. Rev. D, 81, 043519
  • Kratochvil et al. (2012) Kratochvil J. M., Lim E. A., Wang S., Haiman Z., May M., Huffenberger K., 2012, Phys. Rev. D, 85, 103513
  • Krizhevsky et al. (2012) Krizhevsky A., Sutskever I., Hinton G. E., 2012, in Pereira F., Burges C. J. C., Bottou L., Weinberger K. Q., eds, , Advances in Neural Information Processing Systems 25. Curran Associates, Inc., pp 1097–1105, link
  • Laureijs et al. (2011) Laureijs R., et al., 2011, eprint arXiv: 1110.3193,
  • LeCun et al. (2015) LeCun Y., Bengio Y., Hinton G., 2015, Nature, 521, 436
  • Lin & Kilbinger (2018) Lin C.-A., Kilbinger M., 2018, A&A, 614, A36
  • Lin et al. (2017) Lin T., Goyal P., Girshick R. B., He K., Dollár P., 2017, CoRR, abs/1708.02002
  • Lucie-Smith et al. (2018) Lucie-Smith L., Peiris H. V., Pontzen A., Lochner M., 2018, MNRAS, 479, 3405
  • Martinet et al. (2018) Martinet N., et al., 2018, MNRAS, 474, 712
  • Motohashi et al. (2013) Motohashi H., Starobinsky A. A., Yokoyama J., 2013, Phys.Rev.Lett., 110, 121302
  • Naik et al. (2018) Naik A. P., Puchwein E., Davis A.-C., Arnold C., 2018, ArXiv e-prints 1805.12221,
  • Ntampaka et al. (2018) Ntampaka M., et al., 2018, preprint, (arXiv:1810.07703)
  • Orlov et al. (2006) Orlov N., Johnston J., Macura T., Wolkow C., Goldberg I., 2006, in 3rd IEEE International Symposium on Biomedical Imaging: Nano to Macro, 2006.. pp 1152–1155, doi:10.1109/ISBI.2006.1625127
  • Orlov et al. (2008) Orlov N., Shamir L., Macura T., Johnston J., Eckley D. M., Goldberg I. G., 2008, Pattern recognition letters, 29, 1684
  • Otsu (1979) Otsu N., 1979, IEEE Transactions on Systems, Man, and Cybernetics, 9, 62
  • Parkinson et al. (2012) Parkinson D., et al., 2012, Phys. Rev. D, 86, 103518
  • Peel et al. (2017) Peel A., Lin C.-A., Lanusse F., Leonard A., Starck J.-L., Kilbinger M., 2017, A&A, 599, A79
  • Peel et al. (2018) Peel A., Pettorino V., Giocoli C., Starck J.-L., Baldi M., 2018, preprint, (arXiv:1805.05146)
  • Perlmutter et al. (1999) Perlmutter S., et al., 1999, Astrophys. J., 517, 565
  • Petri (2016) Petri A., 2016, Astronomy and Computing, 17, 73
  • Petri et al. (2015) Petri A., Liu J., Haiman Z., May M., Hui L., Kratochvil J. M., 2015, Phys. Rev. D, 91, 103511
  • Petri et al. (2017) Petri A., Haiman Z., May M., 2017, Phys. Rev. D, 95, 123503
  • Pezzotta et al. (2017) Pezzotta A., et al., 2017, A&A, 604, A33
  • Planck Collaboration et al. (2016a) Planck Collaboration et al., 2016a, A&A, 594, A13
  • Planck Collaboration et al. (2016b) Planck Collaboration et al., 2016b, A&A, 594, A24
  • Planck Collaboration et al. (2018) Planck Collaboration et al., 2018, ArXiv e-prints 1807.06209,
  • Prewitt (1970) Prewitt J., 1970, Picture Processing and Psychopictorics. New York: Academic Press
  • Puchwein et al. (2013) Puchwein E., Baldi M., Springel V., 2013, MNRAS, 436, 348
  • Radon (1917) Radon J., 1917, in Berichte über die Verhandlungen der Königlich-Sächsischen Akademie der Wissenschaften zu Leipzig. Teubner, pp 262–277, doi:doi:10.1109/TMI.1986.4307775, https://dx.doi.org/10.1109%2FTMI.1986.4307775
  • Ravanbakhsh et al. (2017) Ravanbakhsh S., Oliva J., Fromenteau S., Price L. C., Ho S., Schneider J., Poczos B., 2017, preprint, (arXiv:1711.02033)
  • Riess et al. (1998) Riess A. G., et al., 1998, Astron. J., 116, 1009
  • Rodriguez et al. (2018) Rodriguez A. C., Kacprzak T., Lucchi A., Amara A., Sgier R., Fluri J., Hofmann T., Réfrégier A., 2018, preprint, (arXiv:1801.09070)
  • Roncarelli et al. (2018) Roncarelli M., Baldi M., Villaescusa-Navarro F., 2018, MNRAS, 481, 2497
  • Rumelhart et al. (1986) Rumelhart D. E., Hinton G. E., Williams R. J., 1986, Nature, 323, 533
  • Schäfer et al. (2012) Schäfer B. M., Heisenberg L., Kalovidouris A. F., Bacon D. J., 2012, MNRAS, 420, 455
  • Schmidt et al. (1998) Schmidt B. P., et al., 1998, Astrophys.J., 507, 46
  • Schneider (1996) Schneider P., 1996, MNRAS, 283, 837
  • Schneider et al. (1998) Schneider P., van Waerbeke L., Jain B., Kruse G., 1998, MNRAS, 296, 873
  • Shamir et al. (2008) Shamir L., Orlov N., Eckley D. M., Macura T., Johnston J., Goldberg I., 2008, Source Code for Biology and Medicine, 3, 13
  • Shamir et al. (2010) Shamir L., Delaney J., Orlov N., Eckley D., Goldberg I., 2010, PLoS Comput Biol 6(11): e1000974
  • Shan et al. (2018) Shan H., et al., 2018, MNRAS, 474, 1116
  • Shirasaki et al. (2017) Shirasaki M., Nishimichi T., Li B., Higuchi Y., 2017, MNRAS, 466, 2402
  • Simonyan & Zisserman (2014) Simonyan K., Zisserman A., 2014, preprint, (arXiv:1409.1556)
  • Simonyan et al. (2013) Simonyan K., Vedaldi A., Zisserman A., 2013, preprint, (arXiv:1312.6034)
  • Spergel et al. (2015) Spergel D., et al., 2015, ArXiv e-prints 1503.03757,
  • Springel (2005) Springel V., 2005, Mon. Not. Roy. Astron. Soc., 364, 1105
  • Springel et al. (2005) Springel V., et al., 2005, Nature, 435, 629
  • Springenberg et al. (2014) Springenberg J. T., Dosovitskiy A., Brox T., Riedmiller M., 2014, preprint, (arXiv:1412.6806)
  • Springer et al. (2018) Springer O. M., Ofek E. O., Weiss Y., Merten J., 2018, preprint, (arXiv:1808.07491)
  • Srivastava et al. (2014) Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R., 2014, Journal of Machine Learning Research, 15, 1929
  • Szegedy et al. (2013) Szegedy C., Zaremba W., Sutskever I., Bruna J., Erhan D., Goodfellow I., Fergus R., 2013, preprint, (arXiv:1312.6199)
  • Szegedy et al. (2014) Szegedy C., et al., 2014, CoRR, abs/1409.4842
  • Szegedy et al. (2016) Szegedy C., Ioffe S., Vanhoucke V., 2016, CoRR, abs/1602.07261
  • Tamura et al. (1978) Tamura H., Mori S., Yamawaki T., 1978, IEEE Transactions on Systems, Man, and Cybernetics, 8, 460
  • Teague (1980) Teague M. R., 1980, J. Opt. Soc. Am., 70, 920
  • Tessore et al. (2015) Tessore N., Winther H. A., Metcalf R. B., Ferreira P. G., Giocoli C., 2015, J. Cosmology Astropart. Phys., 10, 036
  • Troxel et al. (2018) Troxel M. A., et al., 2018, Phys. Rev. D, 98, 043528
  • Van Waerbeke et al. (2013) Van Waerbeke L., et al., 2013, MNRAS, 433, 3373
  • Viel et al. (2010) Viel M., Haehnelt M. G., Springel V., 2010, JCAP, 1006, 015
  • Vikhlinin et al. (2009) Vikhlinin A., et al., 2009, ApJ, 692, 1060
  • Villaescusa-Navarro et al. (2017) Villaescusa-Navarro F., Banerjee A., Dalal N., Castorina E., Scoccimarro R., Angulo R., Spergel D. N., 2017, preprint, (arXiv:1708.01154)
  • White (1993) White S. D. M., 1993, in Gleiser R. J., Kozameh C. N., Moreschi O. M., eds, General Relativity and Gravitation 1992. pp 331–+
  • White (1996) White S. D. M., 1996, in Schaeffer R., Silk J., Spiro M., Zinn-Justin J., eds, Cosmology and Large Scale Structure. p. 349
  • White & Rees (1978) White S. D. M., Rees M. J., 1978, MNRAS, 183, 341
  • Winther et al. (2015) Winther H. A., et al., 2015, Mon. Not. Roy. Astron. Soc., 454, 4208
  • Wright et al. (2017) Wright B. S., Winther H. A., Koyama K., 2017, J. Cosmology Astropart. Phys., 10, 054
  • Wu et al. (1992) Wu C.-M., Chen Y.-C., Hsieh K.-S., 1992, IEEE Transactions on Medical Imaging, 11, 141
  • Wu et al. (2016) Wu Y., et al., 2016, CoRR, abs/1609.08144
  • Zeiler & Fergus (2013) Zeiler M. D., Fergus R., 2013, preprint, (arXiv:1311.2901)
  • Zennaro et al. (2017) Zennaro M., Bel J., Villaescusa-Navarro F., Carbone C., Sefusatti E., Guzzo L., 2017, MNRAS, 466, 3244
  • Zernike (1934) Zernike v. F., 1934, Physica, 1, 689

Appendix A Wnd-charm features

The total length of the wnd-charm feature vector entails 2919 descriptors, which can be divided into five families. We provide an overview of the features and their respective families in table 10. The algorithm does not only work on the image itself (raw), but also on its Fourier (F), Wavelet (W), Chebyshev (C) or Edge transformation (E) as indicated by the ’Input’ column of table 10. Transformations of transformations are considered by the bracket notation. While Fourier and Chebyshev transforms are implemented using common algorithms and methodologies, the Wavelet transformation is performed with a one level filter pass with a 5th order symlet (Orlov et al., 2008) and the Edge transformation is carried out using a Prewitt operator (Prewitt, 1970) to approximate the image gradient.

The pixel statistics family is made out of four different subclasses, with the simplest being the intensity statistics consisting of mean, median, standard deviation, minimum and maximum. The multi-scale histograms are calculated by using three, five, seven or nine bins to order the pixel amplitudes. The counts in each of those bins makes up the 24 features in this subclass. The combined moments are mean, standard deviation, skewness and kurtosis, which are calculated in a horizontal stripe through the image centre and with a width which is half the total image width. The stripe is then rotated by 45, 90 and 135 degrees and the measurement is repeated. Those 16 numbers are sampled into three bins each, providing a total of 48 features. The Gini coefficient (Abraham et al., 2003) is a measure of how equal the spectrum of pixel intensities is distributed within the image.

The second feature family is comprised of polynomial decompositions. The coefficients of an order 20 Chebyshev and an order 23 Chebyshev-Fourier (Orlov et al., 2006) transformation are sorted into 32 bin histograms. Radon transformations are carried out along lines with an inclination angle of 0, 45, 90 and 135 degrees with respect to the image horizontal (Radon, 1917) and ordered in 3 bin histograms. The class of Zernike coefficients is derived from a 2D Zernike decomposition of the image (Teague, 1980) and the first 72 of those coefficients contribute to the feature vector.

The use of textures is common in image processing and is a way of describing spatial correlations of intensity values. We extract seven Gabor filters (e.g. Fogel & Sagi, 1989) using Gaussian harmonic functions and define their image occupation area as a feature. Tamura textures are described in detail in Tamura et al. (1978) and wnd-charm uses contrast, directionality, coarseness sum and coarseness binned into a 3 sample histogram. The 28 Haralick textures are specific properties of the grey-level dependence matrix of the image and are described in Haralick et al. (1973). The fractal analysis is based on a Brownian motion model of the image following Wu et al. (1992) and wnd-charm uses the first 20 parameters of this analysis as features.

Object statistics are only derived from the raw image data. The starting point is an edge transform using a Prewitt filter and mean, median, variance and 8-bin histogram of both image gradient and its directionality add up to 22 features, which are supplemented by the total number of edge pixels, their genus and the differences between the directionality bins. Otsu features and their inverse are calculated after the application of an Otsu threshold (Otsu, 1979). Finally, for all objects the algorithm calculates minimum, maximum, mean, median, variance and 10-bin histogram for area and image-centre distance of all Otsu objects in the image.

Family Class Features Input Reference
Pixel Combined moments 48 raw, F, W, C, C(F), W(F)
statistics F(W), F(C), C(W), E, F(E), W(E)
Gini coefficient 1 raw, F, W, C, C(F), W(F) Abraham et al. (2003)
F(W), F(C), C(W), E, F(E), W(E)
Multiscale histograms 24 raw, F, W, C, C(F), W(F)
F(W), F(C), C(W), E, F(E), W(E)
Pixel intensity statistics 5 raw, F, W, C, C(F), W(F)
F(W), F(C), C(W), E, F(E), W(E)
Polynomial Chebyshev coefficients 32 raw, F, W, C, F(W), E, F(E), W(E)
decomposition Chebyshev-Fourier coefficients 32 raw, F, W, C, F(W), E, F(E), W(E) Orlov et al. (2006)
Radon coefficients 12 raw, F, W, C, C(F), W(F) Radon (1917)
F(W), F(C), C(W), E, F(E), W(E)
Zernike coefficients 72 raw, F, W, C, F(W), E, F(E), W(E) Teague (1980))
Textures Fractal analysis 20 raw, F, W, C, C(F), W(F) Wu et al. (1992)
F(W), F(C), C(W), E, F(E), W(E)
Gabor 7 raw Fogel & Sagi (1989)
Haralick 28 raw, F, W, C, C(F), W(F) Haralick et al. (1973)
F(W), F(C), C(W), E, F(E), W(E)
Tamura 6 raw, F, W, C, C(F), W(F) Tamura et al. (1978)
F(W), F(C), C(W), E, F(E), W(E)
Objects Edge features 28 raw Prewitt (1970)
Otsu object features 34 raw Otsu (1979)
Inverse Otsu object features 34 raw Otsu (1979)
Table 10: wnd-charm image features used in this analysis.

Appendix B Deep neural networks

In this appendix we collect some more detailed information about deep neural networks. The first section formally defines all network layers used in this work and the second section deals with activation functions. The third section provides a thorough description about the architecture of the convolutional neural network (CNN) that we use in our analyses.

b.1 Layers

Given a 2D input vector with a convolution layer applies the following operation to produce an output

(19)
(20)
(21)
(22)
(23)

The stride parameters