I Introduction
The analysis of multiple probes, including the Cosmic Microwave Background (CMB) and large scale structure (LSS), have yielded very precise estimates for the parameters that define the standard cosmological model,
CDM Planck Collaboration et al. (2016); Anderson et al. (2014). Early fluctuations in the CMB evolved through gravitational instability and formed the structures we observe in the late universe. The evolution of the matter distribution in the universe encodes rich cosmological information that can be mined to test the standard model and constrain the possible values for its defining parameters.Over 80% of the matter in the universe is nonbaryonic Dark Matter (DM), detectable through its gravitational effects. It contributes to gravitational lensing, distorting the shapes of background galaxies to an extent that is usually too small to be directly observed. Weak gravitational lensing (WL) can, nonetheless, be measured statistically through the correlation in the shapes of galaxies Refregier (2003); Kilbinger (2015). The lensed galaxies’ redshifts allow the reconstruction of the matter density field’s evolution Hu (2002), making WL one of the most promising cosmological probes. Lensing measurements and their analysis in a cosmological context are an essential part of experiments such as CFHTLenS ^{1}^{1}1http://www.cfhtlens.org/, KiDS ^{2}^{2}2http://kids.strw.leidenuniv.nl/index.php, the Dark Energy Survey (DES ^{3}^{3}3http://www.darkenergysurvey.org) or HSC ^{4}^{4}4http://hsc.mtk.nao.ac.jp/ssp/, and will be included in even wider ( larger) surveys (Large Synoptic Survey Telescope, LSST ^{5}^{5}5http://www.lsst.org, the Euclid mission ^{6}^{6}6http://sci.esa.int/euclid/ and the Wide Field Infrared Survey Telescope, WFIRST ^{7}^{7}7http://wfirst.gsfc.nasa.gov).
The large volume of upcoming datasets raises the question of how to extract all the cosmological information encoded in them. Nonlinear gravitational collapse distorts the Gaussian character of the initial fluctuations. Thus, twopoint statistics are insufficient to characterize weak lensing data and additional descriptors have been considered to extract additional information Weinberg et al. (2013). An alternative approach is to transform the data so that nonlinearities become less important and it is easier to recover the information encoded in the transformed field (e.g. with the power spectrum). Logarithmic transformations have been proposed for the 3D matter density field Neyrinck et al. (2009) and the 2D convergence Seo et al. (2011), as well as other local, Gaussianization transformations Shirasaki (2017).
Overall, nonGaussian statistics such as lensing peaks and moments involving gradients of the convergence field are promising, since they can improve parameter errors by a factor of 23 compared to using only secondorder statistics
Kratochvil et al. (2010, 2012); Dietrich and Hartlap (2010); Shirasaki and Yoshida (2014); Liu et al. (2015); Petri et al. (2015); Kacprzak et al. (2016); Martinet et al. (2018). It is not clear where the extra information lies, or if all of it is accessible Takada (2014). It has been investigated and partially understood only for lensing peaks, which derive some (but not all) information from underlying collapsed DM halos Yang et al. (2011); Zorrilla Matilla et al. (2016); Liu and Haiman (2016). This halopeak connection has inspired the development of approximate analytic models for peak counts Fan et al. (2010); Lin and Kilbinger (2015a).All these statistics compress the information in the original dataset, typically a map representing a noisy estimate of the projected matter density field, into a lowdimensional descriptor that can be used to infer the parameters that determine how the data was generated. An alternative approach is to use deep learning techniques, which have proven successful in a wide range of areas LeCun et al. (2015) to infer cosmological parameters directly from the uncompressed raw data.
Artificial neural networks (NNs) are pattern recognition algorithms, in which a series of processing nodes, capable of performing simple operations, are connected to each other in a network. The nodes of a NN are typically arranged in layers, with nodes in one layer connected to those in the next. Information is fed to the NN through the input layer, its outcome comes from the output layer, and all intermediate steps are called “hidden” layers. The strength of the connections is stored in a series of weights that can be adjusted to match a given output; this process is called ”learning”. This quality allows the use of NNs for forecasting and inference. While we do not have a full understanding on what drives NNs predictive power
Lin et al. (2016), they have been successfully used in Astronomy, from source detection and classification, to light curve analyses and even adaptive optics control (see reviews on NNs in astronomy in Miller (1993); Tagliaferri et al. (2003)).Convolutional Neural Networks (CNNs) are particularly well suited to work on datasets with spatial information, such as images, since the connection of their convolution layers’ nodes to subsets of the data take advantage of the high correlation of nearby points imprinted by the locality of physical processes. Recently they have been used to infer cosmological parameters from the 3D matter density field Ravanbakhsh et al. (2016), and have been found to outperform constraints estimated from its power spectrum. Weak lensing provides (in principle) an unbiased map of the projected matter distribution. One of the aims of this study is to assess if neural networks overperform relative to the power spectrum when analyzing 2D WL data, as they do for the 3D matter field. Similar techniques have also been used to generate data with the same statistical properties as the output of physicallymotivated simulations Mustafa et al. (2017); Rodriguez et al. (2018).
A similar study has recently applied convolutional neural networks to weak lensing data for inference Schmelzle et al. (2017). Our study shares the same motivation and reaches similar conclusions, but has some differences. While Schmelzle et al. (2017) focused on the ability of deep learning techniques to differentiate between models along a known {} degeneracy, Jain and Seljak (1997), we focus on the parameters’ constraints that can be inferred by extracting information through neural networks. To do so we trained our networks on a set of 96 cosmological models covering a large region of the parameter space (see Fig. 1
for the distribution of those models). Furthermore, we used different simulation techniques, the architecture of our network is different and we compared the neural network to a different set of observables (power spectrum and lensing peaks, instead of skewness and kurtosis). Finally, we restricted our analysis to noiseless data, leaving the analysis of the effect of shape noise for a followup study.
The paper is organized as follows, in §II we describe how we generated the data used to train and test the CNN, the architecture and training of the network, and the summary statistics used as benchmarks. In §III we compare the performance of the CNN to that of alternative summary statistics, in terms of its predictive accuracy and the cosmological constraints that can be inferred. In §IV we discuss the implications of our results and we summarize our conclusions in §V.
Ii Data
The goal of this paper is to assess the performance of CNNs predicting cosmological parameters from WL data. We do so comparing the network’s predictions with those that can be inferred from statistics measured on the maps, as well as the credible regions that can be inferred around the predicted parameters. In this section, we describe how the WL data used was generated, the design and training of the CNN, and describe the summary statistics measured on the WL data: the power spectrum and lensing peaks.
ii.1 Mock convergence maps
Our initial data set consists of mock convergence () maps generated assuming 96 different values for the matter density and the scale of the initial perturbations normalized at the late Universe, (see Fig. 1). We adjusted the Dark Energy density to enforce flatness, , and kept the rest of the parameters constant: baryon density (), Hubble constant (), scalar spectral index (
), effective number of relativistic degrees of freedom (
) and neutrino masses ().We singled out the cosmology with {} as a fiducial to compute the covariance of the observables used to assess the performance of the CNN (see § II.3). The density of the model sampling increases towards the fiducial and shows some correlation with the direction of the degeneracy, , as can be seen in Fig. 1. We refer the reader to Zorrilla Matilla et al. (2016), where this suite of simulations was also used, and our pipeline Lenstools Petri (2016) for a detailed description of our sampling algorithm and simulation processing. We provide a summary here for convenience.
We evolved the matter density field using the body code GADGET2 Springel (2005). For each cosmology we simulated a single volume from initial conditions computed with CAMB Lewis et al. (2000). The simulation boxes are cubes with a sidelength of Mpc, large enough to cover the maps’ field of view of to a redshift of . Each box is populated with Dark Matter (DM) particles, yielding a mass resolution of .
We raytraced the outputs of our simulations following the multiple lens plane algorithm Schneider et al. (1992). It has been shown that while the Born approximation is sufficient for an accurate estimation of the power spectrum even in the largest planned future WL surveys, full raytracing is necessary to avoid biased estimations for the counts of lensing peaks and higher order statistics Petri et al. (2017). The value of for each of our maps’ pixels is derived from the deflection experienced by a light ray as it crosses a series of lens planes stacked to form its past lightcone. For this study, we considered all the lensed galaxies located at a single fixed redshift of . Each resulting map has pixels, and was sliced in 16 smaller patches of pixels each to speed up the neural network’s training (§II.2).
Each lens plane was generated from the snapshot corresponding to its redshift by cutting a Mpc slab along one of its axes, estimating the matter density on a grid, and solving the Poisson equation in 2D for the gravitational potential. By cutting different slabs, combining different planes at each redshifts, and randomly translating and rotating them, we ultimately generated 512 independent maps from a single simulation box for each cosmology. Through this recycling process, it is possible to generate up to independent realizations of the convergence field from a single Nbody simulation Petri et al. (2016). The resulting unsmoothed, noiseless convergence maps, is analogous to a 2D version of the dataset used in Ravanbakhsh et al. (2016).
ii.2 Neural network training and architecture
Neural networks consist of interconnected nodes (or neurons), arranged in layers. Each neuron transforms a linear combination of its inputs through an “activation” function,
, where is a matrix of weights anda vector of inputs (in our case, the latter contains the values of the convergence in the pixelized 2D lensing map). The inputs can come from other neurons in the network, or from external data. The activation function is usually nonlinear (e.g. a sigmoid function). The weights used to linearly combine the inputs can be adjusted to minimize a loss function, in a process that is called “training” or “learning”. Some of the layers in the neural network used for this study convolve their input data with a kernel whose values are fitted during training. The resulting “convolutional neural network” takes advantage of the correlations between neighboring pixels and has been shown to yield good results when analyzing natural images.
Each “labeled example” the network is exposed to is a 1024 1024 map coupled with the “label” that corresponds to the cosmology used to generate that map. From each such example, we created 16 “labeled examples” by slicing the map into smaller, 256 256 maps. And these are the maps used as input for the neural net. This operation reduced the number of nodes in the CNN and, consequently, its training time. We do not expect the performance of the network to be adversely affected, because of the limited constraining power of the modes that are small enough to be captured by the full maps but not their slices, i.e. spherical harmonic indices in the range (see, e.g. ref Fang and Haiman (2007), for a demonstration that most of the information is on smaller scales). The prediction for each 1024 1024 map is the mean of the predictions for the 16, 256 256 maps that were sliced from the original, bigger map. Our whole dataset amounts to 96 different cosmological models, each having 512, 1024 1024 independent maps. We trained the neural networks using 70% of our data, and set aside the remaining 30% to test their performance.
The architecture of the CNN was inspired by that used in Ravanbakhsh et al. (2016). We sketch the architecture in Fig. 2 and summarize its elements in Table 1. The network is a combination of convolutions (transformed by a nonlinear “activation” function) and pooling layers that reduce the spatial dimensionality of the output, followed by fully connected layers in charge of the highlevel logic. For the convolutional layers, we chose a kernel for speed. Each convolution layer applies more than 1 filter to its input in sublayers. The weights (filter values) are the same for all the neurons within a sublayer. This parametersharing reduces the number of weights to fit during training and is a reasonable choice given the data’s translational and rotational symmetries.
The first layer convolves any input map with 4 different filters and applies the activation function to the resulting 4 feature maps. Each filter is defined by 10 parameters (9 determine the convolution kernel plus an overall additive bias). In total, 40 weights need to be adjusted during training for the first layer. The second layer downsamples the feature maps from the first layer substituting consecutive pixels by their mean (“average pooling”). The third and fourth layers are convolutional layers, and each applies 12 different kernels to all incoming feature maps, including all depth levels from the previous layer. While the convolution is a linear operation, the application of the activation function breaks the linearity. The number of tunable weights grows with each layer as new kernels are added. Another average pooling layer (layer 5) is followed by two sets of convolution + average pooling (layers 69).
At each layer, we can consider the neurons arranged along 3 dimensions, 2 that follow the spatial dimensions of the feature maps fed into the layer (width and height) and another that grows with the number of filters used to process the layer’s input (depth). As information flows through the network, the spatial dimensions of the feature maps shrink and the depth of nodes processing those maps grows. The convolutional layers 6 and 8 do not apply an activation function to their output. Another average pooling (layer 10), followed by a flattening layer (layer 11) reduce the spatial dimensionality to unity, with a depth of 2304.
A series of fully connected layers (layers 12, 14 and 16) are followed by dropout layers (layers 13, 15 and 17) that shrink the depth of the output. The final fully connected layer (layer 18) outputs the estimated values for and , which are compared with their true value through the loss function to adjust the weights in the network through backpropagation.
Layer  Type  Sublayers  Output dimension  Weights 

1  Convolution + LeakyReLU  4  40  
2  Average pooling  1  0  
3  Convolution + LeakyReLU  12  444  
4  Convolution + LeakyReLU  12  1308  
5  Average pooling  1  0  
6  Convolution  32  3488  
7  Average pooling  1  0  
8  Convolution  64  18496  
9  Average pooling  1  0  
10  Average pooling  1  0  
11  Flattening  1  2304  0 
12  Fully connected + LeakyReLU  1  1024  2360320 
13  Dropout  1  1024  0 
14  Fully connected + LeakyReLU  1  256  262400 
15  Dropout  1  256  0 
16  Fully connected + LeakyReLU  1  10  2570 
17  Dropout  1  10  0 
18  Fully connected  1  2  22 
Total  2649088 
The total number of parameters to be fitted during training is , a large number but very small compared with the total number of pixels in the training data set ().
The adopted activation function is the “leaky rectified linear unit” (LeakyReLU), with a leak parameter of 0.03, within the range suggested in Maas et al. (2013):
(1) 
This functions helps mitigate the “dying” ReLU problem, in which a neuron gets stuck in a region of zero gradient
Nair and Hinton (2010). To prevent overfitting, we enforced regularization applying “dropout” at the fully connected layers: the output of any neuron was ignored with a chance Hinton et al. (2012). This process took part only during training, and the output from the nodes that were not droppedout was doubled to compensate for the ignored neurons.We used two loss functions to minimize during the training of our neural networks. The first one is the sum of the absolute error on and , computed over batches of 32 maps each, in which the data is split for each pass of the training examples:
(2) 
This is a popular choice, and converges faster than the sum of the squares of errors because its gradient does not necessarily cancel near zero. Due to the heterogeneous sampling in parameter space of our simulated models, the network is exposed to fewer examples from cosmologies in sparsely sampled regions. This can induce a bias in the predictions. To assess the impact of the nonuniform sampling on parameter constraints, we also used a weighted loss function:
(3) 
where is a weight inversely proportional to the sampling density at the location of a cosmological model in parameter space. Errors in predictions for maps from cosmologies in sparsely sampled regions are more severely penalized than those for maps from densely sampled regions. We show in §IV.3 that such a weighted loss function reduces the bias in the predictions, at the cost of a longer network training, but has only a limited impact on the parameter constraints inferred from the predictions.
The algorithm used to minimize the loss function was an Adam optimizer Kingma and Ba (2014) with a learning rate of and first and second moment exponential decay rates of 0.9 and 0.999, respectively.
We trained each network until the loss function converged, which took in most cases 5 epochs (an epoch is a pass of all the training examples in the data set). The training maps were split in batches and randomly reshuffled after each epoch. The networks’ weights were recomputed after each batch, minimizing the total loss over the 32 tiles. Each batch took
s on a NVIDIA K20 GPU with 5GB of onboard memory, at the NSF XSEDE facility ^{8}^{8}8http://www.xsede.org. To further reinforce the rotationinvariance of the dataset, all maps were rotatedwith a 50% probability before feeding them to the network.
ii.3 Alternative descriptors
In order to assess the performance of the CNN, we compared the accuracy of its predictions with that achieved through analysis of summary statistics. We used two observables, the power spectrum and lensing peak counts. Both compress the information available in a given WL map in a data vector of dimension small compared with the number of pixels in the original map.
The power spectrum is defined as the Fourier transform of the twopoint correlation function of
Kilbinger (2015).(4) 
In the above expression is the Dirac delta function and is the 2D angular wave vector. We measured the power spectrum on all 512 mock maps for each of the 96 cosmological models. We evaluated the power spectra on 20 bins, logarithmically spaced in the interval . The minimum angular scale (maximum wavenumber ) is set to prevent any loss of information at the pixel level. The finite resolution of our simulations results in deviations from theory at wavenumbers with a significant loss of power for , as Fig. 3 shows for the fiducial cosmology.
The power spectrum is a widely used observable in cosmology, mainly because it fully characterizes Gaussian random fields and is a welldeveloped analytic tool. While the initial conditions for the matter perturbations are Gaussian (or nearly so), nonlinear evolution introduces significant nonGaussianities in the matter density field at late times.
Lensing peaks are local maxima in the field. In the absence of ellipticity noise, they probe high density regions, where nonlinear effects become relevant. We chose the peaks’ count as a function of their value as a second observable because they are sensitive to information not captured by the power spectrum. As an illustration, we compare in Fig. 4 the average peak counts measured on the 512 mock maps generated for the fiducial cosmology to those measured over Gaussian Random Fields (GRFs) that share their power spectra with the maps. That is, for each convergence map, we measured its power spectrum, built a GRF from it and measured the number of peaks in this new field. The distribution is clearly different, the peak histogram from convergence maps exhibiting a high tail resulting from the nonlinear growth of structures.
Peak counts yield tighter constraints than the power spectrum Kratochvil et al. (2010); Dietrich and Hartlap (2010); Liu et al. (2015); Kacprzak et al. (2016); Kratochvil et al. (2012); Shirasaki and Yoshida (2014); Petri et al. (2015); Martinet et al. (2018) and constitute a good benchmark for other methods which aim at extracting additional cosmological information. We counted the peaks in 20 bins, linearly spaced. We set the upper and lower limits of the bins to , in units of the mean r.m.s. for the fiducial maps, to fully cover the range of peaks present in the data; this corresponds to , and a bin width of .
Iii Results
We assessed the CNN’s performance in terms of the precision of their predictions for the cosmological parameters, and the constraints for those parameters for a given observation. The left and center panels of Fig. 5 display the predictions for and as a function of their “ground truth”, that is, the values that correspond to the cosmologies used to generate the data. The right panel shows the same comparison for the derived along the degeneracy between both parameters. Each point corresponds to one of the test maps available for each of the 96 cosmologies. For the neural network, the predicted for a given map are the average values for the network’s output when fed the 16 tiles in which the map was sliced. For the power spectrum and peak counts, the predictions are the values that minimize for that map. We estimated for each of the 96 sampled cosmologies as:
(5) 
where is the data vector measured on map (binned power spectrum or peak counts), is the mean of the same descriptor for the model and is the precision matrix for the data vector evaluated at the fiducial model. We used all 512 available maps per model to evaluate both the mean descriptor and the precision matrix, as in any realistic scenario in which a survey provides a mass map all the simulated data would be used for inference. We corrected for the bias in the precision matrix following Hartlap et al. (2007):
(6) 
is the number of realizations used to estimate the covariance (512) and is the dimension of the data vector (20, the number of bins).
The 96
values were used to interpolate
and find its minimum. We used a CloughTocher interpolator that builds a continuously differentiable piecewise cubic surface over a nonuniform grid Alfeld and other (1984); Renka and Cline (1984) . The minimum was found using the downhill simplex algorithm Nelder and Mead (1965). We verified that the results for the power spectrum and lensing peaks do not change when these observables are measured in a different number of bins (as long as they’re more than ) or a different interpolator is used to find the minimum of .For all cosmologies, the neural network is significantly more precise than both the power spectrum and lensing peaks: the scatter in its predictions for a given model is smaller. On average, the standard deviation of the CNN’s predictions is a factor of 47 lower than that of the statistical descriptors, and up to
smaller for the fiducial (see Table 2). In terms of accuracy (i.e. how close the predictions are to the ground truth), the network shows some bias that may degrade the constraints that can be inferred from the network’s predictions.CNN  Power spectrum  Peak counts  

Noiseless, unsmoothed  
5.1 (2.4)  21.7 (21.2)  35.6 (13.2)  
10.1 (7.9)  52.7 (84.1)  62.5 (63.8)  
7.2 (4.7)  32.3 (73.2)  36.0 (59.2) 
We note the presence of a small set of maps from models close to the fiducial for which both the power spectrum and lensing peaks tend to overpredict and
as a result (the outliers on both panels correspond to the same maps). These maps form a clearly detached clump on the rightmost panel of Fig.
5, where a dashed rectangle highlights their location. They represent of the maps forcosmologies not far from the fiducial model. We found through visual inspection that this overprediction seems to be due to an anomalous number of structures projected in the field of view. Interestingly, the CNN seems to be immune to such chance projections and classifies these maps correctly. This suggests that the neural network extracts different information from the maps than the power spectrum or lensing peaks. Alternatively, these fluctuations may be the result from cosmic variance, and the neural network may be underweighting those effects.
For a few cosmologies, parameter predictions from the CNN converged at different values from those of neighboring models. This is noticeable on the leftmost panel of Fig. 5 where a few red points show a relative overprediction in in the range . These outliers correspond to points in sparsely sampled areas near the boundaries of the explored parameter space. This highlights the importance of a wellsampled parameter space for the neural network to generalize accurately. In § B we analyze the effect of sampling on the predictions and credible contours inferred from the neural network. As these outliers lie far from the fiducial cosmology, they do not alter the parameter constraints presented in this study. Furthermore, they are identifiable in the training data, and as such could be removed if needed. We did not remove any model from our data set even when it was evident from the training data that they could be outliers.
The relevant metric to compare the performance of the neural network relative to summary statistics is the probability distribution for the cosmological parameters given our data. This posterior distribution is related to the easiertocompute probability of measuring a specific data vector given the cosmological parameters, or likelihood, by Bayes’ theorem:
(7) 
where is the set of cosmological parameters, a data vector and the underlying model, in our case DMonly simulations of CDM cosmologies. For the CNN, we define our data vector as the predicted values for the cosmological parameters, , and for the alternative statistics, the measured binned power spectra and peak histograms described in §II.3.
The term that multiplies the likelihood, or prior , and that on the denominator, or evidence, are the same when using the neural network or the statistical descriptors. The reason is we are using the same convergence maps from the same sampling of the parameter space. We can drop them as a normalization factor, as well as the explicit dependence on the underlying model used to generate the
maps, and compare directly the likelihoods derived from the different methods. For the likelihoods, we assumed a Gaussian distribution:
(8) 
with a precision matrix evaluated at the fiducial cosmology and as an expected value for the data, , the mean value measured from the simulations for the cosmology defined by .
Since we use the same covariance matrix for all cosmologies, we do not need to include the normalization prefactor. For the power spectrum, we expect the Gaussian likelihood to be accurate. Our simulated maps cover a small field of view of on which the power spectrum can be measured only for relatively high
. At those scales, many modes contribute to each measurement of the power spectrum, and the central limit theorem shows that its probability distribution function should converge to a Gaussian
Sato et al. (2010). For the lensing peaks and predictions from the neural network, we verified that the approximation remains valid (see §A). The alternative approach of estimating the probability density using a kernel density estimator (KDE) depends on the width of the kernel chosen, and the estimates for a large dimensional data vector such as our power spectra are noisy due to the relative limited amount of independent
maps realizations.To compute the likelihood, we used as data (observation) the average observable for the fiducial cosmology. For the power spectrum and lensing peaks, all 512 maps were used to estimate the means for each cosmology, and the covariance matrix for the fiducial. For the neural network, only the test maps were used ( per cosmology). We display the 68% and 95% credible contours for the likelihoods computed for the power spectrum, lensing peaks and neural network in the central panel of Fig. 6, and the marginalized distributions for and in the upper and right panels, respectively. At each point in parameter space, the expected data vector is interpolated linearly from the mean data vectors for the simulated cosmologies. Due to the choice of measurement (the predicted mean for the fiducial) all likelihoods peak at the true values for the fiducial cosmology. This is true also for the neural network. The smaller scatter in the CNN predictions translates into tighter parameter constraints, by a factor of compared with lensing peaks and compared with the power spectrum (see Table 3). The neural network seems capable of extracting more information from noiseless convergence maps than alternative methods such as the power spectrum or lensing peaks.
CNN  Power spectrum  Peak counts  

Area  1  5.9  1.9 
Area  1  6.1  1.9 
Iv Discussion
iv.1 NonGaussian information extracted by the neural network
The significantly tighter constraints obtained by the CNN, shown in Fig. 6, are encouraging and an indication that weak lensing maps encode more information than what is usually used for inference. Neural networks are capable of extracting some of it, at least more than the power spectrum and even more than some nonGaussian statistics such as lensing peaks. Given the large number of parameters that need to be fitted during training, there is the risk that the gain in precision comes from a form of overfitting, in the general sense of making predictions based on irrelevant information Zhang et al. (2016).
For instance, a Gaussian random field, GRF, is fully determined by its power spectrum. As a result, no other statistic or method used to extract information from it should outperform the power spectrum. To test whether the neural network satisfies this limit, we built a collection of GRFs and used it as a new dataset to train and test the CNN’s architecture. We generated the GRFs by Fourier transforming random fields with a Gaussian distribution defined by the power spectra measured over the maps raytraced from the outputs of cosmological Nbody simulations. The new suite, which has a onetoone correspondence with the original data, has no information encoded beyond the power spectrum.
The 68% and 95% credible contours from the power spectrum, lensing peaks and the newly trained CNN, as well as the marginalized distributions for and are displayed in Fig. 7, which is analogous to Fig. 6 but for GRFs instead of maps from Nbody simulations.
As before, the likelihoods peak on the true parameter values for the fiducial and the contours appear centered around . The likelihood for the power spectrum is the same as the one computed for the convergence maps. The likelihoods for lensing peaks and the neural network are different, and their contours larger than those derived from the power spectrum. In particular, the contours from lensing peaks are larger for the contours, and those from the neural network larger. This result is consistent with the absence of information beyond the power spectrum in the Gaussian Random Fields, and demonstrates that the small scatter in the parameters’ predictions from the neural network trained on convergence maps is not the result of a tendency to overfitting by its architecture or other spurious effects.
Comparing the predictions with the ground truth for the test GRFs, as done in § III for the test maps, we see that there is both an increase in the scatter and the bias of the neural network’s predictions (see Fig. 8). Both effects drive the deterioration in the parameter constraints that can be inferred from those predictions. Furthermore, the neural network seems almost insensitive to , as the predictions for all the test GRFs scatter around the median for the 96 cosmologies. The CNN cannot easily distinguish between models with different and defaults to the value that minimizes the loss function. The use of an unweighted loss function in this analysis may also have some influence, but the same behavior is not seen on . The power spectrum and lensing peaks are both sensitive to that parameter, indicating that they extract different information than the neural network.
iv.2 Effect of the smoothing scale on the results
The angular resolution of the mock convergence maps used for our analysis is arcmin per pixel. This high resolution is interesting from an academic perspective, but at present it is of little practical interest. Accurate shear estimates require measuring the shape of many galaxies to estimate their correlations. For instance, the upcoming LSST survey will reach an effective number of galaxies of , after considering losses due to blending and masking Chang et al. (2013). This means that arcmin is characteristic of the resolution achievable by future surveys. Furthermore, at small scales (), baryonic physics alter the matter distribution and can bias WL observables relative to estimates from DMonly simulations Semboloni et al. (2013); Osato et al. (2015).
To assess whether the neural network still outperforms alternative observables on arcmin resolution data, we trained a new network with the same architecture on the maps after smoothing them with a Gaussian kernel. The resulting constraints, for a smoothing scale of 1 arcmin, are displayed in Fig. 9.
The parameters’ constrains degrade for all three methods. In principle, we would expect the nonGaussian statistics’ performance to degrade relative to the power spectrum as small scale features are smoothed away from the maps. Up to 1 arcmin smoothing, the neural network keeps well its relative advantage to the power spectrum, yielding credible regions smaller at the level. Lensing peaks are more adversely affected than the CNN by smoothing, yielding contours that are only smaller than the power spectrum. This would indicate that any additional information extracted by the neural network is not confined to very small angular scales.
The first attempt at training the neural network on smoothed data failed. To guarantee the convergence in the training process, we gradually smoothed the maps in a similar way as Schmelzle et al. (2017) added noise to theirs. We fed the network with maps of growing smoothing scale, starting with a kernel of 0.2 arcmin of bandwidth. Once the network reached convergence at a smoothing scale, the kernel’s bandwidth was increased by 0.05 arcmin and the network retrained. In all cases the neural network kept its advantage (see Table 4). The ratio between the areas of the credible regions derived from the power spectrum and the neural network remained roughly constant, while the same ratio for the lensing peaks and neural network increased as the capability of peaks to extract information degraded faster with larger smoothing scales.
Smoothing  Power spectrum  Peak counts  

68 %  95%  68%  95%  
  5.9  6.1  1.9  1.9 
0.2  7.0  5.9  1.8  1.6 
0.3  7.7  7.9  2.3  2.7 
0.4  6.5  6.4  1.9  2.2 
0.5  7.1  6.5  2.5  2.5 
0.6  6.5  5.7  2.5  2.4 
0.7  6.4  5.2  2.8  2.4 
0.8  4.7  4.1  2.5  2.3 
0.9  5.2  4.4  3.0  2.8 
1.0  5.6  4.8  3.6  3.3 
iv.3 Bias in the CNN predictions
The parameter predictions from the neural network exhibit some bias (see Fig. 5). The bias is more severe when an unweighted loss function is used, as can be seen in Fig. 10. This can be due to the loss function being dominated by errors in the densely sampled regions of the parameter space.
Weighting the loss function according to the sampling density helps mitigate the bias. The effect is larger for the high region than for the high models. This can be due to the difference in sampling between both regions. The high region, corresponding to quadrant in Fig. 1 has more models further from the fiducial and with large spacing between them than the high region (quadrant ).
The weights in the loss function were computed using a kernel density estimator (KDE) to estimate the sampling density in parameter space. The KDE bandwidth used was 1.0, a value that yielded a smooth estimate.
Biases in predictions from neural networks have been found in other works (e.g. Ravanbakhsh et al. (2016)), so we cannot guarantee that the heterogeneous sampling of our data is the only source of the bias. Future work using a different dataset, uniformly sampled, will address this issue.
The parameter constraints for an observation near the fiducial model are not affected by the use of an unweighted loss function, as Fig. 11 illustrates. This is because the scatter of the predictions in densely populated areas does not increase significantly when the bias in the sparsely sampled areas is reduced with a modified loss function. We did not retrain our networks with a weighted loss function due to the additional computational cost, since the constraints from the network’s predictions are essentially unchanged.
V Conclusions
We trained a convolutional neural network on simulated, noiseless, weak lensing convergence maps. We demonstrated that neural networks can outperform methods based on traditional observables such as the power spectrum, or even statistics previously shown to extract nonGaussian information, such as lensing peaks. On data smoothed at 1 arcmin scales, within reach of upcoming surveys, the neural network outperformed the power spectrum by a factor of and the lensing peaks by a factor of (using the area of the confidence contour in the {} plane as a figureofmerit).
We performed null tests to verify that the improvement in the parameter constraints reflect the network’s ability to extract additional information present in the WL data, and is not a numerical artifact (for instance, some form of overfitting). This sets a lower limit to the cosmological information encoded in noiseless lensing maps, whether this is also the case in more realistic, noisy data sets, remains an open question. The network’s constraints are limited by both the precision and bias of its predictions. Whether further improvements are reachable through a different network architecture, or a richer training data set, remains an open question and calls for further investigation.
Our results are consistent with previous findings in Ravanbakhsh et al. (2016) for the 3D matter power spectrum and in Schmelzle et al. (2017) for the ability of neural networks to distinguish WL data generated from different cosmologies. Some of the questions that future work will address are:

Effect of noise on predictive power. The presence of realistic levels of noise (e.g. shape noise) can pose challenges to neural network training Schmelzle et al. (2017). It remains to be shown if the improvement in parameter constraints compared with the power spectrum is achievable with noisy data.

Propagation of systematics on constraints from neural networks. Before neural networks can be used to infer parameters from weak lensing data, we need to understand the effect of the systematics present in the data on the resulting parameter constraints.

Scaling with survey area. Since neural networks’ training time steeply increases with the map size, it is important to assess how the constraining power from their predictions scale with map size, and how the scaling compares with that for alternative methods such as the power spectrum.

Network analysis. While the interpretation of feature maps from deep networks (see Schmelzle et al. (2017)) is not straightforward, it may provide valuable insights to design new summary statistics capable of extracting cosmological information from lensing observations.

Improvements in the network’s training and architecture. An extended exploration of training parameters (density of models in parameter space, number of independent examples per model, loss function, etc.) and architecture’s features (convolutional kernel size, number of layers, etc.) will elucidate the effect of these choices in the resulting constraints.
Acknowledgments
We thank Martin Kilbinger and the anonymous referee for valuable comments on the manuscript. The simulations to generate the data, and the training of the neural networks were performed at the NSF XSEDE facility, supported by grant number ACI1053575 and the Habanero computing cluster at Columbia University. This work was supported in part by NSF Grant No. AST1210877 and by the Research Opportunities and Approaches to Data Science (ROADS) program at the Institute for Data Sciences and Engineering (IDSE) at Columbia University. DH acknowledges support from a Sloan Research Fellowship. ZH gratefully acknowledges sabbatical support by the Simons Fellowship in Theoretical Physics and thanks New York University, where some of this work was performed, for their hospitality.
Appendix A Gaussian likelihood approximation
One way to assess how valid the Gaussian approximation is for the likelihood of a given observable is to estimate its probability density function (PDF) from our simulations without assuming any specific functional form. A nonparametric method to do that estimation is the Kernel Density Estimator (KDE). The main challenge to apply this approach to lensing peaks is how to achieve a density estimator in a high dimensional space with a limited number of independent vectors (512 per model).
We performed an analysis with noisy data within the framework of a different study, that supports that a Gaussian likelihood is not a bad approximation for lensing peaks. The dataset corresponds to the same cosmologies used for this study, and the convergence maps have been smoothed with a characteristic scale of 1 arcmin, but they also have an ellipticity noise of
present. To reduce the dimensionality of the observable, we performed an Independent Component Analysis (ICA)
Herault et al. (1985); Comon (1994). This method provides the directions that maximize negative entropy, which can be interpreted as the directions in which the data is less Gaussian. As a preprocessing step, we whitened the data (i.e. we removed its mean and normalized its covariance), and then we projected the whitened data into the 9 directions found following ICA. We then used a KDE to estimate the PDF of the resulting data. While we found some nonGaussianities, specially for peak counts corresponding to high significance, the effect on the likelihood (and corresponding credible contours) is limited.As an illustration, in Fig. 12 we show the difference in credible contours obtained from a Gaussian likelihood from those obtained using a KDE. We display only the contours derived using only peaks with a signaltonoise greater than 3. These are the peaks for which the nonGaussianities are the most pronounced, and yet the contours obtained with both methods are comparable. Using a model to predict peak counts that does not rely on Nbody simulations, Lin and Kilbinger (2015b) also found that a Gaussian likelihood is a good approximation (to 10%) for lensing peaks.
To analyze whether a Gaussian distribution is a good approximation for the predictions from the neural network we used a modification of the KolmogorovSmirnoff test that can be applied to twodimensional distributions Fasano and Franceschini (1987). For each model, we computed the mean and covariance from the predictions for the test maps. Then, we tested the predictions against a Gaussian distribution defined by the estimated mean and covariance.
The null hypothesis, that there is no statistical difference between the distribution of our empirical samples (neural network predictions) and a Gaussian, cannot be rejected with a confidence of 99% except for 2 models which are far from the fiducial,
and . We conclude that a Gaussian likelihood is a reasonable approximation for the predictions from the neural network.Appendix B Sensitivity of results to interpolation
To assess how sensitive our results were to the models sampled from the parameter space , we trained an additional network on the same unsmoothed maps but removing the model from the training data set. When fed the test maps for that cosmology, the network that was not exposed to it during training yielded somewhat different predictions than the network which had seen maps from that model during training. The differences in the mean prediction were very small, with a shift of in and in . The change in scatter is more significant, the standard deviation in the predictions for increasing by and that for the predictions by . The larger degradation for may be related with the fact that the network’s architecture seems to have greater difficulty in distinguishing between models that differ in that parameter, as was shown in §IV for both GRFs and smoothed convergence maps.
While this sensitivity to interpolation highlights how relevant a wellsampled training data set is for proper generalization by the network’s architecture, we are mostly concerned about how interpolation errors propagate into the inferred parameters’ constraints. That effect is small, as Fig. 13 shows. The credible contours inferred from the predictions by both networks barely change, and the same applies to the marginal distributions inferred for both and . We show the contours computed for the worstcase scenario, that is, when the model missing from the training dataset is the “true” cosmology.
The small change in the parameter constraints’ from both networks indicate that our main conclusions would not change with a different sampling of the parameter space. Besides, as the priors on our cosmological parameters improve with new experiments, the parameter volume to be explored will shrink and the number of models that need to be simulated to sample that space properly will also decrease.
References
 Planck Collaboration et al. (2016) Planck Collaboration, P. A. R. Ade, N. Aghanim, M. Arnaud, M. Ashdown, J. Aumont, C. Baccigalupi, A. J. Banday, R. B. Barreiro, J. G. Bartlett, and et al., A&A 594, A13 (2016), arXiv:1502.01589 .
 Anderson et al. (2014) L. Anderson et al., MNRAS 441, 24 (2014), arXiv:1312.4877 .
 Refregier (2003) A. Refregier, ARA&A 41, 645 (2003), astroph/0307212 .
 Kilbinger (2015) M. Kilbinger, Reports on Progress in Physics 78, 086901 (2015), arXiv:1411.0115 .
 Hu (2002) W. Hu, Phys. Rev. D 66, 083515 (2002), astroph/0208093 .
 (6) http://www.cfhtlens.org/.
 (7) http://kids.strw.leidenuniv.nl/index.php.
 (8) http://www.darkenergysurvey.org.
 (9) http://hsc.mtk.nao.ac.jp/ssp/.
 (10) http://www.lsst.org.
 (11) http://sci.esa.int/euclid/.
 (12) http://wfirst.gsfc.nasa.gov.
 Weinberg et al. (2013) D. H. Weinberg, M. J. Mortonson, D. J. Eisenstein, C. Hirata, A. G. Riess, and E. Rozo, Phys. Rep. 530, 87 (2013), arXiv:1201.2434 .
 Neyrinck et al. (2009) M. C. Neyrinck, I. Szapudi, and A. S. Szalay, ApJ 698, L90 (2009), arXiv:0903.4693 [astroph.CO] .
 Seo et al. (2011) H.J. Seo, M. Sato, S. Dodelson, B. Jain, and M. Takada, ApJ 729, L11 (2011), arXiv:1008.0349 [astroph.CO] .
 Shirasaki (2017) M. Shirasaki, MNRAS 465, 1974 (2017), arXiv:1610.00840 .
 Kratochvil et al. (2010) J. M. Kratochvil, Z. Haiman, and M. May, Phys. Rev. D 81, 043519 (2010), arXiv:0907.0486 [astroph.CO] .
 Kratochvil et al. (2012) J. M. Kratochvil, E. A. Lim, S. Wang, Z. Haiman, M. May, and K. Huffenberger, Phys. Rev. D 85, 103513 (2012), arXiv:1109.6334 [astroph.CO] .
 Dietrich and Hartlap (2010) J. P. Dietrich and J. Hartlap, MNRAS 402, 1049 (2010), arXiv:0906.3512 .
 Shirasaki and Yoshida (2014) M. Shirasaki and N. Yoshida, ApJ 786, 43 (2014), arXiv:1312.5032 .
 Liu et al. (2015) J. Liu, A. Petri, Z. Haiman, L. Hui, J. M. Kratochvil, and M. May, Phys. Rev. D 91, 063507 (2015), arXiv:1412.0757 .
 Petri et al. (2015) A. Petri, J. Liu, Z. Haiman, M. May, L. Hui, and J. M. Kratochvil, Phys. Rev. D 91, 103511 (2015), arXiv:1503.06214 .
 Kacprzak et al. (2016) T. Kacprzak et al., MNRAS 463, 3653 (2016), arXiv:1603.05040 .
 Martinet et al. (2018) N. Martinet, P. Schneider, H. Hildebrandt, H. Shan, M. Asgari, J. P. Dietrich, J. HarnoisDéraps, T. Erben, A. Grado, C. Heymans, H. Hoekstra, D. Klaes, K. Kuijken, J. Merten, and R. Nakajima, MNRAS 474, 712 (2018), arXiv:1709.07678 .
 Takada (2014) M. Takada, in Statistical Challenges in 21st Century Cosmology, IAU Symposium, Vol. 306, edited by A. Heavens, J.L. Starck, and A. KroneMartins (2014) pp. 78–89, arXiv:1407.3330 .
 Yang et al. (2011) X. Yang, J. M. Kratochvil, S. Wang, E. A. Lim, Z. Haiman, and M. May, Phys. Rev. D 84, 043529 (2011), arXiv:1109.6333 .
 Zorrilla Matilla et al. (2016) J. M. Zorrilla Matilla, Z. Haiman, D. Hsu, A. Gupta, and A. Petri, Phys. Rev. D 94, 083506 (2016), arXiv:1609.03973 .
 Liu and Haiman (2016) J. Liu and Z. Haiman, Phys. Rev. D 94, 043533 (2016), arXiv:1606.01318 .
 Fan et al. (2010) Z. Fan, H. Shan, and J. Liu, ApJ 719, 1408 (2010), arXiv:1006.5121 .
 Lin and Kilbinger (2015a) C.A. Lin and M. Kilbinger, A&A 576, A24 (2015a), arXiv:1410.6955 .
 LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton, Nature 521, 436 (2015).
 Lin et al. (2016) H. W. Lin, M. Tegmark, and D. Rolnick, ArXiv eprints (2016), arXiv:1608.08225 [condmat.disnn] .
 Miller (1993) A. S. Miller, Vistas in Astronomy 36, 141 (1993).
 Tagliaferri et al. (2003) R. Tagliaferri et al., Neural Networks 16, 297 (2003), neural Network Analysis of Complex Scientific Data: Astronomy and Geosciences.

Ravanbakhsh et al. (2016)
S. Ravanbakhsh, J. Oliva,
S. Fromenteau, L. Price, S. Ho, J. Schneider, and B. Poczos, in Proceedings of The 33rd International Conference on Machine
Learning
, Proceedings of Machine Learning Research, Vol. 48, edited by M. F. Balcan and K. Q. Weinberger (PMLR, New York, New York, USA, 2016) pp. 2407–2416.
 Mustafa et al. (2017) M. Mustafa, D. Bard, W. Bhimji, R. AlRfou, and Z. Lukić, ArXiv eprints (2017), arXiv:1706.02390 [astroph.IM] .
 Rodriguez et al. (2018) A. C. Rodriguez, T. Kacprzak, A. Lucchi, A. Amara, R. Sgier, J. Fluri, T. Hofmann, and A. Réfrégier, ArXiv eprints (2018), arXiv:1801.09070 .
 Schmelzle et al. (2017) J. Schmelzle, A. Lucchi, T. Kacprzak, A. Amara, R. Sgier, A. Réfrégier, and T. Hofmann, ArXiv eprints (2017), arXiv:1707.05167 .
 Jain and Seljak (1997) B. Jain and U. Seljak, ApJ 484, 560 (1997), astroph/9611077 .
 Petri (2016) A. Petri, Astronomy and Computing 17, 73 (2016), arXiv:1606.01903 .
 Springel (2005) V. Springel, MNRAS 364, 1105 (2005), astroph/0505010 .
 Lewis et al. (2000) A. Lewis, A. Challinor, and A. Lasenby, Astrophys. J. 538, 473 (2000), arXiv:astroph/9911177 [astroph] .
 Schneider et al. (1992) P. Schneider, J. Ehlers, and E. E. Falco, Gravitational Lenses, XIV, 560 pp. 112 figs.. SpringerVerlag Berlin Heidelberg New York. Also Astronomy and Astrophysics Library (1992) p. 112.
 Petri et al. (2017) A. Petri, Z. Haiman, and M. May, Phys. Rev. D 95, 123503 (2017), arXiv:1612.00852 .
 Petri et al. (2016) A. Petri, Z. Haiman, and M. May, Phys. Rev. D 93, 063524 (2016), arXiv:1601.06792 .
 Fang and Haiman (2007) W. Fang and Z. Haiman, Phys. Rev. D 75, 043010 (2007), astroph/0612187 .
 Maas et al. (2013) A. L. Maas, A. Y. Hannun, and A. Y. Ng, in Proc. ICML, Vol. 30 (2013).
 Nair and Hinton (2010) V. Nair and G. E. Hinton, in Proceedings of the 27th international conference on machine learning (ICML10) (2010) pp. 807–814.
 Hinton et al. (2012) G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, CoRR abs/1207.0580 (2012).
 Kingma and Ba (2014) D. P. Kingma and J. Ba, ArXiv eprints (2014), arXiv:1412.6980 [cs.LG] .
 (51) http://www.xsede.org.
 Kilbinger et al. (2009) M. Kilbinger, K. Benabed, J. Guy, P. Astier, I. Tereno, L. Fu, D. Wraith, J. Coupon, Y. Mellier, C. Balland, F. R. Bouchet, T. Hamana, D. Hardin, H. J. McCracken, R. Pain, N. Regnault, M. Schultheis, and H. Yahagi, A&A 497, 677 (2009), arXiv:0810.5129 .
 Takahashi et al. (2012) R. Takahashi, M. Sato, T. Nishimichi, A. Taruya, and M. Oguri, ApJ 761, 152 (2012), arXiv:1208.2701 .
 Hartlap et al. (2007) J. Hartlap, P. Simon, and P. Schneider, A&A 464, 399 (2007), astroph/0608064 .
 Alfeld and other (1984) P. Alfeld and other, Defense Technical Information Center Kingma, D. P. and Ba, J. (1984).
 Renka and Cline (1984) R. Renka and A. Cline, Rocky Mountain J. Math 14 (1984).
 Nelder and Mead (1965) J. A. Nelder and R. Mead, The computer journal 7, 308 (1965).
 Sato et al. (2010) M. Sato, K. Ichiki, and T. T. Takeuchi, Physical Review Letters 105, 251301 (2010), arXiv:1011.4996 [astroph.CO] .
 Zhang et al. (2016) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, ArXiv eprints (2016), arXiv:1611.03530 [cs.LG] .
 Chang et al. (2013) C. Chang, M. Jarvis, B. Jain, S. M. Kahn, D. Kirkby, A. Connolly, S. Krughoff, E.H. Peng, and J. R. Peterson, MNRAS 434, 2121 (2013), arXiv:1305.0793 .
 Semboloni et al. (2013) E. Semboloni, H. Hoekstra, and J. Schaye, MNRAS 434, 148 (2013), arXiv:1210.7303 .
 Osato et al. (2015) K. Osato, M. Shirasaki, and N. Yoshida, ApJ 806, 186 (2015), arXiv:1501.02055 .
 Herault et al. (1985) J. Herault, C. Jutten, and B. Ans, Dixieme colloque sur le traitement du signal et ses appications (1985).
 Comon (1994) P. Comon, Signal Processing 36, 287 (1994).
 Lin and Kilbinger (2015b) C.A. Lin and M. Kilbinger, A&A 583, A70 (2015b), arXiv:1506.01076 .
 Fasano and Franceschini (1987) G. Fasano and A. Franceschini, MNRAS 225, 155 (1987).
Comments
There are no comments yet.