Graph Neural Networks for IceCube Signal Classification

09/17/2018 ∙ by Nicholas Choma, et al. ∙ NYU college Imperial College London berkeley college Berkeley Lab USI Università della Svizzera italiana 2

Tasks involving the analysis of geometric (graph- and manifold-structured) data have recently gained prominence in the machine learning community, giving birth to a rapidly developing field of geometric deep learning. In this work, we leverage graph neural networks to improve signal detection in the IceCube neutrino observatory. The IceCube detector array is modeled as a graph, where vertices are sensors and edges are a learned function of the sensors' spatial coordinates. As only a subset of IceCube's sensors is active during a given observation, we note the adaptive nature of our GNN, wherein computation is restricted to the input signal support. We demonstrate the effectiveness of our GNN architecture on a task classifying IceCube events, where it outperforms both a traditional physics-based method as well as classical 3D convolution neural networks.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Graph- and manifold-structured data are ubiquitous in a broad range of research fields studying interactions and relations between objects. Social networks, molecules, point clouds, and 3D shapes are just a few of many different examples of such data. The remarkable success of deep learning techniques in computer vision has motivated in recent years the development of deep learning architectures capable of coping with data defined on non-Euclidean domains, referred to by the general term

Geometric Deep Learning [1].

A plethora of geometric deep learning models have been proposed, including learnable information diffusion processes [2, 3, 4, 5], generalizations of convolutional neural networks (CNN) [6] to graphs using spectral filters [7, 8, 9, 10, 11, 12, 13] and local neighborhood operations [14, 15, 16, 17, 18, 19]. Geometric deep learning has been successfully employed in a broad spectrum of applications, ranging from computer graphics, vision [20, 21, 19], and medicine [22, 23] to chemistry [14, 5] and high energy physics [24].

In this paper, we study the application of graph neural networks (GNN) to the challenging problem of neutrino detection in the IceCube observatory. One of the key motivations for the use of GNNs is the irregular geometry of the detectors, as well as the sparse nature of the signal recorded by the IceCube detectors. A significant challenge setting apart the neutrino detection problem from typical pattern classification problems is a very large asymmetry between positive and negative events; requiring good accuracy in the regime of extremely low False Positive Rates on the order of .

I-a IceCube Experiment

IceCube is a 1 km neutrino observatory located at the South Pole [25]. Its primary purpose is to look for high-energy (above 100 gigaelectronvolts (GeV)) neutrinos that are produced by the same cosmic particle accelerators that produce ultra-high energy cosmic-rays [26]. Its 5,160 sensors (digital optical modules, or DOMs) detect the Cherenkov light that is produced by the relativistic charged particles resulting from high-energy neutrinos interacting in the Antarctic ice. The Cherenkov light is emitted at an angle and may scatter before being observed by sensors that are typically 10 to 60 meters away from the track. Sixty sensors are deployed on each of 86 strings placed in holes drilled in the ice. Most of the strings are on a 125 m triangular grid, but 8 strings, forming the ’Deep Core’ infill array, have much tighter spacing. On most strings, the DOMs are deployed every 17 m, from 1450 m to 2450 m below the surface; in Deep Core, most of the DOMs are deployed with a 7 m spacing between 2100 and 2450 m. IceCube includes an array of 81 surface stations called IceTop, designed to study cosmic ray interactions in the atmosphere. The schematic view of the IceCube detector is shown in Fig. 1. The sensors record the photon arrival times using waveform digitizers. Across the array, the relative arrival times are known to better than 3 ns [25].

Fig. 1: The IceCube Neutrino Observatory with the in-ice array, its sub-array DeepCore, and the cosmic-ray air shower array IceTop. The string color scheme represents different deployment seasons. The top-right insert presents the top view of the IceCube detector. The DeepCore sub-array is represented by open circles.

IceCube observes two classes of events. Contained events occur when neutrinos interact within the detector. Through-going events are mostly long-lived muons which can travel many kilometers in the ice. They can be produced in neutrino interactions, or, in downward-going events, in cosmic-ray air showers that occur when high-energy cosmic-rays interact in the upper atmosphere.

IceCube has observed a spectrum of astrophysical neutrinos from energies of about 35 teraelectronvolts (TeV) to above 2 petaelectronvolts (PeV) [27, 28]. At lower energies, there is an overwhelming background of neutrinos produced when cosmic-rays interact in our atmosphere. At higher energies, several factors emerge. First, the neutrino-nucleon cross-section rises and the Earth becomes opaque to neutrinos, so we cannot observe neutrinos coming from much below the horizon [29]. Second, the flux drops sharply at higher energies.

A study of downward-going muons from astrophysical neutrino interactions could significantly expand IceCube’s reach in this area. Such a search is challenging because of the very high flux of muons from cosmic-ray air showers. In this work, we discuss methods to separate this signal (muons from neutrinos) from the background (muons from cosmic-ray showers). For the present purposes, the main difference between the signal and the background is the stochasticity of the energy deposition, which translates into clumpiness in the light emission from the track. Muons from neutrinos are single muons, which lose energy stochastically, leading to a very uneven light emission along the muon track. In contrast, muons from cosmic-ray interactions come in bundles containing from one (rarely) to hundreds of muons. With the large number of muons, the light emission averages out, and becomes quite smooth (see Fig. 2). There is still variation in light detection due to the depth-dependent optical properties of the ice.

Fig. 2: The characteristic pattern of light deposition for muon bundles (left) and a high-energy single muon with visible stochastic light emission along the track (right). The red line shows the actual (Monte Carlo track), while each colored bubble represents a DOM that saw light in the event. The colors indicate the relative light arrival time, from red (earliest) to blue (latest), while the size of the bubbles indicates the number of observed photons.

We generated two Monte Carlo data sets, one for signal and one for background, and used them for all three methods discussed in this paper. In this work, we consider only astrophysical muon neutrino events as a signal. In our signal simulation set, neutrino energies range from 100 GeV to 10 GeV. The astrophysical neutrino data set was assumed to follow a power law, with , while the cosmic-ray air shower background was generated with the CORSIKA simulation package [30] using H4a composition assumption [31]. The energy range of cosmic ray primaries in our background simulation set is 600 GeV to 10 GeV. No other backgrounds are taken into consideration. For both signal and background, more energetic events are likely to pass the event selection. So, to efficiently generate our samples, we generated the signal and background samples following a harder spectral index (i. e. biased toward more energetic events). Each event has an associated weight, so that weighted histograms reproduce the output from the original spectrum. This allows considerably more efficient use of computer resources, but it becomes harder to evaluate the statistical significance of any result.

The particles produced by these simulations were then run through an IceCube-specific detector simulation, which included the generation of Cherenkov light, light propagation through the ice, and the detector response. For uniformity, the DeepCore array and IceTop stations were excluded during the preselection process. The resulting simulated data were run through the standard IceCube calibration and reconstruction packages. To be usable by IceCube, the final event sample must have a reasonably high signal-to-noise ratio (SNR). Here, to compare different methods, we decided a-priori to evaluate the methods on the basis of how many events they could find, subject to maintaining a 1:1 SNR. Since the background is many orders of magnitude larger than the signal, this requires a very high level of rejection.

Ii Baseline Architectures

Ii-a Physics Baseline Method

IceCube has developed two conventional (i. e. not machine learning) criteria to measure stochasticity and reject muon bundles. The first divides the muon track into 120 m long segments, and determines the light output from the track in each segment [32]. This light output is then fit to a line, and a pseudo- as the sum of squared residuals is calculated. Events with large are very stochastic. The second approach also reconstructs the light deposition along the track, by examining the light output in 50 meter long segments, apportioning light from each DOM to the nearest segment. Then, the energy loss in the largest segment is divided by the mean energy loss, giving the peak/median ratio. The results from these two methods are correlated, but the correlation is low enough that there is value in using both. IceCube uses a combination of hand-tuned selections on these two variables to select single muon events. The baseline method utilizes stochasticity selections and selections in energy proxy versus zenith phase space to limit the cosmic muon background contribution [33].

Ii-B 3D Convolution Neural Network

The geometry of the IceCube detector is shown in Figure 1. Convolution neural networks typically perform convolution on orthogonal coordinate systems, so the input data is transformed and mapped onto an orthogonal grid [34]

. We mapped 86 strings of the detector onto a 10x20 grid and hexagonally shaped input was transformed to orthogonal by padding with zeros. Each string of the detector holds 60 DOMs, as a result input tensor with three spatial dimensions (10x20x60) is fed into a 3D CNN with 18 layers. ResNet-18


is a 18 layer convolution neural network, where the residual learning framework and shortcut connections ease the training of deep networks. These residual blocks consist of convolutions and a summed shortcut connection to the output; convolutions were done using Keras framework with Tensorflow backend. Deepcore strings were added with strings 86, 81, 82 on one row and strings 85, 79, 84, 83, 80 on another row. We extracted information from simulated background and signal data by aggregating events from hdf5 files events (collection of information from the sensors with a fixed time window).

Iii Graph Neural Networks

Iii-a Overview

First attempts of learning on graphs constructed features resulting from the steady state of a learnable diffusion process [2, 3]

. This approach has recently been improved with modern deep learning schemes using gated recurrent units

[4] and neural message passing [5].

Bruna et al. [7, 8]

proposed formulating convolution-like operations in the spectral domain, defined by the eigenvectors of the graph Laplacian operator. The key idea is that Laplacian eigenvectors form an orthogonal basis allowing performing Fourier decomposition of graph-based signals; filtering is thus performed by multiplying the Fourier coefficients of the input signal by learnable spectral representation of a filter. Among the drawbacks of this architecture is

computational complexity due to the cost of computing the forward and inverse graph Fourier transform (multiplication by the dense matrix of eigenvectors),

parameters per layer, and no guarantee of spatial localization of the filters.

A more efficient class of spectral graph CNNs was introduced in [9, 36] and follow up works, who proposed spectral filters that can be expressed in terms of simple operations (such as additions, scalar- and matrix multiplications) w.r.t. the Laplacian. In particular, Defferrard et al. [9] considered Chebyshev polynomial filters of degree , which require only multiplications by the Laplacian matrix (each such multiplication costs assuming that the graph is sparsely connected). Due to the local nature of the Laplacian operator, such filters are guaranteed to have a -hop support. Levie et al. [11] extended this framework using rational filter functions that also include inversions of the Laplacian, which are carried out approximately using an iterative Jacobi method. Monti et al. used multivariate polynomials w.r.t. multiple Laplacians defined by graph motifs [13] as well as Laplacians defined on products of graphs [12].

A different class of methods are spatial formulations of graph CNNs, which operate on local neighborhoods on the graph [14, 15, 16, 17, 18, 19]. Monti et al. [15] proposed the Mixture Model networks (MoNet), generalizing the notion of image ‘patches’ to graphs. The centerpiece of this construction is a system of local pseudo-coordinates assigned to a neighbor of each vertex . The spatial analogue of a convolution is then defined as a set of weighting kernels applied in these coordinates. In particular, when Gaussian kernels are used, the graph convolution can be interpreted as a Gaussian mixture,


where , denotes the - and

-dimensional input and output feature column vectors at vertex

, respectively, is the neighborhood of , and , and are the learnable Gaussian mixture parameters (mixture weights, mean vectors, and covariance matrices, respectively). Each Gaussian applied to the input features can be considered as one dimension of the local -dimensional patch (in the Euclidean setting, one pixel).

Veličkovic̀ et al. [18] reinterpreted this scheme as graph attention (GAT), learning the relevance of neighbor vertices for the filter result,




are attention scores representing the importance of vertex w.r.t. , and the matrix and -dimensional vector are the learnable parameters. The graph attention mechanism was extended in [37] using high-order tensor covariant models, in [38] using the non-backtracking operator defined on the line graph, and in [39] using alternating graph convolutions with dual graph convolutions.

Iii-B Graph neural networks on IceCube

The IceCube experiment can be considered as a geometric domain given by the coordinates of each DOM, resulting in a 3D irregular hexagonal grid. For a given event, active DOMs output three real-valued numbers (representing the energy and the time of the interaction), which can be thought of as a signal defined on a fixed graph. Since many DOMs are inactive, such a signal is very sparse. Alternatively, one can consider each event as a signal on a different graph, containing only the active DOMs. Graph neural networks, capable of dealing both with irregular geometry and graphs of different size, thus appear perfect candidates for the IceCube experiment. Further, a prior assumption typically made on data when using convolutional neural networks is that of stationarity, due to the common assumption of shift-invariance in e.g. natural images [6].

Because the IceCube detector array is hexagonal and irregular, stationarity is a too strong assumption in our case as DOMs may present different distances and orientations one w.r.t. the other. This would involve very fine-grained grids and consecutively large filters or many layers to extract meaningful hierarchical features from the provided input signal, potentially requiring vast amounts of parameters and thus leading to overfitting. Graph neural networks do not require this type of stationarity, as distances between pairs of DOMs can be represented with graph edges which are calculated using a learned kernel function. The flexibility that graph operators introduce in the model allows in this sense to realize rich filters in space while varying just a significantly small amount of parameters (e.g. in the MoNet framework presented above, we just need to change the mean and standard deviations of the considered Gaussian kernels for analyzing different regions of space). This naturally allows to implement significantly smaller models, improving generalization over unseen conditions and ultimately performance at test time.

Iv Network Architecture


The input to our network is an event, consisting of DOMs represented as graph vertices. To each input vertex, a 6-dimensional feature vector is associated, containing the position of its corresponding DOM, the sum of charge in the first pulse within the sensor, the sum of charge in all pulses within the sensor, and the time at which the first pulse crosses the activation threshold. The task is to classify events into positives (neutrino) or negatives (background). A typical event activates only a sparse set of DOMs (see example in Fig. 2); the GNN architecture naturally allows to handle such situations by ignoring the inactive vertices, considering graphs of different size.

Graph construction

The geometry of the detector is represented by a fixed, weighted, directed graph. The weights are defined only by the spatial positions () of the DOMs. The adjacency matrix of the graph is constructed similarly to [15] by applying a Gaussian kernel to the pairwise distances between the DOMs, followed by the softmax operation,


where is the number of DOMs (vertices of the graphs), denotes the spatial coordinates of the th DOM, and is a learnable scalar parameter controlling the locality of the kernel and how fast the matrix spreads information across distant vertices.

Graph convolution

We apply a sequence of graph convolutional layers to the input data, as follows. Each layer inputs -dimensional feature vector at each vertex arranged into an matrix and outputs -dimensional features represented as an matrix .

The graph convolution is performed by applying the adjacency matrix to the input


where the learnable parameters are a -dimensional vector of weights and the scalar bias , different for each layer. The output of the layer is produced by concatenating the result of the graph convolution with a non-linear activation thereof,


which was experimentally found beneficial to achieve better performance.


The output features of the th graph convolutional layer are pooled using a sum operation over the vertices,


producing a -dimensional vector

, to which logistic regression is applied to output the predicted class of the event,


where is a -dimensional vector of weights and is the scalar bias parameter of the output layer.

V Experiment and Results

V-a Training and Evaluation

Model training, validation, and testing were completed using a signal and a background dataset containing 25250 and 109491 events, respectively. Each dataset was then subdivided into 50% training, 25% validation, and 25% test sets. We performed early stopping once performance was maximized during training on the validation set, up to 100 epochs. Of our trained models, we selected for final evaluation the model which performed best on the test set.

Final model evaluation was performed using additional signal and background datasets containing 8487 and 366433 events, respectively.

Each event is assigned a weight during its generation corresponding to the event’s frequency of occurrence. As such, all training and evaluation was performed on weighted samples, keeping in line with the goal of selecting as many signal events per year as possible while maintaining a 1:1 SNR.

Table I displays the number of events for both the signal (SG) and background (BG) classes within each dataset, with and without event weighting. The number of weighted events for a given class within a given dataset is the sum of all weights of events in the dataset which belong to that class. Thus while the class imbalance is large for the unweighted samples, it is far larger once event weighting is applied.

# Unweighted # Weighted
Dataset Signal Background Signal Background
Training 12624 54745 7.8 37275
Validation 6313 27373 3.9 18648
Test 6313 27373 3.9 18632
Final Evaluation 8487 366433 5.2 250982
TABLE I: Unweighted and weighted number of signal and background events within each dataset

V-B Classification Performance

Table II displays the expected number of weighted signal and background events each model selects per year, as assessed on the final evaluation dataset. Here the GNN outperforms the physics baseline by identifying 6.3x more signal events while providing a signal-to-noise ratio (SNR) which is 3x improved, also outperforming the CNN.

Fig. 3 displays receiver operating characteristic (ROC) curves for both the GNN and CNN models. The physics baseline is directly optimized to maximize the number of signal events identified while attempting to maintain an SNR of 1.0; thus it does not have a ROC curve. Because there are so many more weighted background events than signal events present, of primary interest is the true positive rate when the FPR becomes small.

# events per year
Method Signal Background Signal:Noise
Physics Baseline 0.922 0.934 0.987
3D CNN 1.815 1.937 0.937
GNN 5.772 1.937 2.980
TABLE II: Performance of several methods

Fig. 3: Receiver operating characteristic curve for various methods considered in this paper. The green square and blue X indicate the evaluation point for the GNN and CNN, respectively.

V-C Implementation Details

All code for the GNN model was written in Python using the PyTorch deep learning framework. Models were trained using Nvidia Tesla P40 GPUs and required approximately 40 hours for 100 epochs of training over 134741 samples, though typically model convergence occurred after just 40 epochs.

Vi Discussion

The physics baseline result was obtained by optimizing selections on stochasticity in a selected phase space of muon energy proxy and muon zenith angle. The optimization was performed on the full simulated data sets leading to the final cut level numbers of 0.92 and 0.93 events per year for signal and background, respectively. There are some upgrades to the physics analysis performance level that may increase signal and decrease background. One of them is incorporation of data from the surface IceTop detector and exploitation of its veto capabilities [33].

Although the neural network studies were meant to selected events based on the same stochasticity criteria, studies of the selected events found that the networks were also selected starting tracks, where a neutrino interacted in the detector. Starting tracks are studied by IceCube, but in a different set of diffuse analyses [40, 41]. They were excluded from the current physics baseline analysis, and, in any case, the chosen cuts would strongly discriminate against these events. A fully parallel treatment of these events would change the comparison between the baseline and machine learning results.

Despite having small sample sizes, this study shows the possibility of a marked improvement from application of the GNN over the physics baseline and demonstrates a huge potential of this approach for signal classification within the IceCube detector.

Vii Conclusions

In this work, we studied the application of graph neural networks to the challenging problem of signal detection in the IceCube neutrino observatory. We noted the key difficulties within this domain, including the irregular geometry of the detector and the sparse nature of signal it records, and including the large asymmetry between frequency of positive and negative event occurrence.

We presented a graph neural network architecture which allows for adaptive computation by operating on the input signal support. Quantitatively, this architecture results in a 3x improvement in signal-to-noise ratio, and is capable of detecting on the order of 6.3x more signal events than traditional physics-based baselines. While our results are still preliminary, the application of GNNs for the IceCube signal classification task are very promising and we expect future work to further improve the accuracy.


We thank the IceCube Collaboration for their support on this project. This work was supported in part by the National Science Foundation under grant number PHY-1307472 and the U.S. Department of Energy under contract number No. DE-AC02-05CH11231. This research used computational and storage resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility. FM and MB are supported in part by ERC Consolidator Grant No. 724228 (LEMAN), Google Faculty Research Awards, an Amazon AWS Machine Learning Research grant, an Nvidia equipment grant, a Radcliffe Fellowship at the Institute for Advanced Study, Harvard University, and a Rudolf Diesel Industrial Fellowship at IAS TU Munich. JB and NC are partially supported by the Intel IPCC Program and by the Alfred P. Sloan Foundation.