Variational Autoencoders for Generative Modelling of Water Cherenkov Detectors

11/01/2019 ∙ by Abhishek Abhishek, et al. ∙ 0

Matter-antimatter asymmetry is one of the major unsolved problems in physics that can be probed through precision measurements of charge-parity symmetry violation at current and next-generation neutrino oscillation experiments. In this work, we demonstrate the capability of variational autoencoders and normalizing flows to approximate the generative distribution of simulated data for water Cherenkov detectors commonly used in these experiments. We study the performance of these methods and their applicability for semi-supervised learning and synthetic data generation.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We currently cannot explain the observed matter-antimatter asymmetry in the universe. Neutrino oscillations [8, 14] may exhibit significant charge-parity symmetry violation (CPV) [23], which may lead to the answer [15]. These can be measured in experiments such as T2K [4], which produce a muon neutrino () or antineutrino beam directed towards a far detector where the oscillation signal is electron neutrinos () and antineutrinos, respectively.

The planned next generation Hyper-Kamiokande (Hyper-K) [2] far detector and NuPRISM [11] intermediate water Cherenkov detector (IWCD) are multi-kilotonne water tanks surrounded by 20” photomultiplier tubes (PMTs) or multi-PMT (mPMT) modules, consisting of 3” PMT arrays. They detect single photons of Cherenkov light from muons () and electrons () produced in and

interactions with water, respectively. These are easily classified by traditional likelihood ratio-based methods 

[e.g. 17] due to the electromagnetic shower induced by the much lighter electron, causing a fuzzier Cherenkov light ring projected onto the PMTs. These methods, however, have limited discriminative power between an and a high energy photon () produced by a without the corresponding . Such a photon can produce an and pair, which then also shower resulting in a very similar ring as an event. Such misclassified events constitute a significant and poorly understood background to the signal [6, 5, 3], with no existing experimental constraint.

As in existing WC detectors such as Super-Kamiokande (Super-K) [13], 1) accurate first-principles modelling of detector effects, such as varying water conditions or PMT responses, remains a significant challenge. Furthermore, 2) due to the improved granularity of IWCD and potentially the Hyper-K detector, it may be possible to detect the slight difference in Cherenkov light emission between an from a and pair from a

. Thus, the implementation and development of modern machine learning methods can 1) enable training directly on data to mitigate experimental uncertainties, and 2) maximize the rejection of

events to limit exposure to unconstrained theoretical uncertainties. Both developments are expected to enhance the sensitivity of Hyper-K to CPV and other phenomena such as proton decay [10] and multi-messenger astronomical events [e.g. 16, 9, 1].

In this work we study the capability of variational autoencoders (VAEs) and their extensions: normalizing flows (NF) to learn the data generating distribution and benchmark the performance of VAEs in semi-supervised training on simulated IWCD datasets. In addition, we study the capability of VAEs to generate synthetic samples.

2 Methodology

2.1 Simulation and Data Preprocessing

Water Cherenkov Simulation (WCSim)111 is a GEANT4 [7] and ROOT [12] based Monte Carlo (MC) software package. It was used to generate 3 million events each of , , and particles in the IWCD. For this initial study, the , initial positions and

pair production positions were fixed at the center of the detector. In order to use convolutional neural networks (CNNs) as our primary NN architecture, the top and bottom of the detector cylinder were ignored and the particle directions were constrained to be perpendicular to the detector wall. The azimuthal angles and the energies of the particles were uniformly varied between

to and to  MeV. The detector wall is instrumented with mPMT “pixels” and the resulting data is structured as an image with each pixel containing channels corresponding to the light intensity measured by each PMT.

2.2 Semi-supervised and Unsupervised Deep Generative Models

Variational Autoencoders (VAEs) [20] are a powerful class of latent generative models that maximize an evidence lower bound (ELBO) to the intractable log-likelihood of the data, log :



is the Kullback-Leibler divergence,

is the prior distribution over the latent variables , is the variational posterior distribution, and is the conditional generative distribution. Generally, and

are defined as factorized Gaussian distributions

and , respectively. and are parameterized using neural networks with parameters and , learned through minimization of by gradient descent methods.The expectation over the conditional generative distribution corresponds to the fidelity of the reconstruction. Since the input variables are continuous and considered Gaussian distributed, this term is replaced by negative Mean Squared Error (MSE) loss.

Normalizing Flows (NFs)  [22] were developed to address a key limitation of VAEs where, even in the asymptotic regime, one is unable to recover the true posterior distribution due to the simple form of

. In normalizing flows, a random variable

with an initial probability density

is transformed into another random variable with a probability density through a sequence of smooth, invertible mappings, :


In planar normalizing flows, these mappings have the form:


where , , and are the free parameters of the flow and is a smooth element-wise non-linearity.

2.3 Model architecture

The architecture of a VAE can be conceptually divided into an encoder, a “bottleneck”, and a decoder. We employed a simple LeNet [21] inspired CNN as the encoder and a symmetric architecture comprising of transposed convolution layers as the decoder. The encoder comprised of four and two convolution layers with the number of channels per pixel successively increasing from to

. ReLU nonlinearities were used after each layer. Finally,

fully connected layers were used to successively compress the feature vectors to the size of the latent vectors. The simple choice of the CNN architecture allowed for focus on the development of the generative models rather than the NN architectures.

3 Empirical Evaluations and Results

The 9 million event dataset is split into 80%-10%-10% for train, validation, and test subsets, respectively. During the training process, the model with the lowest loss on a validation mini-batch is retained for further evaluation. All models are trained for 10 epochs using the Adam optimizer 

[18] with an initial learning rate of . In addition, training of both the unsupervised VAE and the planar NF is performed using the ‘annealing trick’ [22, 24] where the KL divergence term in the loss is scaled by a factor of , which increases linearly from 0 to 1 with the number of epochs. We observed that using the ‘annealing trick’ resulted in a significantly lower MSE loss and similar KL loss on the test subset than using the standard ELBO objective (1).

Dimensionality of the latent space

We tested the VAE with varying number of latent dimensions in order to empirically determine the optimal setting. As shown in Figure 1, the lowest MSE loss was obtained using 32 latent dimensions. However, using 128 latent dimensions, we achieved better performance in terms of MSE loss and cross entropy loss in the cases of normalizing flows and semi-supervised learning respectively.

Planar Normalizing Flows

We employed planar normalizing flows (Eq. 3) with a similar amortization strategy to [22, 24] where the flow parameters are considered to be functions of the input rather than part of the global network parameters. As shown in Figure 1, we found that using planar normalizing flows, the average KL divergence loss is an order lower than the VAE which can be attributed to having a more flexible posterior distribution. However, the average MSE loss does not improve and in fact is worse than the standard VAE.

Figure 1: Comparing VAEs with different number of latent dimensions (left) and planar normalizing flows with varying flow depth (right) using MSE and KL loss on the test subset. MSE loss is measured in arbitrary units equivalent to the number of photoelectrons produced at the PMT photocathode and KL loss is measured in nats.

Event reconstruction and synthetic sample generation

The comparison of simulated events and their reconstructions (Figure 2) shows that the VAE is able to capture important features of the Cherenkov ring such as its position, shape, and size. However, the reconstructions show poor ring sharpness and lack of isolated channel charge deposits from PMT dark noise and scattered/reflected light. This is expected due to the limited capacity of the encoder and decoder used in addition to the mean field approximation of the VAE. Images based on sampling from the prior (Figure 3) show similar deficiencies and occasional artifacts, most likely due to prior-posterior divergence.

Latent space interpolation

In order to perform interpolation in the latent space along the axis of some physically meaningful quantities, we used the k-nearest neighbors algorithm in the feature space of the event azimuthal angle and energy. The latent vectors for the 256 nearest neighbors to some reference point in the feature space were used to find the start and end positions in the latent space. We observed (Figure 4) that linear interpolation along the energy axis yields smooth transitions from one step to the next, however interpolation along the azimuthal angle axis does not. This suggests that not all high level features correspond linearly to directions in the latent space.

Figure 2: Cherenkov ring images comparing actual simulated events (top) with their corresponding VAE reconstructed events (bottom).
Figure 3: Cherenkov ring images for events randomly sampled from the latent prior p(z) = .
Figure 4: Linear interpolation in the latent space for events along the angle axis from to (top) and energy axis from 200 MeV to 800 MeV (bottom).

Semi-supervised learning

Inspired by the M1 strategy of [19]

, we pre-trained the VAE (128 latent dimensions) using approximately 7.2 million training examples and subsequently trained a 4-layer multilayer perceptron (MLP) that takes the deterministic output of the encoder (before the reparameterization) as input. We benchmarked this semi-supervised model against a fully supervised CNN with a similar architecture. The fully supervised CNN and the MLP classifiers were trained for 10 epochs using the Adam optimizer with varying number of labelled examples. We found that in the low labelled data regime, the semi-supervised model consistently outperformed the fully-supervised model and achieved state-of-the-art performance for the task of

versus event discrimination. As shown in Table 1, the SS-CNN model also outperforms the supervised model on versus classification, considered unfeasible with existing likelihood ratio approaches [6].

Number of
training examples
background rejection ()
at signal efficiency
background rejection ()
at signal efficiency
Table 1: Comparing the semi-supervised model performance to that of the fully supervised model for various labelled dataset sizes.

4 Conclusion

We demonstrated the ability of VAEs and NFs to approximate the generative distribution of simulated water Cherenkov detector data, with NFs showing no significant improvement over the standard VAE. When used for the task of classification, the semi-supervised approach demonstrated performance gains over a fully supervised model comparable to those demonstrated in other domains. Reconstruction and synthetic data generation is possible, however the presence of artifacts suggests an improved design of the loss function and/or the generative model is needed. Linear interpolation along the energy axis displays a smooth behaviour, but along the azimuthal angle axis does not suggesting that more sophisticated interpolation methods may be needed. Through this work, we show the promise of applying generative models to address key challenges of neutrino oscillation experiments such as

vs classification and mitigation of experimental uncertainties through future work.


The authors would like to thank Akira Konaka, Mark Hartz and Olivia Di Matteo for helpful discussions and feedback on a draft of this paper. This research was enabled in part by support provided by TRIUMF, NSERC and Compute Canada (


  • [1] M. Aartsen et al. (2018) Multimessenger observations of a flaring blazar coincident with high-energy neutrino IceCube-170922A. Science 361 (6398). External Links: ISSN 0036-8075 Cited by: §1.
  • [2] K. Abe et al. (2018) Hyper-Kamiokande design report. arXiv:1805.04163. Cited by: §1.
  • [3] K. Abe et al. (2017-11) Measurement of neutrino and antineutrino oscillations by the T2K experiment including a new additional sample of interactions at the far detector. Physical Review D 96, pp. 092006. Cited by: §1.
  • [4] K. Abe et al. (2018-10) Search for Violation in Neutrino and Antineutrino Oscillations by the T2K Experiment with Protons on Target. Physical Review Letters 121, pp. 171802. Cited by: §1.
  • [5] K. Abe et al. (2018-10) Search for Violation in Neutrino and Antineutrino Oscillations by the T2K Experiment with Protons on Target. Physical Review Letters 121, pp. 171802. Cited by: §1.
  • [6] K. Abe et al. (2019-06) Search for neutral-current induced single photon production at the ND280 near detector in T2K. Journal of Physics G: Nuclear and Particle Physics 46 (8), pp. 08LT01. Cited by: §1, §3.
  • [7] S. Agostinelli et al. (2003) GEANT4—a simulation toolkit. Nuclear instruments and methods in physics research section A: Accelerators, Spectrometers, Detectors and Associated Equipment 506 (3), pp. 250–303. Cited by: §2.1.
  • [8] Q. R. Ahmad et al. (2002-06) Direct Evidence for Neutrino Flavor Transformation from Neutral-Current Interactions in the Sudbury Neutrino Observatory. Physical Review Letters 89, pp. 011301. Cited by: §1.
  • [9] P. Antonioli et al. (2004-09) SNEWS: the SuperNova early warning system. New Journal of Physics 6, pp. 114–114. Cited by: §1.
  • [10] K. S. Babu et al. (2013) Working Group Report: Baryon Number Violation. In Proceedings, 2013 Community Summer Study on the Future of U.S. Particle Physics: Snowmass on the Mississippi (CSS2013): Minneapolis, MN, USA, July 29-August 6, 2013, External Links: 1311.5285 Cited by: §1.
  • [11] S. Bhadra et al. (2014) Letter of intent to construct a nuPRISM detector in the J-PARC neutrino beamline. arXiv:1412.3086. Cited by: §1.
  • [12] R. Brun and F. Rademakers (1997) ROOT—an object oriented data analysis framework. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 389 (1-2), pp. 81–86. Cited by: §2.1.
  • [13] S. Fukuda et al. (2003) The Super-Kamiokande detector. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 501 (2-3), pp. 418–462. Cited by: §1.
  • [14] Y. Fukuda et al. (1998-07) Evidence for oscillation of atmospheric neutrinos. Physical Review Letters 81 (8), pp. 1562. Cited by: §1.
  • [15] M. Fukugita and T. Yanagida (1986) Barygenesis without grand unification. Physics Letters B 174 (1), pp. 45 – 47. External Links: ISSN 0370-2693 Cited by: §1.
  • [16] K. S. Hirata et al. (1988-07) Observation in the Kamiokande-II detector of the neutrino burst from supernova SN1987A. Physical Review D 38, pp. 448–458. Cited by: §1.
  • [17] M. Jiang et al. (2019-05) Atmospheric neutrino oscillation analysis with improved event reconstruction in Super-Kamiokande IV. Progress of Theoretical and Experimental Physics 2019 (5). External Links: ISSN 2050-3911 Cited by: §1.
  • [18] D. P. Kingma and J. Ba (2014-12) Adam: a method for stochastic optimization. arXiv:1412.6980. Cited by: §3.
  • [19] D. P. Kingma et al. (2014) Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589. Cited by: §3.
  • [20] D. P. Kingma and M. Welling (2014) Stochastic gradient VB and the Variational Auto-Encoder. In Second International Conference on Learning Representations, ICLR, Cited by: §2.2.
  • [21] Y. LeCun et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.3.
  • [22] D. Rezende and S. Mohamed (2015) Variational Inference with Normalizing Flows. In International Conference on Machine Learning, pp. 1530–1538. Cited by: §2.2, §3, §3.
  • [23] A. D. Sakharov (1991-05) Violation of invariance, asymmetry, and baryon asymmetry of the universe. Soviet Physics Uspekhi 34 (5), pp. 392–393. Cited by: §1.
  • [24] R. van den Berg et al. (2018-03) Sylvester Normalizing Flows for Variational Inference. arXiv:1803.05649. Cited by: §3, §3.