Muon Trigger for Mobile Phones

by   Maxim Borisyak, et al.
Higher School of Economics

The CRAYFIS experiment proposes to use privately owned mobile phones as a ground detector array for Ultra High Energy Cosmic Rays. Upon interacting with Earth's atmosphere, these events produce extensive particle showers which can be detected by cameras on mobile phones. A typical shower contains minimally-ionizing particles such as muons. As these particles interact with CMOS image sensors, they may leave tracks of faintly-activated pixels that are sometimes hard to distinguish from random detector noise. Triggers that rely on the presence of very bright pixels within an image frame are not efficient in this case. We present a trigger algorithm based on Convolutional Neural Networks which selects images containing such tracks and are evaluated in a lazy manner: the response of each successive layer is computed only if activation of the current layer satisfies a continuation criterion. Usage of neural networks increases the sensitivity considerably comparable with image thresholding, while the lazy evaluation allows for execution of the trigger under the limited computational power of mobile phones.


page 6

page 7


Processing Images from Multiple IACTs in the TAIGA Experiment with Convolutional Neural Networks

Extensive air showers created by high-energy particles interacting with ...

Electron Neutrino Energy Reconstruction in NOvA Using CNN Particle IDs

NOvA is a long-baseline neutrino oscillation experiment. It is optimized...

Machine Learning for Particle Flow Reconstruction at CMS

We provide details on the implementation of a machine-learning based par...

Towards Fast Displaced Vertex Finding

Many Standard Model extensions predict metastable massive particles that...

Virtual impactor-based label-free bio-aerosol detection using holography and deep learning

Exposure to bio-aerosols such as mold spores and pollen can lead to adve...

Automatic detection of impact craters on Al foils from the Stardust interstellar dust collector using convolutional neural networks

NASA's Stardust mission utilized a sample collector composed of aerogel ...

1 Introduction

The problem of pattern detection over a set of images arises in many applications. The CRAYFIS experiment is dedicated to observations of Ultra-High-Energy Cosmic Rays (UHECR) by a distributed network of mobile phones provided by volunteers. In the process of interaction with the Earth’s atmosphere, UHECRs produce cascades of particles called Extensive Air Showers (EAS). Some of the particles reach the ground, affecting areas of up to several hundreds of meters in radius. These particles can be detected by cameras on mobile phones, and a localized coincidence of particle detection by several phones can be used to observe very rare UHECR events [1].

This approach presents a number of challenges. In order to observe an EAS, each active smartphone needs to continuously monitor its camera output by scanning megapixel-scale images at rates of 15-60 frames per second. This generates a vast amount of raw data, which presents problems both for volunteers111For example, it may quickly exceed smartphone’s storage capacity or introduce a considerable load on networks. and experimenters if transmitted to data processing servers for later analysis. However, the recorded data contains almost entirely random camera noise, as signals from cosmic ray interactions are expected to occur in fewer than 1 in 10,000 image frames. As there would be potentially millions of smartphones operating simultaneously, it is critical to utilize the local processing power available on each device to select only the most interesting data. Hence, a trigger algorithm is required to filter out background data and identify possible candidates for cosmic rays traces. It is also important that the camera monitoring is subject to negligible dead time; therefore any trigger must operate with an average evaluation response time on the order of 30ms to track with the raw data rate.

Some constituents of an EAS, such as electrons and gamma-ray photons, leave bright traces in the camera [1]. In this case, the simplest trigger strategy is cut on brightness (if there are bright pixels in a fragment then this fragment is considered interesting). This usually enough to provide acceptable background rejection rate in the case of bright traces, and given a target background rejection rate it is possible to automatically determine the threshold value for decision making. However, this strategy is much less effective against another component of the shower, comprising minimally-ionizing particles such as high-energy muons. These particles may leave relatively faint signals in the pixels they traverse, possibly at a level comparable to the sensor’s intrinsic noise.

Nevertheless, these minimally-ionizing particles traverse the sensor’s pixels in distinctively straight lines. If these tracks are long enough in the plane of the sensor, there is a low probability of the same pattern emerging from intrinsic random camera noise. Thus it is still possible to discriminate even these faintly-interacting particles from background.

In this work, we propose a novel approach for fast visual pattern detection, realized here as a trigger designed for fast identification of muon traces on mobile phone’s camera. This method is based on Convolutional Neural Networks and does not contain any specific assumptions for identification of muon traces, hence, in principle, it can be applied to any visual pattern detection problem.

2 Related Work

Minimally-ionizing particles are characterized by the pattern of activated pixels they leave over a small region of an exposure. Hence the problem of minimally-ionizing particle detection can be transformed to the problem of pattern detection over the image. Several attempts were performed to solve pattern detection problem in different setups. The solution proposed in works [2] and [3] utilizes Convolutional Neural Networks (CNNs). Certain properties of CNNs such as translation invariance, locality, and correspondingly few weights to be learned, make them particularly well suited to extracting features from images. The performance of even simple CNNs on image classification tasks have been shown to be very good compared to other methods [4]. However, the training and evaluation of CNNs requires relatively intense computation to which smartphones are not particularly well suited. Viola and Jones in [5]

introduced the idea of a simple feature-based cascading classifier. They proposed using small binary filters as feature detectors and to increase computational power from cascade to cascade.

Bagherinzhad et al. in [6] enumerated a wide range of methods which have been proposed to address efficient training and inference in deep neural networks. Although these methods are orthogonal to our approach, they may be incorporated with the method described here in order to improve efficiency in related tasks.

3 CNN trigger

The key insight of the proposed method is to view a Deep Convolutional Neural Network (CNN) as a chain of triggers, or cascades: each trigger filters its input stream of data before passing it further down the chain. The main feature of such chains is that amount of data passing successfully through the chain gradually decreases, while the complexity of triggers gradually increases, allowing finer selection with each subsequent trigger. This architecture allows one to effectively control the computational complexity of the overall chain, and usually, to substantially decrease the amount of computational resources required [5].

Convolutional Neural Networks are particularly well suited for adopting this approach as instead of passing an image itself throughout the chain, the CNN computes a series of intermediate representations (activations of hidden layers) of the image [4]. Following the same reasoning as in Deeply Supervised Nets (DSN) [7], one can build a network for which discriminative power of intermediate representations grows along the network, making it possible to treat such CNN as a progressive trigger chain 222However, in contrast to DSN, the growth of discriminative power is a requirement for an effective trigger chain rather than for network regularization..

In order to build a trigger chain from a CNN, we propose a method similar to DSN: each layer of the CNN is extended with a binary trigger based on the image representation obtained by this layer. In the present work we use logistic regression as model for the trigger, although, any binary classifier with a differentiable model would be sufficient.

The trigger is applied to each region of the output image to determine if that region should proceed further down the trigger chain. We call these layers with their corresponding triggers a convolutional cascade, by analogy with Viola-Jones cascades [5]. The output of the trigger at each stage produces what we refer to as an activation map, to be passed to the next layer, as illustrated in Fig. 0(a).

From another perspective, this approach can be seen as an extension of the CNN architecture in which network computations concerning regions of the input image are halted as soon as a negative outcome for a region becomes evident333

Lazy application can be viewed as a variation of attention mechanisms recently proposed in Deep Learning literature, see e.g.

[8].. This is effectively accomplished by generalizing the convolution operator to additionally accept as input an activation map indicating the regions to be computed. In the case where sparse regions of activation are expected, this lazy application can result in much fewer computations being performed.

After each application of the lazy convolution, the activation map for the subsequent layer is furnished by the trigger associated with the current layer. The whole chain is illustrated in Fig. 0(b).

(a) convolutional cascade
(b) CNN trigger structure
Figure 1: Fig. 0(a) shows the building block of the CNN trigger, an individual convolutional cascade. In contrast to conventional convolutional layers, the convolutional cascade has an additional input, the activation map, which indicates regions to which the convolutional operator should be applied (lazy application, denoted by dashed lines). The activation map is updated by the associated trigger (represented by the s-shaped node), which may be passed on to the subsequent cascade or interpreted as the final output indicating regions of interest. Fig. 0(b) shows the full structure of CNN trigger as a sequence of convolutional cascades. Initially the whole image is activated (red areas). As the image proceeds through the chain, the activated area becomes smaller as each cascade refines the activation map.

Training of the CNN trigger may be problematic for gradient methods, since prediction is no longer a continuous function of network parameters. This is because the lazy convolution, described above, is in general non-differentiable.

In order to overcome this limitation, we propose to use a slightly different network architecture during training by substituting a differentiable approximation of the lazy convolution operator. The basic idea is that instead of skipping the evaluation of unlikely regions, we simply ensure that any region which has low activation on a given cascade will continue to have low activation on all subsequent cascades. In this scheme, the evaluation is no longer lazy, but since training may be performed on much more powerful hardware, this is not a concern.

To accomplish this, we first replace the binary activation maps (which result from the trigger classification) with continuous probability estimates. Secondly, we introduce

intermediate activation maps, which are computed by the trigger function at each layer. The intermediate map is multiplied by the previous layer’s activation map to produce a refined activation map444 If layers of underlying CNN contains pooling, i.e. change size of the image, pooling should be applied to intermediate activation maps as well. . In this way, the activation probability for any region is nonincreasing with each cascade layer. The process is depicted schematically in Fig. 2.

This differentiable version of the lazy application operation for the cascade is described by the following equations:


where is the intermediate representation of the input image after successive applications of CNN layers, and represents the transformation associated with the layer of the CNN (typically this would be convolution, nonlinearity, and pooling). is the function associated with the trigger (in our case, logistic regression), and its result is the intermediate activation map of the cascade. Finally, is the differentiable version of the activation map, given by the element-wise product of the intermediate activation with the previous layer’s activation. That elements of the initial activation map are set to 1, and the subscripts denote region position.

Note that the dimensions of the activation map define the granularity of regions-of-interest that may be triggered, and may in general be smaller than the dimensions of the input image. In this case, the trigger function

should incorporate some downsampling such as max-pooling.

Figure 2: Schematic of the CNN trigger used for training. To make the network differentiable, lazy application is replaced by its approximation, that does not involve any “laziness”. Activation maps are approximated as the elementwise product of unconditional trigger response (intermediate activation maps) and the previous activation map.

We also note that since similar but still technically different networks are used for training and prediction, special care should be taken while transitioning from probability estimations to binary classification. In particular, additional adjustment of classifier thresholds may be required. Nevertheless, in the present work no significant differences in the two networks’ behaviors were found (see Fig. 4).

To train the network we utilize cross-entropy loss. Since activation maps are also intermediate results, and the activation map of the last cascade is the final result for a network of cascades, the loss can be written as:


where denotes the ground truth map with width and height . The truth map is defined with if the region at coordinates contains a target object, otherwise . The coefficient is introduced to provide control over the penalty for triggering on background.

If the cross-entropy term (5) is the only component of the loss, the network will have no particular incentive to assign regions a low activation on early cascades, limiting the benefit of the lazy evaluation architecture. One approach to force network to produce good results on intermediate layers is to directly introduce penalty term for unnecessary computations:


where represents the per-region cost of performing convolution and trigger in the cascade.

We use a naive estimation of coefficients the , assuming, for simplicity, that convolution is performed by elementwise multiplication of the convolutional kernel with a corresponding image patch. In this case, for filters of size applied to image with channels:


Combining these terms, the resulting total loss function is given by:


where the parameter is introduced to regulate the trade-off between computational efficiency and classification quality.

Another approach is to apply a DSN technique:


Here, is the loss associated with cascade (i.e. companion loss in DSN terminology) defined by analogy with (5). The coefficients regulate the trade-off between losses on different cascades555One may find

to be a relatively good heuristic.


In the present work, we find that the objectives defined by (8) and (9) are highly correlated. However, (9) seems to propagate gradients more effectively, resulting in faster training.

4 Experiments

4.1 Dataset

As of this writing, no labeled dataset of CMOS images containing true muon tracks is available666To obtain real data and fully validate performance of the algorithm, an experimental setup with muon scintillators is scheduled this year.. Instead, an artificial dataset was constructed with properties similar to those expected from real data, in order to assess the CNN trigger concept.

(a) original traces
(b) composition
Figure 3: Test dataset creation steps: 2(a) selection of bright photon tracks, 2(b) track brightness is lowered and superimposed on noisy background.

To construct the artificial dataset, images were taken from a real mobile phone exposed to radioactive Ra, an X-ray photon source. These photons interact in the sensor primarily via compton scattering, producing energetic electrons which leave tracks as seen in Fig. 2(a). These tracks are similar to those expected by muons, the main difference being that the electron tracks tend to be much brighter than the background noise, rendering the classification problem almost trivial.

Therefore, the selected particle tracks are renormalized such that their average brightness is approximately at the level of the camera’s intrinsic noise. Gloom traces are than superimposed on the background with some Gaussian noise to modeling intrinsic camera sensor noise. An example of the resulting artificial data is shown in Fig. 2(b). After these measures, the dataset better emulates the case of low-brightness muons, and also forces the classifier to use more sophisticated (geometric) features for classification.

(a) intermediate activation maps and ground truth map
(b) activation maps and ground truth map
(c) binary activation maps and ground truth map
Figure 4: Evaluation of the trigger CNN (using the input image from Fig. 2(b)). Figs. 3(a) and 3(b) are activation maps for training regime; Fig. 3(c) are binary activation maps for the application regime. The resolution of the map is reduced after each cascade to match the downsampling of the internal image representation.

4.2 CNN trigger evaluation

To evaluate the performance of the method, we consider the case of a CNN trigger with 4 cascades. The first cascade has a single filter of size , equivalent to simple thresholding. The second, third, and fourth cascades have 1, 3, and 6 filters of size , respectively. Within each cascade, convolutions are followed by max-pooling.

Due to the simple structure of the first cascade, its coefficient from (6) is set to .

As motivated in Sec. 1, a successful trigger must run rapidly on hardware with limited abilities. One of the simplest algorithms that satisfies this restriction, thresholding by brightness, is chosen as baseline for comparison.777In order to obtain comparable results, the output of thresholding was max-pooled to match the size of the CNN trigger output. This strategy yields a background rejection rate of around (mainly due to assumptions built in dataset) with perfect signal efficiency.

Two versions of the CNN trigger with average computational costs888As estimated by (6). of and operations per pixel were trained. This computational cost is controlled by varying coefficients in the loss function (9). For each, signal efficiency and background rejection rates at three working points are presented in Table 1. Fig. 4 shows some typical examples of activation maps for different network regimes.

complexity op. per pixel op. per pixel
signal efficiency 0.90 0.95 0.99 0.90 0.95 0.99
background rejection 0.60 0.39 0.12 0.65 0.44 0.15
Table 1: CNN trigger performance for two models with computational costs and operations per pixel. Different points for signal efficiency and background rejection were obtained by varying threshold on output of the CNN trigger (i.e. activation map of the last cascade).

These results indicate a significant improvement of background rejection rate relative to the baseline strategy, even for nearly perfect signal efficiency.

Another performance metric that is interesting to consider is, the normalized computational complexity:


For the models described above, the normalized computational complexity, , is around 4-5 percent, which indicates that a significant amount of computational resources is saved due to lazy application, as compared to a conventional CNN with the same structure.

5 Conclusion

We have introduced a novel approach to construct a CNN trigger for fast visual pattern detection of rare events, designed particularly for the use case of fast identification of muon tracks on mobile phone cameras. Nevertheless, the proposed method does not contain any application-specific assumptions and can be, in principle, applied to a wide range of problems.

The method extends Convolutional Neural Networks by introducing lazy application of convolutional operators, which can achieve comparable performance with lower computational costs. The CNN trigger was evaluated on an artificial dataset with properties similar to those expected from real data. Our results show significant improvement of background rejection rate relative to a simple baseline strategy with nearly perfect signal efficiency, while the per-pixel computational cost of the algorithm is increased by less than a factor of .

The effective computational cost is equivalent to 4-5 percent of the cost required by a conventional CNN of the same size. Therefore the method can enable the evaluation of powerful CNNs in instances where time and resources are limited, or where the network is very large. This is a promising result for CNNs in many other possible applications, such as very fast triggering with radiation-hard electronics, or power-efficient realtime processing of high resolution sensors.



  • [1] Whiteson D, Mulhearn M, Shimmin C, Cranmer K, Brodie K and Burns D 2016 Astroparticle Physics 79 1–9
  • [2]

    Li H, Lin Z, Shen X, Brandt J and Hua G 2015 A convolutional neural network cascade for face detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp 5325–5334
  • [3] Ren S, He K, Girshick R and Sun J 2015 Faster r-cnn: Towards real-time object detection with region proposal networks Advances in neural information processing systems pp 91–99
  • [4] LeCun Y and Bengio Y 1995 The handbook of brain theory and neural networks 3361 1995
  • [5] Viola P and Jones M 2001 Rapid object detection using a boosted cascade of simple features Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on vol 1 (IEEE) pp I–511
  • [6] Hessam B, Mohammad R and Ali F 2016 arXiv preprint arXiv:1611.06473
  • [7] Lee C Y, Xie S, Gallagher P, Zhang Z and Tu Z 2015 Deeply-supervised nets. AISTATS vol 2 p 6
  • [8] Xu K, Ba J, Kiros R, Cho K, Courville A C, Salakhutdinov R, Zemel R S and Bengio Y 2015 Show, attend and tell: Neural image caption generation with visual attention. ICML vol 14 pp 77–81