Multi-Scale Distributed Representation for Deep Learning and its Application to b-Jet Tagging

11/29/2018 ∙ by Jason Lee, et al. ∙ University of Seoul 0

Recently machine learning algorithms based on deep layered artificial neural networks (DNNs) have been applied to a wide variety of high energy physics problems such as jet tagging or event classification. We explore a simple but effective preprocessing step which transforms each real-valued observational quantity or input feature into a binary number with a fixed number of digits. Each binary digit represents the quantity or magnitude in different scales. We have shown that this approach improves the performance of DNNs significantly for some specific tasks without any further complication in feature engineering. We apply this multi-scale distributed binary representation to deep learning on b-jet tagging using daughter particles' momenta and vertex information.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the most recent machine learning, deep-layered artificial neural networks (DNNs) have been successfully applied to wide range of physics problems from phase transitions in statistical mechanics

[1, 2] to quark/gluon jet discrimination in high energy physics [3]. However, since there are no comprehensive rules or principles by which we can select a particular architecture or the input features of the network model for the corresponding tasks, the so called hyper-parameters determining the architecture of the networks and the overall learning processes are examined by pain-stacking trials and errors for a given set of input features. The resulting parameters for the networks are believed to be optimal, but they are too specific to be applied for other similar tasks. So for these similar tasks, trials and errors often need to be repeated.

However, if we use good input features or representations for DNNs, we can reduce the repetitions, since the good input features help DNNs to learn good internal representations [4, 5] with less dependence on its detailed architectures. Our investigation stems from a following simple question: what happens if one uses multiple variables with smaller dynamic range instead of one variable with large dynamic range as an input feature? We have tested this with a simple model. We designed a network to predict the sign of the sum of the spin states on a block within a lattice of random spins provided with the coordinates of center of the block limited within inner lattice as shown in Fig. 1

. The network had to find out what the uniformly distributed input variables for coordinates of block center site meant and how to process them to perform the task, as this was not coded in explicitly. In this setting, when we provided the network with a binary representation of the site coordinate, for example (

,) instead of (,), the network learned more quickly and its training and test error were less as shown on the Fig. 1b. We can consider that this phenomenon is caused by the increased sparsity resulting from transforming the input values from a large base range to a smaller base (see Fig. 8 for the related discussion), having a distributed representation [4, 5] simply implemented with a number of binary digits for each input value in this case.

Inspired by these findings, we transformed each real-valued feature of jet constituents to a -number of binary digits as the input features of deep neural networks for b-jet tagging [6] to test whether this simple preprocessing can improve the performance of the networks on a harder problem. We could implement the multi-scale distributed (MSD) representation in various ways, but the representation with -number of binary digits (MSD -digits) is the simplest and convenient one. Thus we studied only this binary implementation, which shall be denoted as MSD -digits or simply MSD -digits from now on.

In section 2, we will describe briefly data set used in this work, before exploring the detailed preprocessing of jet data and MSD digit representation. Section 3 covers the network architectures and its learning processes, and section 4 presents the results followed by our concluding remarks.

Figure 1: Multi-scale representation of site position distributed in 6 binary digits: the left most digits represent the region in largest scale.

2 Jet Data and MSD -digit Representation

To test the improvement of the MSD representational algorithm against the typically used real-valued ones on b-jet tagging, we generated events in collisions at TeV using the next leading order POWHEG [9] event generator matched to PYTHIA8 [7] for hadronization. Jets with transverse momentum GeV and pseudo rapidity with at least two daughter constituents are selected. We used FastJet [8] with the anti- algorithm [11] with a jet radius for jet finding and clustering. We used Delphes for fast detector simulation [10]. We did not separated samples by range of jet for quick and simple test of MSD representation.

We used only jets initiated from light partons such as up, down, strange quarks and gluon (light-jet) for background. The number of b-jets and light-jets used in this study are the same, each with samples. The total jet samples are divided into a training set () of samples, with a early validation set () of samples for early-stop to be collectively applied to all member networks and test set () of samples for the final performance evaluation.

Figure 2:

Mean and standard deviation (std) plots for jet constituents

, along constituent index, before (upper two rows) and after (lower two rows) normalization.
Figure 3: Mean and standard deviation plots for MSD -digits on of jet constituent in simulated samples: each horizontal axis represents constituent particle whereas each vertical axis is for mean or standard deviation (std).

We used the jet constituent’s relative transverse momentum ratio and it’s vertex positions as input features (,,,). Jets have different number of its constituents or particles. We used a fully connected feed forward networks which take a fixed number of variables for its input, so we truncated the sequence of jet constituents variables up to the th constituent. And if the number of jet constituents is less than , we set rest values to zeros. So, represents the following data structure,

where upper index represents the order of jet constituent in some rule, e.g. ordering by .

Before transforming the sequence of jet constituents variables to MSD -digit representation, zero-centering and normalization processes are performed to adjust the dynamic ranges of each component. This data-oriented processed representation will be called real-valued representation and it is defined as follows:


Fig. 2 shows distributions of ratio of constituent’s to jet , constituent , and before and after the normalization process (1).

Transforming the real-valued representation, , to MSD -digit representation can be simply performed by converting decimal numbers to binary digits with signed magnitude representation after clipping and rounding process as follows:


and then the resulting decimal number is represented as a signed binary


where and . The number of features are increased to by the transformation. While the optimal interval or resolution may need systematic analysis, we just set it by , with for the real-valued representation. Fig. 3 shows the mean and standard deviations of the vertex position, , for each of the digits in the MSD -digit representation. The first digit, , showing the sign bit, followed by for the first significant figure, to the last digit , are the so called signed MSD -digit representation111Let a digit binary number. Its signed -digit representation is defined as follows: For negative , one has the binary representation of while, for non negative , one simply adds an extra in front of to get -digit representation. On the other hand, for its two’s complement representation below, one has -digit binary representation of for negative whereas the definition of -digit representation remains the same for non negative . . The horizontal axis represents the jet constituent index. The plots of the mean values of the most significant binary digits are smooth and it starts to be more discontinuous and the digit significance decreases. The reason for such discontinuity is due to the presence of sharp peaks near zero in the distribution of simulated , as seen in Fig. 4.

Figure 4: The distribution of jet constituents’ in simulated samples: (top) log-scale vertex position significance and (bottom) digitized vertex position with interval near zero. The plots on the left are for the b-jet whereas the right ones for the light-jet.

3 Networks Architectures and Learning processes

In this study, we used a fully connected neural network (FCN) for b-jet tagging. The schematic architecture of this network is shown in Fig. 5

. We used the rectified linear unit (ReLU) activation function

[12] for all units ’s in each hidden layer indexed by the superscript and the function for two output units and . The categorical cross-entropy is used as the cost function for the training.

Figure 5: Schematic architecture of fully connected neural network (FCN) used for b-jet tagging with jet constituents variable in real-valued or MSD -digit representation. Blue lines with arrow connecting units of lower and higher layers represent the direction of information flow in which a signal from one unit in a layer enters into all units in its next layer.

Below are input units for each of the real-valued jet constituent features and its corresponding MSD -digit representation.


where binary position index and for the real-valued input feature. We prepared both “Deep” and “Not so deep” versions for each input representation to check the stability of our result. where binary position index and for the real-valued input feature. We prepared both “Deep” and “Not so deep” versions for each input representation to check the stability of our result.

Neural networks were trained using the Theano Python library

[13] on GPUs using the NVIDIA CUDA platform. The connection weights of the networks were initialized with the Xavier Initialization scheme [14]. The Adam [15]

algorithm was used to update the weights up to each early-stop epoch specified in Table. 

1, using a batch size of 1000 which corresponds to 100 iterations per epoch in the training stage.

Figure 6: Learning progress (mean of training and validation errors by member networks of ensemble in try-out phase with ensemble size along iterations. Two curves showing higher error rates at early epochs are for the real-valued representation on each “Not so deep” and “Deep” plots, while the lower two curves for the MSD representation. Curves on training error are lower than those of validation for each representation.
Hyper-Parameters “Not so deep”  “Deep”
, activation 3,  ReLU 6,  ReLU
Early stop epoch for MSD-16 30 25
Early stop epoch for Real-valued 25 25
Ensemble Size 100 100
penalty parameter for MSD-16 0.001 0.001
penalty parameter for Real-valued 0.0001 0.0001
Table 1: Hyper-parameters on regularization and architecture are presented in this table. While the number of input units of network for real-valued representation is , the number of input units of network for MSD-16 is .

For regularization, we used the simple ensemble voting method [16] in which the network ensemble was composed of networks trained with just different random seeds for their initial connection weights and a random sequence of mini-batches up to epoch like this,


and the ensemble vote was defined as


where is the network model, ’s represent the connection weights and biases updated up to epoch , with different initialization and mini-batch sequence by random seed with index , and is size of the ensemble of networks. We also used penalty [17] and early-stop method together with previously mentioned ensemble method. The early stop was determined by inspecting networks learning progress plot obtained by try-out learning up to epochs (Fig. 6), to find the epoch at which the mean of validation errors by member networks (try-out ensemble size ) become stagnated, with the ensemble mean error as below at epoch ,


where represent prediction error rate by member network of ensemble for data set at epoch . Table. 1 shows the summary of the architectures and parameters on regularization.

Figure 7:

Performance curves obtained from the output of ensemble vote: the horizontal axis represents b-jet efficiency and the vertical axis represents probability with which the network ensemble identify incorrectly light-jet as b-jet. (a) Plot for comparison of MSD

-digit and the real-valued representation. The curves split into two groups, upper two lines for the real-valued and lower two lines for the MSD -digit. (b) Plot for comparison of MSD -digit, two’s complement binary -digit, and the real-valued representations. The performance of the two’s complement representation is rather similar to that of the real-valued representation.

4 Results

We obtained performance curves by extracting outputs of the networks, shown in Fig. 7. The MSD method shows significant improvement on both “Deep” and “Not so deep” networks compared to the real-valued method. On the other hand, only a marginal improvement was observed when one uses two’s complement binary representation over the real-valued method. Two’s complement representation showed lower performance than the signed MSD representation as demonstrated in Fig. 7b.

There are a few comments on the sparsity of our representation. We found that our MSD representation is more sparse than the real-valued representation. This was demonstrated in Fig. 8 where the averages of normalized activation are plotted. In Fig. 8a, the blue line is for the MSD representation in the “Not so deep” network. It clearly shows that the number of deactivated units for this MSD representation is much larger than that for the real-valued representation (See the green line). Hence we conclude that our MSD representation is more sparse, which partially explains why it shows a better performance in this machine learning. Fig. 8b is for the “Deep” networks and basically shows the same trends.

Figure 8: Effect on sparsity: the averages of normalized activation of hidden units in the first layer of “Not so deep” (a) and “Deep” (b) networks for each real-valued and MSD 16-digit representation to the problem of b-jet tagging, where is the sorted activation from the smallest to the largest as .

5 Conclusions

We demonstrated that a simple transformation from each real-valued input feature to MSD digit representation can lead to a large improvement in the fully connected networks for b-jet tagging without any additional domain specific feature engineering.

In typical network optimization, one has to examine broad range of hyper-parameters, such as network depth, number of units per layer and regularization parameters. We have shown that our results from two groups of networks, “Deep” and “Not so deep”, are not sensitive to such parameters.

Compared with a typical binary transformation, our MSD conversion described in Eqs. (2) and (3) from the real-valued feature to signed binary digits reflects the multi-scale property of the original data. For example, typical two’s complement representation converts decimal digits to , while our signed binary representation converts to . Our choice not only strengthens the multi-scale property but also makes the result information-theoretically effective. For instance, this is because , two’s complement of , activates excessive input units compared with the signed representation of .

In this note, we limit our study of the MSD digit representation to the problem of b-jet tagging. However our method can be straightforwardly applied to many other areas of deep learning problems. Further investigation is required in this direction.


This work was supported in part by NRF Grant 2017R1A2B4003095.