1 Introduction
In the datarich environment of the Large Hadron Collider (LHC), machine learning techniques have the potential to significantly improve on many classification, regression, and generation problems in collider physics. There has been a recent surge of interest in applying deep learning and other modern algorithms to a wide variety of problems, such as jet tagging
Cogan:2014oua ; Almeida:2015jua ; deOliveira:2015xxd ; Baldi:2016fql ; Barnard:2016qma ; Kasieczka:2017nvn ; Komiske:2016rsd ; deOliveira:2017pjk ; Komiske:2017ubm ; Louppe:2017ipp ; ATLASCONF2017064 ; ATLPHYSPUB2017013 ; ATLPHYSPUB2017004 ; CMSDP2017005 ; CMSDP2017013 ; ATLPHYSPUB2017003 ; Butter:2017cot ; Pearkes:2017hku ; Datta:2017rhs ; ATLPHYSPUB2017017 ; CMSDP2017027 . Despite the power of these methods, they all currently rely on significant input from simulations. Existing multivariate approaches for classification used by the LHC experiments all have some degree of mismodeling by simulations and must be corrected posthoc using datadriven techniques Aad:2015ydr ; Chatrchyan:2012jua ; Aad:2014gea ; CMS:2013kfa ; CMSDP2016070 ; Aad:2015rpa ; Khachatryan:2014vla ; Aad:2016pux ; CMS:2014fya . The existence of these scale factors is an indication that the algorithms trained on simulation are suboptimal when tested on data. Adversarial approaches can be used to mitigate potential mismodeling effects during training at the cost of algorithmic performance Louppe:2016ylz . The only solution that does not compromise performance is to train directly on data. This is often thought to not be possible because data is unlabeled.In this paper, we introduce classification without labels (CWoLa, pronounced “koala”), a paradigm which allows robust classifiers to be trained directly on data in scenarios common in collider physics. Remarkably, the CWoLa method amounts to only a minor variation on wellknown machine learning techniques, as one can effectively utilize standard fullysupervised techniques on two mixed samples. As long as the two samples have different compositions of the true classes (even if the label proportions are unknown), we prove that the optimal classifier in the CWoLa framework is the optimal classifier in the fullysupervised case.^{1}^{1}1After we developed this framework, we learned of a mathematically equivalent (but conceptually different) rephrasing of CWoLa in the language of learning from random noisy labels in Ref. scott2013 , where a version of Theorem 1 also appears. See the discussion in Sec. 2.3. In practice, after training the classifier on large event samples without using label information, the operating points of the classifier can be determined from a small sample where at least the label proportions are known.
The CWoLa paradigm is part of a broader set of classification frameworks that fall under the umbrella of weak supervision. These frameworks go beyond the standard fullysupervised paradigm with the goal of learning from partial, nonstandard, or imperfect label information. See Ref. hernandez2015 for a recent review and comprehensive taxonomy. Weak supervision was first applied in the context of high energy physics in Ref. Dery:2017fap to distinguish jets originating from quarks from those originating from gluons using only class proportions during training; this paradigm is known as learning from label proportions (LLP) quadrianto2009 ; patrini2014almost . For quark versus gluon jet tagging, LLP was an important development because useful quark/gluon discrimination information is often subtle and sensitive to lowenergy or wideangle radiation inside jets, which may not be modeled correctly in parton shower generators Gras:2017jty . The main drawback of LLP, however, is that there is still uncertainty in the quark/gluon labels themselves, since quark/gluon fractions are determined by matrix element calculations convolved with parton distribution functions, which carry their own uncertainties. The CWoLa paradigm sidesteps the issue of quark/gluon fractions entirely, and only relies on the assumption that the samples used for training are proper mixed samples without contamination or sampledependent labeling.
The ideas presented below may prove useful for a wide variety of machine learning applications, but for concreteness we focus on classification. It is worth emphasizing that the CWoLa framework can be applied to a huge variety of classifiers^{2}^{2}2CWoLA can be applied to train any classifier with a threshold that can be varied to sweep over operating points. nearest neighbors classification, for instance, does not have this property. without modification to the training procedure, by simply training on mixed event samples instead of on pure samples. By contrast, LLPstyle weak supervision such as in Ref. Dery:2017fap
requires a nontrivial modification to the loss function.
^{3}^{3}3The recent study in Ref. Cohen:2017exh , which was initially inspired by the LLP paradigm, is actually performing weak supervision using the CWoLa approach. We thank Timothy Cohen, Marat Freytsis, and Bryan Ostdiek for clarifications on this point. For this reason, CWoLa can be applied even for classifiers that are not trained in terms of loss functions at all.Despite the power and simplicity of the CWoLa approach, there are some important limitations to keep in mind. First, the optimality of CWoLa is only true asymptotically; for a finite training set and a realistic machine learning algorithm, there can be differences, as discussed more below. Second, CWoLa does not apply when one class does not already exist in the data, as may be the case in a search for physics beyond the Standard Model (SM) with an exotic signature. That said, if the new physics can be decomposed into SMlike components, such as different types of jets, then CWoLa may once again be possible. Third, when the CWoLa strategy is employed for training in one event topology and testing in another event topology, there may be systematic uncertainties associated with the extrapolation. Of course, this is also true for traditional fullysupervised classification, which may introduce residual dependence on simulation; indeed, one could even combine adversarial approaches with CWoLa in this case to mitigate simulation dependence Louppe:2016ylz . Finally, the CWoLa approach presented here only applies to mixtures of two categories, and further developments would be needed to disentangle multicategory samples.
The remainder of this paper is organized as follows. In Sec. 2
, we explain the theoretical foundations of the CWoLa paradigm and contrast it with LLPstyle weak supervision and full supervision. We illustrate the power of CWoLa with a toy example of two gaussian random variables in Sec.
3. We then apply CWoLa to the challenge of quark versus gluon jet tagging in Sec. 4, using a dense network of five standard quark/gluon discriminants to highlight the performance of CWoLa on mixed samples. The paper concludes in Sec. 5 with a summary and future outlook.2 Machine learning with and without labels
The goal of classification is to distinguish two processes from each other: signal and background . Let be a list of observables that are useful for distinguishing signal from background, and define and
to be the probability distributions of
for the signal and background, respectively. A classifier is designed such that higher values of are more signallike and lower values are more backgroundlike. A classifier operating point is defined by a threshold cut ; the signal efficiency is then and the background efficiency (i.e. mistag rate) is , for the Heaviside step function . The performance of a classifier can be described by its receiver operating characteristic (ROC) curve which is the function . A classifier is optimal if for any other classifier , for all possible . By the NeymanPearson lemma nplemma , an optimal classifier is the likelihood ratio: . Therefore, the goal of classification is to learn or any classifier that is monotonically related to it.In practice, one learns to approximate from a set of signal and background examples (training data). When the dimensionality of is small and the number of examples large, it is often possible to approximate and
directly by using histograms. When the dimensionality is large, an explicit construction is often not possible. In this case, one constructs a loss function that is minimized using a machine learning algorithm like a boosted decision tree or (deep) neural network. The following section describes three paradigms for learning
with different amounts of information available at training time: full supervision, LLP, and CWoLa. The ideas presented here apply to any procedure for constructing .2.1 Full supervision
Fully supervised learning is the standard classification paradigm. Each example
comes with a label . For models trained to minimize loss functions, typical loss functions are the mean squared error:(1) 
for the indicator function , or the crossentropy:
(2) 
where is the size of the subset (batch) of the available training data. With large enough training samples, flexible enough model parameterization, and suitable minimization procedure, the learned should approach the performance of .
2.2 Learning from label proportions
For weak supervision, one does not have complete and/or accurate label information. Here, we consider the case of accurate labels, but in the context of mixed samples. Consider two processes and that are mixtures of the original signal and background processes:
(3)  
(4) 
with the signal fractions satisfying .
Instead of having training data labeled as being from or , we are now only given examples drawn from and with the corresponding and labels. We are however told and ahead of time. The resulting optimization problems are much less constrained than those in Sec. 2.1, but learning is still possible. The key is to use several different mixed samples with sufficiently different fractions in order to avoid trivial failure modes, as discussed in Ref. Dery:2017fap . One possible loss function is given by:
(5) 
where and are the number of and examples in the batch. One could extend (and improve) this paradigm by adding in more samples with different fractions, but we consider only two here for simplicity.
2.3 Classification without labels
CWoLa is an alternative strategy for weak supervision in the context of mixed samples. Rather than modifying the loss function to accommodate the limited information as in Sec. 2.2, the CWoLa approach is to simply train the model to discriminate the mixed samples and from one another. The classifier trained to distinguish from (using full supervision) is then directly applied to distinguish from . An illustration of this technique is shown in Fig. 1. Remarkably, this procedure results in an optimal classifier (as defined in the beginning of Sec. 2) for the versus classification problem:
Theorem 1.
Proof.
The optimal classifier to distinguish examples drawn from and is the likelihood ratio . Similarly, the optimal classifier to distinguish examples drawn from and is the likelihood ratio . Where has support, we can relate these two likelihood ratios algebraically:
(6) 
which is a monotonically increasing rescaling of the likelihood as long as , since . If , then one obtains the reversed classifier. Therefore, and define the same classifier. ∎
An important feature of CWoLa is that, unlike the LLPstyle weak supervision in Sec. 2.2, the label proportions and are not required for training. Of course, this proof only guarantees that the optimal classifier from CWoLa is the same as the optimal classifier from fullysupervised learning. We explore the practical performance of CWoLa in Secs. 3 and 4.
The problem of learning from unknown mixed samples can be shown to be mathematically equivalent to the problem of learning with asymmetric random label noise, where there have been recent advances scott2013 ; natarajan2013learning . The equivalence of these frameworks follows from the fact that randomly flipping the labels of pure samples, possibly with different flip probabilities for signal and background, produces mixed samples. In the language of noisy labels, Ref. scott2013
argues that even unknown class proportions can be estimated from mixed samples under certain conditions using mixture proportion estimation
scott2015rate , which may have interesting applications in collider physics. There are also connections between learning from unknown mixed samples and the calibrated classifiers approach in Ref. Cranmer:2015bka , where measurement of the class proportions from unknown mixtures is also shown to be possible.2.4 Operating points
While the optimal classifier from CWoLa is independent of the mixed sample compositions, some minimal input is needed in order to establish classification operating points. Specifically, to define a cut on the classifier at a value to achieve signal efficiency , one requires some degree of label information.
One practical strategy is to use CWoLa to train on two large mixed samples without label or class proportion information, and then benchmark it on two smaller samples where the class proportions and are precisely known. In that case, one can solve a simple system of equations on the smaller samples:
(7)  
(8) 
where the probabilities can be estimated numerically by counting the number of events that pass the classifier cut in some sample, e.g. , where is the mixed sample data. Thus with class proportions only, the ROC curve of a classifier can be determined.^{4}^{4}4We are grateful to Francesco Rubbo for bringing this to our attention.
For the purpose of establishing working points, one might need to rely on simulations to determine the label proportions of the test samples. In many cases, though, label proportions are better known than the details of the observables used to train the classifier. For instance, in jet tagging, the label proportions of kinematicallyselected samples are largely determined by the hard scattering process, with only mild sensitivity to effects such as shower mismodeling. In this way, one is sensitive only to simulation uncertainties associated with sample composition, which in most cases are largely uncorrelated with uncertainties associated with tagging performance.
To summarize, the CWoLa paradigm does not need class proportions during training, and it only requires a small sample of test data where class proportions are known in order to determine the classifier performance and operating points, with minimal input from simulation.
3 Illustrative example: Two gaussian random variables
Before demonstrating the combination of CWoLa with a modern neural network, we first illustrate the various forms of learning discussed in Sec. 2 through a simplified example where the optimal classifier can be obtained analytically. Consider a single observable for distinguishing a signal from a background . For simplicity, suppose that the probability distribution of is a Gaussian with mean
for the signal and a Gaussian with mean and standard deviation for the background. We then consider the mixed samples and from Eqs. (3) and (4) with signal fractions and .In this onedimensional case, the optimal fullysupervised classifier can be constructed analytically:
(9) 
Of course, nonparameterically estimating Eq. (9) numerically requires a choice of binning which can introduce numerical fluctuations. To avoid this effect, we discretize into bins between and (under/overflow is added to the first/last bins). There are then a finite number of possibilities for the likelihood ratio in Eq. (9).
Using a calligraphic font to denote explicit training samples, we test the following classifiers on signal (), background (), and mixed () training samples of the same size:
The performance of the classifiers trained in this way is evaluated on a holdout set of signal and background examples that is large enough such that statistical fluctuations are negligible. We use the area under the curve (AUC) metric to quantify performance. For continuous random variables, the AUC can be defined as
. This notion extends well to discrete random variables (indexed by integers):
(13) 
For a properly constructed classifier, the AUC . In all of the numerical examples shown below, the classifier is inverted if necessary so that by construction, AUC .
In Fig. 2, we illustrate the performance of the three classification paradigms described above with 100, 1k, and 10k training examples each of and , or and in the LLP and CWoLa cases, taking for concreteness. Testing is performed on 100k and examples in all cases. The LLP and CWoLa paradigms have nearly the same dependence on the number of training events and the signal fraction . The full supervision does not depend on the signal composition of and as it is trained directly on labeled signal and background examples. As expected, the performance is poor when the number of training examples is small or is close to (so the effective number of useful events is small). As , the two mixtures become identical and there is thus no way to distinguish and ; in the context of LLP, this corresponds to attempting to solve a degenerate system of equations. With sufficiently many training examples and/or wellseparated fractions and , the techniques trained with and converge to the fully supervised case, as expected from Theorem 1.
One advantage of CWoLa over the LLP approach is that the fractions and are not required for training. In Fig. 3, we demonstrate the impact on the AUC for LLP when the wrong fractions are provided at training time. Here, the true fractions are and , but different fractions are used to calculate Eq. (11). For far from , there is little dependence on the fraction used for training. This insensitivity is likely due to the preservation of monotonicity to the full likelihood with small perturbations in , as discussed in detail in Ref. Cohen:2017exh .
With this onedimensional example, the estimate for the optimal classifier under each of the three schemes is computable directly. It is often the case that is highly multidimensional, though, in which case a more sophisticated learning scheme may be required. We investigate the performance of CWoLa in a fivedimensional space in the next section.
4 Realistic example: Quark/gluon jet discrimination
Quark versus gluoninitaited jet tagging Nilles:1980ys ; Jones:1988ay ; Fodor:1989ir ; Jones:1990rz ; Lonnblad:1990qp ; Pumplin:1991kc ; Gallicchio:2011xq ; Gallicchio:2012ez ; Larkoski:2014pca is a particularly important classification problem in high energy physics where training on data would be beneficial. This is because correlations between key observables known to be useful for tagging are not always wellmodeled by simulations as they depend on the detailed structure of a jet’s radiation pattern Aad:2014gea ; Aad:2016oit . Furthermore, even the LLP paradigm proposed in Ref. Dery:2017fap can be sensitive to the input fractions which are themselves dependent on nonperturbative information from parton distribution functions. In this section, we test the performance of CWoLa in a realistic context where a small number of quark/gluon discriminants are combined into one classifier, similar to the CMS quark/gluon likelihood CMS:2013kfa ; CMSDP2016070 .
A key limitation of this study is that we artificially construct mixed samples and from pure “quark” () and pure “gluon” () samples.^{5}^{5}5The reason for the scare quotes is discussed at length in Ref. Gras:2017jty , as the definition of a quark or gluon jet is fundamentally ambiguous. In the practical case of interest at the LHC, one would measure a quarkenriched sample in plus jet events and a gluonenriched sample in dijet events, with more sophisticated selections possible as well Gallicchio:2011xc . However, the “quark” jet in event is not the same as the “quark” jet in , since there are soft color correlations with the rest of the event. Jet grooming techniques Butterworth:2008iy ; Ellis:2009su ; Ellis:2009me ; Krohn:2009th ; Dasgupta:2013ihk ; Larkoski:2014wba can mitigate the impact of soft effects to provide a more universal “quark” jet definition Frye:2016okc ; Frye:2016aiz . Still, one needs to validate the robustness of quark/gluon classifiers to the possibility of sampledependent labels, and we leave a detailed study of this effect to future work.
This study is based on five key jet substructure observables which are known to be useful quark/gluon discriminants Gras:2017jty . The discriminants are combined using a modern neural network employing either CWoLa or fullysupervised learning. We do not show a benchmark curve for LLP since it is difficult to ensure a fair comparison. By contrast, CWoLa and full supervision use the same loss function with the same training strategy, so a direct comparison is meaningful. All of the observables can be written in terms of the generalized angularities Larkoski:2014pca (see also Berger:2003iw ; Almeida:2008yp ; Ellis:2010rwa ):
(14) 
where is the rapidity/azimuth distance to the scheme jet axis,^{6}^{6}6This is in contrast to Ref. Gras:2017jty , which uses the winnertakeall axis Bertolini:2013iqa ; Larkoski:2014uqa ; Salam:WTAUnpublished . is the particle transverse momentum, and is the jet radius. The observables used to train the network use values of:
(15) 
where the names map onto the wellknown discriminants in the quark/gluon literature.^{7}^{7}7Strictly speaking is the square of Chatrchyan:2012sn , and is masssquared over energysquared in the softcollinear limit. For this study, we use the angularity definition of the five observables. Note that the first observable is infrared and collinear (IRC) unsafe, the second observable is IR safe but C unsafe, and the last three observables with are all IRC safe. LHA refers to the Les Houches Angularity from the eponymous study in Refs. Gras:2017jty ; Badger:2016bpw .
Quark and gluon jets are simulated from the decay of a heavy scalar particle with GeV in either the or channel. Production, decay, and fragmentation are modeled with Pythia 8.183 Sjostrand:2007gs . Jets are clustered using the anti algorithm Cacciari:2008gp with radius implemented in Fastjet 3.1.3 Cacciari:2011ma . Only detectorstable hadrons are used for jet finding. Since the gluon color factor is larger than the quark color factor by about a factor of two, gluon jets have more particles and are “wider” on average as measured by the angularities listed above.
To classify quarks and gluons with either the CWoLa or fullysupervised method, we use a simple neural network consisting of two dense layers of 30 nodes with rectified linear unit (ReLU) activation functions connected to a 2node output with a softmax activation function. All neural network training was performed with the
Python deep learning library Keras keras with a Tensorflow tensorflow backend. The data consisted of 200k quark/gluon events, partitioned into 20k validation event, 20k test events, and the remainder used as training event samples of various sizes. Heuniform weight initialization heuniform was used for the model weights. The network was trained with the categorical crossentropy loss function using the Adam algorithm adam with a learning rate of 0.001 and a batch size of 128.quark fraction. Shown are the range of AUC values obtained from 10 repetitions of training the neural network on (a) 25k events and (b) 150k events for 10 epochs.
In Fig. 4, we show the performance of CWoLa training for quark/gluon classification using mixed samples of different purities. These mixed samples of 25k and 150k training events were generated by shuffling the pure samples into two sets in different proportions. Performance is measured in terms of the classifier AUC. The behavior resembles that found in the toy model of Fig. 2, with more training data resulting in increased robustness to sample impurity. It is remarkable that such good performance can be obtained even when the signal/background events are so heavily mixed.
In Fig. 5, we show ROC and significance improvement (SI) curves for 150k training events, where SI is a curve of at different values Gallicchio:2012ez . Results are given for the fullysupervised classifier trained on pure samples and the CWoLa classifier trained on mixed samples with and , along with the curves of the input observables. Both the fullysupervised and CWoLa dense networks achieve similar performance, with the expected improvement over the individual input observables. This suggests that the proof of CWoLa optimality in Theorem 1 is achievable in practice, though many more studies are needed to demonstrate this in a wider range of contexts.
5 Conclusions
We introduced the CWoLa framework for training classifiers on different mixed samples of signal and background events, without using true labels or class proportions. The observation that the optimal classifier for mixed samples of signal and background is also optimal for pure samples of signal and background, proven in Theorem 1, could be of tremendous practical use at the LHC for learning directly from data whenever truth information is unknown or uncertain and whenever detailed and reliable simulations are unavailable. We highlight that no new specific code, loss function, or model architecture is needed to implement CWoLa. Any tools for training a classifier using truth information can be directly applied to discriminate mixed samples and thus to train in the CWoLa framework directly on data.
Using a toy example, we found that CWoLa performs as well as LLP (which requires knowledge of the class proportions), suggesting that CWoLa is a robust paradigm for weak supervision. Of course, to determine operating points and classification power for the CWoLa method, some label information is needed, but it can be furnished by a smaller sample of testing data that can be separate from the larger mixed samples used for training. It is also worth remembering that CWoLa assumes that the mixed samples are not subject to contamination or sampledependent labeling, though one could imagine using datadriven crossvalidation with more than two mixed samples to identify and mitigate such effects. More ambitiously, one could try to apply CWoLa to event samples that otherwise look identical, to try to tease out potential subpopulations of events.
As a realistic example, we applied the CWoLa framework to the important case of quark/gluon discrimination, a classification task for which simulations are typically unreliable and true labels are unknown. We showed that the CWoLa method can be successfully used to train a dense neural network for quark/gluon classification on mixed samples with five jet substructure observables as input. Though the realistic example made use of a neural network, the CWoLa paradigm can be used to train many other types of classifiers. While in this study we considered a relatively small network on a small (but important) number of inputs, the same principles apply for any type of model or input. In future work, we plan to study CWoLa in the context of deeper architectures and larger inputs.
Acknowledgements.
The authors are grateful to Timothy Cohen, Kyle Cranmer, Marat Freytsis, Patrick Komiske, Bryan Ostdiek, Francesco Rubbo, Matthew Schwartz, and Clayton Scott for helpful discussions and suggestions. Cloud computing resources were provided through a Microsoft Azure for Research award. The work of E.M.M. and J.T. is supported by the DOE under grant contract numbers DESC00012567 and DESC00015476. The work of B.N. is supported by the DOE under contract DEAC0205CH11231.References
 (1) J. Cogan, M. Kagan, E. Strauss, and A. Schwarztman, JetImages: Computer Vision Inspired Techniques for Jet Tagging, JHEP 02 (2015) 118, [arXiv:1407.5675].
 (2) L. G. Almeida, M. Backović, M. Cliche, S. J. Lee, and M. Perelstein, Playing Tag with ANN: Boosted Top Identification with Pattern Recognition, JHEP 07 (2015) 086, [arXiv:1501.05968].
 (3) L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. Schwartzman, Jetimages  deep learning edition, JHEP 07 (2016) 069, [arXiv:1511.05190].
 (4) P. Baldi, K. Bauer, C. Eng, P. Sadowski, and D. Whiteson, Jet Substructure Classification in HighEnergy Physics with Deep Neural Networks, Phys. Rev. D93 (2016), no. 9 094034, [arXiv:1603.09349].
 (5) J. Barnard, E. N. Dawe, M. J. Dolan, and N. Rajcic, Parton Shower Uncertainties in Jet Substructure Analyses with Deep Neural Networks, Phys. Rev. D95 (2017), no. 1 014018, [arXiv:1609.00607].
 (6) G. Kasieczka, T. Plehn, M. Russell, and T. Schell, Deeplearning Top Taggers or The End of QCD?, JHEP 05 (2017) 006, [arXiv:1701.08784].
 (7) P. T. Komiske, E. M. Metodiev, and M. D. Schwartz, Deep learning in color: towards automated quark/gluon jet discrimination, JHEP 01 (2017) 110, [arXiv:1612.01551].
 (8) L. de Oliveira, M. Paganini, and B. Nachman, Learning Particle Physics by Example: LocationAware Generative Adversarial Networks for Physics Synthesis, arXiv:1701.05927.
 (9) P. T. Komiske, E. M. Metodiev, B. Nachman, and M. D. Schwartz, Pileup Mitigation with Machine Learning (PUMML), arXiv:1707.08600.
 (10) G. Louppe, K. Cho, C. Becot, and K. Cranmer, QCDAware Recursive Neural Networks for Jet Physics, arXiv:1702.00748.
 (11) ATLAS Collaboration, Performance of Top Quark and W Boson Tagging in Run 2 with ATLAS, ATLASCONF2017064 (2017).
 (12) ATLAS Collaboration, Optimisation and performance studies of the ATLAS tagging algorithms for the 201718 LHC run, ATLPHYSPUB2017013 (2017).
 (13) ATLAS Collaboration, Identification of HadronicallyDecaying W Bosons and Top Quarks Using HighLevel Features as Input to Boosted Decision Trees and Deep Neural Networks in ATLAS at = 13 TeV, ATLPHYSPUB2017004 (2017).
 (14) CMS Collaboration, Heavy flavor identification at CMS with deep neural networks, CMSDP2017005 (Mar, 2017).
 (15) CMS Collaboration, CMS Phase 1 heavy flavour identification performance and developments, CMSDP2017013 (May, 2017).

(16)
ATLAS Collaboration, Identification of Jets Containing
Hadrons with Recurrent Neural Networks at the ATLAS Experiment
, ATLPHYSPUB2017003 (2017).  (17) A. Butter, G. Kasieczka, T. Plehn, and M. Russell, Deeplearned Top Tagging using Lorentz Invariance and Nothing Else, arXiv:1707.08966.
 (18) J. Pearkes, W. Fedorko, A. Lister, and C. Gay, Jet Constituents for Deep Neural Network Based Top Quark Tagging, arXiv:1704.02124.
 (19) K. Datta and A. Larkoski, How Much Information is in a Jet?, JHEP 06 (2017) 073, [arXiv:1704.08249].
 (20) ATLAS Collaboration, Quark versus Gluon Jet Tagging Using Jet Images with the ATLAS Detector, ATLPHYSPUB2017017 (2017).
 (21) CMS Collaboration, New Developments for Jet Substructure Reconstruction in CMS, CMSDP2017027 (2017).
 (22) ATLAS Collaboration, G. Aad et al., Performance of Jet Identification in the ATLAS Experiment, JINST 11 (2016), no. 04 P04008, [arXiv:1512.01094].
 (23) CMS Collaboration, S. Chatrchyan et al., Identification of bquark jets with the CMS experiment, JINST 8 (2013) P04013, [arXiv:1211.4462].
 (24) ATLAS Collaboration, G. Aad et al., Lightquark and gluon jet discrimination in collisions at with the ATLAS detector, Eur. Phys. J. C74 (2014), no. 8 3023, [arXiv:1405.6583].
 (25) CMS Collaboration, Performance of quark/gluon discrimination in 8 TeV pp data, Tech. Rep. CMSPASJME13002, 2013.
 (26) CMS Collaboration, Performance of quark/gluon discrimination in 13 TeV data, Tech. Rep. CMSDP2016070, Nov, 2016.
 (27) ATLAS Collaboration, G. Aad et al., Identification of boosted, hadronically decaying W bosons and comparisons with ATLAS data taken at TeV, Eur. Phys. J. C76 (2016), no. 3 154, [arXiv:1510.05821].
 (28) CMS Collaboration, V. Khachatryan et al., Identification techniques for highly boosted W bosons that decay into hadrons, JHEP 12 (2014) 017, [arXiv:1410.4227].
 (29) ATLAS Collaboration, G. Aad et al., Identification of high transverse momentum top quarks in collisions at = 8 TeV with the ATLAS detector, JHEP 06 (2016) 093, [arXiv:1603.03127].
 (30) CMS Collaboration, Boosted Top Jet Tagging at CMS, Tech. Rep. CMSPASJME13007, 2014.
 (31) G. Louppe, M. Kagan, and K. Cranmer, Learning to Pivot with Adversarial Networks, arXiv:1611.01046.
 (32) G. Blanchard, M. Flaska, G. Handy, S. Pozzi, and C. Scott, Classification with asymmetric label noise: Consistency and maximal denoising, Electron. J. Statist. 10 (2016), no. 2 2780–2824.
 (33) J. HernándezGonzález, I. Inza, and J. A. Lozano, Weak supervision and other nonstandard classification problems: a taxonomy, Pattern Recognition Letters 69 (2016) 49–55.
 (34) L. M. Dery, B. Nachman, F. Rubbo, and A. Schwartzman, Weakly Supervised Classification in High Energy Physics, JHEP 05 (2017) 145, [arXiv:1702.00414].
 (35) N. Quadrianto, A. J. Smola, T. S. Caetano, and Q. V. Le, Estimating labels from label proportions, Journal of Machine Learning Research 10 (2009), no. Oct 2349–2374.
 (36) G. Patrini, R. Nock, P. Rivera, and T. Caetano, (almost) no label no cry, in Advances in Neural Information Processing Systems, pp. 190–198, 2014.
 (37) P. Gras, S. Höche, D. Kar, A. Larkoski, L. Lönnblad, S. Plätzer, A. Siódmok, P. Skands, G. Soyez, and J. Thaler, Systematics of quark/gluon tagging, JHEP 07 (2017) 091, [arXiv:1704.03878].
 (38) T. Cohen, M. Freytsis, and B. Ostdiek, (Machine) Learning to Do More with Less, arXiv:1706.09451.
 (39) J. Neyman and E. S. Pearson, On the problem of the most efficient tests of statistical hypotheses, in Breakthroughs in statistics, pp. 73–108. Springer, 1992.
 (40) N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, Learning with noisy labels, in Advances in neural information processing systems, pp. 1196–1204, 2013.
 (41) C. Scott, A rate of convergence for mixture proportion estimation, with application to learning from noisy labels, in Artificial Intelligence and Statistics, pp. 838–846, 2015.
 (42) K. Cranmer, J. Pavez, and G. Louppe, Approximating Likelihood Ratios with Calibrated Discriminative Classifiers, arXiv:1506.02169.
 (43) H. P. Nilles and K. H. Streng, Quark  Gluon Separation in Three Jet Events, Phys. Rev. D23 (1981) 1944.
 (44) L. M. Jones, Tests for Determining the Parton Ancestor of a Hadron Jet, Phys. Rev. D39 (1989) 2550.
 (45) Z. Fodor, How to See the Differences Between Quark and Gluon Jets, Phys. Rev. D41 (1990) 1726.
 (46) L. Jones, Towards a Systematic Jet Classification, Phys. Rev. D42 (1990) 811–814.
 (47) L. Lönnblad, C. Peterson, and T. Rognvaldsson, Using neural networks to identify jets, Nucl. Phys. B349 (1991) 675–702.
 (48) J. Pumplin, How to tell quark jets from gluon jets, Phys. Rev. D44 (1991) 2025–2032.
 (49) J. Gallicchio and M. D. Schwartz, Quark and Gluon Tagging at the LHC, Phys. Rev. Lett. 107 (2011) 172001, [arXiv:1106.3076].
 (50) J. Gallicchio and M. D. Schwartz, Quark and Gluon Jet Substructure, JHEP 04 (2013) 090, [arXiv:1211.7038].
 (51) A. J. Larkoski, J. Thaler, and W. J. Waalewijn, Gaining (Mutual) Information about Quark/Gluon Discrimination, JHEP 11 (2014) 129, [arXiv:1408.3122].
 (52) ATLAS Collaboration, G. Aad et al., Measurement of the chargedparticle multiplicity inside jets from TeV collisions with the ATLAS detector, Eur. Phys. J. C76 (2016), no. 6 322, [arXiv:1602.00988].
 (53) J. Gallicchio and M. D. Schwartz, Pure Samples of Quark and Gluon Jets at the LHC, JHEP 10 (2011) 103, [arXiv:1104.1175].
 (54) J. M. Butterworth, A. R. Davison, M. Rubin, and G. P. Salam, Jet substructure as a new Higgs search channel at the LHC, Phys. Rev. Lett. 100 (2008) 242001, [arXiv:0802.2470].
 (55) S. D. Ellis, C. K. Vermilion, and J. R. Walsh, Techniques for improved heavy particle searches with jet substructure, Phys. Rev. D80 (2009) 051501, [arXiv:0903.5081].
 (56) S. D. Ellis, C. K. Vermilion, and J. R. Walsh, Recombination Algorithms and Jet Substructure: Pruning as a Tool for Heavy Particle Searches, Phys. Rev. D81 (2010) 094023, [arXiv:0912.0033].
 (57) D. Krohn, J. Thaler, and L.T. Wang, Jet Trimming, JHEP 02 (2010) 084, [arXiv:0912.1342].
 (58) M. Dasgupta, A. Fregoso, S. Marzani, and G. P. Salam, Towards an understanding of jet substructure, JHEP 09 (2013) 029, [arXiv:1307.0007].
 (59) A. J. Larkoski, S. Marzani, G. Soyez, and J. Thaler, Soft Drop, JHEP 05 (2014) 146, [arXiv:1402.2657].
 (60) C. Frye, A. J. Larkoski, M. D. Schwartz, and K. Yan, Precision physics with pileup insensitive observables, arXiv:1603.06375.
 (61) C. Frye, A. J. Larkoski, M. D. Schwartz, and K. Yan, Factorization for groomed jet substructure beyond the nexttoleading logarithm, JHEP 07 (2016) 064, [arXiv:1603.09338].
 (62) C. F. Berger, T. Kucs, and G. F. Sterman, Event shape / energy flow correlations, Phys. Rev. D68 (2003) 014012, [hepph/0303051].
 (63) L. G. Almeida, S. J. Lee, G. Perez, G. F. Sterman, I. Sung, et al., Substructure of high Jets at the LHC, Phys. Rev. D79 (2009) 074017, [arXiv:0807.0234].
 (64) S. D. Ellis, C. K. Vermilion, J. R. Walsh, A. Hornig, and C. Lee, Jet Shapes and Jet Algorithms in SCET, JHEP 1011 (2010) 101, [arXiv:1001.0014].
 (65) D. Bertolini, T. Chan, and J. Thaler, Jet Observables Without Jet Algorithms, JHEP 1404 (2014) 013, [arXiv:1310.7584].
 (66) A. J. Larkoski, D. Neill, and J. Thaler, Jet Shapes with the Broadening Axis, JHEP 1404 (2014) 017, [arXiv:1401.2158].
 (67) G. Salam, Scheme, Unpublished.
 (68) CMS Collaboration, S. Chatrchyan et al., Search for a Higgs boson in the decay channel to ZZ(*) to qbar l+ in collisions at TeV, JHEP 04 (2012) 036, [arXiv:1202.1416].
 (69) J. R. Andersen et al., Les Houches 2015: Physics at TeV Colliders Standard Model Working Group Report, in 9th Les Houches Workshop on Physics at TeV Colliders (PhysTeV 2015) Les Houches, France, June 119, 2015, 2016. arXiv:1605.04692.
 (70) T. Sjostrand, S. Mrenna, and P. Z. Skands, A Brief Introduction to PYTHIA 8.1, Comput. Phys. Commun. 178 (2008) 852–867, [arXiv:0710.3820].
 (71) M. Cacciari, G. P. Salam, and G. Soyez, The Antik(t) jet clustering algorithm, JHEP 04 (2008) 063, [arXiv:0802.1189].
 (72) M. Cacciari, G. P. Salam, and G. Soyez, FastJet User Manual, Eur. Phys. J. C72 (2012) 1896, [arXiv:1111.6097].
 (73) F. Chollet, “Keras.” https://github.com/fchollet/keras, 2017.
 (74) M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for largescale machine learning., in OSDI, vol. 16, pp. 265–283, 2016.

(75)
K. He, X. Zhang, S. Ren, and J. Sun,
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification
, inProceedings of the IEEE international conference on computer vision
, pp. 1026–1034, 2015.  (76) D. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv:1412.6980.