1 Introduction
With the increasing complexity of theoretical models and computing power, many scientific projects increasingly rely on simulations to design analysis techniques. This is especially true for high energy particle physics, where high fidelity Monte Carlo (MC) simulation is used to model physical processes at distances ranging from
meters all the way to the macroscopic dimensions of detectors. However, just as models have become more complex, analysis techniques have also become more sophisticated. Numerous, often subtle, features of events are combined using powerful supervised learning algorithms trained on large simulated (labeled) datasets. Despite these advances, there is no guarantee that techniques highly optimized in simulation are also optimal in nature. One of the most ubiquitous analysis procedures is classifying events as originating from one of two different processes. It is sometimes the case that one knows the proportions of each class better than the properties of each class that are useful for classification.
Weakly supervised classification is a new machine learning paradigm for classification where training is performed directly on (unlabeled) data.The task of training a classifier on multidimensional data based only on class proportions is highly underconstrained. Neural network training is already a nonconvex problem, but removing all local information about class labels significantly increases the difficulty of optimization. However, the field of multiinstance learning (MIL)
DIETTERICH199731 has shown that local information is not necessarily needed for classification^{1}^{1}1See Ref. Amores201381 for a review of recent work in MIL.. The setup of MIL is a series of sets (‘bags’) of individual instances without individual labels. Consider the task of distinguishing two classes, called and . For the training set, it is known if a bag contains at least one instance of class . The algorithm is then optimized to identify the presence of at least one instance of class in an unseen bag. Recent work has extended this procedure to identify the class of individual instances, still only training on baglevel labels Kotzias:2015:GIL:2783258.2783380 . In this paper we make supervision even weaker in that baglabels are only known on average. In particular, all that is known to the training is the expected fraction of class in any particular bag. This paradigm is also referred to as Learning with Label Proportions (LLP) NIPS2014_5453 .High energy quarks and gluons produced in reactions at the Large Hadron Collider (LHC) result in collimated streams of particles traveling at nearly the speed of light, known as jets. One of the most challenging classification tasks in high energy physics is to distinguish quarkinduced jets from gluoninduced jets based on their radiation pattern. There is an extensive literature exploring discriminating observables Gallicchio:2011xq . However, standard quark and gluon discriminants are known to be poorly modeled by stateoftheart simulations Aad:2014gea ; Badger:2016bpw
. Despite this, the fraction of quark and gluon jets in a given sample is often wellknown. At a fixed order in perturbation theory, the probability for an outgoing parton to be a quark or a gluon depends on wellknown parton distribution functions and matrix elements. Therefore, quark versus gluon discrimination is wellsuited for for weakly supervised classification and is therefore the main example used later in this paper.
This paper is organized as follows. Section 2 formally introduces weakly supervised classification and describes how it is applied in practice. The technique is illustrated using quark versus gluon jet tagging in Sec. 3. The paper ends in Sec. 4 with some concluding remarks. The source code implemented to produce the results presented in this paper is available at DOI: 10.5281/zenodo.322813.
2 Weakly supervised classification
Given a set of data originating from two classes labeled and , the goal of classification is to construct a function , where is the dimensionality of the feature space used to discriminate the two classes. In the traditional classification paradigm of fully supervised training, the function
is built by minimizing a loss function like the following:
(1) 
where is the number of labeled data available for training, is a loss function with , and is the true label of example . A common loss function is the squared error. In order to provide flexibility and stability, one often modifies the original problem to take and the output is interpreted as a probability for an event to be in class or . The ideal classifier that one tries to approximate with Eq. 1 is based on the likelihood ratio , where is the
dimensional probability density for the feature vector
for the class . Weakly supervised classification is a new paradigm in which instead of knowing the , all that is known is the proportion of events in either class: . Thus, the weakly supervised is given by(2) 
The argument of Eq. 2 is nonconvex, with many minima. In particular, the trivial solution results in a loss of zero. However, using multiple batches of data with different proportions is sufficient to collapse the solution space, so long as the distribution , i.e. the distribution of the discriminating features for a particular class is the same in every batch . To build intuition for why there is any hope to solve this problem, consider a case where there are two batches and with proportions and . Consider an dimensional histogram where the dimension captures a discretized version of the discriminating feature. If the dimension has bins, then the total number of bins in the histogram is . One can always rearrange bins so that instead of an dimensional histogram with bins in the dimension, there is a onedimensional histogram with bins. As visualizing high dimensional histograms can be cumbersome, let be onedimensional histograms with bins for the batch and be the corresponding histogram for batch . Then, for each bin , one can write
(3)  
(4) 
where is the content of the bin of the histogram . Except for contrived scenarios, Eq. 3 will have a unique solution for and , which are discretized versions of the probability densities and . One can then form an (approximately) optimal classifier from the ratio of histograms with bin contents /. If the number of dimensions is large, one can add a further step to use machine learning to approximate the optimal classifier from and . As a result, the problem is completely solvable. Weakly supervised training combines the classification step with the first step and does so without binning. Solving Eq. 3 ‘byhand’ is intractable when is relatively large or the number of examples is relatively small. It is also complicated when there are more than two batches (overconstrained). These challenges are all naturally handled by the allinone machine learning approach of weakly supervised classification, as illustrated below.
In the weakly supervised training used in the following examples, in Eq. 2
is parametrized as a threelayer neural network with three inputs, a hidden layer with 30 neurons, and a sigmoid output. We use the Adam optimizer
DBLP:journals/corr/KingmaB14in Keras
chollet2015keras with a learning rate of 0.009 and train for 25 iterations. As reference, we consider a traditional classifier(5) 
where labels the individual instances and
is parametrized as a threelayer neural network with three inputs, a hidden layer with 10 neurons, and a sigmoid output. Minimization is performed with stochastic gradient descent in Keras with a learning rate of 0.01 run for 40 iterations. For each training, both networks are initialized with random weights, following a normal distribution.
Feature  

1  26  8  18  7 
2  0.09  0.04  0.06  0.04 
3  0.28  0.04  0.23  0.05 
) and standard deviation (
) values of the normal distributions for class and of each feature.Figure 1 shows the weakly supervised classifier performance when training with 9 subsets of data with proportions between 0.2 and 0.4 compared with that of the fully supervised one. Three features, labeled are constructed so that the distribution of feature given class follows a normal distribution with mean and standard deviation . For reference, the values of and used for the example shown in Fig. 1 are in Table. 1. Both the traditional and weakly supervised classifiers have the same Receiver Operator Characteristic (ROC) and thus have identical classification performance. Note that the loss for weakly supervised classification is symmetric with respect to swapping the class assignment, therefore the classifier output for a given training can give higher values for class , while for a different training it would give higher value for class .
As with any machine learning algorithm with inherent randomness, the performance of a weakly supervised classifier has a stochastic component. This is quantified by retraining the same network many times with different random number seeds in each iteration. The interquartile range (IQR) over the Area Under the Curve (AUC) values for each training is a measure of the spread due to the inherent randomness. Figure 2 shows the AUC IQR for the toy example with one proportion fixed to and the second proportion scanned from to . The stability improves as the difference between the class proportions increases. In addition to the performance varying less as the proportions are further apart, the overall performance quantified by the median AUC (denoted by ) also improves (increases). The improvement in the median AUC is not as dramatic as the reduction in the AUC IQR, but it does suggest that it is (slightly) easier for the machine learning algorithm when the proportions are very different^{2}^{2}2Even when the proportions are within few percents, stable performance can be achieved if multiple () subsets with different proportions can be used for training.. This makes sense in the context of the twostep intuitionbuilding paradigm given above: the algorithm can spend more attention on the classification task if it is easier to extract the class distributions.
3 Example: quark and gluon jet discrimination
Due to the strength of the strong force, there is a plethora of gluon jets produced at the LHC. However, many processes result in mostly quark jets. Prominent examples include the identification of hadronically decaying bosons CMS:2014joa ; Aad:2015owa , jets associated with vector boson fusion Khachatryan:2015bnx ; Khachatryan:2014dea ; Aaboud:2016cns , and multiquarks resulting from supersymmetry Bhattacherjee:2016bpy . The references given here are the small number of public results that mention quark/gluon tagging, but there many more analyses that would benefit from a tagger if a robust technique existed.
The weakly supervised classification strategy is particularly useful for quark/gluon tagging because the fraction of quark jets for a particular set of events is wellknown from parton distribution functions and matrix element calculations while useful discriminating features have not been computed to high accuracy and simulations often mismodel the data. To illustrate this concrete example, quark and gluon jets are simulated and a weakly supervised classifier is trained on the generated event sample. Unlike real data, in the simulated sample, we also know perevent labels which are used to additionally train a fully supervised classifier. Events with quarkgluon scattering (dijet events) are simulated using the Pythia 8.18 Pythia8 event generator. Jets are clustered using the anti algorithm Cacciari:2008gp with distance parameter via the FastJet 3.1.3 fastjet package. Jets are classified as quark or gluoninitiated by considering the type of the highest energy quark or gluon in the full generator event record that is inside a radius of the jet axis. For simplicity, one transverse momentum range is considered: 45 GeV 55 GeV. Additionally, there is a pseudorapidity requirement that mimics the usual detector acceptance for charged particle tracking:
. Heuristically, gluons have twice as much strongforce charge as quark jets, resulting in more constituents and a broader radiation pattern. Therefore, the following variables are useful for quark/gluon discrimination: the number of jet constituents
, the first radial moment in
(jet width) , and the fraction of the jet carried by the leading anti subjet . The constituents considered for computing and are the hadrons in the jet with MeV., the fully supervised classifier (blue line) is trained on a labeled simulated training sample. The weakly supervised classifier (red line) is trained on an unlabeled pseudodata training sample. In both cases, the performance is evaluated on the same pseudodata test sample. The ratios to the performance of a fully supervised classifier trained on a labeled pseudodata sample are shown in the bottom pad.
A weakly supervised classifier with one hidden layer of size 30 is trained by considering 12 bins of the distribution of the absolute difference in pseudorapidity between the two jets Aad:2016oit . The proportion of quark initiated jets varies between 0.21 and 0.32. Figure 3
shows that, while the individual observables perform differently in the high or low gluon efficiency (true positive rate) regimes, their combination in a NN gives consistently better performance. The weakly supervised classifier matches the performance of the fully supervised NN, despite only knowing sample proportions instead of individual event labels. By construction the weakly supervised classifier is also robust against a realistic amount of mismodeling in the input variables. This feature is tested by building a pseudodata sample where the probability distributions of
and are distorted in the training sample to emulate the difference in efficiency measured in Ref. Aad:2014gea . The study in Ref. Aad:2014gea found that a classifier extracted from simulation is more powerful than one extracted from the data. This is reflected in the results presented in the right plot of Fig. 3. When a fully supervised classifier is trained on a sample generated with the same distribution as the test sample (mimicking training and testing on simulation), it achieves a better performance than when trained on the original sample and tested on the distorted pseudodata (mimicking training on simulation and testing on data). In contrast, the weakly supervised classifier can be trained directly on the distorted pseudodata sample (representing the data) so is insensitive to the mismodeling of the input variables. This results in a 10% bias from the standard procedure that is avoided by the weakly supervised classifier. Even larger differences may be expected from this and other classification tasks that utilize even more input features or are more mismodeled. The weakly supervised classifier is robust and outperforms the standard supervised learning trained on simulation.4 Conclusions
We have presented a new approach to classification with NN in cases where class proportions are known but individual labels are not readily available. This weakly supervised classification has broad applicability and has been demonstrated in one important discrimination task in high energy physics: quark versus gluon jet tagging. In the quark/gluon and related contexts, weakly supervised classification provides a robust and powerful approach because it can be directly trained on examples from (unlabeled) data instead of (labeled, but unreliable) simulation. The examples presented so far have used a small number of input features to illustrate the ideas, but there is no algorithmic limitation on the number of features. Figure 4 is a simple extension of Fig. 1 with 5 features instead of 3; in future work, we will study the extension to many more features (tens to hundreds). This paper has laid the conceptual groundwork for this new tool that has started a new classification paradigm that can be applied to a wide variety of learning problems to boost performance and robustness when detailed simulations are not reliable or not available.
5 Acknowledgments
This work is supported by the Stanford Data Science Initiative and by the US Department of Energy (DOE) under grant DEAC0276SF00515. We would like to thank Russell Stewart for useful discussions about labelfree supervision strategies and nonconvex optimization problems and Gilles Louppe for useful discussion about related work on learning from label proportions.
References
 (1) T. G. Dietterich, R. H. Lathrop and T. LozanoPérez, Solving the multiple instance problem with axisparallel rectangles, Artificial Intelligence 89 (1997), no. 1 31 – 71.
 (2) J. Amores, Multiple instance classification: Review, taxonomy and comparative study, Artificial Intelligence 201 (2013) 81 – 105.

(3)
D. Kotzias, M. Denil, N. de Freitas and P. Smyth,
From group to individual labels using deep features
, in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, (New York, NY, USA), pp. 597–606, ACM, 2015.  (4) G. Patrini, R. Nock, P. Rivera and T. Caetano, (almost) no label no cry, in Advances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger, eds.), pp. 190–198. Curran Associates, Inc., 2014.
 (5) J. Gallicchio and M. D. Schwartz, Quark and Gluon Tagging at the LHC, Phys. Rev. Lett. 107 (2011) 172001 [1106.3076].
 (6) ATLAS Collaboration, G. Aad et. al., Lightquark and gluon jet discrimination in collisions at with the ATLAS detector, Eur. Phys. J. C74 (2014), no. 8 3023 [1405.6583].
 (7) J. R. Andersen et. al., Les Houches 2015: Physics at TeV Colliders Standard Model Working Group Report, in 9th Les Houches Workshop on Physics at TeV Colliders (PhysTeV 2015) Les Houches, France, June 119, 2015, 2016. 1605.04692.
 (8) D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, CoRR abs/1412.6980 (2014).
 (9) F. Chollet, “Keras.” https://github.com/fchollet/keras, 2015.
 (10) CMS Collaboration, V Tagging Observables and Correlations, CMSPASJME14002 (2014).
 (11) ATLAS Collaboration, G. Aad et. al., Search for highmass diboson resonances with bosontagged jets in protonproton collisions at TeV with the ATLAS detector, JHEP 12 (2015) 055 [1506.00962].
 (12) CMS Collaboration, V. Khachatryan et. al., Search for the standard model Higgs boson produced through vector boson fusion and decaying to , Phys. Rev. D92 (2015), no. 3 032008 [1506.01010].
 (13) CMS Collaboration, V. Khachatryan et. al., Measurement of electroweak production of two jets in association with a Z boson in protonproton collisions at , Eur. Phys. J. C75 (2015), no. 2 66 [1410.3153].
 (14) ATLAS Collaboration, M. Aaboud et. al., Search for the Standard Model Higgs boson produced by vectorboson fusion and decaying to bottom quarks in TeV pp collisions with the ATLAS detector, JHEP 11 (2016) 112 [1606.02181].
 (15) B. Bhattacherjee, S. Mukhopadhyay, M. M. Nojiri, Y. Sakaki and B. R. Webber, Quarkgluon discrimination in the search for gluino pair production at the LHC, JHEP 01 (2017) 044 [1609.08781].
 (16) T. Sjostrand, S. Mrenna and P. Z. Skands, A Brief Introduction to PYTHIA 8.1, Comput. Phys. Commun. 178 (2008) 852–867 [0710.3820].
 (17) M. Cacciari, G. P. Salam and G. Soyez, The Antik(t) jet clustering algorithm, JHEP 04 (2008) 063 [0802.1189].
 (18) M. Cacciari, G. P. Salam and G. Soyez, FastJet User Manual, Eur. Phys. J. C72 (2012) 1896 [1111.6097].
 (19) ATLAS Collaboration, G. Aad et. al., Measurement of the chargedparticle multiplicity inside jets from TeV collisions with the ATLAS detector, Eur. Phys. J. C76 (2016), no. 6 322 [1602.00988].