I Introduction
Traditionally, a classification or detection model is trained on data with detailed annotations showing the class label and location of each instance. Preparing such training data requires a lot of manual work. Segmentation of natural data like images and speeches into pieces belonging to different classes could be artificial and error prone due to the blurred boundaries among instances. Furthermore, models learned on such data typically expect the test data are preprocessed in similar ways. This significantly limits their applicabilities. Let us take the image classification problem as an example task to make the above statements more concrete. For example, both the classic convolutional neural network (CNN) LeNet5 [LeCun et al.(1998)] and moderns ones like the AlexNet [Alex et al.(2012)] all assume that one image only contains one instance of a class during both the train and test stages. Their last one or several layers are fully connected. Thus, they further require the train and test images to have the same sizes. These limitations complicate their usages in reality where images come with diverse sizes, resolutions, and one image could be cluttered with multiple instances from one or multiple classes. Hence, more complicated object detection and localization models are developed based on these classic oneinstancepersample classification models. A recent review of such methods can be found in [Huang et al.(2017)]. However, learning these more advanced models requires to mark out the class label and location of each object in all the training images. Needless to say, labeling work with such fine details is labor intensive and error prone.
This paper proposes an alternative solution to the classification or detection problem by relaxing the labeling requirements. For each class and each training sample, we only provide a binary label showing whether any instance of this class exists in this sample or not. As a result, the model has no need to predict details such as the number of instances, their locations or relative orders, etc.. It only predicts the likelihood of the existence of any instance from a set of class labels. One might doubt the usefulness of learning from such a limited task. However, the very simplicity of this setting might enable us to design general and elegant solutions to many tough problems. Figure 1 demonstrates the usage of an all convolutional network trained by our method on detecting the house numbers in images from the street view house numbers (SVHN) dataset [Netzer et al.(2011)]. Without accessing to any detailed labeling information such as the ground truth bounding boxes of digits in the images during the training stage, the learned all convolutional network can recognize and localize each digit in the images without further processing.
Ii Overview of our method
We use image recognition as the example task to introduce our notations and formulation of the problem. However, one should aware that our method is not limited to image recognition with CNN. For example, one might use our method to detect certain audio events in an audio stream using recurrent neural networks (RNN).
Iia Binary labels
We consider an image classification or recognition problem. Let
be an input image tensor of size
, where , and are the number of channels, height and width of image, respectively. The number of classes is . Without loss of generality, we assume that these class labels are chosen from set . Each imageis assigned with a unique vector label
with length , where can only take binary values, e.g., or . By convention, suggests that at least one instance of the th class exists in the associated image, and means that no instance of the th class exists in the image. It is clear that label can at most take distinct values. It is convenient to represent as a set denoted by , where if , and when is not in set . Clearly, only needs to contain distinct class labels. Unlike , may have variable length. Specifically, we have , an empty set, when is a vector of zeros, and when is a vector of ones. We typically put and its associated label or together as pair or . One should not confuse the class label for an instance of the image and the label or for the whole image.A few examples will clarify our conventions. We consider the house number recognition task demonstrated in Figure 1. There are ten classes, and we assign class labels to digits , respectively. The third input image contains four digits, and . Thus, its binary vector label will be . Equivalently, we can set its label as or . When an image contains no digit, its label is either , or simply . We mainly adopt the notation of in the rest of this paper.
IiB Maximum likelihood model estimation
Let be an image recognition model parameterized with trainable parameter vector . It accepts as its input, and produces an output tensor with shape , where and are two positive integers determined by and its input image sizes. Clearly, is a function of and , although we do not explicitly show this dependence to simplify our notations. We define
as a probability tensor such that its
th element has meaning(1) 
where denotes the probability of an event, and can be understood as a class label for the background. It is convenient to assign value to class label to simplify our notations. Thus, we will use and interchangeably. By definition, is a nonnegative tensor, and has property
(2) 
For example,
could be a CNN cascaded with a softmax layer to generate normalized probabilities at all locations. Then, we are possible to calculate the conditional probability for a label
and model given the input image . We denote this probability with . Now, the maximum likelihood model parameter vector is given by(3) 
where denotes taking expectation over independent pairs of images and their associated labels. To calculate analytically, we do need to make the following working assumption.
Assumption 1: Given any input image , model generates independent probability predictions at different locations such that for any and class label and , the probability of emitting class label at location and emitting class label at location equals .
Clearly, when model consists of CNN and a softmax layer, Assumption 1 only approximately holds for locations and close to each other due to weight sharing in CNN and correlations among adjacent pixels in most natural images. Nevertheless, becomes intractable without Assumption 1.
Let us consider two simple examples for the calculation of . In the first example, we have when , where is the order of . In the second example, we assume that and has shape . Recall that suggests that at least one instance of the th class appears in , and no instance of any other class shows up. Hence, only the following combinations of class labels emitted by the model are acceptable,
By Assumption 1, the probability of each combination is the product of the probabilities emitted at all locations. will be the sum of the probabilities of all feasible combinations, i.e.,
The above naive method for calculating has complexity even in the case of . We provide two affordable methods for calculating in the next section.
Iii Model likelihood calculation
Iiia Alternating series expression
To begin with, let us introduce a quantity . It denotes the probability of the event that at least one instance of any class from set shows up in , and no instance from any other class shows up. The class label in set can be , i.e., by our convention. Still, all these class labels are required to be distinct. Thus, we have . With this notation, is given by
(4) 
Specifically, we have
(5) 
The following statement can be used to calculate efficiently. With a slight abuse of notations, we use to denote this probability since is uniquely determined by and .
Proposition 1: Let be a set of distinct class labels excluding . The probability is given by the sum of following terms
(6) 
We have outlined its proof in Appendix A. Here, we consider a few special cases to have an intuitive understanding of Proposition 1. By definition, we have
When only instances of the th class can show up, we have
(7) 
where the term compensates the probability that only the class label is emitted by . With order , we have
where the term compensates the probability that instance from either one of the two classes is missing, and is here since it is counted twice in sum . In general, we will observe the pattern of alternating signs in (6).
IiiB Recursive expression
We introduce another quantity . It denotes the probability of the event that at least one instance of each class from set appears in , and no instance of any other class shows up. Again, the class label in set can be , i.e., by our convention. All these class labels are distinct. Thus, we have as well. By this definition, we have
(8) 
However, for , we can only calculate recursively by relationship
(9) 
where the initial conditions are given by (8). With the above notations, can be calculated using the following expression,
(10) 
Note that both the expressions given in (6) and have complexity . Neither one will be affordable for large enough . The series in (6) has alternating signs, and generally, its truncated version gives neither an upper bound nor a lower bound of . Compared with (6), one advantage of (10) is that it allows us to truncate the series in (10) to obtain an upper bound of since all ’s are nonnegative. A truncated version of the series given in could provide a useful approximation for for arbitrarily large .
Iv Relationship to existing techniques
Iva Relationship to the classic classification problem
It is worthy to consider the classic classification problem where each sample only contains one instance of a class. For example, the commonly used image classification databases, e.g., the MNIST [LeCun et al.(1998)]
and ImageNet
[Deng et al.(2009)], all assume the oneinstancepersample setting. The image classification task itself is of great interest, and it also is one of the most important building blocks in achieving more complicated tasks, e.g., object detection and localization. In our notations, we have for this setting. Still, we do not assume the number of instances of this specific class associated with the sample.During the training stage, our method maximizes the logarithm likelihood objective function in (3), where the probability is given by (7). When the probability tensor has shape , (7) reduces to
(11) 
Then, maximizing the logarithm likelihood objective function in (3) is equivalent to minimizing the cross entropy loss, a routine practice in training a classic classification model.
During the test stage, we calculate the tensor for a test sample, and scan for all
. The estimated class label is the one that maximizes
, i.e.,Noting that , and is independent of , we simply have
(12) 
For the classic classification problem, we may have no need to consider the class label since as the training samples, each test sample is supposed to be associated with a unique class label in range . However, in many more general settings, e.g., object detections, class label plays an important role in separating one instance of a class from the background and other instances, either belonging to the same class or other classes.
For a well trained classifier, the model typically tends to emit either label
or at any location . Thus, we should have for any pair . With Taylor serieswe have
Thus, we could determine the class label simply by
(13) 
Our experiences suggest that (12) and (13) give identical results most of the time. However, (13) is more intuitive and easier to implement.
IvB Relationship to learning with weak supervisions
Considering that detailed data annotations are expensive, learning with weak supervisions is promising, and becoming an active research direction recently. A few examples of such work are [Papadopoulos et al.(2017), Blaschko et al.(2010)]. Weak supervision and weak label are rather ambiguous terms. They could take quite different forms and meanings. Generally, such learning requires no detailed labeling information so that much more training data can be gathered cheaply. Still, the trained models are expected to have better performance and wider applicabilities than the ones learned on smaller but better annotated training data. Our method only requires binary labels suggesting whether any instance of a class exists or not in a sample. It well fits into the category of weak supervision or weak label learning.
V Experimental results
We only consider the image classification problems here, although our method could apply to other tasks, e.g., audio event detection. To our method, the main difference is that the probability tensor
will have different orders. In the following image recognition tasks, we only consider all convolutional networks such that the learned models are perfectly shiftinvariant. We even do not use components like max pooling. Decimation, if necessary, is achieved by convolutional layers with stride larger than
. These constraints make our models simple, although there could be rooms for performance improvement after removing such constraints. A Newton type method [Anonymous(2019)] is adopted for optimization. We do not tune the optimization part a lot due to its use of normalized step size. With epochs of iterations, we typically set the learning rate to for the first epochs, and for the lastepochs. Pytorch implementations reproducing the following reported results are available at
Any detail not reported here can be found in our implementation package.
Va Classification with oneinstancepersample setting
VA1 MNIST handwritten digit recognition
We have tested a CNN model with five layers for feature extractions, and one last layer for detection. All the convolution filters have kernel size
. Decimations and zero paddings are set to let
has shape . A baseline CNN model having almost identical structure is considered as well. We just apply less zero paddings in the baseline model to make its has shape such that the traditional cross entropy loss can be used to train it. With these settings, the baseline model achieves typical test classification error rates in range , while our method gives typical test error rates in range . Our method performs slightly better. To our knowledge, these test error rates are competitive, considering that we do not use any training data augmentation, model regularization, or fancy network structure.VA2 CIFAR10 image classification
We have tested a CNN model with nine layers for feature extractions, and one last layer for detection. All convolutional filters have kernel size . Decimations and zero paddings are set to let has shape . A similar baseline model trained by minimizing cross entropy loss is considered as well. With these settings, both our method and the baseline model achieve typical test classification error rates in range . We do aware that for this task, deeper and larger models can achieve test error rates in range . The purpose here is not to compete with these stateoftheart performances, but to empirically show that replacing the traditional cross entropy loss with ours does not lead to meaningful classification performance loss. Actually, our performances are no worse than those of the all convolutional nets reported in [Springenberg et al.(2015)].
VB Extended MNIST experiment
We use synthesized MNIST data to learn the same CNN model in Section 5.1.1. We randomly select two handwritten digit images and nest them into a larger image. Then, this larger image is used as the training image. Its label only tells the model which digits appear in the image and which do not appear. As a result, the model never get a chance to see any individual digit. As expected, the learned model can recognize randomly scattered handwritten digits in new test images with arbitrary sizes without any further processing, as shown in Figure 2. Here, the class label plays an important role in separating one instance from another. We have tested the learned model on the same MNIST test data, and obtained typical test classification error rates in range . The lowest one is . To our knowledge, these are the best test error rates ever achieved without using any data augmentation like affine or elastic distortions.
VC SVHN experiment
We consider the street view house number recognition task [Netzer et al.(2011)] in settings as realistic as possible. The task is to transcribe an image with house numbers to a string of digits. The training images come with very different sizes and resolutions. To facilitate the training, we take a tight square crop containing all the digits in an image, and rescale it to size . During the training, the model only knows which digits appear in the crops, and which do not. Thus, only limited labeling information is used to train our models. Note that most other works solving this same task exploit more labeling information, e.g., the complete digit sequences, their locations, the maximum sequence length, etc. [Goodfellow et al.(2014), Ba et al.(2015)].
We have trained two CNN models. All have layers, and consist of convolutional filters with kernel size . They have and million coefficients, respectively. Both models are significantly smaller than the ones in [Goodfellow et al.(2014), Ba et al.(2015)]. Currently, we use a very coarse transcription method to convert the recognized digits into a sequence. We only consider those horizontally oriented house numbers. As illustrated in Figure 1, we simply replace successive and repetitive detected digits with a single the same digit to obtain the transcriptions. This method could yield incorrect transcriptions when the house numbers are not horizontally oriented even if the model successfully recognize all digits. We are still trying to improve this part. With these settings, we obtain sequence transcription error rates and for the small and large models, respectively. Our models detect no digit in about of the test images. With coverage rate , the sequence transcription error rates reduce to and for the small and large models, respectively. These performances are somewhat worse than the ones reported in [Goodfellow et al.(2014)], but comparable to those in [Ba et al.(2015)] on crops with similar sizes. We expect that larger models will perform significantly better. Salient advantages of our models are their wider applicabilities and interpretable decisions. Figure 3 shows some examples where our models can recognize house numbers in the original images without rescaling or ground truth bounding boxes information. Closer inspections suggest that many detection errors are due to detecting vertical edges as digit , or failing to detect digit , possibly regarding it as edges by the models. Indeed, it could be difficult to distinguish vertical edges and digit for our models since they are all convolutional networks without fully connected layers. Increasing the receptive field of the last detection layer may help to alleviate this issue.
Vi Conclusions
We have proposed a novel method for machine learning with labels suggesting that any instance of a class exists in a sample or not. We demonstrate its applications to object detection and localization. With our method, weak labeling information and simple models are shown to be able to solve tough problems like the street view house number recognition in reasonably realistic settings. As our models already are able to detect and approximately locate the instances, one interesting direction is to let the model mark out the pixels that account for the detection of each instance.
References
 [Alex et al.(2012)] Alex, K., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
 [Anonymous(2019)] Anonymous. Learning preconditioners on matrix lie groups. In Submitted to International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bye5SiAqKX. under review.
 [Ba et al.(2015)] Ba, J., Mnih, V., and Kavukcuoglu, K. Multiple object recognition with visual attention. In International Conference on Learning Representations, 2015.
 [Blaschko et al.(2010)] Blaschko, M., Vedaldi, A., and Zisserman, A. Simultaneous object detection and ranking with weak supervision. In NIPS, 2010.
 [Deng et al.(2009)] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and FeiFei, L. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
 [Goodfellow et al.(2014)] Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. Multidigit number recognition from street view imagery using deep convolutional neural networks. 2014. URL https://arxiv.org/abs/1312.6082.

[Huang et al.(2017)]
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer,
I., Wojna, Z., Song, Y., Guadarrama, S., and Murphy, K.
Speed/accuracy tradeoffs for modern convolutional object detectors.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2017.  [LeCun et al.(1998)] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[Netzer et al.(2011)]
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y.
Reading digits in natural images with unsupervised feature learning.
In
NIPS Workshop on Deep Learning and Unsupervised Feature Learning
, 2011.  [Papadopoulos et al.(2017)] Papadopoulos, D. P., Uijlings, J. R. R., Keller, F., and Ferrari, V. Training object class detectors with click supervision. In CVPR, 2017.
 [Springenberg et al.(2015)] Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: the all convolutional net. In International Conference on Learning Representations, 2015.
Appendix A: proof of Proposition 1
We start from (10), and repetitively apply (9) to replace ’s with ’s. This process is tedious, but could evenly prove the correctness of Proposition 1. Starting from the end of (10) could make this process more manageable. By expanding the term in (10) with (9), we obtain
(14) 
Next, we expand all the terms like
We continue this process until all ’s are replaced with ’s. Finally, the coefficient before will be
which is just , where denotes binomial coefficient. The coefficient before will be . This finishes the proof of Proposition 1.