Log In Sign Up

Joint Learning of Set Cardinality and State Distribution

by   S. Hamid Rezatofighi, et al.

We present a novel approach for learning to predict sets using deep learning. In recent years, deep neural networks have shown remarkable results in computer vision, natural language processing and other related problems. Despite their success, traditional architectures suffer from a serious limitation in that they are built to deal with structured input and output data, i.e. vectors or matrices. Many real-world problems, however, are naturally described as sets, rather than vectors. Existing techniques that allow for sequential data, such as recurrent neural networks, typically heavily depend on the input and output order and do not guarantee a valid solution. Here, we derive in a principled way, a mathematical formulation for set prediction where the output is permutation invariant. In particular, our approach jointly learns both the cardinality and the state distribution of the target set. We demonstrate the validity of our method on the task of multi-label image classification and achieve a new state of the art on the PASCAL VOC and MS COCO datasets.


Learn to Predict Sets Using Feed-Forward Neural Networks

This paper addresses the task of set prediction using deep feed-forward ...

Learning to Represent and Predict Sets with Deep Neural Networks

In this thesis, we develop various techniques for working with sets in m...

Stochastic Deep Networks

Machine learning is increasingly targeting areas where input data cannot...

DeepSetNet: Predicting Sets with Deep Neural Networks

This paper addresses the task of set prediction using deep learning. Thi...

Fine-grained Generalization Analysis of Structured Output Prediction

In machine learning we often encounter structured output prediction prob...

Localized Structured Prediction

Key to structured prediction is exploiting the problem structure to simp...


Recently, deep structured networks such as deep convolutional (CNN) and recurrent (RNN) neural networks have become increasingly popular in artificial intelligence, showing remarkable performance on many real-world problems, including scene classification 

[Krizhevsky, Sutskever, and Hinton2012], speech recognition [Hinton et al.2012], gaming [Mnih et al.2013, Mnih et al.2015], semantic segmentation [Papandreou et al.2015], and image captioning [Johnson, Karpathy, and Fei-Fei2016]

. However, like most machine learning techniques, current deep learning approaches are based on conventional statistics and require the problem to be formulated in a structured way. In particular, they are designed to learn a model for a distribution (or a function) that maps a structured input, typically a vector, a matrix, or a tensor, to a structured output.

Consider the task of image classification as an example. The goal here is to predict a label (or a category) of a given image. The most successful approaches address this task with CNNs, i.e. by applying a series of convolutional layers followed by a number of fully connected layers [Krizhevsky, Sutskever, and Hinton2012, Simonyan and Zisserman2014, Szegedy et al.2014, He et al.2016]. The final output layer is a fixed-sized vector with the length corresponding to the number of categories in the dataset (e.g. 1000 in the case of the ILSVR Challenge [Russakovsky et al.2015]

). Each element in this vector is a score or probability for one particular category such that the final prediction corresponds to a probability distribution over all classes. The difficulty arises when the number of classes is unknown in advance and in particular varies for each example. Then, the final output is generated heuristically by a discretization process such as choosing the

highest scores [Gong et al.2013a, Wang et al.2016], which is not part of the learning process. This shortcoming concerns not only image tagging but also other problems like detection or graph optimization, where connectivity and graph size can be arbitrary.

We argue that such problems can be naturally expressed with sets rather than vectors. As opposed to a vector, the size of a set is not fixed in advance, and it is invariant to the ordering of entities within it. Therefore, learning approaches built on conventional statistics cannot be the right choice for these problems. In this paper, we propose a learning approach based on point processes and finite set statistics to deal with sets in a principled manner. More specifically, in the presented model, we assume that the input (the observation) is still structured, but the output is modelled as a set. Our approach is inspired by a recent work on set learning using deep neural networks [Rezatofighi et al.2017]. The main limitation of that work, however, is that the approach employs two sets of independent weights (two independent networks) to generate the cardinality and state distributions of the output set. In addition, to generate the final output as a set, sequential inference has to be applied instead of joint inference. In this paper, we derive a principled formulation for performing both learning and inference steps jointly. The main contribution of the paper is summarised as follows:

  1. We present a novel way to learn both cardinality and state distributions jointly within a single deep network. Our model is learned end-to-end to generate the output set.

  2. We perform the inference step both jointly and optimally. We show how we can generate the most likely (the optimal) set using MAP inference for our given model.

  3. Our approach outperforms existing solutions and achieves state-of-the-art results on the task of multi-label image classification on two standard datasets.

Related Work

Handling unstructured input and output data, such as sets or point patterns, for both learning and inference is an emerging field of study that has generated substantial interest in recent years. Approaches such as mixture models [Blei, Ng, and Jordan2003, Hannah, Blei, and Powell2011, Tran et al.2016], learning distributions from a set of samples [Muandet et al.2012, Oliva, Póczos, and Schneider2013], model-based multiple instance learning [Vo et al.2017]

and novelty detection from point pattern data 

[Vo et al.2016]

, can be counted as few out many examples that use point patterns or sets as input or output and directly or indirectly model the set distributions. However, existing approaches often rely on parametric models,

e.g. the elements in output sets needs to be derived from an independent and identically distributed (i.i.d.) Poisson point process distribution [Adams, Murray, and MacKay2009, Vo et al.2016]. Recently, deep learning has enabled us to use less parametric models to capture highly complex mapping distributions between structured inputs and outputs. Somewhat surprisingly, there are only few works on learning sets using deep neural networks. One interesting exception in this direction is the recent work of Vinyals et al. vinyals2015order, which uses an RNN to read and predict sets. However, the output is still assumed to have an ordered structure, which contradicts the orderless (or permutation invariant) property of sets. Moreover, the framework can be used in combination with RNNs only and cannot be trivially extended to any arbitrary learning framework such as feed-forward architectures. Another recent work proposed by Zaheer et al. zaheer2017deep is a deep learning framework which can deal with sets as input with different sizes and permutations. However, the outputs are either assumed to be structured, e.g. a scalar as a regressing score, or a set with the same entities of the input set, which prevents this approach to be used for the problems that require output sets with arbitrary entities. Perhaps the most related work to our problem is a deep set network recently proposed by Rezatofighi et al. rezatofighi2017deepsetnet which seamlessly integrates a deep learning framework into set learning in order to learn to predict sets in two challenging computer vision applications, image tagging and pedestrian detection. However, the approach requires to train two independent networks to model a set, one for cardinality and one for state distribution. Our approach is largely inspired by this latter work but overcomes its limitation on independent learning and inference.

To validate our model, we apply it on the multi-label image classification task. Despite its relevance, there exists rather little work on this problem that makes use of deep architectures. One example is Gong et al. Gong:2013:arxiv, who combine deep CNNs with a top- approximate ranking loss to predict multiple labels. Wei et al. Wei:2014:arxiv propose a Hypotheses-Pooling architecture that is specifically designed to handle multi-label output. While both methods open a promising direction, their underlying architectures largely ignore the correlation between multiple labels. To address this limitation, recently, Wang et al. Wang_2016_CVPR proposed a model that combines CNNs and RNNs to predict an arbitrary number of classes in a sequential manner. RNNs, however, are not suitable for set prediction mainly for two reasons. First, the output represents a sequence and not a set, and is thus highly dependent on the prediction order, as was shown recently by Vinyals et al. vinyals2015order. Second, the final prediction may not result in a feasible solution (e.g. it may contain the same element multiple times), such that post-processing or heuristics such as beam search must be employed [Vinyals, Fortunato, and Jaitly2015, Wang et al.2016]. Here we show that our approach not only guarantees to always predict a valid set, but also outperforms previous methods.


To better explain our approach, we first review some mathematical background and introduce the notation used throughout the paper. In statistics, a continuous random variable

is a variable that can take an infinite number of possible values. A continuous random vector can be defined by stacking several continuous random variables into a fixed length vector,

. The mathematical function describing the possible values of a continuous random vector and their associated joint probabilities is known as a probability density function (PDF)

such that

In contrast, a random finite set (RFS) is a finite-set valued random variable . The main difference between an RFS and a random vector is that for the former, the number of constituent variables, , is random and the variables themselves are random and unordered. Throughout the paper, we use for a set with unknown cardinality, for a set with known cardinality and for a vector (or an ordered set) with known dimension .

A statistical function describing a finite-set variable is a combinatorial probability density function which consists of a discrete probability distribution, the so-called cardinality distribution, and a family of joint probability densities on both the number and the values of the constituent variables [Mahler2007, Vo et al.2017], i.e.


where is the cardinality distribution of the set and is a symmetric joint probability density distribution of the set given known cardinality . The normalisation factor between and appears because the probability density for a set with known cardinality must be equally distributed among all the possible permutations of the corresponding vector  [Mahler2007, Vo et al.2017]. is the unit of hyper-volume in the feature space, which cancels out the unit of the probability density making it unit-less, and thereby avoids the unit mismatch across the different dimensions (cardinalities). Without this normalizing constant, the comparison between probabilities of the sets with different cardinalities is not properly defined because a distribution with the smallest set size will always have the highest probabilities. For example, always holds regardless of the particular choice for and . Please refer to [Vo et al.2017] for an intuitive discussion.

Finite Set Statistics provides powerful and practical mathematical tools for dealing with random finite sets, based on the notion of integration and density that is consistent with the point process theory [Mahler2007].111A random finite set can be viewed as a simple finite point process [Baddeley, Bárány, and Schneider2007].

For example, similar to the definition of a PDF for a random variable, the PDF of an RFS must sum to unity over all possible cardinality values and all possible element values as well as their permutations. This type of statistics, which is derived from the point process stochastic process, defines basic mathematical operations on finite sets such as functions, derivatives and integrations as well as other statistical tools such as probability density function of a random finite set and its statistical moments 

[Mahler2007, Vo et al.2017]. For further details on point processes, we refer the reader to textbooks such as [Chiu et al.2013, Daley and Vere-Jones2007, Moller and Waagepetersen2003].

Conventional machine learning approaches, such as Bayesian learning and convolutional neural networks, have been proposed to learn the optimal parameters

of the distribution which maps the input vector to the output vector . In this paper, we instead propose an approach that can learn the parameters for a set distribution that allows one to map the input vector to the output set , i.e. . For mathematical convenience, we use an i.i.d.-cluster point process model. Moreover, we target applications where the order of the outputs during training is irrelevant, e.g. multi-label image classification. Modifying the application or the i.i.d. assumption to non-i.i.d. set elements, may require to deal with the complexity of permutation invariant property of sets during the learning step, which leads to serious mathematical complexities and is left for future work.

Joint Deep Set Network

We follow the convention introduced in [Rezatofighi et al.2017] and define a training set , where each training sample is a pair consisting of an input feature (e.g. image), and an output (or label) set . In the following we will drop the instance index for better readability. Note that denotes the cardinality of set . Following the definition in Eq.( 1), the probability density of a set with an unknown cardinality is defined as


where denotes the collection of parameters which model both the cardinality distribution of the set elements as well as the parameters of

that model the joint distribution of set element

values for a fixed cardinality . Note that in contrast to previous works [Rezatofighi et al.2017, Vo et al.2016, Vo et al.2017] that assume that two sets of independent parameters (two independent networks) are required to represent the set distribution , we will show that one set of parameters is sufficient to learn this distribution and as it turns out also yields better performance.

The above formulation represents the probability density of a set which is very general and completely independent of the choices of both cardinality and state distributions. It is thus straightforward to transfer it to many applications that require the output to be a set. However, to make the problem amenable to mathematical derivation and implementation, we adopt two assumptions: i) the outputs (or labels) in the set are derived from an independent and identically distributed (i.i.d.)-cluster point process model, and ii) their cardinality follows a categorical distribution parameterised by event probabilities . Thus, we can write the distribution as


where denotes the probability of taking on the state in a singleton set , and is the vector of event probabilities, i.e. and .

Posterior distribution

To learn the parameters , we assume that the training samples are independent from each other and that the distribution from which the input data is sampled is independent from both the output and the parameters. Then, the posterior distribution over the parameters can be derived as


A closed-form solution for the integral in Eq. (4

) can be obtained by using conjugate priors:

where and represent respectively a categorical distribution with the event probabilities and a Dirichlet distribution with parameters . Moreover,

can be assumed a zero-mean normal distribution with covariance equal to

, i.e. . The key difference between our method and [Rezatofighi et al.2017] is that we only need to use one network as opposed to two networks used in the previous work. It is important to note that our method jointly predicts both cardinality and the set elements as opposed to sequentially predicting the cardinality first and then the set elements as previously done in [Rezatofighi et al.2017]. We have provided a comparison between the graphical models of both methods in terms of plate notation in Fig. 1 to further illustrate their differences.

We assume that the cardinality follows a categorical distribution whose event probabilities vector

is estimated from a Dirichlet distribution with parameters

, which can be directly estimated from the input data . Note that the cardinality distribution in Eq. (3) can be replaced by any other discrete distribution, e.g. Poisson, binomial or negative binomial (cf. [Rezatofighi et al.2017]

). Here, we use the categorical distribution as the cardinality model, which better suits the task at hand. The rationale here is that Poisson and negative binomial are long-tailed distributions and their variance increases with their mean. Therefore, the final model will have more uncertainty (and possibly a higher error) in estimating the cardinality of the sets with high values. In contrast, the categorical distribution does not have the drawback of correlating its mean and variance. Moreover, in the image tagging application, the maximum cardinality is often known and there is no need to use long-tailed distributions, which are more suitable for the applications where the maximum cardinality is unknown.

Consequently, the integral in Eq. (4) is simplified and forms a Dirichlet-Categorical distribution


where is the number of samples in the training set with cardinality , and is the total number of training samples. Finally, the full posterior distribution can be written as

(a) (b)
Figure 1: Comparison of the graphical models: a) The set learning approach introduced in  [Rezatofighi et al.2017] by replacing Dirichlet-Categorical as its cardinality distribution; (b) our proposed joint set learning. The work in  [Rezatofighi et al.2017] first predicts the cardinality , and then the labels s given . There is a separation between parameters and . Consequently, an incorrect predicted via can not be fixed by . Our method only uses one joint parameter which aims to learn and predict the best and ’s jointly. and are shaded as they are observed in the training data. Note that is a variable in our model (b) which determines the repetition of the plates (i.e. the number of ’s). Removing the top-right chain in (a) recovers the traditional vector based (non-set based) method.


For simplicity, we use a point estimate for the posterior , i.e. , where is computed using the MAP estimator, i.e. . Therefore, we have


describes a neural network with coefficients learned to map the input to the output (label) . This function represents the state distribution of each set element over the state space. is the regularisation parameter, proportional to the predefined covariance parameter . This parameter is also known as the weight decay parameter and is commonly used in training neural networks.

For example, in the application to multi-label image classification, represents the existence of a specific label in the input image instance from all pre-defined labels. In this application, we can rewrite an equivalent binary formulation for the above MAP problem as


where represents the existence or non-existence of any specific label in the image .

can be defined as a binary logistic regression function

where is the network’s predicted output corresponding to the label.

Note that can generally be learned using a number of existing machine learning techniques. In this paper we rely on deep CNNs to perform this task. More formally, to estimate , we compute the partial derivatives of the objective function in Eq. (8

) and use standard backpropagation to learn the parameters of the deep neural network.


Having learned the network parameters , for a test image , we use a MAP estimate to generate a set output as


where , and as above. Therefore, the MAP estimate can be written as follows,


Since the unit of hyper-volume in this application is unknown, we assume it as a constant hyper-parameter, estimated from the validation set of the data.

To solve the above inference problem, we define the binary variable

for existence of each label similar to the learning process. Therefore, an equivalent formulation for Eq. (10) is


where The above problem can be seen as a combination of a higher-order term,


which accounts for the total number of selected variables, and a linear binary program, , where and


Therefore, we can re-write it as


Since for each specific cardinality , the most likely set corresponds to the highest values of , the optimal solution for can be found efficiently when the sorted values of , here denoted by , and is maximised w.r.t. :


Then, the optimal

can be obtained by solving a simple linear program:


Note that the optimal solution to the problem in Eq. (16) are exactly those variables that correspond to the highest values of .

Experimental Results

To validate our proposed joint set learning approach, we perform experiments on the task of multi-label image classification. This is an appropriate application for our model as its output is expected to be in the form of a set (a set of labels in this particular case) with an unknown cardinality while the order of its elements (labels) in the output list does not have any meaning. Moreover, we assume that the labels are derived from an i.i.d.-cluster process model. To make our work directly comparable to [Rezatofighi et al.2017], we use the same two standard and popular benchmarks, the PASCAL VOC 2007 dataset [Everingham et al.2007] and the Microsoft Common Objects in Context (MS COCO) dataset [Lin et al.2014].

Implementation details. We follow the same experimental setup used in [Rezatofighi et al.2017, Wang et al.2016]. Our model is built on the -layers VGG network [Simonyan and Zisserman2014]

, pretrained on the 2012 ImageNet dataset. We adapt VGG for our purpose by modifying the last fully connected prediction layer to predict both cardinality and classification distributions according to the loss proposed in Eq. (

8), i.e. DC for cardinality and binary cross-entropy for classification. We then fine-tune the entire network using the training set of these datasets with the same train/test split as in existing literature [Rezatofighi et al.2017, Gong et al.2013a, Wang et al.2016].

To train our network, which we call JDS in the following, we use stochastic gradient descent and set the weight decay to

, with a momentum of and a dropout rate of

. The learning rate is adjusted to gradually decrease after each epoch, starting from

. The network is trained for epochs for both datasets and the epoch with the lowest validation objective value is chosen for evaluation on the test set. The hyper-parameter is set to be , adjusted on the validation set.

To demonstrate that joint learning is helpful to learn a better classifier (state distribution) as well as a better cardinality distribution, we perform an additional baseline experiment where we replace the negative binomial (NB) distribution used in 

[Rezatofighi et al.2017] with the Dirichlet-Categorical (DC) distribution from Eq. (5). Then, an independent cardinality distribution network is trained using the same network structure as the one used in [Rezatofighi et al.2017] while modifying the final fully connected layer to predict the cardinality using the DC distribution. We fine-tune the network on cardinality distribution, initialised with the network weights learned for the classification task of each of the reported datasets, i.e. PASCAL VOC and MS COCO. To train the cardinality CNN, we use the exact same hyper-parameters and training strategy as described above.

Evaluation protocol.

We employ the common evaluation metrics for multi-label image classification also used in 

[Gong et al.2013a, Wang et al.2016, Rezatofighi et al.2017]. These include the average precision, recall and F1-score222

F1-score is calculated as the harmonic mean of precision and recall.

of the generated labels, calculated per-class (C-P, C-R and C-F1) and overall (O-P, O-R and O-F1). Since C-P, C-R and C-F1 can be biased to the performance of the most frequent classes, we also report the average precision, recall and F1-score of the generated labels per image/instance (I-P, I-R and I-F1).

We rely on F1-score to rank approaches on the task of label prediction. A method with better performance has a precision/recall value that has a closer proximity to the perfect point shown by the blue triangle in Fig. 2. To this end, for the classifiers such as BCE and Softmax, we find the optimal evaluation parameter that maximises the F1-score. For the deep set network (DS) [Rezatofighi et al.2017] and our proposed joint set network (JDS), prediction/recall is not dependent on the value of . Rather, one single value for precision, recall and F1-score is computed.


Figure 2: Precision/recall curves for the classification scores when the classifier is trained independently (red solid line) and when it is trained jointly with the cardinality term using our proposed joint approach (black solid line) on PASCAL VOC dataset. The circles represent the upper bound when ground truth cardinality is used for the evaluation of the corresponding classifiers. The ground truth prediction is shown by a blue triangle.
Classifier Eval. C-P C-R C-F1 O-P O-R O-F1 I-P I-R I-F1
Softmax k=(1)
BCE k=(1)

DS (BCE-NB) [Rezatofighi et al.2017]
DS (BCE-DC) k=
JDS (BCE-DC) k= 78.7 81.5 85.1
Table 1: Quantitative results for multi-label image classification on the PASCAL VOC dataset.

We first test our approach on the Pascal Visual Object Classes benchmark [Everingham et al.2007], which is one of the most widely used datasets for detection and classification. This dataset includes images with a 50/50 split for training and test, where objects from pre-defined categories have been annotated by bounding boxes. Each image contains between and unique objects.

We first investigate if the joint learning improves the performance of cardinality and classifier. Fig. 2 shows the precision/recall curves for the classification scores when the classifier is trained solely using binary cross-entropy (BCE) loss (red solid line) and when it is trained using the same loss jointly with the cardinality term (Joint BCE). We also evaluate the precision/recall values when the ground truth cardinality is provided. The results confirm our claim that the joint learning indeed improves the classification performance. We also calculate the mean absolute error of the cardinality estimation when the cardinality term using the DC loss is learned jointly and independently as proposed in [Rezatofighi et al.2017]. The mean absolute cardinality error of our prediction on PASCAL VOC is , while this error is when the cardinality is learned independently. We compare the performance of our proposed joint deep set network, i.e. JDS (BCE-DC), with softmax and BCE classifiers with the best value as well as the deep set network [Rezatofighi et al.2017] when the classifier is binary cross entropy and the cardinality loss is negative binomial, i.e. DS (BCE-NB). In addition, Table 1 reports the results for the deep set network when the cardinality loss is replaced by our proposed Dirichlet-Categorical loss, i.e. (BCE-DC). The results show that we outperform the other approaches w.r.t. all three types of F1-scores. In addition, our joint formulation allows for a single training step to obtain the final model, while the deep set network learns two VGG networks to generate the output sets.

Microsoft COCO.

Figure 3: Qualitative comparison between our proposed joint deep set network (JDS) and the deep set networks with Dirichlet-Categorical (DS (DC)) and Negative Binomial (DS (NB)) as the cardinality loss. For each image, the ground truth tags and the predictions for our JDS and the two baselines are denoted below. False positives are highlighted in red. Our JDS approach reduces both cardinality and classification error.
Classifier Eval. C-P C-R C-F1 O-P O-R O-F1 I-P I-R I-F1

BCE k=(3)
CNN-RNN [Wang et al.2016] k=(3)
DS (Softmax-NB) [Rezatofighi et al.2017] k=
DS (BCE-NB) [Rezatofighi et al.2017] k=
DS (BCE-DC) k=
JDS (BCE-DC) k= 65.5 70.7 75.6

Table 2: Quantitative results for multi-label image classification on the MS COCO dataset.

The MS-COCO [Lin et al.2014] benchmark is another popular benchmark for image captioning, recognition, and segmentation. The dataset includes K images, each labelled with per instance segmentation masks of classes. The number of unique objects for each image varies between and . Around images in the training set do not contain any of the classes and there are only a handful of images that have more than tags. Most images contain between one and three labels. We use images with identical training and validation split as [Rezatofighi et al.2017], and the remaining images as test data.

The classification results on this dataset are reported in Table 2. The results once again show that our approach consistently outperforms our baselines and the other methods measured by F1-score. Due to this improvement, we achieve state-of-the-art results on this dataset as well. Some examples of label prediction using our joint deep set network and comparison with other deep set networks are shown in Fig. 3. The results show that our joint learning can simultaneously reduce the cardinality and classification errors in these examples.


We proposed a framework to jointly learn and predict a set’s cardinality and state distributions by modelling both distributions using the same set of weights. This approach not only significantly reduces the number of learnable parameters, but also helps to model both distributions more accurately. We have demonstrated the effectiveness of this approach on multi-class image classification, outperforming previous state of the art on standard datasets. The main limitation of our framework is that we do not include the complexity of permutation invariance of sets in the learning step. Therefore, our method is only applicable to set problems that do not rely on permutation invariance during training, such as image tagging. In future, we plan to overcome this limitation by incorporating permutation variables during training procedure. Another potential avenue could be to exploit the Bayesian nature of the model to include uncertainty as opposed to relying on the MAP estimation alone.

Acknowledgments. This research was supported by the Australian Research Council through the Centre of Excellence in Robotic Vision, CE140100016, and through Laureate Fellowship FL130100102 to IDR.