Large-Scale Classification of Structured Objects using a CRF with Deep Class Embedding

05/21/2017 ∙ by Eran Goldman, et al. ∙ Bar-Ilan University 0

This paper presents a novel deep learning architecture to classify structured objects in datasets with a large number of visually similar categories. We model sequences of images as linear-chain CRFs, and jointly learn the parameters from both local-visual features and neighboring classes. The visual features are computed by convolutional layers, and the class embeddings are learned by factorizing the CRF pairwise potential matrix. This forms a highly nonlinear objective function which is trained by optimizing a local likelihood approximation with batch-normalization. This model overcomes the difficulties of existing CRF methods to learn the contextual relationships thoroughly when there is a large number of classes and the data is sparse. The performance of the proposed method is illustrated on a huge dataset that contains images of retail-store product displays, taken in varying settings and viewpoints, and shows significantly improved results compared to linear CRF modeling and unnormalized likelihood optimization.



There are no comments yet.


page 1

page 4

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Object recognition is one of the fundamental problems in computer vision. It involves finding and identifying objects in images, and plays an important role in many real-world applications such as advanced driver assistance systems, military target detection, diagnosis with medical images, video surveillance, and identity recognition.

Over the past few years deep convolutional neural networks (CNN) have led to a remarkable progress in image classification

[1, 2], and resulted in reliable appearance-based detectors e.g. [3, 4, 5, 6]. Fine-grained object recognition aims to identify subcategory object classes, which includes finding subtle difference among visually similar subcategories. Recent studies achieved good performance on fine-grained tasks [7, 8]. However, the problem remains extremely difficult when the dataset categories are nearly identical in terms of their visual appearance. In this case, object categories are often virtually indistinguishable, since the discriminative features may be masked by inadequate observation or visual artifacts.

Fig. 1: Spot the difference: Examples of classes with similar appearance. Each product in each grouping in this image belongs to a different category.

This study addresses the problem of classifying a sequence of objects based on their visual appearance and their relative locations. Our dataset contains photos of retail-store product displays, taken in varying settings and viewpoints. We need to identify the class of each product at the front of the shelves. The dataset is exclusively characterized by having a distinct geometric object structure - sequences of shelves, a large number of classes, and very subtle visual differences between groups of classes - some classes only differ in sizes or minor design details. The unique challenges in this task involve handling the large number of possible classes, and the fact that the classes are not clearly distinctive by their appearance but rather by their context. For example, products with identical appearance but with different container volumes are considered different classes (see examples at Fig.1, 2).

Because the object local appearance may not suffice for accurate categorization, additional information needs to be considered. In real world images, contextual data provides useful information about spatial and semantic relationships between objects. Modeling a joint visual-contextual classifier is nontrivial in that some contextual cues are very informative, whereas others are irrelevant, or even misleading. Most deep learning detectors classify each detected object individually without taking the contextual information into account.

Context has been used to improve performance for image understanding tasks in various ways [9, 10, 11]. Graphical models have been widely applied to visual and auditory analysis tasks, by jointly modeling local features, and contextual relations. Tasks addressed by these models include image segmentation and object recognition [12, 13, 14, 15, 16, 17], as well as speech [18], music[19], text [20] and video analysis [21].

Fig. 2: Scene example with the relevant classes presented on the left. For some of the products in this scene, classification based on local appearance alone would be extremely difficult even for expert humans. The object on the top right, for instance, is partially occluded, and visually distorted by reflections, illuminations, and focus. Additionally, it is facing backwards, making its flavor undetermined. When viewed separately, the volume cannot be determined either. Shelf-level classification exploits the information extracted from the other items and their spatial relations, including a-priori knowledge of structure statistics, to jointly determine the shelf classes.

Few studies have applied deep learning features or detection results to context models: Chen et al.[22]

explored several techniques to learn structured models jointly with deep features that form MRF potentials. Chu and Cai

[23] evaluated the performance of a joint CRF model on Faster R-CNN [3] detection results, using an a-priori statistical summary for the pairwise potentials. Korzeniowski and Widmer [19] introduced a two-stage learning model for musical chord recognition: one network learns a single-frame representation, and the other learns the potentials of a linear-chain CRF model, using the frame-representations as the CRF input. The aforementioned CRF models use pairwise potentials to represent object-pair interaction. They allocate a different parameter for each class pair. This approach, which ignores class similarities, may be sufficient in small sets of distinct classes. However, it is not suitable for a large class-set that contains visually similar classes. Our dataset, which includes many visually similar categories, and nearly a thousand classes and a million possible pairwise transitions overall, requires more advanced learning mechanism. In most previous object recognition studies the visual information was dominant. In our task context information also has a significant contribution.

In this study we provide a CRF based method that explicitly learns the embedding of classes with respect to their neighbor’s visual features. This is achieved by factorizing the CRF pairwise potential matrix to impose the desired structure of class embedding in a low-dimensional space. Our model learns the factorized parameters, and yields a joint contextual-visual embedding of the classes. To efficiently train the network, we introduce a pairwise softmax architecture which optimizes a local approximation of the likelihood. Since the factorized loss function is not convex, we exploit the simplicity of the local approximation architecture to include batch-norm related regularization for the object samples, and attain dramatic improvement not only in training time but also in the overall performance of the trained model. At test time dynamics programming techniques are used for efficient exact inference of classes.

The contributions of this work are the following:

  1. Combining deep class embedding into a CRF formulation that enables handling datasets with huge number of classes.

  2. An approximated-likelihood training procedure that is both computational efficient and, unlike exact CRF likelihood, we can easily incorporate batch-normalization into the training procedure.

The rest of the paper is organized as follows. In the next section we describe a CRF model with class embedding formulation and present the learning and inference algorithms. Section 3 contains a detailed data description and comparative experimental results and finally conclusions are given in Section 4.

Ii CRF With Class Embedding

Ii-a Model Formulation

Our study is motivated by images of store shelves where a large number of objects with many possible classes appear in a single image and we want to classify the object using both visual and context information. A preprossessing detection stage is applied to extract the detection bounding boxes, crop their image patches and organize them according to their locations on the shelves. The input data used for our task are the sequences of images: Each image captures an individual product, and the images are organized in sequences ordered according to their relative positions on the shelves. Let denote the image in position of sequence , and the corresponding class label. For notational simplicity, we omit the index when referring to an individual sequence.

We wish to predict the sequence of the target labels , given a sequence of observations . Standard classification approaches use a CNN to predict each object-level observation individually, implicitly assuming independence between object samples. In order to include context in the classification process, we model the sequences as a CRF.

Linear-chain Conditional Random Field (LC-CRF) [24] is a type of discriminative undirected probabilistic graphical model, whose conditional distribution

obeys a conditional Markov property. The joint probability distribution of a linear-chain CRF is:



the sequence of observation feature vectors,

the corresponding sequence of the target labels, the model potential function, the partition function defined as the global probability normalization over all possible sequence label-assignments of length , and . Assume that the potential function is defined as:


where matrix the pairwise potentials matrix, the unary potentials, and vector

the label bias, are all model parameters, and we use a one-hot encoding for the labels. The likelihood function, therefore, is log-linear and concave.

Combining neural networks with CRFs gives a fully differentiable model that can be learned jointly, as shown e.g. in [25, 26]

. For the task at hand, however, we found it sufficient to train both parts separately, and apply transfer learning for faster convergence and easier regularization. We can train a local CNN to classify individual objects, and then interpret the hidden layer activations as a non-linear representation of the input image


. Similarly to the concept of transfer-learning, we can now discard the CNN softmax layer, and use the convolutional layers to compute the feature-vectors of the input images. For image-detection

we define the feature vector as the activations of the last fully-connected hidden layer, and use it as the CRF input observation feature vector:


The score function (3) is still concave, but its input is computed from a non-concave source, a deep CNN. The rationale for using deep representation for the input images is clear: as introduced by Krizhevsky et al. [1], the immense complexity of the visual object recognition task requires a model with a very large learning capacity. Convolutional layers provide the structure required for learning visual features of the unary input. We would like to craft a suitable structure to learn the pairwise contextual relations as well.

CRF was originally applied to language processing tasks such as Part of Speech (POS) tagging and Named Entity Recognition (NER)

[24]. In most applications of CRF to either language or image understanding, there are no more than a few dozen different classes. In our dataset, we have hundreds of classes. The pairwise transition between two classes has nearly a million possible states, whereas the CRF function (3

) has a log-linear form, and contains a single parameter per transition ordered-pair. In order to properly learn and generalize the massive variety of possible neighboring patterns, we enforce a structure on the pairwise potential matrix: the goal is to learn neighboring-class embedding in a feature vector space. For this purpose, we define a low-dimensional decomposition of the pairwise potential matrix

as the product of the left-side neighbor embedding matrix and the class embedding matrix :


The columns of are low-dimensional embeddings of the target classes, and the columns of are embeddings of the classes of the left-side object. Assigning the matrix factorization (4) to the CRF potential function (3) we get:


The objective function is no longer linear or concave with respect to the network parameters, but deep learning training techniques have been shown to yield good results for non-convex optimization tasks [28]. This simply means that we need to apply the deep learning approach not only for the input image representations, but also for the neighboring transition parameters.

Fig. 3: Examples of display shelves. Each arrow color represents a different class. Some typical patterns are evident.

Ii-B Training

The CRF model defined above can be trained in a supervised manner by maximizing the log-likelihood of all sequences in the training dataset [24]:


where the vector contains the ground-truth labels of the th sequence, contains the corresponding object feature vectors of the sequence of observations, goes over the sequences in the training data, and the loss function is defined at (1) with the potentials (5). Since the underlying graph is loop-free, it is tractable to compute the likelihood function and its gradient using the forward-backward algorithm [29]

. However, the optimization is relatively slow for a large number of classes, because its complexity is quadratic in the number of possible classes. In order to speed up the training process, we can estimate the parameters locally, by optimizing an approximate objective function. A local approximation of the likelihood would require samples of individual objects and their immediate neighbors rather than entire sequences.

Linear-chain CRFs were originally introduced as an improvement on the Maximum Entropy Markov model (MEMM)


, which is essentially a Markov model in which the transition distributions are given by a logistic regression model. The main difference between CRFs and MEMMs is that a MEMM uses per-state exponential models for the conditional probabilities of next-states given the current-state, whereas the CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation-sequence. CRF and MEMM can be written using the same set of parameters. The MEMM directed graphical modeling in our case is:




One major advantage of MEMMs over CRFs (and HMMs) is that training can be considerably more efficient. Unlike CRFs, in MEMMs the parameters of the maximum-entropy distributions used for the transition probabilities can be estimated for each transition distribution separately. When applying MMEM for inference it suffers from the label bias problem [24, 31] which may lead to a drop in performance in some applications. Here, however, MEMM objective is used only as a local approximation to learn the parameter set of the linear-chain CRF model whereas the test time inference uses a global normalization of CRF modeling and thus avoids the label bias problem. The objective function is now defined as the conditional probability of the current-object class, given the class of the left-side neighbor object:


where goes over the sequences and goes over the objects in the sequence, is the object CNN-based representation, is the true class label and is as defined at (8). Note that the computational complexity of the MEMM likelihood (9) is linear in the number of classes unlike the CRF likelihood whose computational complexity is quadratic.

This surrogate likelihood function whose samples are pairs of objects and corresponding neighboring labels can be used at train time to accelerate the training process. Because the model is stationary and conditionally independent of indirect neighbors, breaking the samples from sequences into adjacent pairs of direct neighbors does not necessarily eliminate significant contextual information. Rather, when learning the non-convex objective of class-embedding CRF, it may enrich the training dataset, improve the stochastic nature of the SGD optimization process, simplify and improve regularization techniques such as batch-normalization, and help prevent overfitting since there are many more object samples than sequence samples, and the mini-batches are composed of adjacent pairs of objects taken from random training samples. In contrast, restricting the mini-batches to contain full sequences, would decrease the model’s freedom to discover better solutions for the objective of pairwise transition parameters. In fact, as we empirically show in the next section, optimizing the local approximate likelihood with object-level batch-normalization yields better results than optimizing the unnormalized global LC-CRF likelihood. In the appendix we review standard likelihood approximation strategies for efficient CRF training and show that the training method we are using in this study can be viewed as a simplified version of the piecewise-pseudolikelihood approximation [32].

Ii-C Feature Scaling with Batch Normaliziaion

In optimization, feature standardization or whitening is a common procedure that has been shown to reduce convergence rates [33]. In deep neural networks, whitening the inputs to each layer may also prevent converging into poor local optima. However, training a deep neural network is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers, and need to continuously adapt to the new distribution. The batch-normalization (BN) [34] method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch.

In our model, we found it advantageous to standardize the input features of the softmax layer. They are composed of the visual features of the CNN and the learned neighbor embeddings (see Fig. 4). The standardization of the feature vector

is important in order to avoid inherent bias between the local-visual and contextual information. The goal is to encourage each feature of the softmax input to have standard mean and variance. Since we use a pre-trained CNN we can standardize the visual features by an offline pre-processing stage. In contrast, the embeddings are jointly learned with the softmax layer and hence we use the batch-normalization

[34] method to learn their mini-batch normalization during the training process. In fact, since the input of the embedding layer is a one-hot vector, the batch-normalization process directly standardize each feature in the embedding space.

Formally, by applying batch normalization to the context representation, Eq. (8) is replaced by:


A major advantage of the approximate-likelihood we are using is that, unlike sequential models such as CRF, here it is very simple and effective to apply embedding batch normalization for each neighboring-label sample.

Training data: Feature sequences with corresponding label sequences . Training algorithm: Train a CNN to maximize the likelihood:

Train a CRF to maximize the local likelihood approximation:
and is the CNN-based representation of .
Inference Algorithm: Given an object sequence : Apply the CNN to obtain a non-linear representation . Apply the forward-backward (or Viterbi) algorithm on the CRF:
to find the labels of the object sequence.

TABLE I: CRF with deep class embedding algorithm.
Fig. 4: The Approximate likelihood Training Architecture.

Ii-D Inference

At test time, global classification is applied to the linear-chain CRF. Dynamic programming algorithms may be used for efficient and exact inference as follows: the Viterbi algorithm finds the most probable sequence label assignment, and the Forward-Backward algorithm extracts the marginal probability of each item, by summation over all possible assignments [29]. Note that although the training was done by local likelihood approximation, and we assumed that the predecessor label is known, at test time we apply the global normalization over all possible object sequences. The proposed training and inference methods are summarized in Table I. Fig. 4 shows an illustration of the training architecture.

Iii Experiments

Iii-a The Dataset

Our dataset contains photos of retail-store displays, taken in supermarkets and grocery-stores. The images capture arbitrary subsections of the displays, in varying settings and viewpoints. The objects are the inventory items positioned at the front of the displays, and the classes are their stock-keeping-unit (SKU) unique identifiers. Each object is annotated by its class label and bounding-box coordinates. The objects in each image are grouped into shelves - sequences of horizontal layouts, sorted from left to right.

The benchmark contains 24,024 images, 76,081 sequences, and 460,121 objects, each labeled as one of 972 different classes. Sequence lengths can vary from 2 to 32, and are typically between 4-12. We split the dataset into 80% training and 20% testing.

Many groups of classes belong to the same archetype, and only differ in terms of minor details such as volume, flavor, nutrient-content etc. They often share similar visual features, which makes appearance-based classification very difficult (Fig. 2). On the other hand, the object layout behavior is very coherent: it is dictated by the supplier planograms (specified product layouts) and extracted from the image realograms (observed product layouts). Although realograms are non-deterministic by nature, consistent semantic patterns are frequently spotted. Class transition behavior may be discovered, revealing tendencies of pairs to appear as left-to-right neighbors, and individual classes to appear multiple times successively (Fig. 3). The unique challenges we face in our task are derived from the large number of visually similar classes, which co-occur in distinct structures in large-scale images. Since the images capture arbitrary subsections of the shelf displays, the visual appearances of object sequence in a shelf vary unpredictably in terms of their relative positions and occasional unnoticed or absent elements. Nevertheless, the co-occurrence data statistics remains stable in most cases, which justifies stationarity and Markovity assumptions for the structure modeling.

Iii-B Implementation Details

We first train an AlexNet CNN [1]

to compute the hidden representation vector

for each image-patch. In our implementation the hidden layer size was

. Then, as a preprocessing step for the CRF model, we calculate the mean and standard deviation of each feature of the hidden representation vector from the training dataset:


The number of classes in our dataset is , and the class embedding dimensionality we use is . We learn a class embedding matrix , a neighbor embedding matrix , a unary potential matrix

and a bias vector

. We train the network as described in subsection II-B, using SGD with mini-batches of size 128, and maximizing the log-likelihood function (9) with as defined at 10 and regularization factor for all network parameters. The training samples in each mini-batch are object-pairs selected randomly from the benchmark. Each sample is a horizontally adjacent pair of left-label in a one-hot encoding, and right image-patch representation . If the object has no left neighbor, is assigned to the zero vector, so the pairwise related parameters are not affected. After convergence of the training stage, we apply the batch-normalization infernece procedure [34] to standardize the context embedding matrix by the training population statistics, and multiply it by the target embedding matrix to restore the CRF pairwise potential matrix for the inference stage. At test-time, we compute the CNN representation vector for each object in the sequence, normalize each of its features with the pre-calculated and , and classify the objects as described in subsection II-D.

Iii-C Comparisons with Other Methods

In order to validate the performance of the proposed method we implemented several alternative methods. All the methods are based on the same context-less CNN local information. They differ in the way they learn the object context information from the training dataset and the way they integrate the context model with the local CNN soft-decision. Below is a list of the context models we implemented.

Unary: The baseline comparison model is the original CNN without any context information.

Pairwise Statistics: Following the work of [23], we created a CRF model with unary potentials taken from the CNN classifier prediction results, and the pairwise potentials are pairwise statistic

that are estimated from the training dataset. In other words, the context information is modeled by a stationary first-order Markov chain. No additional NN training is applied. The only single parameter we need to set is the relative weight of the unary and pairwise potentials. This weight, which adjusts the tradeoff between the local appearance and the contextual information, was selected via cross validation.

Mixture of Statistics CRFs: Still relying on the pairwise statistics summary model, we can also model global context information; for instance, the fact that all the objects in the sequence have the same label. We clustered the sequences into a mixture of

Markov models, using the Expectation-Maximization (EM) algorithm. The training sequences are eventually split into

different groups, and pairwise statistics are separably calculated for each one of them. At test time, the most probable Markov model is selected for each sequence, and the corresponding pairwise statistics CRF model is used. The mixture of Markov models method was examined with values ranging from 2 to 16. It revealed chain groupings to some extent, but did not lead to a significant improvement in the overall classification performance compared to the baseline CRF model of .

We also tried an alternative (or complementary) clustering approach, in which we grouped the classes into clusters that maximize the mutual information between consecutively visited groups [35]. The pairwise potential was then defined by pairwise statistics of the class clusters. Distinct clusters of classes were identified, but we did not manage to harness this information for the task of non-hierarchic class identification.

Log-linear CRF: This method learns the log-linear parameters of the linear-chain CRF (3). We implemented both global and local approximate likelihood training methods and tried both and regularizations for the pairwise potential matrix. We also applied standardization on the one-hot input vectors. The results in all cases were comparable, and provided noticeable improvement over the baseline contextless classifier. The local training procedure is much more efficient, because its time complexity is linear in the number of classes, whereas the global training procedure is quadratic in the number of classes. Since the number of classes is 972 and the training dataset is large, this significantly affects training time even when applying extensive GPU parallelization to compute the partition function. In our experiments, the local training method was about 25 times faster than the global training method, provided the same amount of GPU memory, and as we empirically found, its performance is nearly identical to the globally trained network.

Class-embedding CRF: This is the main model described in this study, where the CRF is enhanced into a much richer, but non-convex model by extending the pairwise weight matrix as defined in Eq. (5). We implemented both global and local training procedures, and studied several alternatives of embedding structures and likelihood approximations as elaborated below. In all cases, local likelihood approximations are extremely faster to train, but most of the methods for either local or global training provided similar or worse results in performance, compared to the linear CRF model. The sole variant which remarkably improved performance is the objective structure in which all the output embedding features are standardized by batch-normalization. In this case, the local approximate likelihood method has two major advantages over global maximum likelihood: Global optimization of LC-CRF is not only much more time-consuming, but also lacks the ability to apply a straightforward batch normalization strategy, since the activations are shared in multiple locations in each sample in the mini-batch.

Similar issues occur when applying more complex embedding structures: We originally considered other variants of the class embedding concept, in which the embedding parameters of the target and neighboring labels are tied. For that purpose, we impose the structure of the embedding matrix on the current class as well as the neighboring class. The pairwise potential in this case is factorized as to get the same embedding for the class and its neighbor. We may also apply the class embedding on the unary potentials matrix by factorizing . In these parametrization, applying embedding-batch-norm would require parameter tying between the softamx inputs and the softmax weights, and thus compromise the effectiveness of the batch normalization process.

The same problem appears in other known methods of local likelihood approximation: Close variants of our local training model are the piecewise, pseudolikelihood Piecewise-Pseudolikelihood (PWPL) methods (See details in appendix A). Applying embedding-batch-norm to the pseudolikelihood or PWPL methods would once-again require parameter tying between the softamx inputs and the softmax weights. However, the PWPL in our case can be reduced to the from of a forward term which is equivalent to the MEMM-like objective (8) and an additive backwards term which is independent of the CRF input. Hence, the MEMM-like objective function is theoretically very related to PWPL.

We also tried to replace BN by standardization of the one-hot input vectors at the input of the embedding layer, but this approach does not affect the output of the embedding layer as BN does, and did not achieve improvement in performance. Hence, we favor the pairwise softmax architecture with the MEMM-like objective (10

) and a BN layer between the embedding output and the softmax input. In addition, we tried increasing the model’s non-linearity by adding another fully connected layer and nonlinear ReLU between the one-hot vector input and the fully connected embedding layer. We also tried learning the embedding in a higher dimensional space. Those enhancement, however, did not improve performance, and turned out to be redundant.

Recurrent Neural Network:

Another modeling option for a sequence estimation is Bi-Directional Recurrent Neural Network

[36] with LSTM [37] as memory block (BiLSTM). In that approach we compute the posterior distribution of the current object label based on all the visual information provided by the CNN: . The BiLSTM architecture learns a context vector for each object, which encapsulates the bidirectional information in the sequence input observations transferred from the CNN output , and learns a softmax prediction for each object label. This approach, however, did not exceed the original unary CNN. In our case the visual features of the neighbors hardly provide any additional information to the local visual features. The most important information, in addition to the object local appearance, is the label relations between neighboring objects, which are not captured here. Note that the BiLSTM network uses a softmax output layer that provides a separate prediction for each class and thus ignores class similarities. It is interesting to compare our task of visual-sequence classfication, with NLP sequence-tagging tasks such as POS or NER, where both the neighboring words and tags may be very informative and thus both CRF and BiLSTM have shown to improve accuracy. The BiLSTM-CRF model [38], which stacks a linear-chain CRF over the BiLSTM context vectors, produces more accurate results than each one of them separately. In our case, however, such complex models are not required.

Architecture Learning Potentials Error %
Unary (no context) CNN None 15
BiLSTM RNN None 15
Pairwise Statistics CRF Cross-Validation Distributions 15
Mixture of Statistics CRFs EM Distribution per Cluster 15
Log-linear CRF Global Linear 14
Log-linear CRF Approximate Linear 14
Class-embedding CRF Approximate Factorized 14
Class-embedding CRF Global Factorized 13
Class-embedding CRF with BN Approximate Factorized 11
TABLE II: Comparison of the object-level error rate between the different methods. Testing benchmark contains 90592 objects.
Architecture Learning Potentials Recall % Precision %
Unary (no context) CNN None 79 91
BiLSTM RNN None 80 91
Pairwise Statistics CRF Cross-Validation Distributions 83 91
Mixture of Statistics CRFs EM Distribution per Cluster 83 91
Log-linear CRF Global Linear 81 91
Log-linear CRF Approximate Linear 81 91
Class-embedding CRF Approximate Factorized 81 91
Class-embedding CRF Global Factorized 84 91
Class-embedding CRF with BN Approximate Factorized 87 91
TABLE III: Comparison of recall % when score acceptance threshold is calibrated to receive 91% precision.

Iii-D Results

Table II describes the results in terms of model error rate, and portrays the incremental improvement in accuracy over model variations, and reveals that the non-linear method that is based on batched-normalized class embedding showed better results than the linear model. Table III refers to what was defined as our original objective: maximize recall while preserving at least 91% precision. It is interesting to note that our model, involving both local training and normalized class embeddings, is the only one that led to significant improvement over the pairwise statistical-summary model of [23] for this objective. It is worth pointing out that our benchmark is considerably large, which means that we correctly identified 7,200 more objects than the unary model, and 3,600 more objects than the pairwise-statistics model.

Iii-E Class Embedding Analysis

As a byproduct of the classification model we also obtain a low-dimensional embedding of the different classes. Each column of the neighbor embedding matrix is vector representation of the corresponding class. A common similarity metric is the cosine of the angle between the vectors. We can measure the distance between classes by the cosine of their vector representation. Fig. 5 shows several examples of an object class and its most similar classes. We can see that this similarity does not reflect visual appearance similarity, e.g. in the second example the similar classes have very different colors. This situation was extensively studied for the linguistic problem of word embedding. The goal of word embedding algorithms is to represent similar words by similar vectors. It is often useful to distinguish two kinds of similarity or association between words [39]. Two words have first-order co-occurrence if they are typically nearby each other (e.g. wrote is a first-order associate of book or poem). Two words have second-order co-occurrence if they have similar neighbors (e.g. wrote is a second-order associate of words like said or remarked). Second-order word similarity is thus expected to capture a semantic meaning and measure the extent to which the two classes are replaceable based on their tendencies to appear in similar contexts. In Fig. 5 and 6 we show that object class embedding captures second-order information. Proximity here corresponds to the mutual tendency to have similar neighbors. We can see in the figures that similar classes, although look visually different, represent products of similar container-types, volumes and brands.

Fig. 5: Class similarity examples. For each class we show five nearest neighbors based on cosine distance computed on the class embeddings.
Fig. 6: A visualiztion of the embedded classes in the order similarity space, created by t-SNE to reduce the 32D space into a 2D space. It can be seen that classes are clustered according to shelf “semantic” (rather than visual) similarty and relations.

Iv Conclusion

We introduced a novel technique to learn deep contextual and visual features for fine-grained structured prediction of object categories, and tested it on a dataset that contains spatial sequences of objects, and a large number of visually similar classes.

Our model clearly outperforms all the other tested models. This architecture appears to be the most straightforward generalization of a context-less classifier to be context-dependent when both the input and the context data require a large learning capacity: the network learns deep feature vectors for neighboring classes, analogously to the learned deep input representations. The Markovity and stationarity assumptions make it sufficient to train with individual objects as samples to enrich the training data diversity, allow for a simple embedding batch normalization, and boost the non-convex optimization process both in terms of time and performance.

Shelf-level classification may not be sufficient in situation such as the one depicted in Fig. 7, where it may help identify a few probable shelf-classification possibilities, but no clear decision on the best one. This can be resolved by defining additional spatial-relationships in the graph, such as top-bottom edges, or hyperedges between shelves, and learning their embeddings. A scene-level classification may be inferred using a belief-propagation, variational methods or MCMC methods. Additional improvement may be achieved by integrating the bounding-box dimensions as part of the model input, applying end-to-end joint training of the CNN and the CRF, and using recurrent layers to model relations between object proposals and labels, learn sequence embedding, and perform training and inference in loopy graphs.

Fig. 7: Example of a case where shelf-level classifications are not sufficient, and additional spatial relations need to be included in the model. The arrows point at the thumbnails of the ground-truth labels of the objects on each shelf.

Appendix A Local Likelihood Approximation

Pseudolikelihood [40] is a classical approximation of the CRF likelihood function that simultaneously classifies each node given its neighbors in the graph. The pseudolikelihood objective function is hence only dependent on the object and its Markov blanket. The pseudolikelihood of our model is:




Piecewise training [41]

is a heuristic method to predict the graph factors from separate “pieces” of the graph. The piecewise objective function is equivalent to the likelihood function of a node-split graph

[32], which contains all the single-factor components split from the original graph. Using the factor defined at (5), the piecewise likelihood in our case is:


Note that computing the piecewise likelihood is quadratic in the number of classes. Piecewise Pseudolikelihood (PWPL) is the standard pseudolikelihood applied to the node-split graph. Its computation is efficient because the objective function is simply the sum of local conditional probabilities. Sutton and McCallum [32] showed that in many cases the PWPL has better accuracy than standard pseudolikelihood, and in some scenarios it has nearly equivalent performance to piecewise approximation and even to global maximum likelihood. In our case, applying pseudolikelihood approach on the piecewise objective (13) would give us the PWPL form:


The first term is inside the function is equivalent to the forward MEMM objective function (8), while the second term can be reduced to derive the PWPL form:


where the backwards term


The backwards term of the PWPL (16) is independent of the CRF input and hence the MEMM-like objective function is theoretically very related to PWPL. Thus, the approximated likelihood we are using for training, that is based on the MEMM model, can be viewed as a simplified version of the piecewise-pseudolikelihood objective (15) that was found to be the preferred likelihood approximation for language processing tasks [32].


  • [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,

    Imagenet classification with deep convolutional neural networks,”

    in NIPS, 2012.
  • [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in NIPS, 2015.
  • [4] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg, “SSD: Single shot multibox detector,” in ECCV, 2016.
  • [5] Spyros Gidaris and Nikos Komodakis, “Locnet: Improving localization accuracy for object detection,” in CVPR, 2016.
  • [6] Joseph Redmon and Ali Farhadi, “Yolo9000: Better, faster, stronger,” arXiv preprint arXiv:1612.08242, 2016.
  • [7] Di Lin, Xiaoyong Shen, Cewu Lu, and Jiaya Jia, “Deep lac: Deep localization, alignment and classification for fine-grained recognition,” in CVPR, 2015.
  • [8] Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell, “Partbased R-CNNs for fine-grained category detection,” in ECCV, 2014.
  • [9] Antonio Torralba, “Contextual priming for object detection,” International journal of computer vision, vol. 53, no. 2, pp. 169–191, 2003.
  • [10] Santosh K Divvala, Derek Hoiem, James H Hays, Alexei A Efros, and Martial Hebert, “An empirical study of context in object detection,” in CVPR, 2009.
  • [11] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
  • [12] Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos, Eric Wiewiora, and Serge Belongie, “Objects in context,” in ICCV, 2007.
  • [13] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” arXiv preprint arXiv:1606.00915, 2016.
  • [14] Stephen Gould, Richard Fulton, and Daphne Koller, “Decomposing a scene into geometric and semantically consistent regions,” in ICCV, 2009.
  • [15] Jian Yao, Sanja Fidler, and Raquel Urtasun,

    “Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation,”

    in CVPR, 2012.
  • [16] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr, “Conditional random fields as recurrent neural networks,” in ICCV, 2015.
  • [17] Alexander G Schwing and Raquel Urtasun, “Fully connected deep structured networks,” arXiv preprint arXiv:1503.02351, 2015.
  • [18] Yuxuan Wang and DeLiang Wang, “Cocktail party processing via structured prediction,” in NIPS, 2012.
  • [19] Filip Korzeniowski and Gerhard Widmer, “A fully convolutional deep auditory model for musical chord recognition,” in MLSP, 2016.
  • [20] Gang Chen, Yawei Li, and Sargur N Srihari, “Word recognition with deep conditional random fields,” in ICIP, 2016.
  • [21] Ninghang Hu, Gwenn Englebienne, Zhongyu Lou, and Ben Kröse, “Learning latent structure for activity recognition,” in IEEE Int. Conf. on Robotics and Automation (ICRA), 2014, pp. 1048–1053.
  • [22] Liang-Chieh Chen, Alexander Schwing, Alan Yuille, and Raquel Urtasun, “Learning deep structured models,” in ICML, 2015.
  • [23] Wenqing Chu and Deng Cai, “Deep feature based contextual model for object detection,” arXiv preprint arXiv:1604.04048, 2016.
  • [24] John Lafferty, Andrew McCallum, and Fernando CN Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in ICML, 2001.
  • [25] Jian Peng, Liefeng Bo, and Jinbo Xu, “Conditional neural fields,” in NIPS, 2009.
  • [26] Thierry Artieres et al., “Neural conditional random fields,” in AISTATS, 2010.
  • [27] Yoshua Bengio, Aaron Courville, and Pascal Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
  • [28] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun, “The loss surfaces of multilayer networks.,” in AISTATS, 2015.
  • [29] Charles Sutton and Andrew McCallum, “An introduction to conditional random fields for relational learning,” Introduction to statistical relational learning, pp. 93–128, 2006.
  • [30] Andrew McCallum, Dayne Freitag, and Fernando CN Pereira, “Maximum entropy markov models for information extraction and segmentation,” in ICML, 2000.
  • [31] Sham Kakade, Yee Whye Teh, and Sam T Roweis, “An alternate objective function for markovian fields,” in ICML, 2002.
  • [32] Charles Sutton and Andrew McCallum, “Piecewise pseudolikelihood for efficient training of conditional random fields,” in ICML, 2007.
  • [33] Genevieve B Orr and Klaus-Robert Müller, Neural networks: tricks of the trade, Springer, 2003.
  • [34] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015.
  • [35] Amir Alush, Avishay Friedman, and Jacob Goldberger, “Pairwise clustering based on the mutual-information criterion,” Neurocomputing, vol. 182, pp. 284–293, 2016.
  • [36] Mike Schuster and Kuldip K Paliwal,

    Bidirectional recurrent neural networks,”

    IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
  • [37] Sepp Hochreiter and Jürgen Schmidhuber,

    Long short-term memory,”

    Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [38] Zhiheng Huang, Wei Xu, and Kai Yu, “Bidirectional LSTM-CRF models for sequence tagging,” arXiv preprint arXiv:1508.01991, 2015.
  • [39] Hinrich Schütze and Jan Pedersen, “A vector model for syntagmatic and paradigmatic relatedness,” in Proc. of the 9th Annual Conference of the UW Centre for the New OED and Tex, 1993.
  • [40] Julian Besag, “Statistical analysis of non-lattice data,” The statistician, pp. 179–195, 1975.
  • [41] Charles Sutton and Andrew McCallum, “Piecewise training for undirected models,” in UAI, 2005.