I Introduction
Object recognition is one of the fundamental problems in computer vision. It involves finding and identifying objects in images, and plays an important role in many realworld applications such as advanced driver assistance systems, military target detection, diagnosis with medical images, video surveillance, and identity recognition.
Over the past few years deep convolutional neural networks (CNN) have led to a remarkable progress in image classification
[1, 2], and resulted in reliable appearancebased detectors e.g. [3, 4, 5, 6]. Finegrained object recognition aims to identify subcategory object classes, which includes finding subtle difference among visually similar subcategories. Recent studies achieved good performance on finegrained tasks [7, 8]. However, the problem remains extremely difficult when the dataset categories are nearly identical in terms of their visual appearance. In this case, object categories are often virtually indistinguishable, since the discriminative features may be masked by inadequate observation or visual artifacts.This study addresses the problem of classifying a sequence of objects based on their visual appearance and their relative locations. Our dataset contains photos of retailstore product displays, taken in varying settings and viewpoints. We need to identify the class of each product at the front of the shelves. The dataset is exclusively characterized by having a distinct geometric object structure  sequences of shelves, a large number of classes, and very subtle visual differences between groups of classes  some classes only differ in sizes or minor design details. The unique challenges in this task involve handling the large number of possible classes, and the fact that the classes are not clearly distinctive by their appearance but rather by their context. For example, products with identical appearance but with different container volumes are considered different classes (see examples at Fig.1, 2).
Because the object local appearance may not suffice for accurate categorization, additional information needs to be considered. In real world images, contextual data provides useful information about spatial and semantic relationships between objects. Modeling a joint visualcontextual classifier is nontrivial in that some contextual cues are very informative, whereas others are irrelevant, or even misleading. Most deep learning detectors classify each detected object individually without taking the contextual information into account.
Context has been used to improve performance for image understanding tasks in various ways [9, 10, 11]. Graphical models have been widely applied to visual and auditory analysis tasks, by jointly modeling local features, and contextual relations. Tasks addressed by these models include image segmentation and object recognition [12, 13, 14, 15, 16, 17], as well as speech [18], music[19], text [20] and video analysis [21].
Few studies have applied deep learning features or detection results to context models: Chen et al.[22]
explored several techniques to learn structured models jointly with deep features that form MRF potentials. Chu and Cai
[23] evaluated the performance of a joint CRF model on Faster RCNN [3] detection results, using an apriori statistical summary for the pairwise potentials. Korzeniowski and Widmer [19] introduced a twostage learning model for musical chord recognition: one network learns a singleframe representation, and the other learns the potentials of a linearchain CRF model, using the framerepresentations as the CRF input. The aforementioned CRF models use pairwise potentials to represent objectpair interaction. They allocate a different parameter for each class pair. This approach, which ignores class similarities, may be sufficient in small sets of distinct classes. However, it is not suitable for a large classset that contains visually similar classes. Our dataset, which includes many visually similar categories, and nearly a thousand classes and a million possible pairwise transitions overall, requires more advanced learning mechanism. In most previous object recognition studies the visual information was dominant. In our task context information also has a significant contribution.In this study we provide a CRF based method that explicitly learns the embedding of classes with respect to their neighbor’s visual features. This is achieved by factorizing the CRF pairwise potential matrix to impose the desired structure of class embedding in a lowdimensional space. Our model learns the factorized parameters, and yields a joint contextualvisual embedding of the classes. To efficiently train the network, we introduce a pairwise softmax architecture which optimizes a local approximation of the likelihood. Since the factorized loss function is not convex, we exploit the simplicity of the local approximation architecture to include batchnorm related regularization for the object samples, and attain dramatic improvement not only in training time but also in the overall performance of the trained model. At test time dynamics programming techniques are used for efficient exact inference of classes.
The contributions of this work are the following:

Combining deep class embedding into a CRF formulation that enables handling datasets with huge number of classes.

An approximatedlikelihood training procedure that is both computational efficient and, unlike exact CRF likelihood, we can easily incorporate batchnormalization into the training procedure.
The rest of the paper is organized as follows. In the next section we describe a CRF model with class embedding formulation and present the learning and inference algorithms. Section 3 contains a detailed data description and comparative experimental results and finally conclusions are given in Section 4.
Ii CRF With Class Embedding
Iia Model Formulation
Our study is motivated by images of store shelves where a large number of objects with many possible classes appear in a single image and we want to classify the object using both visual and context information. A preprossessing detection stage is applied to extract the detection bounding boxes, crop their image patches and organize them according to their locations on the shelves. The input data used for our task are the sequences of images: Each image captures an individual product, and the images are organized in sequences ordered according to their relative positions on the shelves. Let denote the image in position of sequence , and the corresponding class label. For notational simplicity, we omit the index when referring to an individual sequence.
We wish to predict the sequence of the target labels , given a sequence of observations . Standard classification approaches use a CNN to predict each objectlevel observation individually, implicitly assuming independence between object samples. In order to include context in the classification process, we model the sequences as a CRF.
Linearchain Conditional Random Field (LCCRF) [24] is a type of discriminative undirected probabilistic graphical model, whose conditional distribution
obeys a conditional Markov property. The joint probability distribution of a linearchain CRF is:
(1) 
where
the sequence of observation feature vectors,
the corresponding sequence of the target labels, the model potential function, the partition function defined as the global probability normalization over all possible sequence labelassignments of length , and . Assume that the potential function is defined as:(2) 
where matrix the pairwise potentials matrix, the unary potentials, and vector
the label bias, are all model parameters, and we use a onehot encoding for the labels. The likelihood function, therefore, is loglinear and concave.
Combining neural networks with CRFs gives a fully differentiable model that can be learned jointly, as shown e.g. in [25, 26]
. For the task at hand, however, we found it sufficient to train both parts separately, and apply transfer learning for faster convergence and easier regularization. We can train a local CNN to classify individual objects, and then interpret the hidden layer activations as a nonlinear representation of the input image
[27]. Similarly to the concept of transferlearning, we can now discard the CNN softmax layer, and use the convolutional layers to compute the featurevectors of the input images. For imagedetection
we define the feature vector as the activations of the last fullyconnected hidden layer, and use it as the CRF input observation feature vector:(3) 
The score function (3) is still concave, but its input is computed from a nonconcave source, a deep CNN. The rationale for using deep representation for the input images is clear: as introduced by Krizhevsky et al. [1], the immense complexity of the visual object recognition task requires a model with a very large learning capacity. Convolutional layers provide the structure required for learning visual features of the unary input. We would like to craft a suitable structure to learn the pairwise contextual relations as well.
CRF was originally applied to language processing tasks such as Part of Speech (POS) tagging and Named Entity Recognition (NER)
[24]. In most applications of CRF to either language or image understanding, there are no more than a few dozen different classes. In our dataset, we have hundreds of classes. The pairwise transition between two classes has nearly a million possible states, whereas the CRF function (3) has a loglinear form, and contains a single parameter per transition orderedpair. In order to properly learn and generalize the massive variety of possible neighboring patterns, we enforce a structure on the pairwise potential matrix: the goal is to learn neighboringclass embedding in a feature vector space. For this purpose, we define a lowdimensional decomposition of the pairwise potential matrix
as the product of the leftside neighbor embedding matrix and the class embedding matrix :(4) 
The columns of are lowdimensional embeddings of the target classes, and the columns of are embeddings of the classes of the leftside object. Assigning the matrix factorization (4) to the CRF potential function (3) we get:
(5) 
The objective function is no longer linear or concave with respect to the network parameters, but deep learning training techniques have been shown to yield good results for nonconvex optimization tasks [28]. This simply means that we need to apply the deep learning approach not only for the input image representations, but also for the neighboring transition parameters.
IiB Training
The CRF model defined above can be trained in a supervised manner by maximizing the loglikelihood of all sequences in the training dataset [24]:
(6) 
where the vector contains the groundtruth labels of the ^{th} sequence, contains the corresponding object feature vectors of the sequence of observations, goes over the sequences in the training data, and the loss function is defined at (1) with the potentials (5). Since the underlying graph is loopfree, it is tractable to compute the likelihood function and its gradient using the forwardbackward algorithm [29]
. However, the optimization is relatively slow for a large number of classes, because its complexity is quadratic in the number of possible classes. In order to speed up the training process, we can estimate the parameters locally, by optimizing an approximate objective function. A local approximation of the likelihood would require samples of individual objects and their immediate neighbors rather than entire sequences.
Linearchain CRFs were originally introduced as an improvement on the Maximum Entropy Markov model (MEMM)
[30], which is essentially a Markov model in which the transition distributions are given by a logistic regression model. The main difference between CRFs and MEMMs is that a MEMM uses perstate exponential models for the conditional probabilities of nextstates given the currentstate, whereas the CRF has a single exponential model for the joint probability of the entire sequence of labels given the observationsequence. CRF and MEMM can be written using the same set of parameters. The MEMM directed graphical modeling in our case is:
(7) 
where
(8) 
One major advantage of MEMMs over CRFs (and HMMs) is that training can be considerably more efficient. Unlike CRFs, in MEMMs the parameters of the maximumentropy distributions used for the transition probabilities can be estimated for each transition distribution separately. When applying MMEM for inference it suffers from the label bias problem [24, 31] which may lead to a drop in performance in some applications. Here, however, MEMM objective is used only as a local approximation to learn the parameter set of the linearchain CRF model whereas the test time inference uses a global normalization of CRF modeling and thus avoids the label bias problem. The objective function is now defined as the conditional probability of the currentobject class, given the class of the leftside neighbor object:
(9) 
where goes over the sequences and goes over the objects in the sequence, is the object CNNbased representation, is the true class label and is as defined at (8). Note that the computational complexity of the MEMM likelihood (9) is linear in the number of classes unlike the CRF likelihood whose computational complexity is quadratic.
This surrogate likelihood function whose samples are pairs of objects and corresponding neighboring labels can be used at train time to accelerate the training process. Because the model is stationary and conditionally independent of indirect neighbors, breaking the samples from sequences into adjacent pairs of direct neighbors does not necessarily eliminate significant contextual information. Rather, when learning the nonconvex objective of classembedding CRF, it may enrich the training dataset, improve the stochastic nature of the SGD optimization process, simplify and improve regularization techniques such as batchnormalization, and help prevent overfitting since there are many more object samples than sequence samples, and the minibatches are composed of adjacent pairs of objects taken from random training samples. In contrast, restricting the minibatches to contain full sequences, would decrease the model’s freedom to discover better solutions for the objective of pairwise transition parameters. In fact, as we empirically show in the next section, optimizing the local approximate likelihood with objectlevel batchnormalization yields better results than optimizing the unnormalized global LCCRF likelihood. In the appendix we review standard likelihood approximation strategies for efficient CRF training and show that the training method we are using in this study can be viewed as a simplified version of the piecewisepseudolikelihood approximation [32].
IiC Feature Scaling with Batch Normaliziaion
In optimization, feature standardization or whitening is a common procedure that has been shown to reduce convergence rates [33]. In deep neural networks, whitening the inputs to each layer may also prevent converging into poor local optima. However, training a deep neural network is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers, and need to continuously adapt to the new distribution. The batchnormalization (BN) [34] method draws its strength from making normalization a part of the model architecture and performing the normalization for each training minibatch.
In our model, we found it advantageous to standardize the input features of the softmax layer. They are composed of the visual features of the CNN and the learned neighbor embeddings (see Fig. 4). The standardization of the feature vector
is important in order to avoid inherent bias between the localvisual and contextual information. The goal is to encourage each feature of the softmax input to have standard mean and variance. Since we use a pretrained CNN we can standardize the visual features by an offline preprocessing stage. In contrast, the embeddings are jointly learned with the softmax layer and hence we use the batchnormalization
[34] method to learn their minibatch normalization during the training process. In fact, since the input of the embedding layer is a onehot vector, the batchnormalization process directly standardize each feature in the embedding space.Formally, by applying batch normalization to the context representation, Eq. (8) is replaced by:
(10) 
A major advantage of the approximatelikelihood we are using is that, unlike sequential models such as CRF, here it is very simple and effective to apply embedding batch normalization for each neighboringlabel sample.
IiD Inference
At test time, global classification is applied to the linearchain CRF. Dynamic programming algorithms may be used for efficient and exact inference as follows: the Viterbi algorithm finds the most probable sequence label assignment, and the ForwardBackward algorithm extracts the marginal probability of each item, by summation over all possible assignments [29]. Note that although the training was done by local likelihood approximation, and we assumed that the predecessor label is known, at test time we apply the global normalization over all possible object sequences. The proposed training and inference methods are summarized in Table I. Fig. 4 shows an illustration of the training architecture.
Iii Experiments
Iiia The Dataset
Our dataset contains photos of retailstore displays, taken in supermarkets and grocerystores. The images capture arbitrary subsections of the displays, in varying settings and viewpoints. The objects are the inventory items positioned at the front of the displays, and the classes are their stockkeepingunit (SKU) unique identifiers. Each object is annotated by its class label and boundingbox coordinates. The objects in each image are grouped into shelves  sequences of horizontal layouts, sorted from left to right.
The benchmark contains 24,024 images, 76,081 sequences, and 460,121 objects, each labeled as one of 972 different classes. Sequence lengths can vary from 2 to 32, and are typically between 412. We split the dataset into 80% training and 20% testing.
Many groups of classes belong to the same archetype, and only differ in terms of minor details such as volume, flavor, nutrientcontent etc. They often share similar visual features, which makes appearancebased classification very difficult (Fig. 2). On the other hand, the object layout behavior is very coherent: it is dictated by the supplier planograms (specified product layouts) and extracted from the image realograms (observed product layouts). Although realograms are nondeterministic by nature, consistent semantic patterns are frequently spotted. Class transition behavior may be discovered, revealing tendencies of pairs to appear as lefttoright neighbors, and individual classes to appear multiple times successively (Fig. 3). The unique challenges we face in our task are derived from the large number of visually similar classes, which cooccur in distinct structures in largescale images. Since the images capture arbitrary subsections of the shelf displays, the visual appearances of object sequence in a shelf vary unpredictably in terms of their relative positions and occasional unnoticed or absent elements. Nevertheless, the cooccurrence data statistics remains stable in most cases, which justifies stationarity and Markovity assumptions for the structure modeling.
IiiB Implementation Details
We first train an AlexNet CNN [1]
to compute the hidden representation vector
for each imagepatch. In our implementation the hidden layer size was. Then, as a preprocessing step for the CRF model, we calculate the mean and standard deviation of each feature of the hidden representation vector from the training dataset:
.The number of classes in our dataset is , and the class embedding dimensionality we use is . We learn a class embedding matrix , a neighbor embedding matrix , a unary potential matrix
and a bias vector
. We train the network as described in subsection IIB, using SGD with minibatches of size 128, and maximizing the loglikelihood function (9) with as defined at 10 and regularization factor for all network parameters. The training samples in each minibatch are objectpairs selected randomly from the benchmark. Each sample is a horizontally adjacent pair of leftlabel in a onehot encoding, and right imagepatch representation . If the object has no left neighbor, is assigned to the zero vector, so the pairwise related parameters are not affected. After convergence of the training stage, we apply the batchnormalization infernece procedure [34] to standardize the context embedding matrix by the training population statistics, and multiply it by the target embedding matrix to restore the CRF pairwise potential matrix for the inference stage. At testtime, we compute the CNN representation vector for each object in the sequence, normalize each of its features with the precalculated and , and classify the objects as described in subsection IID.IiiC Comparisons with Other Methods
In order to validate the performance of the proposed method we implemented several alternative methods. All the methods are based on the same contextless CNN local information. They differ in the way they learn the object context information from the training dataset and the way they integrate the context model with the local CNN softdecision. Below is a list of the context models we implemented.
Unary: The baseline comparison model is the original CNN without any context information.
Pairwise Statistics: Following the work of [23], we created a CRF model with unary potentials taken from the CNN classifier prediction results, and the pairwise potentials are pairwise statistic
that are estimated from the training dataset. In other words, the context information is modeled by a stationary firstorder Markov chain. No additional NN training is applied. The only single parameter we need to set is the relative weight of the unary and pairwise potentials. This weight, which adjusts the tradeoff between the local appearance and the contextual information, was selected via cross validation.
Mixture of Statistics CRFs: Still relying on the pairwise statistics summary model, we can also model global context information; for instance, the fact that all the objects in the sequence have the same label. We clustered the sequences into a mixture of
Markov models, using the ExpectationMaximization (EM) algorithm. The training sequences are eventually split into
different groups, and pairwise statistics are separably calculated for each one of them. At test time, the most probable Markov model is selected for each sequence, and the corresponding pairwise statistics CRF model is used. The mixture of Markov models method was examined with values ranging from 2 to 16. It revealed chain groupings to some extent, but did not lead to a significant improvement in the overall classification performance compared to the baseline CRF model of .We also tried an alternative (or complementary) clustering approach, in which we grouped the classes into clusters that maximize the mutual information between consecutively visited groups [35]. The pairwise potential was then defined by pairwise statistics of the class clusters. Distinct clusters of classes were identified, but we did not manage to harness this information for the task of nonhierarchic class identification.
Loglinear CRF: This method learns the loglinear parameters of the linearchain CRF (3). We implemented both global and local approximate likelihood training methods and tried both and regularizations for the pairwise potential matrix. We also applied standardization on the onehot input vectors. The results in all cases were comparable, and provided noticeable improvement over the baseline contextless classifier. The local training procedure is much more efficient, because its time complexity is linear in the number of classes, whereas the global training procedure is quadratic in the number of classes. Since the number of classes is 972 and the training dataset is large, this significantly affects training time even when applying extensive GPU parallelization to compute the partition function. In our experiments, the local training method was about 25 times faster than the global training method, provided the same amount of GPU memory, and as we empirically found, its performance is nearly identical to the globally trained network.
Classembedding CRF: This is the main model described in this study, where the CRF is enhanced into a much richer, but nonconvex model by extending the pairwise weight matrix as defined in Eq. (5). We implemented both global and local training procedures, and studied several alternatives of embedding structures and likelihood approximations as elaborated below. In all cases, local likelihood approximations are extremely faster to train, but most of the methods for either local or global training provided similar or worse results in performance, compared to the linear CRF model. The sole variant which remarkably improved performance is the objective structure in which all the output embedding features are standardized by batchnormalization. In this case, the local approximate likelihood method has two major advantages over global maximum likelihood: Global optimization of LCCRF is not only much more timeconsuming, but also lacks the ability to apply a straightforward batch normalization strategy, since the activations are shared in multiple locations in each sample in the minibatch.
Similar issues occur when applying more complex embedding structures: We originally considered other variants of the class embedding concept, in which the embedding parameters of the target and neighboring labels are tied. For that purpose, we impose the structure of the embedding matrix on the current class as well as the neighboring class. The pairwise potential in this case is factorized as to get the same embedding for the class and its neighbor. We may also apply the class embedding on the unary potentials matrix by factorizing . In these parametrization, applying embeddingbatchnorm would require parameter tying between the softamx inputs and the softmax weights, and thus compromise the effectiveness of the batch normalization process.
The same problem appears in other known methods of local likelihood approximation: Close variants of our local training model are the piecewise, pseudolikelihood PiecewisePseudolikelihood (PWPL) methods (See details in appendix A). Applying embeddingbatchnorm to the pseudolikelihood or PWPL methods would onceagain require parameter tying between the softamx inputs and the softmax weights. However, the PWPL in our case can be reduced to the from of a forward term which is equivalent to the MEMMlike objective (8) and an additive backwards term which is independent of the CRF input. Hence, the MEMMlike objective function is theoretically very related to PWPL.
We also tried to replace BN by standardization of the onehot input vectors at the input of the embedding layer, but this approach does not affect the output of the embedding layer as BN does, and did not achieve improvement in performance. Hence, we favor the pairwise softmax architecture with the MEMMlike objective (10
) and a BN layer between the embedding output and the softmax input. In addition, we tried increasing the model’s nonlinearity by adding another fully connected layer and nonlinear ReLU between the onehot vector input and the fully connected embedding layer. We also tried learning the embedding in a higher dimensional space. Those enhancement, however, did not improve performance, and turned out to be redundant.
Recurrent Neural Network:
Another modeling option for a sequence estimation is BiDirectional Recurrent Neural Network
[36] with LSTM [37] as memory block (BiLSTM). In that approach we compute the posterior distribution of the current object label based on all the visual information provided by the CNN: . The BiLSTM architecture learns a context vector for each object, which encapsulates the bidirectional information in the sequence input observations transferred from the CNN output , and learns a softmax prediction for each object label. This approach, however, did not exceed the original unary CNN. In our case the visual features of the neighbors hardly provide any additional information to the local visual features. The most important information, in addition to the object local appearance, is the label relations between neighboring objects, which are not captured here. Note that the BiLSTM network uses a softmax output layer that provides a separate prediction for each class and thus ignores class similarities. It is interesting to compare our task of visualsequence classfication, with NLP sequencetagging tasks such as POS or NER, where both the neighboring words and tags may be very informative and thus both CRF and BiLSTM have shown to improve accuracy. The BiLSTMCRF model [38], which stacks a linearchain CRF over the BiLSTM context vectors, produces more accurate results than each one of them separately. In our case, however, such complex models are not required.Architecture  Learning  Potentials  Error % 

Unary (no context)  CNN  None  15 
BiLSTM  RNN  None  15 
Pairwise Statistics CRF  CrossValidation  Distributions  15 
Mixture of Statistics CRFs  EM  Distribution per Cluster  15 
Loglinear CRF  Global  Linear  14 
Loglinear CRF  Approximate  Linear  14 
Classembedding CRF  Approximate  Factorized  14 
Classembedding CRF  Global  Factorized  13 
Classembedding CRF with BN  Approximate  Factorized  11 
Architecture  Learning  Potentials  Recall %  Precision % 

Unary (no context)  CNN  None  79  91 
BiLSTM  RNN  None  80  91 
Pairwise Statistics CRF  CrossValidation  Distributions  83  91 
Mixture of Statistics CRFs  EM  Distribution per Cluster  83  91 
Loglinear CRF  Global  Linear  81  91 
Loglinear CRF  Approximate  Linear  81  91 
Classembedding CRF  Approximate  Factorized  81  91 
Classembedding CRF  Global  Factorized  84  91 
Classembedding CRF with BN  Approximate  Factorized  87  91 
IiiD Results
Table II describes the results in terms of model error rate, and portrays the incremental improvement in accuracy over model variations, and reveals that the nonlinear method that is based on batchednormalized class embedding showed better results than the linear model. Table III refers to what was defined as our original objective: maximize recall while preserving at least 91% precision. It is interesting to note that our model, involving both local training and normalized class embeddings, is the only one that led to significant improvement over the pairwise statisticalsummary model of [23] for this objective. It is worth pointing out that our benchmark is considerably large, which means that we correctly identified 7,200 more objects than the unary model, and 3,600 more objects than the pairwisestatistics model.
IiiE Class Embedding Analysis
As a byproduct of the classification model we also obtain a lowdimensional embedding of the different classes. Each column of the neighbor embedding matrix is vector representation of the corresponding class. A common similarity metric is the cosine of the angle between the vectors. We can measure the distance between classes by the cosine of their vector representation. Fig. 5 shows several examples of an object class and its most similar classes. We can see that this similarity does not reflect visual appearance similarity, e.g. in the second example the similar classes have very different colors. This situation was extensively studied for the linguistic problem of word embedding. The goal of word embedding algorithms is to represent similar words by similar vectors. It is often useful to distinguish two kinds of similarity or association between words [39]. Two words have firstorder cooccurrence if they are typically nearby each other (e.g. wrote is a firstorder associate of book or poem). Two words have secondorder cooccurrence if they have similar neighbors (e.g. wrote is a secondorder associate of words like said or remarked). Secondorder word similarity is thus expected to capture a semantic meaning and measure the extent to which the two classes are replaceable based on their tendencies to appear in similar contexts. In Fig. 5 and 6 we show that object class embedding captures secondorder information. Proximity here corresponds to the mutual tendency to have similar neighbors. We can see in the figures that similar classes, although look visually different, represent products of similar containertypes, volumes and brands.
Iv Conclusion
We introduced a novel technique to learn deep contextual and visual features for finegrained structured prediction of object categories, and tested it on a dataset that contains spatial sequences of objects, and a large number of visually similar classes.
Our model clearly outperforms all the other tested models. This architecture appears to be the most straightforward generalization of a contextless classifier to be contextdependent when both the input and the context data require a large learning capacity: the network learns deep feature vectors for neighboring classes, analogously to the learned deep input representations. The Markovity and stationarity assumptions make it sufficient to train with individual objects as samples to enrich the training data diversity, allow for a simple embedding batch normalization, and boost the nonconvex optimization process both in terms of time and performance.
Shelflevel classification may not be sufficient in situation such as the one depicted in Fig. 7, where it may help identify a few probable shelfclassification possibilities, but no clear decision on the best one. This can be resolved by defining additional spatialrelationships in the graph, such as topbottom edges, or hyperedges between shelves, and learning their embeddings. A scenelevel classification may be inferred using a beliefpropagation, variational methods or MCMC methods. Additional improvement may be achieved by integrating the boundingbox dimensions as part of the model input, applying endtoend joint training of the CNN and the CRF, and using recurrent layers to model relations between object proposals and labels, learn sequence embedding, and perform training and inference in loopy graphs.
Appendix A Local Likelihood Approximation
Pseudolikelihood [40] is a classical approximation of the CRF likelihood function that simultaneously classifies each node given its neighbors in the graph. The pseudolikelihood objective function is hence only dependent on the object and its Markov blanket. The pseudolikelihood of our model is:
(11) 
where
(12) 
Piecewise training [41]
is a heuristic method to predict the graph factors from separate “pieces” of the graph. The piecewise objective function is equivalent to the likelihood function of a nodesplit graph
[32], which contains all the singlefactor components split from the original graph. Using the factor defined at (5), the piecewise likelihood in our case is:(13) 
Note that computing the piecewise likelihood is quadratic in the number of classes. Piecewise Pseudolikelihood (PWPL) is the standard pseudolikelihood applied to the nodesplit graph. Its computation is efficient because the objective function is simply the sum of local conditional probabilities. Sutton and McCallum [32] showed that in many cases the PWPL has better accuracy than standard pseudolikelihood, and in some scenarios it has nearly equivalent performance to piecewise approximation and even to global maximum likelihood. In our case, applying pseudolikelihood approach on the piecewise objective (13) would give us the PWPL form:
(14) 
The first term is inside the function is equivalent to the forward MEMM objective function (8), while the second term can be reduced to derive the PWPL form:
(15) 
where the backwards term
(16) 
The backwards term of the PWPL (16) is independent of the CRF input and hence the MEMMlike objective function is theoretically very related to PWPL. Thus, the approximated likelihood we are using for training, that is based on the MEMM model, can be viewed as a simplified version of the piecewisepseudolikelihood objective (15) that was found to be the preferred likelihood approximation for language processing tasks [32].
References

[1]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,
“Imagenet classification with deep convolutional neural networks,”
in NIPS, 2012.  [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
 [3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster RCNN: Towards realtime object detection with region proposal networks,” in NIPS, 2015.
 [4] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, and Alexander C Berg, “SSD: Single shot multibox detector,” in ECCV, 2016.
 [5] Spyros Gidaris and Nikos Komodakis, “Locnet: Improving localization accuracy for object detection,” in CVPR, 2016.
 [6] Joseph Redmon and Ali Farhadi, “Yolo9000: Better, faster, stronger,” arXiv preprint arXiv:1612.08242, 2016.
 [7] Di Lin, Xiaoyong Shen, Cewu Lu, and Jiaya Jia, “Deep lac: Deep localization, alignment and classification for finegrained recognition,” in CVPR, 2015.
 [8] Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell, “Partbased RCNNs for finegrained category detection,” in ECCV, 2014.
 [9] Antonio Torralba, “Contextual priming for object detection,” International journal of computer vision, vol. 53, no. 2, pp. 169–191, 2003.
 [10] Santosh K Divvala, Derek Hoiem, James H Hays, Alexei A Efros, and Martial Hebert, “An empirical study of context in object detection,” in CVPR, 2009.
 [11] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan, “Object detection with discriminatively trained partbased models,” IEEE Transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
 [12] Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos, Eric Wiewiora, and Serge Belongie, “Objects in context,” in ICCV, 2007.
 [13] LiangChieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” arXiv preprint arXiv:1606.00915, 2016.
 [14] Stephen Gould, Richard Fulton, and Daphne Koller, “Decomposing a scene into geometric and semantically consistent regions,” in ICCV, 2009.

[15]
Jian Yao, Sanja Fidler, and Raquel Urtasun,
“Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation,”
in CVPR, 2012.  [16] Shuai Zheng, Sadeep Jayasumana, Bernardino RomeraParedes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr, “Conditional random fields as recurrent neural networks,” in ICCV, 2015.
 [17] Alexander G Schwing and Raquel Urtasun, “Fully connected deep structured networks,” arXiv preprint arXiv:1503.02351, 2015.
 [18] Yuxuan Wang and DeLiang Wang, “Cocktail party processing via structured prediction,” in NIPS, 2012.
 [19] Filip Korzeniowski and Gerhard Widmer, “A fully convolutional deep auditory model for musical chord recognition,” in MLSP, 2016.
 [20] Gang Chen, Yawei Li, and Sargur N Srihari, “Word recognition with deep conditional random fields,” in ICIP, 2016.
 [21] Ninghang Hu, Gwenn Englebienne, Zhongyu Lou, and Ben Kröse, “Learning latent structure for activity recognition,” in IEEE Int. Conf. on Robotics and Automation (ICRA), 2014, pp. 1048–1053.
 [22] LiangChieh Chen, Alexander Schwing, Alan Yuille, and Raquel Urtasun, “Learning deep structured models,” in ICML, 2015.
 [23] Wenqing Chu and Deng Cai, “Deep feature based contextual model for object detection,” arXiv preprint arXiv:1604.04048, 2016.
 [24] John Lafferty, Andrew McCallum, and Fernando CN Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in ICML, 2001.
 [25] Jian Peng, Liefeng Bo, and Jinbo Xu, “Conditional neural fields,” in NIPS, 2009.
 [26] Thierry Artieres et al., “Neural conditional random fields,” in AISTATS, 2010.
 [27] Yoshua Bengio, Aaron Courville, and Pascal Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
 [28] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun, “The loss surfaces of multilayer networks.,” in AISTATS, 2015.
 [29] Charles Sutton and Andrew McCallum, “An introduction to conditional random fields for relational learning,” Introduction to statistical relational learning, pp. 93–128, 2006.
 [30] Andrew McCallum, Dayne Freitag, and Fernando CN Pereira, “Maximum entropy markov models for information extraction and segmentation,” in ICML, 2000.
 [31] Sham Kakade, Yee Whye Teh, and Sam T Roweis, “An alternate objective function for markovian fields,” in ICML, 2002.
 [32] Charles Sutton and Andrew McCallum, “Piecewise pseudolikelihood for efficient training of conditional random fields,” in ICML, 2007.
 [33] Genevieve B Orr and KlausRobert Müller, Neural networks: tricks of the trade, Springer, 2003.
 [34] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015.
 [35] Amir Alush, Avishay Friedman, and Jacob Goldberger, “Pairwise clustering based on the mutualinformation criterion,” Neurocomputing, vol. 182, pp. 284–293, 2016.
 [36] Mike Schuster and Kuldip K Paliwal, IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
 [37] Sepp Hochreiter and Jürgen Schmidhuber, Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [38] Zhiheng Huang, Wei Xu, and Kai Yu, “Bidirectional LSTMCRF models for sequence tagging,” arXiv preprint arXiv:1508.01991, 2015.
 [39] Hinrich Schütze and Jan Pedersen, “A vector model for syntagmatic and paradigmatic relatedness,” in Proc. of the 9th Annual Conference of the UW Centre for the New OED and Tex, 1993.
 [40] Julian Besag, “Statistical analysis of nonlattice data,” The statistician, pp. 179–195, 1975.
 [41] Charles Sutton and Andrew McCallum, “Piecewise training for undirected models,” in UAI, 2005.
Comments
There are no comments yet.