Adaptive Nonparametric Image Parsing

05/07/2015 ∙ by Tam V. Nguyen, et al. ∙ 0

In this paper, we present an adaptive nonparametric solution to the image parsing task, namely annotating each image pixel with its corresponding category label. For a given test image, first, a locality-aware retrieval set is extracted from the training data based on super-pixel matching similarities, which are augmented with feature extraction for better differentiation of local super-pixels. Then, the category of each super-pixel is initialized by the majority vote of the k-nearest-neighbor super-pixels in the retrieval set. Instead of fixing k as in traditional non-parametric approaches, here we propose a novel adaptive nonparametric approach which determines the sample-specific k for each test image. In particular, k is adaptively set to be the number of the fewest nearest super-pixels which the images in the retrieval set can use to get the best category prediction. Finally, the initial super-pixel labels are further refined by contextual smoothing. Extensive experiments on challenging datasets demonstrate the superiority of the new solution over other state-of-the-art nonparametric solutions.



There are no comments yet.


page 3

page 6

page 7

page 8

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image parsing, also called scene understanding or scene labeling, is a fundamental task in computer vision literature

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. However, image parsing is very challenging since it implicitly integrates the tasks of object detection, segmentation, and multi-label recognition into one single process. Most current solutions to this problem follow the two-step pipeline. First, the category label of each pixel is initially assigned by using a certain classification algorithm. Then, contextual smoothing is applied to enforce the contextual constraints among the neighboring pixels. The algorithms in the classification step can be roughly divided into two categories, namely parametric methods and nonparametric methods.

Fig. 1: The flowchart of our proposed nonparametric image parsing. Given a test image, we segment the image into super-pixels. Then the locality-aware retrieval set is extracted by using super-pixel matching, and the initial category label of each super-pixel is assigned by adaptive nonparametric super-pixel classification. The initial labels, in combination with contextual smoothing, give a dense labeling of the test image. The red rectangle highlights the new contributions of this work, and removing the keywords of locality-aware and adaptive in red then leads to the traditional nonparametric image parsing pipeline.

Parametric methods   Fulkerson et al. [13]

constructed an SVM classifier on the bag-of-words histogram of local features around each super-pixel. Tighe et al.

[14] combined super-pixel level features with per-exemplar sliding window detectors to improve the performance. Socher et al. [15]

proposed a method to aggregate super-pixels in a greedy fashion using a trained scoring function. The originality of this approach is that the feature vector of the combination of two adjacent super-pixels is computed from the feature vectors of the individual super-pixels through a trainable function. Farabet et al.

[16] later proposed to use a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered at each pixel.

Nonparametric methods  Different from parametric methods, nonparametric or data-driven methods liaise with -nearest neighbors classifiers [4, 5]. Liu et al. [4]

proposed a nonparametric image parsing method based on estimating SIFT Flow, a dense deformation field between images. Given a test and a training image, the annotated category labels of the training pixels are transferred to the test ones via pixel correspondences. However, inference via pixel-wise SIFT Flow is currently very complex and computationally expensive. Therefore, Tighe et al.

[5] further transferred labels at the level of super-pixels, or coherent image regions produced by a bottom-up segmentation method. In this scheme, given a test image, the system searches for the top similar training images based on global features. The super-pixels of the most similar images are obtained as a retrieval set. Then the label of each super-pixel in the test image is assigned based on the corresponding most similar super-pixels in the retrieval set. Eigen et al. [17] further improved [5] by learning per-descriptor weights that minimize classification error. In order to improve the retrieval set, Singh et al. [18] used adaptive feature relevance and semantic context. They adopted a locally adaptive distance metric which is learned at query time to compute the relevance of individual feature channels. Using the initial labelling as a contextual cue for presence or absence of objects in the scene, they proposed a semantic context descriptor which helped refine the quality of the retrieval set. In a different work, Yang et al. [19] looked into the long-tailed nature of the label distribution. They expanded the retrieval set by rare class exemplars and thus achieved more balanced super-pixel classification results. Meanwhile, Zhang et al. [20]

proposed a method which exploits partial similarity between images. Namely, instead of retrieving global similar images from the training database, they retrieved some partially similar images so that for each region in the test image, a similar region exists in one of the retrieved training images.

Due to the limited discriminating power of classification algorithms, the output initial labels of pixels may be noisy. To further enhance the label accuracy, contextual smoothing is generally used to exploit global contexts among the pixels. Rabinovich et al. [9] incorporated co-occurrence statistics of category labels of super-pixels into the fully connected Conditional Random Field (CRF). Galleguillos et al. [10] proposed to exploit the information of relative location such as above, beside, or enclosed between super-pixel categories. Meanwhile, Myeong et al. [6] introduced a context link view of contextual knowledge, where the relationship between a pair of annotated super-pixels is represented as a context link on a similarity graph of regions, and link analysis techniques are used to estimate the pairwise context scores of all pairs of unlabeled regions in the input image. Later, [11] proposed a method to transfer high-order semantic relations of objects from annotated images to unlabeled images. Zhu et al. [21] proposed the hierarchical image model composed of rectangular regions with parent-child dependencies. This model captures large-distance dependencies and is solved efficiently using dynamic programming. However, it supports neither multiple hierarchies, nor dependencies between variables at the same level. In another work, Tu et al. [22] introduced a unified framework to pool the information from segmentation, detection and recognition for image parsing. They have to spend much effort to design such complex models. Due to the complexity, the proposed model might not scale well with different datasets.

In this work, our focus is placed on nonparametric solutions to the image parsing problem. However, there are several shortcomings in existing nonparametric methods. First, it is often quite difficult to get globally similar images to form the retrieval set. Also by only considering global features, some important local components or objects may be ignored. Second, is fixed empirically in advance in such a nonparametric image parsing scheme. Tighe et al. [5] reported the best results by varying on the test set. However, this strategy is impractical since the ground-truth labels are not provided in the testing phase. Therefore, the main issues in the context of the nonparametric image parsing are 1) how to get a good retrieval set, and 2) how to choose a good for initial label transfer. In this work, we aim to improve both aspects, and the main contributions of this work are two-fold.

  1. Unlike the traditional retrieval set which consists of globally similar images, we propose the locality-aware retrieval set. The locality-aware retrieval set is extracted from the training data based on super-pixel matching similarities, which are augmented with feature extraction for better differentiation of local super-pixels.

  2. Instead of fixing as in traditional nonparametric methods, we propose an adaptive method to set the sample-specific as the number of the fewest nearest neighbors which similar training super-pixels can use to get their best category label predictions.

Ii Adaptive Nonparametric Image Parsing

Ii-a Overview

Generally, for nonparametric solutions to the image parsing task, the goal is to label the test image at the pixel level based on the content of the retrieval set, but assigning labels on a per-pixel basis as in [4, 16] would be too inefficient. In this work, we choose to assign labels to super-pixels produced by bottom-up segmentation as in [5]. This not only reduces the complexity of the problem, but also gives better spatial support for aggregating features belonging to a single object than, say, fixed-size square patches centered at each pixel in an image.

The training images are first over-segmented into super-pixels by using the fast graph-based segmentation algorithm of [23] and their appearances are described using 20 different features similar to those of [5]. The complete list of super-pixel’s features is summarized in Table I. Each training super-pixel is assigned a category label if 50% or more of the super-pixel overlaps with a ground truth segment mask of that label. For each super-pixel, we perform feature extraction and then reduce the dimension of the extracted feature.

Type Dim Type Dim

Centered mask
64 SIFT histogram top 100
Bounding box 2 SIFT histogram right 100
Super-pixel area 1 SIFT histogram left 100
Absolute mask 64 Mean color 3
Top height 1

Color standard deviation

Texton histogram 100 Color histogram 33
Dilated texton histogram 100 Dilated color histogram 33
SIFT histogram 100 Color thumbnail 192
Dilated SIFT histogram 100 Masked color thumbnail 192
SIFT histogram bottom 100 GIST 320
TABLE I: The list of all super-pixel’s features.

For the test image, as illustrated in Figure 1, over-segmentation and super-pixel feature extraction are also conducted. Next, we perform the super-pixel matching process to obtain the locality-aware retrieval set. The adaptive nonparametric super-pixel classification is proposed to determine the initial label of each super-pixel. Finally, the graphical model inference is performed to preserve the semantic consistency between adjacent pixels. More details of the proposed framework, namely the locality-aware retrieval set, adaptive nonparametric super-pixel classification, and contextual smoothing, are elaborated as follows.

Fig. 2: The process to extract the retrieval set by super-pixel matching. The test image is first oversegmented into super-pixels. Then, we compute the similarity between the test image and each training image as described in Algorithm 1. (Please view in high 400% resolution).

Ii-B Locality-aware Retrieval Set

For nonparametric image parsing, one important step of parsing a test image is to find a retrieval set of training images that will serve as the reference of candidate super-pixel level annotations. This is done not only for computational efficiency, but also to provide scene-level context for the subsequent processing steps. A good retrieval set should contain images of a similar scene type as that of the test image, along with similar objects and spatial layouts. Unlike [5] where global features are used to obtain the retrieval set, we utilize the super-pixel matching as illustrated in Figure 2. The motivation is that sometimes it may be difficult to get globally similar images, especially when the training set is not big enough, yet locally similar ones are easier to obtain; also sometimes if only global features are considered for retrieval set selection, some important local components or objects may be ignored. In this work, the retrieval set is selected based on local similarity measured over super-pixels. To enhance the discriminating power of super-pixels, we utilize Linear Discriminant Analysis (LDA) [24]

for feature reduction to a lower feature dimension. Then we use the augmented super-pixel similarity instead of global similarity to extract the retrieval set.

Denote as the original feature vector of the super-pixel, where is the dimension of the feature vector. The corresponding feature vector after the feature reduction is computed as,


where is the transformation matrix. In particular, LDA looks for the directions that are most effective for discrimination by minimizing the ratio between the intra-category () and inter-category () scatters:


where is the number of super-pixels in all training images, is the number of categories, is the number of super-pixels for the -th category, , , is the feature vector of one training super-pixel, is the category label of the -th super-pixel in the training images, is the mean of feature vector of training super-pixels, and is the mean of the -th category. Note that the category label of each super-pixel is obtained from the ground-truth object segment with the largest overlapping with the super-pixel. As shown in [24], the projection matrix

is composed of the eigenvectors of

. Note that there are at most

eigenvectors with non-zero real corresponding eigenvalues since there are only

points to compute . In other words, the dimensionality of is . Therefore, LDA naturally reduces the feature dimension to in the image parsing task. Since the category number is much smaller than the feature number, the benefits of the reduced dimension include the shrinkage of memory storage and the removal of those less informative features for consequent super-pixel matching. Obviously the reduction of feature dimension is also beneficial to the nearest super-pixel search in the super-pixel classification stage.

1:parameters: , , , , .
2:The unique index set .
4:for i = 1: do
5:     [, ]Knn(, , );
6:     ;
7:     if  then
8:          RefineIndexSet();
9:          FindImageIndex();
10:          ;
11:          ;
12:     end if
13:end for
16:return top training images.
18:function RefineIndexSet(, )
19:     .
20:     .
21:     for i = 1: do
22:          if   then
23:               ;
24:          else
25:               ;
26:          end if
27:     end for
28:     return .
29:end function
31:function FindImageIndex()
32:     .
33:     for i = 1: do
34:           = ;
35:     end for
36:     return .
37:end function
39:function NormalizeAndSort()
40:     .
41:     for i = 1: do
42:           = ;
43:     end for
44:     .
45:     return .
46:end function
Algorithm 1 Locality-aware Retrieval Set Algorithm
Fig. 3: The distribution of best s for all training images in the SIFTFlow dataset. It can be observed that there is no dominant from to .

The procedure to obtain the retrieval set is summarized in Algorithm 1. Denote as the number of super-pixels in the test image, as the number of super-pixels for the -th training image, and as the number of training images. We impose the nature constraint that one super-pixel in a training image is matched with only one super-pixel of the test image. We denote as the unique index set which stores the indices of the already matched super-pixels, as the similarity vector between the test image and all training images, as the feature matrix for all the super-pixels in the test image, as the feature matrix for all the super-pixels in the training set, and as the mapping index between the super-pixel and the corresponding training image. As aforementioned, the over-segmentation over the image is performed by using [23]. Then we extract the corresponding features similarly as [5] for each super-pixel and use LDA to reduce the feature dimension.

We match each super-pixel in the test image with all super-pixels in the training set. In order to reduce the complexity, we perform Knn to find the nearest super-pixels in the training images for the -th super-pixel in the test image. The Euclidean distance is used to calculate the dissimilarity between two super-pixels. As a result, we have as the indices of the returned nearest super-pixels of the -th test super-pixel, and as the corresponding distances of the returned nearest super-pixels to the -th test super-pixel. We remove the super-pixels in from , where includes the training super-pixels matched by the first test super-pixels. There may be more than one super-pixel from one training image, thus RefineIndexSet is performed to keep the nearest one. Note that denotes the number of the elements in an array. Then, the index set is updated by adding .

The function FindImageIndex is invoked to retrieve the corresponding image index of . Then we update the similarity vector since the number of super-pixels is not the same for every image. For example, the number of super-pixels of SIFTFlow training set varies from to . Therefore we perform NormalizeAndSort to obtain the final similarity vector. Namely, for each training image , is divided by . The retrieval set then includes the top training images by , where the parameters and are selected by the grid search over the training set based on the leave-one-out strategy. Namely, we choose a pair of with step size , and with step size and perform the following adaptive non-parametric super-pixel classification for all images in the training set. The leave-one-out strategy means that when one training image is selected as a test image, the rest of training images is used as the corresponding training set.

Ii-C Adaptive Nonparametric Super-pixel Classification

Adaptive nonparametric super-pixel classification aims to overcome the limitation of the traditional -nearest neighbor (-NN) algorithm which usually assigns the same number of nearest neighbors for each test sample. For nonparametric algorithms, the label of each super-pixel in the test image is assigned based on the corresponding similar super-pixels in the retrieval set. Our improved -NN algorithm focuses on looking for the suitable for each test sample.

Basically the sample-specific of each test image is propagated from its similar training images. In particular, each training image retrieved by the super-pixel matching process, is considered as one test image, while the left images in the training set are referred to the corresponding training set. Then we perform super-pixel matching to obtain the retrieval set for and assign the label of the -th super-pixel by the majority vote of the nearest super-pixels in the retrieval set,


where is the likelihood ratio for the -th super-pixel to have the category based on the nearest super-pixels and defined as below,


Here is the number of super-pixels with class label in the nearest super-pixels of the -th super-pixel in the retrieval set, is the set of all labels excluding , and is the set of all super-pixels in the whole training set. consists of nearest super-pixels of the -th super-pixel from the retrieval set. Then we compute the per-pixel accuracy of each retrieved training image for different s. We denote as the per-pixel performance (the percentage of all ground-truth pixels that are correctly labeled) of the training image with the parameter value . We vary from to with step size , . As can be observed in Figure 3, there is no dominant from to in the overall SIFTFlow training set. It motivates the necessity of adaptive nearest neighbors for the nonparametric super-pixel classification process. Thus, for each test image, we assign its by transferring s of the similar images returned by the super-pixel matching process,


where is the number of images in the retrieval set for the test image. Then based on selected , the initial label of a super-pixel in the test image is obtained in the same way as in Eqn. (4).

Fig. 4: (Top) Label frequencies for the pixels in the SIFTFlow training set. (Bottom) The per-category classification rates of different s and our adaptive nonparametric method on the SIFTFlow dataset. The categories ‘bird’, ‘cow’, ‘dessert’, and ‘moon’ are dropped from the figure since they are not present in the test split. (Please view in high 400% resolution).

Ii-D Contextual Smoothing

Generally, the initial labels for the super-pixels may still be noisy, and these labels need be further refined with global context information. The contextual constraints are very important for parsing images. For example, a pixel assigned with “car” is likely connected with “road”. Therefore, the initial labels are smoothened with an MRF energy function defined over the field of pixels:


where is the set of all pixels in the image, is the set of edges connecting adjacent pixels, and is a smoothing constant. The data term is defined as follows


where means the super-pixel containing the -th super-pixel and the function is defined in Eqn. (5). The MRF model also includes the smoothness constraint reflecting the spatial consistency (pixels or super-pixels close to each other are most likely to have similar labels). Therefore, the smoothing term imposes a penalty when two adjacent pixels (, ) are similar but are assigned with different labels (, ).

is defined based on probabilities of label co-occurrence and biases the neighboring pixels to have the same label in the case that no other information is available, and the probability depends on the edge of the image:


where is the conditional probability of one pixel having label given that its neighbor has label , estimated by counts from the training set. is defined based on the normalized gradient value of the neighboring pixels:


where is the norm of the gradient of the test image at a pixel and its neighbor pixel . The stronger the luminance edge is, the more likely the neighboring pixels may have different labels. Multiplication with the constant Potts penalty is necessary to ensure that this energy term is semi-metric as required by graph cut inference [25]. We perform the inference using the swap algorithm [25, 26, 27].

Iii Experiments

Iii-a Datasets and Evaluation Metrics

In this section, our approach is validated on two challenging datasets: SIFTFlow [4] and 19-Category LabelMe [28].

SIFTFlow dataset111 is composed of 2,688 images that have been throughly labeled by LabelMe users. The image size is 256 256 pixels. Liu et al. [4] split this dataset into 2,488 training images and 200 test images and used synonym correction to obtain 33 semantic labels (sky, building, tree, mountain, road, sea, field, car, sand, river, plant, grass, window, sidewalk, rock, bridge, door, fence, person, staircase, awning, sign, boat, crosswalk, pole, bus, balcony, streetlight, sun, bird, cow, dessert, and moon).

19-Category LabelMe dataset222 Jain et al. [28] randomly collected 350 images from LabelMe [8] with 19 categories (grass, tree, field, building, rock, water, road, sky, person, car, sign, mountain, ground, sand, bison, snow, boat, airplane, and sidewalk). This dataset is split into 250 training images and 100 test images.

We evaluate our approach on both sets, but perform additional analysis on the SIFTFlow dataset since it has a larger number of categories and images. In evaluating image parsing algorithms, there are two metrics that are commonly used: per-pixel and per-category classification rate. The former rates the total proportion of correctly labeled pixels, while the latter indicates the average proportion of correctly labeled pixels in each object category. If the category distribution is uniform, then the two would be the same, but this is not the case for real-world scenes. Note that for all experiments, the is empirically set as in the contextual smoothing process. and are set as and , respectively. In all of our experiments, we use Euclidean distance metric to find the nearest neighbors.


  Algorithm Per-Pixel (%) Per-Category (%)  



  Parametric Baselines  
Tighe et al. [14]
78.6 39.2  
 Farabet et al. [16] 78.5 29.6  



  Nonparametric Baselines  

Liu et al. [4]
 Tighe et al. [5] 76.3 28.8  
 Tighe et al. [5] (adding geometric information) 76.9 29.4  
 Myeong et al. [11] 76.2 29.6  
 Eigen et al. [17] 77.1 32.5  
  Our Proposed Adaptive Nonparametric Algorithm  
 Super-pixel Classification 77.2 34.9  
 Contextual Smoothing 78.9 34.0  


TABLE II: Performance comparison of our algorithm with other algorithms on the SIFTFlow dataset [4]. Per-pixel rates and average per-category rates are presented. The best performance values are marked in bold.


 Parameter Per-Pixel (%) Per-Category (%)  



  = 1 70.2 31.9  
  = 5 76.6 34.8  
  = 10 77.5 34.6  
  = 20 77.8 33.5  
  = 30 77.9 33.3  
  = 40 77.9 30.6  
  = 50 77.8 29.5  
  = 60 77.5 28.6  
  = 70 77.8 28.5  
  = 80 77.5 28.2  
  = 90 77.1 27.2  
  = 100 76.9 26.8  



 Adaptive in Our Algorithm 78.9 34.0  


TABLE III: Performance comparison of different s and our algorithm on the SIFTFlow dataset [4]. Per-pixel rates and average per-category rates are presented.

Iii-B Performance on the SIFTFlow Dataset

Comparison of our algorithm with state-of-the-arts   Table II reports per-pixel and average per-category rates for image parsing on the SIFTFlow dataset. Even though the nonparametric methods are our main baselines, we still list parametric methods for reference. Our proposed method outperforms the baselines by a remarkable margin. We did not compare our work with [18] and [19] since [18] uses a different set of super-pixel’s features whereas [19] utilizes the extra data to balance the distribution of the categories in the retrieval set. Compared with our initial super-pixel classification result, the final contextual smoothing improves overall per-pixel rates on the SIFTFlow dataset by about 1.7%. Average per-category rates drop slightly due to the contextual smoothing on some of the smaller classes. Note that Tighe et al. [14] improved [5] by adding extensively multiple detectors (their performance reaches ). The addition of many object detectors brings a better per-category performance but also increases the processing time since running object detection is very time-consuming. Note that, to train the object detectors, [14] must use extra data. Also, [14] utilizes SVM instead of -NN as in our work that may bring better classification results, especially for some rare categories. Meanwhile, our proposed method improves [5] with a simpler solution and even achieves a better performance in terms of per-pixel rate (). Also, our method performs better than [16]

which deployed heavily deep learning features.

Fig. 5: Top 4 exemplar retrieval results of super-pixel matching, global matching [5], and GIST-based matching [4]. (a) Global matching returns “tall building” and “open country” scenes, and GIST-based matching obtains “inside city” and “mountain”. Meanwhile, our method obtains the reasonable images of “urban street”. (b) The “open country” images are retrieved in GIST-based matching and the “sunset coastal” scenes are returned in global matching instead of “highway” as in our method.
Fig. 6: Exemplar results from the SIFTFlow dataset. In (a), the adaptive nonparametric method successfully parses the test image. In (b), the “rock” is classified instead of “river” or “mountain” in other two methods. In (c), our method recovers the “sun” and removes the spurious classification of the sun’s reflection in the water as “sky”. In (d), the labeled “sea” regions in two other methods are recovered as “road”. In (e), some of the trees are recovered in the adaptive nonparametric method. In (f), our method recovers “window” from “door”. Best viewed in color.


 Algorithm Performance  




SuperParsing [5]
76.3 (28.8)  



  Our Improvements  
SuperParsing + LDA + Global Matching + (fixed = 20)
76.4 (31.2)  
 SuperParsing + LDA + Super-pixel Matching + ( = 20) 77.8 (33.5)  
 SuperParsing + LDA + Super-pixel Matching + Adaptive 78.9 (34.0)  


TABLE IV: Performance comparison of different settings on the SIFTFlow dataset [4]. Per-pixel classification rates (with per-category rates in parentheses) are presented.

Performance of different s   The impact of different s is further investigated on the SIFTFlow dataset. In this experiment, the parameter varies from to . LDA and super-pixel matching are utilized in order to keep fair comparison with our adaptive nonparametric method. Table III summarizes the performance of different s on both per-pixel and per-category criteria. The relationship between per-pixel and per-category of different s is inconsistent. The smaller s () tend to achieve a higher per-category whereas the larger s lean to a higher per-pixel rate. A lower responds well with rare categories (i.e., boat, pole, bus, etc. as illustrated in Figure 4), thus it leads to improved per-category classification. Meanwhile, a higher leads to better per-pixel accuracy since it works well for more common categories such as sky, building, and tree. yields the largest per-category rate, but its per-pixel performance is much lower than that of . As a closer look, Figure 4 also shows the details of per-category classification rates of different s. The smaller s yield better results on the categories with a small number of samples while the larger s are sensitive on categories with a large number of samples such as sky, sea, etc. As observed in the same Figure 4, our adaptive nonparametric approach exhibits advantages over smaller and larger s.

Fig. 7: Exemplar results on the 19-Category LabelMe dataset [28]. The test images, ground truth, and results from our proposed adaptive nonparametric method are shown in triple batches. Best viewed in color.

How each new component affects SuperParsing [5]   In order to study the impact of each newly proposed component, another experiment is conducted with different configuration settings. Namely, we report the results by incrementally adding LDA, super-pixel matching and adaptive nonparametric super-pixel classification to the traditional nonparametric image parsing pipeline [5], respectively. Keeping the fixed as 20 and the number of similar images in the retrieval set as 200, as recommended in [5] and adding LDA increase the performance of [5] by a small margin. We observe a large gain by adding super-pixel matching, i.e., 1.4%, in per-pixel rate. Further adding adaptive nonparametric super-pixel classification drastically increases the combination ([5], LDA, super-pixel matching, and fixed ) by 1.1% in per-pixel rate. For comparison, our work improves [5] by 2.6% in terms of per-pixel rate and 5.2% in terms of per-category rate. The results clearly show the efficiency of our proposed super-pixel matching and adaptive nonparametric super-pixel classification. Figure 6 shows the exemplar results of different experimental settings on the SIFTFlow dataset.

Retrieval Set Algorithm   NDCG

GIST-based matching [4]
Global matching [5] 0.85
Super-pixel matching 0.88
TABLE V: The evaluation of the relevance of a retrieval set with respect to a query.

How good is the locality-aware retrieval set   We evaluate the performance of our retrieval set via Normalized Discounted Cumulative Gain (NDCG) [29] which is commonly used to evaluate ranking systems. NDCG is defined as follows,


where is a binary value indicating whether the scene of the returned image is relevant (with value 1) or irrelevant (with value 0) to the one of the query image, and is a constant to normalize the calculated score. Recall that is the number of returned images from locality-aware retrieval set to ensure the fair comparison. As shown in Table V, our super-pixel matching outperforms other baselines, namely, GIST-based matching and global matching in terms of NDCG. Figure 5 also demonstrates the good results of our locality-aware retrieval set.

Adaptive on different scene classes  Based on our hypothesis that the similar images should share the same , we would like to study how the adaptive selection works for different types (scene classes) of similar images. To this end, we divide images in the SIFTFlow dataset into scene classes based on their filenames. For example, the test image “coast_arnat59.jpg” is classified into coast scene class. In total, there are 8 scene classes, namely, coast, forest, highway, inside city, mountain, open country, street, and tall building. We compute the mean number of categories (car, building, road, etc.) inside the testing set of the SIFTFlow dataset. Next, we compute the selected for each scene class by selecting the that has the highest confidence over all of the images in the same scene. The mean number of categories and the selected of each scene class are reported in Table VI. As we can observe, the scene images with more object categories, i.e., highway, inside city and street, have lower s. In contrast, the scene images with fewer object categories have larger s. Note that our method is unaware of the scene class of the test image. This means our method adapts well to different scene classes and brings the remarkable improvement to image parsing. In the preliminary experiments, we apply the randomization for the order of test super-pixels but the performance is similar to the one that is from to . Therefore, the order of the super-pixels of test image does not affect the performance.

Scene Class Mean No. of Categories Selected

3.8 12
Forest 2.5 36
Highway 6.5 6
Inside City 7.2 12
Mountain 2.6 22
Open Country 3.9 14
Street 7.5 6
Tall Building 3.3 43
TABLE VI: The mean number of categories and the correspondingly selected of each scene class on the SiftFlow dataset.


 Algorithm Per-Pixel (%) Per-Category (%)  



  Parametric Baselines  
 Jain et al. [28] 59.0  
 Chen et al. [30] 75.6 45.0  



  Nonparametric Baselines  
Myeong et al. [6]
80.1 53.3  
  Adaptive Nonparametric Algorithm  
 Super-pixel Classification 80.3 53.3  
 Contextual Smoothing 82.7 55.1  


TABLE VII: Performance comparison of our algorithm with other algorithms on the 19-Category LabelMe dataset [28]. Per-pixel rates and average per-category rates are presented. The best performance values are marked in bold.

Iii-C Performance on 19-Category LabelMe Dataset

Table VII shows the performance of our work compared with other baselines on the 19-Category LabelMe dataset. Our final adaptive nonparametric method on this dataset achieves 82.7%, surpassing all state-of-the-art performances. For the adaptive nonparametric method, our result has surpassed the one of Myeong et al. [6] by a large margin. Compared with the parametric method [30], our work improves by 7.1%. Some exemplar results on this dataset are shown in Figure 7.

Iv Conclusions and Future Work

This paper has presented a novel approach to image parsing that can take advantage of adaptive nonparametric super-pixel classification. To the best of our knowledge, we are the first ones to exploit the locality-aware retrieval set and adaptive nonparametric super-pixel classification in image parsing. Extensive experimental results have clearly demonstrated the proposed method can achieve the state-of-the-art performance on diverse and challenging image parsing datasets.

For future work, we are interested in exploring possible extensions to improve the performance. For example, the combination weight of different types of features can be learned. Another possible extension is to elegantly transfer other parameters apart from , for example, the of the contextual smoothing process from the retrieved training images to the test image. Since the current solution is specific for image parsing, we are also interested in generalizing the proposed method to other recognition tasks, such as image retrieval, and general -NN classification applications. We also plan to leverage our work to video domain, i.e., action recognition [31] and human fixation prediction [32].

Last but not least, to boost the super-pixel matching process, we can embed Locality-sensitive hashing (LSH) [33] or the recently introduced Set Compression Tree (SCT) [34] to encode the feature’s representative in few bits (instead of bytes) for large-scale matching. These coding methods and the insignificant number of super-pixels of each image make our super-pixel matching process feasible. In this paper, we only investigate the impact of adaptive non-parametric method in scene parsing. The utilization of LSH or SCT which are suitable for large-scale dataset will be considered for building a practical system in the future.


  • [1] S. Gould, R. Fulton, and D. Koller, “Decomposing a scene into geometric and semantically consistent regions,” in IEEE International Conference on Computer Vision, 2009, pp. 1–8.
  • [2] M. P. Kumar and D. Koller, “Efficiently selecting regions for scene understanding,” in

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2010, pp. 3217–3224.
  • [3] V. S. Lempitsky, A. Vedaldi, and A. Zisserman, “Pylon model for semantic segmentation,” in Neural Information Processing Systems Conference, 2011, pp. 1485–1493.
  • [4] C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing via label transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 12, pp. 2368–2382, 2011.
  • [5] J. Tighe and S. Lazebnik, “Superparsing: Scalable nonparametric image parsing with superpixels,” in European Conference on Computer Vision, 2010, pp. 352–365.
  • [6] H. Myeong, J. Y. Chang, and K. M. Lee, “Learning object relationships via graph-based context model,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2727–2734.
  • [7] X. He and R. S. Zemel, “Learning hybrid models for image annotation with partially labeled data,” in Neural Information Processing Systems Conference, 2008, pp. 625–632.
  • [8] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme: A database and web-based tool for image annotation,” International Journal of Computer Vision, vol. 77, no. 1-3, pp. 157–173, 2008.
  • [9] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie, “Objects in context,” in IEEE International Conference on Computer Vision, 2007, pp. 1–8.
  • [10] C. Galleguillos, B. McFee, S. J. Belongie, and G. R. G. Lanckriet, “Multi-class object localization by combining local contextual interactions,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 113–120.
  • [11]

    H. Myeong and K. M. Lee, “Tensor-based high-order semantic relation transfer for semantic scene segmentation,” in

    IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3073–3080.
  • [12] D. Munoz, J. A. Bagnell, and M. Hebert, “Stacked hierarchical labeling,” in European Conference on Computer Vision, 2010, pp. 57–70.
  • [13] B. Fulkerson, A. Vedaldi, and S. Soatto, “Class segmentation and object localization with superpixel neighborhoods,” in IEEE International Conference on Computer Vision, 2009, pp. 670–677.
  • [14] J. Tighe and S. Lazebnik, “Finding things: Image parsing with regions and per-exemplar detectors,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3001–3008.
  • [15]

    R. Socher, C. C.-Y. Lin, A. Y. Ng, and C. D. Manning, “Parsing natural scenes and natural language with recursive neural networks,” in

    International Conference in Machine Learning

    , 2011, pp. 129–136.
  • [16] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Scene parsing with multiscale feature learning, purity trees, and optimal covers,” in International Conference in Machine Learning, 2012, pp. 1–8.
  • [17] D. Eigen and R. Fergus, “Nonparametric image parsing using adaptive neighbor sets,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2799–2806.
  • [18] G. Singh and J. Kosecka, “Nonparametric scene parsing with adaptive feature relevance and semantic context,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3151–3157.
  • [19] J. Yang, B. Price, S. Cohen, and M.-H. Yang, “Context driven scene parsing with attention to rare classes,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [20] H. Zhang, T. Fang, X. Chen, Q. Zhao, and L. Quan, “Partial similarity based nonparametric scene parsing in certain environment,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 2241–2248.
  • [21] L. Zhu, Y. Chen, Y. Lin, C. Lin, and A. L. Yuille, “Recursive segmentation and recognition templates for image parsing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 2, pp. 359–371, 2012.
  • [22] Z. Tu, X. Chen, A. L. Yuille, and S. C. Zhu, “Image parsing: Unifying segmentation, detection, and recognition,” International Journal of Computer Vision, vol. 63, no. 2, pp. 113–140, 2005.
  • [23] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image segmentation,” International Journal of Computer Vision, vol. 59, no. 2, pp. 167–181, 2004.
  • [24] A. M. Martínez and A. C. Kak, “PCA versus LDA,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228–233, 2001.
  • [25] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222–1239, 2001.
  • [26] V. Kolmogorov and R. Zabih, “What energy functions can be minimized via graph cuts?” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 147–159, 2004.
  • [27] Y. Boykov and V. Kolmogorov, “An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 9, pp. 1124–1137, 2004.
  • [28] A. Jain, A. Gupta, and L. S. Davis, “Learning what and how of contextual models for scene labeling,” in European Conference on Computer Vision, 2010, pp. 199–212.
  • [29] B. Siddiquie, R. S. Feris, and L. S. Davis, “Image ranking and retrieval based on multi-attribute queries,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 801–808.
  • [30] X. Chen, A. Jain, A. Gupta, and L. S. Davis, “Piecing together the segmentation jigsaw using context,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 2001–2008.
  • [31] T. Nguyen, Z. Song, and S. Yan, “STAP: Spatial-temporal attention-aware pooling for action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, 2015.
  • [32] T. Nguyen, M. Xu, G. Gao, M. S. Kankanhalli, Q. Tian, and S. Yan, “Static saliency vs. dynamic saliency: a comparative study,” in ACM Multimedia Conference, 2013, pp. 987–996.
  • [33] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” Communications of the ACM, vol. 51, no. 1, pp. 117–122, 2008.
  • [34] R. Arandjelovic and A. Zisserman, “Extremely low bit-rate nearest neighbor search using a set compression tree,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.