The field of service robots aims to provide robots with functionalities which allow them to work in man-made environments. For instance, the robots should be able to categorize objects and estimate the pose of the objects to accomplish various robotics tasks, such as grasping objects [Kootstra_ijrr12]. Representation of object categories enables the robot to further refine the grasping strategy by giving context to the search for the pose of the object [lai_aaai11].
In this paper, we propose a joint object categorization and pose estimation approach which extract information about statistical and geometric properties of object poses and categories extracted from the object parts and compositions that are constructed at different layers of the Learned Hierarchy of Parts (LHOP) [fidler_cvpr07, fidler_book, fidler_eccv10].
In the proposed approach, we first employ LHOP [fidler_cvpr07, fidler_book] to learn hierarchical part libraries which represent object parts and compositions across different object categories and views as shown in Fig. 1. Then, we extract statistical and geometric features from the part realizations of the objects in the images in order to represent the information about the object pose and category at each different layer of the hierarchy. We propose two novel feature extraction algorithms, namely Histogram of Oriented Parts (HOP) and Entropy of Part Graphs. HOP features measure local distributions of global orientations of part realizations of objects at different layers of a hierarchy. On the other hand, Entropy of Part Graphs provides information about the statistical and geometric structure of object representations by measuring the entropy of the relative orientations of parts. In addition, we compute a Histogram of Oriented Gradients (HOG) [HOG] of part realizations in order to obtain information about the co-occurrence of the gradients of part orientations.
Unlike traditional approaches which extract information from the object representations at specific layers of the hierarchy to accomplish specific tasks, we combine the information extracted at different layers to solve a joint object pose estimation and categorization problem using a distributed optimization algorithm. For this purpose, we first formulate the joint object pose estimation and categorization problem as a sparse optimization problem called Group Lasso [glasso]
. We consider the pose estimation problem as a sparse regression problem and the object categorization problem as a multi-class logistic regression problem using Group Lasso. Then, we solve the optimization problems using a distributed and parallel optimization algorithm called the Alternating Direction Method of Multipliers (ADMM)[admm].
In this work, we extract information on object poses and categories from 2-D images to handle the cases where 3-D sensing may not be available or may be unreliable (e.g. glass, metal objects). We examine the proposed approach and the algorithms on two benchmark 2-D multiple-view image datasets. The proposed approach and the algorithms outperform state-of-the-art Support Vector Machine and Regression algorithms. In addition, the experimental results shed light on the relationship between object categorization, pose estimation and the part realizations observed at different layers of the hierarchy.
In the next section, related work is reviewed and the novelty of our proposed approach is summarized. In Section II, a brief presentation of the hierarchical compositional representation is given. Feature extraction algorithms are introduced in Section III. The joint object pose estimation and categorization problem is defined, and two algorithms are proposed to solve the optimization problem in Section IV. Experimental analyses are given in Section V. Section VI concludes the paper.
I-a Related Work and Contribution
In the field of computer vision the problem of object categorization and pose estimation is studied thoroughly and some of the approaches are proliferating to the robotics community. With an advent of devices based on PrimeSense sensors, uni-modal 3-D or multi-modal integration of 2-D and 3-D data (e.g. rgb-d data) have been widely used by robotics researchers[jiang_corr12]. However, 3-D sensing may not be available or reliable due to limitations of object structures, lighting resources and imaging conditions in many cases where single or multiple view 2-D images are used for categorization and pose estimation [collet_icra09, collet_ijrr11, Damien]. In [Damien], a probabilistic approach is proposed to estimate the pose of a known object using a single image. Collet et al. [collet_icra09] build 3D models of objects using SIFT features extracted from 2D images for robotic manipulation, and combine single image and multiple image object recognition and pose estimation algorithms in a framework in [collet_ijrr11].
A promising approach to the object categorization and the scene description is the use of hierarchical compositional architectures [fidler_cvpr07, fidler_eccv10, lai_aaai11]. Compositional hierarchical models are constructed for object categorization and detection using single images in [fidler_cvpr07, fidler_eccv10]. Multiple view images are used for pose estimation and categorization using a hierarchical architecture in [lai_aaai11]. In the aforementioned approaches, the tasks are performed using either discriminative or generative top-down or bottom-up learning approaches in architectures. For instance, Lai et al. employ a top-down categorization and pose estimation approach in [lai_aaai11], where a different task is performed at each different layer of the hierarchy. Note that, a categorization error occurring at the top-layer of the hierarchy may propagate to the lower layer and affect the performance of other tasks such as pose estimation in this approach. In our proposed approach, we first construct generative representations of object shapes using LHOP [fidler_cvpr07, fidler_book, fidler_eccv10]. Then, we train discriminative models by extracting features from the object representations. In addition, we propose a new method, which enables us to combine the information extracted at each different layer of the hierarchy, for joint categorization and pose estimation of objects. We avoid the propagation of errors of performing multiple tasks through the layers and enable the shareability of parts among layers by the employment of optimization algorithms in each layer in a parallel and distributed learning framework.
The novelty of the proposed approach and the paper can be summarized as follows;
In this work, the Learned Hierarchy of Parts (LHOP) is employed in order to learn a hierarchy of parts using the shareability of parts across different views as well as different categories [fidler_cvpr07, fidler_book].
Two novel feature extraction algorithms, namely Histogram of Oriented Parts (HOP) and Entropy of Part Graphs, are proposed in order to obtain information about the statistical and geometric structure of objects’ shapes represented at different layers of the hierarchy using part realizations.
The proposed generative-discriminative approach enables us to combine the information extracted at different layers in order to solve a joint object pose estimation and categorization problem using a distributed and parallel optimization algorithm. Therefore, this approach also enables us to share the parts among different layers and avoid the propagation of object categorization and pose estimation errors through the layers.
Ii Learned Hierarchy of Parts
In this section, Learned Hierarchy of Parts (LHOP)[fidler_cvpr07, fidler_book] is briefly described. In LHOP, the object recognition process is performed in a hierarchy starting from a feature layer through more complex and abstract interpretations of object shapes to an object layer. A learned vocabulary is a recursive compositional representation of shape parts. Unsupervised bottom-up statistical learning is encompassed in order to obtain such a description.
Shape representations are built upon a set of compositional parts which at the lowest layer use atomic features, e.g. Gabor features, extracted from image data. The object node is a composition of several child nodes located at one layer lower in the hierarchy, and the composition rule is recursively applied to each of its child nodes to the lowest layer . All layers together form a hierarchically encoded vocabulary . The entire vocabulary is learned from the training set of images together with the vocabulary parameters [fidler_book].
The parts in the hierarchy are defined recursively in the following way. Each part in the layer represents the spatial relations between its constituent subparts from the layer below. Each composite part constructed at the layer is characterized by a central subpart and a list of remaining subparts with their positions relative to the center as
where denotes the relative position of the subpart , while
denotes the allowed variance of its position around ().
Iii Feature Extraction from Learned Parts
LHOP provides information about different properties of objects, such as poses, orientations and category memberships, at different layers [fidler_cvpr07]. For instance, the information on shape parts, which are represented by edge structures and textural patterns observed in images, is obtained using Gabor features at the first layer . In the second and the following layers, compositions of parts are constructed according to the co-occurrence of part realizations that are detected in the images among different views of the objects and across different object categories. In other words, a library of object parts and compositions is learned jointly for all object views and categories.
In order to obtain information about statistical and geometric properties of parts, we extract three types of features from the part realizations detected at each different layer of the LHOP.
Iii-a Histogram of Orientations of Parts
Histograms of orientations of parts are computed in order to extract information on the co-occurrence of orientations of the parts across different poses of objects. Part orientations are computed according to a coordinate system of an image whose origin is located at the center of the image , and the axes of the coordinate system are shown with blue lines in Fig. 2.
If we define as the realization of the detected part in the layer at an image coordinate of , then its orientation with respect to the origin of the coordinate system is computed as
Then, the image is partitioned into cells , and histograms of the part orientations of the part realizations that are located in each cell are computed. The aggregated histogram values are considered as variables of a dimensional feature vector .
Iii-B Histogram of Oriented Gradients of Parts
In addition to the computation of histograms of orientations of part realizations , we compute histogram of oriented gradients (HOG) [HOG] of in order to extract information about the distribution of gradient orientations of . We denote the HOG feature vector extracted using in the layer as , where is the dimension of the HOG feature vector. The details of the implementation of HOG feature vectors are given in Section V.
Iii-C The Entropy of Part Graphs
We measure the statistical and structural properties of relative orientations of part realizations by measuring the complexity of a graph of parts. Mathematically speaking, we define a weighted undirected graph in the layer, where is the set of part realizations, is the set of edges, where each edge that connects the part realizations and is associated to an edge weight , which is defined as
where is the position vector of , is the norm or Euclidean norm, and is the inner product of and . In other words, the edge weights are computed according to the orientations of parts relative to each other.
We measure the complexity of the weighted graph by computing its graph entropy. First, we compute the normalized weighted graph Laplacian [von, ent] as
where is a weighted adjacency matrix or a matrix of weights , and is a diagonal matrix with members . Then, we compute the von Neumann entropy of [von, ent] as
are the eigenvalues of, is the trace of the matrix product and . We use as a feature variable .
Iv Combination of Information Obtained at Different Layers of LHOP for Joint Object Pose Estimation and Categorization
In hierarchical compositional architectures, a different object property, such as object shape, pose and category, is represented at a different layer of a hierarchy in a vocabulary [lai_aaai11]. According the structures of the abstract representations of the properties, i.e. vocabularies, recognition processes have been performed using either a bottom-up [fidler_cvpr07, fidler_book] or top-down [lai_aaai11] approach. It’s worth noting that the information in the representations are distributed among the layers in the vocabularies. In other words, the information about the category of an object may reside at the lower layers of the hierarchy instead of the top layer. In addition, lower layer atomic features, e.g. oriented Gabor features, provide information about part orientations which can be used for the estimation of pose and view-points of objects at the higher layers. Moreover, the relationship between the pose and category of an object is bi-directional. Therefore, an information integration approach should be considered in order to avoid the propagation of errors that occur in multi-task learning and recognition problems such as joint object categorization and pose estimation, especially when only one of the bottom-up and top-down approaches is implemented.
For this purpose, we propose a generative-discriminative learning approach in order to combine the information obtained at each different layer of LHOP using the features extracted from part realizations. We represent the features defining a dimensional feature vector . The feature vector is computed for each training and test image, therefore we denote the feature vector of the image as , , in the rest of the paper.
We combine the feature vectors extracted at each layer for object pose estimation and categorization under the following Group Lasso optimization problem [glasso]
where is the squared norm, is a regularization parameter, is the weight vector computed at the layer, is a matrix of feature vectors , , and is a vector of target variables , . More specifically, where is a set of object poses, i.e. object orientation degrees, in a pose estimation problem.
where is the local estimate of the global variable for at the layer. Then, we solve (5) in the following three steps [admm, jsac],
At each layer , we compute as
where , is a penalty parameter, , is the average of , , and is a vector of scaled dual optimization variables computed at an iteration .
Then we update as
Finally, is updated as
These three steps are iterated until a halting criterion, such as for a given termination time , is achieved. Implementation details are given in the next section.
In a class object categorization problem, is a category variable. In order to solve this problem, we employ 1-of-C coding for sparse logistic regression as
where , is a weight vector associated to the category, if , . Then, we define the following optimization problem
norm. Second, we employ the logistic regression loss function in the computation ofas
In the training phase of the pose estimation algorithm, we compute the solution vector using training data. In the test phase, we employ the solution vector on a given test feature vector of the part realizations of an object to estimate its pose as
In the categorization problem, we predict the category label of an object in the image as
We examine our proposed approach and algorithms on two benchmark object categorization and pose estimation datasets, which are namely the Amsterdam Library of Object Images (ALOI) [aloi] and the Columbia Object Image Library (COIL-100) [coil]. We have chosen these two benchmark datasets for two main reasons. First, images of objects are captured by rotating the objects on a turntable by regular orientation degrees which enable us to analyze our proposed algorithm for multi-view object pose estimation and categorization in uncluttered scenes. Second, object poses and categories are labeled within acceptable precision which is important to satisfy the statistical stability of training and test samples and their target values. In our experiments, we also re-calibrated labels of pose and rotation values of the objects that are mis-recorded in the datasets.
We select the bin size () of the histograms and cell size of HOP (see Section III-A) and HOG features (see Section III-B) by greedy search on the parameter set , and take the optimal and which minimizes pose estimation and categorization errors in pose estimation and categorization problems using training datasets, respectively. In the employment of optimization algorithms, we compute , where , , is norm and parameter is selected from the set using greedy search by minimizing training error of object pose estimation and categorization as suggested in [admm]. In the implementation of LHOP, we learn the compositional hierarchy of parts and compute the part realizations for [fidler_cvpr07].
In the experiments, pose estimation and categorization performances of the proposed algorithms are compared with state-of-the-art Support Vector Regression (SVR), Support Vector Machines (SVM) [libsvm], Lasso and Logistic regression algorithms [stl] which use the state-of-the-art HOG features [HOG] extracted from the images as considered in [pose1]. In the results, we refer to an implementation of SVM with HOG features as SVM-HOG, SVM with the proposed LHOP features as SVM-LHOP, SVR with HOG features as SVR-HOG, SVR with the proposed LHOP features as SVR-LHOP, Lasso with HOG features as L-HOG, Logistic Regression with HOG features as LR-HOG, Lasso with LHOP features as L-LHOP, Logistic Regression with LHOP features as LR-LHOP.
We use RBF kernels in SVR and SVM. The kernel width parameter is searched in the interval and the SVR cost penalization parameter is searched in the interval using the training datasets.
V-a Experiments on Object Pose Estimation
We have conducted two types of experiments for object pose estimation, namely Object-wise and Category-wise Pose Estimation. We analyze the sharability of the parts across different views of an object in Object-wise Pose Estimation experiments. In Category-wise Pose Estimation experiments, we analyze incorporation of category information to sharability of parts in the LHOP and to pose estimation performance.
V-A1 Experiments on Object-wise Pose Estimation
In the first set of experiments, we consider the objects belonging to each different category, individually. For instance, we select objects for training and objects for testing using objects belonging to cups category. The ID numbers of the objects and their category names are given in Table I. For each object, we have object instances each of which represents an orientation of the object on a turntable rotated with and .
|Object IDs for Training||82||103||762||13||54||157||9|
|Object IDs for Testing||363, 540, 649, 710||164, 266, 291, 585||798, 829, 831, 965||110, 26, 46, 78||136, 138, 148, 158||36, 125, 153, 259||93, 113, 350, 826|
In the experiments, we first analyze the variation of part realizations and feature vectors across different orientations of an object. We visualize the features , and in Fig. 3 for a cup which is oriented with and for each . In the first row at the top of the figure, the change of is visualized . In the second row, the original images of the objects are given. In the third to the sixth rows, are visualized by displaying the part realizations with pixel intensity values for each . features are visualized in the rest of the rows for each .
In Fig. 3, we first observe that values of the object change discriminatively across different object orientations . For instance, if the handle of the cup is not seen from the front viewpoint of the cup (e.g. at ), then we observe a smooth surface of the cup and the complexity of the part graphs, i.e. the entropy values, decrease. On the other hand, if the handle of the cup is observed at a front viewpoint (e.g. at ), then the complexity increases. In addition, we observe that the difference between values of the object parts across different orientations decreases as increases. In other words, the discriminative power of the generative model of the LHOP increases at the higher layers of the LHOP since the LHOP captures the important parts and compositions that are co-occurred across different views through different layers.
Given a ground truth and an estimated pose value , the pose estimation error is defined as . Pose estimation errors of state-of-the-art algorithms and the proposed Hierarchical Compositional Approach are given in Fig. 4. In these results, we observe that the pose estimation errors of the algorithms which are implemented using the symmetric objects, such as apples and balls, are greater than that of the algorithms that are implemented on more structural objects such as cups.
In order to analyze this observation in detail, we show the ground truth and the estimated orientations of some of the objects from Apples, Balls, cups and Shoes categories in Fig. 5. We observe that some of the different views of the same object have the same shape and textural properties. For instance, the views of the ball at the orientations and represent the same pentagonal shape patterns. Therefore, similar parts are detected at these different views and the similar features are extracted from these detected parts. Then, the orientation of the ball, which is rotated by , is incorrectly estimated as .
V-A2 Experiments on Category-wise Pose Estimation
In Category-wise Pose Estimation experiments, we select different number of objects from different number of categories as training images to estimate the pose of test objects, randomly. We employ the experiments on both ALOI and COIL datasets.
In the ALOI dataset, we randomly select number of training objects and test object which belong to Cups, Cow, Car, Clock and Duck categories. We repeat the random selection process two times and give the average pose estimation error for each experiment. In order to analyze the contribution of the information that can be obtained from the parts to the pose estimation performance using the part shareability of the LHOP, we initially select Cups and Cow categories () and add new categories (Car, Clock and Duck) to the dataset, incrementally. The results are given in Table II. The results show that the pose estimation error decreases as the number of training samples, , increases. This is due to the fact that the addition of new objects to the dataset increases the statistical representation capacity of the LHOP and the learning model of the regression algorithm. In addition, we observe that the pose estimation error observed in the experiments for decreases when the objects from Car category are added to a dataset of objects belonging to Cups and Cow category in the experiments with . The performance boost is achieved by increasing the shareability of co-occurred object parts in different categories. For instance, the parts that construct the rectangular silhouettes of cows and cars can be shared in the construction of object representations in the LHOP (see Fig. 6.
We employed two types of experiments on COIL dataset, constructing balanced and unbalanced training and test sets, in order to analyze the effect of the unbalanced data to the pose estimation performance. In the experiments, the objects are selected from Cat, Spatula, Cups and Car categories which contain , , and objects. Each object is rotated on a turntable by from to .
In the experiments on balanced datasets, images of number of objects are initially selected from Cat and Spatula categories (for ), and then images of the objects selected from Cups and Car categories are incrementally added to the dataset for and category experiments. More specifically, objects are randomly selected from each category and the random selection is repeated two times for each experiment. The results are shown in Table III. We observe that the addition of new objects to the datasets decreases the pose estimation error. Moreover, we observe a remarkable performance boost when the images of the objects from the categories that have similar silhouettes, such as Cat and Cups or Spatula and Car, are used in the same dataset.
We prepared unbalanced datasets by randomly selecting the images of object from each category as a test sample and the images of the rest of the objects belonging to the associated category in the COIL dataset as training samples. For instance, the images of a randomly selected cat are selected as test samples and the images of the remaining two cats are selected as training samples. This procedure is repeated two times in each experiment and the average values of pose estimation errors are depicted in Fig. 7. The results show that SVR is more sensitive to the balance of the dataset and the number of training samples than the proposed approach. For instance, the difference between the pose estimation error of SVR given in Table III and Fig. 7 for is approximately , while that of the proposed Hierarchical Compositional Approach is approximately .
In the next subsection, the experiments on object categorization are given.
V-B Experiments on Object Categorization
In the Object Categorization experiments, we use the same experimental settings that are described in Section V-A2 for Category-wise Pose Estimation.
The results of the experiments employed on ALOI dataset and balanced subsets of COIL dataset are given in Table IV and Table V, respectively. In these experiments, we observe that the categorization performance decreases as the number of categories increases. However, we observe that the pose estimation error decreases as the number of categories increases in the previous sections. The reason of the observation of this error difference is that the objects rotated on a turn table may provide similar silhouettes although they may belong to different categories. Therefore, addition of the images of new objects that belong to different categories may boost pose estimation performance. On the other hand, addition of the images of these new objects may decrease the categorization performance if the parts of the object cannot be shared across different categories and increase the data complexity of the feature space.
In this paper, we have proposed a compositional hierarchical approach for joint object pose estimation and categorization using a generative-discriminative learning method. The proposed approach first exposes information about pose and category of an object by extracting features from its realizations observed at different layers of LHOP in order to consider different levels of abstraction of information represented in the hierarchy. Next, we formulate joint object pose estimation and categorization problem as a sparse optimization problem. Then, we solve the optimization problem by integrating the features extracted at each different layer using a distributed and parallel optimization algorithm.
We examine the proposed approach on benchmark 2-D multi-view image datasets. In the experiments, the proposed approach outperforms state-of-the-art Support Vector Machines for object categorization and Support Vector Regression algorithm for object pose estimation. In addition, we observe that shareability of object parts across different object categories and views may increase pose estimation performance. On the other hand, object categorization performance may decrease as the number of categories increases if parts of an object cannot be shared across different categories, and increase the data complexity of the feature space. The proposed approach can successfully estimate the pose of objects which have view-specific statistical and geometric properties. On the other hand, the proposed feature extraction algorithms cannot provide information about the view-specific properties of symmetric or semi-symmetric objects, which leads to a decrease of the object pose estimation and categorization performance. Therefore, the ongoing work is directed towards alleviating the problems with symmetric or semi-symmetric objects.