Vision-Based Object Recognition in Indoor Environments Using Topologically Persistent Features

by   Ekta U. Samani, et al.
University of Washington

Object recognition in unseen indoor environments remains a challenging problem for visual perception of mobile robots. In this letter, we propose the use of topologically persistent features, which rely on the shape information of the objects, to address this challenge. In particular, we extract two kinds of features, namely, sparse persistence image (PI) and amplitude, by applying persistent homology to multi-directional height function-based filtrations of the cubical complexes representing the object segmentation maps. The features are then used to train a fully connected network for recognition. For performance evaluation, in addition to a widely-used shape dataset, we collect a new dataset comprising scene images from two different environments, namely, a living room and a mock warehouse. The scenes in both the environments include up to five different objects that are chosen from a given set of fourteen objects. The objects have varying poses and arrangements, and are imaged under different illumination conditions and camera poses. The recognition performance of our methods, which are trained using the living room images, remains relatively unaffected on the unseen warehouse images. In contrast, the performance of the state-of-the-art Faster R-CNN method decreases significantly. In fact, the use of sparse PI features yields higher overall recall and accuracy; and, better F1 scores on many of the individual object classes. We also implement the proposed method on a real-world robot to demonstrate its usefulness.



page 1

page 3

page 4

page 7


Topologically Persistent Features-based Object Recognition in Cluttered Indoor Environments

Recognition of occluded objects in unseen indoor environments is a chall...

What's in my Room? Object Recognition on Indoor Panoramic Images

In the last few years, there has been a growing interest in taking advan...

Structural and object detection for phosphene images

Prosthetic vision based on phosphenes is a promising way to provide visu...

SizeNet: Object Recognition via Object Real Size-based convolutional networks

Inspired by the conclusion that human choose the visual cortex regions w...

Object Recognition under Multifarious Conditions: A Reliability Analysis and A Feature Similarity-based Performance Estimation

In this paper, we investigate the reliability of online recognition plat...

Robust Structure Identification and Room Segmentation of Cluttered Indoor Environments from Occupancy Grid Maps

Identifying the environment's structure, i.e., to detect core components...

New Graph-based Features For Shape Recognition

Shape recognition is the main challenging problem in computer vision. Di...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Perception is one of the core capabilities that autonomous robots must possess for successful operation. Object recognition forms an essential aspect of robot perception. Deep learning-based object detectors such as the R-CNN and its variants

[12, 13, 26], YOLO and its variants [23, 24]

have achieved tremendous success in localizing and recognizing objects. Such methods require a large amount of data. Therefore, a common technique used in training them is to use a model pre-trained on large databases with millions of images such as ImageNet

[7], and then fine-tune them using images from the scene where the models are expected to operate.

Considering the challenge of long term autonomy, where robots operate in complex environments for longer periods, the environment keeps changing continuously. Therefore, robustness in perception becomes critical [18]. Fine-tuning the object recognition models every time the environment changes is cumbersome. Moreover, the fine-tuned models are sensitive to illumination changes, objects’ texture [32, 5]. These limitations make such models unsuitable for long term autonomy applications in environments that change frequently. Different approaches have recently been explored, such as open-ended learning of new object categories based on human-robot interaction [17], or capturing domain-relevant data [8]. Alternatively, for object recognition in such continually changing environments, features dependent on the shape and independent of illumination, context, color, and texture are also promising.

Extracting shape information from high-dimensional data using algebraic topology tools is at the core of topological data analysis (TDA). Persistent homology, a predominant tool in TDA, has been successfully applied to extract persistent features for machine learning tasks


, especially computer vision

[25, 14, 10]. Reininghaus et al.[25] propose a new stable summary representation known as persistence images using which they show remarkable improvements in 3D shape classification/retrieval and texture recognition using topological features. Guo et al. [14] use sparsely sampled persistence images generated using persistent homology for human posture recognition and texture classification. Garin et al. [10]

use persistent features obtained using different filtrations to classify hand-written digits. Generating approximate persistence images using a deep neural network for image classification and human activity recognition using accelerometer data has also been explored


In this work, we propose the use of topologically persistent features for object recognition in indoor environments. We propose extracting two different kinds of persistent features by applying persistent homology to cubical complexes constructed from binary segmentation maps of objects. These features are then fed a fully-connected network for recognition. Additionally, we present the UW Indoor Scenes Dataset, with scenes from two different environments, namely a livingroom and a warehouse, for evaluating the robustness of trained object recognition and detection models on unseen environments. We show that since topologically persistent features are based entirely on the shape information, recognition networks trained using such features show greater robustness to changing environments than the state-of-the-art. Lastly, we also successfully implement the proposed framework on a real-world robot.

The paper is organized as follows. Section II covers some mathematical preliminaries associated with TDA. Section III provides details of the proposed approach, and Section IV describes the datasets used. Experimental details and results are summarized in Section V, followed by a discussion in Section VI. We highlight conclusions and point directions for future work in Section VII.

Ii Mathematical Preliminaries

Ii-a Cubical complexes

For TDA, data is often represented by cubical (or simplicial complexes) depending on the type of data. Image data can be considered a point cloud by treating pixels as a point in a two-dimensional Euclidean space. Such a point cloud is commonly represented using a simplicial complex. However, since images are made up of pixels, they have a natural grid-like structure to them. Such a structure can be more efficiently represented as a cubical complex [35] in various ways [27, 10]. A cubical complex in is a finite set of elementary cubes aligned on the grid . An elementary cube is a finite product of elementary intervals whose dimension is defined as the number of its non-degenerate components [16]. An -dimensional image is a map . A voxel is an element , and its value is the intensity. When , the voxel is known as a pixel, and the intensity is known as the grayscale value. To construct a cubical complex from an -dimensional image, Garin et al. [10] adopt an approach in which an -cube represents a voxel, and all adjacent lower-dimensional cubes (faces of the -cube) are added. The values of the voxels are extended to all the cubes in the resulting cubical complex , as follows


After data is represented as a suitable complex, a filtration is constructed from the complex as described in the following section.

Ii-B Filtration

A filtration is a collection of cubical (or simplicial) complexes such that is a subcomplex of , for each . A grayscale image comes with a natural filtration embedded in the grayscale values of its pixels. These values can be used to obtain sublevel sets of the corresponding cubical complex. Let denote the -th sublevel set of obtained as follows


The set defines a filtration of cubical complexes, which is indexed by the value of the function . For binary images, various functions known as descriptor functions (or filtration functions) are used to construct grayscale images from them [10]. Similarly, such functions are also defined for generating filtrations from point cloud data or mesh data [25].

Ii-C Persistent homology

Persistent homology, a prevalent tool in topological data analysis, studies the topological changes of the sublevel sets as increases from . During filtration, topological features, interpreted as -dimensional holes, appear and disappear at different scales, referred to as their birth and death times, respectively. This information is summarised in an -dimensional persistence diagram (PD). An -dimensional PD is a countable multiset of points in . Each point represents an -dimensional hole111In this work, we consider only order homology, i.e., appearance and disappearance of -dimensional holes (connected components) born at a time and filled at a time . The diagonal of a PD is a multiset where every point in has infinite multiplicity.

Several other stable representations of persistence that can be obtained from a PD are also proposed [1, 2]

. One such representation is the Persistence Image (PI), a stable and finite dimensional vector representation of persistent homology

[1]. The PD is first mapped to an integrable function known as the persistence surface, which is defined as the weighted sum of Gaussian functions centered at each point in the PD. A grid is obtained by discretizing a sub-domain of the persistence surface. Integrating the persistence surface over each grid box results in a matrix known as the PI.

Iii Method

Given an RGB scene image, our goal is to recognize objects present in the scene using the shape information captured in topologically persistent features. We first generate segmentation maps for all the objects present in the input scene image. Section III-A describes the adopted approach. We then extract topologically persistent features from the object segmentation maps, as described in III-B. These features are then fed to a fully connected network for recognition. Fig. 1 illustrates the proposed framework.

Fig. 1: Proposed framework for object recognition using topologically persistent features

Iii-a Object mask generation

To generate segmentation maps for all the objects in an input scene image, we follow a foreground segmentation approach similar to the one proposed in [37]. We use state-of-the-art DeepLabv3+ architecture [4] pre-trained on a large number of classes, hypothesized to have a strong representation of ’objectness.’ Subsequently, we train the network using pixel-level foreground annotations for a limited number of images. A shape-based object recognition method relies on the objects’ contours for distinguishing between multiple objects. However, when segmentation is performed on images taken from distances as large as two meters, the number of foreground pixels is very low as compared to the background. Hence, it is difficult for a segmentation model to capture minor details in the objects’ shape in a single shot. Therefore, to preserve the objects’ contours’ details, we employ a two-stage framework for segmentation. In the first stage, the segmentation model predicts a segmentation map for the input scene image. Contour detection is performed on this scene segmentation map to obtain bounding boxes of the objects. These bounding boxes are used to divide the scene image into multiple sub-images that each contains a single object in them. These sub-images are then fed to the same trained segmentation model for predicting individual object segmentation maps.

Iii-B Persistent features generation

Segmentation maps are essentially binary images made of only black and white voxels. A grayscale image, suitable for building a filtration of cubical complexes, can be built from such binary images by using various filtration functions. A commonly used filtration function for highlighting topological features in data is the height function. Height function is introduced in [33] to compute the Persistent Homology Transform, a sufficient statistic that can uniquely represent shapes in and surfaces . For our case of cubical complexes, we follow the following definition of height function as in [10]. Consider a binary object segmentation map . A grayscale image for the segmentation map can be obtained as follows. A direction of unit norm is chosen for the height function. All voxels are reassigned values such that


where is the distance of voxel

from the hyperplane defined by

, and is the filtration value of the voxel that is farthest away from the hyperplane.

We obtain such grayscale images for each object segmentation map considering directions that are evenly spaced on unit -sphere . We construct a cubical complexes from each grayscale image as according to Eq. 1. Sublevel sets of the complexes are then obtained according to Eq. 2 to get filtrations. Persistent homology is applied to the filtrations to obtain persistence diagrams (PDs) for every object segmentation map.

We investigate the performance of two types of persistent features computed from the generated PDs. The following subsections III-B1 and III-B2 describe the details of their computation.

Iii-B1 Sparse Persistence Image features

Since the number of points in a PD varies from shape to shape, such a representation of persistence is not suitable for machine learning tasks. Instead, we use the PI representation to generate features suitable for training the recognition network. However, only a few key pixel locations of the persistence images (PIs), which contain nonzero entries, sparsely encode topological information. Therefore we adopt the QR-pivoting based sparse sampling to obtain a Sparse PI as proposed in [14].

For every object segmentation mask, PIs are generated from the corresponding PDs. Sparse sampling is performed separately for PIs of different directions. For every direction , corresponding PIs for all training object segmentation maps are vectorized and arranged into columns of a matrix , where is the number of training object segmentation maps, and indicates the direction. Fig. 2 shows sample PIs generated for different objects using height functions in multiple directions.

Fig. 2: Persistence images (PIs) for two sample objects. PIs, obtained using height functions in different directions, together have enough information to distinguish between the two objects.

The dominant PI variation patterns

are obtained by computing the truncated singular value decomposition of

as follows



is the optimal singular value threshold

[11] for th direction PIs. is then discretely sampled using the pivoted QR factorization as follows


The numerically well-conditioned row permutation is then multiplied to , to give a matrix of sparsely sampled PIs. Each column of represents the direction sparse PI for the corresponding object segmentation map. For a particular object segmentation map, sparse PIs for all directions are stacked to form a single feature vector.

Iii-B2 Amplitude features

An alternative method of generating topologically persistent features is using the amplitude, or distance of a PD, from an empty diagram. For each of the generated PDs corresponding to an object segmentation map, we compute the bottleneck amplitude, defined as follows


where are all the non-diagonal points in the direction PD. These amplitudes are stacked to form a -dimensional feature vector. Such -dimensional feature vectors are generated for all the training object segmentation maps and used for training the recognition network.

Iv Datasets

Iv-a MPEG-7 Shape Silhouette Dataset

To exclusively evaluate the performance of the persistent features-based recognition network, we choose the MPEG-7 Shape Silhouette Dataset, which is a widely used dataset for image retrieval. We use a subset of this dataset, namely, the MPEG-7 CE Shape 1 Part B dataset. The subset is specifically designed for evaluating the performance of 2D shape descriptors for similarity-based image retrieval

[19]. It includes shapes of 70 different classes and 20 images for each class, for a total of 1400 images. Fig. 3 shows sample classes of the dataset.

Fig. 3: Sample images from the MPEG Shape Silhoutte Dataset

Iv-B UW Indoor Scenes Dataset

The shapes in the MPEG-7 are detailed and fairly distinguishable from each other. However, ordinary objects in indoor environments such as warehouses are often less detailed and, therefore, more challenging for topological methods. Most deep learning-based object detectors work exceptionally well in detecting in such everyday objects in their training environments. However, the same models face challenges when used in new environments without any retraining.

Therefore, we introduce a new dataset designed for testing the performance of object detection in indoor scenes on different training and test environments. We pick fourteen objects from the Yale-CMU-Berkeley (YCB) object and model set [3] for our dataset. The dataset consists of indoor scenes taken in two completely different environments. The first environment is a living room scene where objects are placed on a tabletop. The second environment is a mock warehouse setup where objects are placed on a shelf. For the living room environment, we have a total of 347 scene images. The images are taken in four different illumination settings, from three different perspectives and varying distances up to two meters. Sixteen out of the 347 scene images are with two different objects, 135 images are with three different objects, 156 images are with four different objects, and 40 images are with five different objects. For the warehouse environment, we have a total of 200 scene images taken from distances up to 1.6 meters. Sixty out of 200 images are images with three different objects, 68 images with four different objects, and 72 images are with five different objects. Fig. 3(a) shows some sample living room scene images, and Fig. 3(b) shows sample images from the warehouse environment. Fig. 3(c) shows all the fourteen objects used in our dataset.

(a) Living room environment
(b) Warehouse environment
(c) Objects
Fig. 4: Representative images from the UW Indoor Scenes Dataset

V Experiments

V-a Implementation details

For the MPEG Shape Dataset, we divide the 1400 images into five sets of 280 images each (four images of each class). We perform fivefold training and testing using these sets, such that each set is used once as a test set while the remaining four sets are used for training and validation. We use the giotto-tda [30] library to generate the PDs and the Persim package in Scikit-TDA Toolbox to generate We choose a grid size equal to 50x50, a spread of 10, and a linear weighting function for generating the PIs. PDs and PIs are generated using height functions in 8 directions evenly spaced on

. We use a three-layered, fully connected network for recognition using amplitude features. For sparse PI features, we use a five-layered, fully connected network for recognition. We use rectified linear unit (ReLU) activation for all layers except the last layer, which uses softmax activation. We use the Adam optimizer


and the categorical cross-entropy loss function. We use an initial learning rate of 0.01 for the first 500 epochs. We decrease the learning rate by a factor of 10 after every 100 epochs for the next 200 epochs, and by a factor of 100 for the last 100 epochs.

For the UW Indoor Scene dataset, we use the living room scene images for training and testing, whereas the warehouse scene images are exclusively used for testing. We use the Xception-65 network backbone [6]

for the DeepLabv3+ architecture. We initialize the network using a model pre-trained on the Imagenet dataset

[7] and PASCAL VOC2012 dataset [9]. The model is trained with 200 living room scene images and corresponding segmentation maps generated using LabelMe [36]

. The segmentation maps consist of two only classes, the foreground class, and the background class. Horizontally flipped counterparts of the 200 images are also included in the training. We use the Tensorflow implementation of DeepLabv3+ and the pre-trained model from

[31]. We train the network for 20,000 steps using the categorical cross-entropy loss with 1% hard example mining after 2500 steps. In other words, the number of pixels used for computing the loss values is gradually reduced from 100% to 1% till the first 2500 steps, after which only the top 1% pixels (with respect to loss values) are used. The model is trained on a workstation running Ubuntu 18.04 LTS operating system, equipped with a 3.7GHz 8 Core Intel Xeon W-2145 CPU, GPU ZOTAC GeForce GTX 1080 Ti, and 64 GB RAM.

We then perform fivefold training and testing for object recognition. All the 347 living room scene images are divided into five sets (three sets with 69 images and two sets with 70 images). We also include horizontally flipped counterparts of the 347 living room scene images in their respective sets. Object segmentation maps generated from each set are used once as test set while the object segmentation maps generated from the remaining four sets are used for training and validation. We augment the training data by rotating every training object segmentation map by , , and

. Since all the objects are of different sizes and observed from different distances, their corresponding segmentation maps also have different sizes. Therefore, we pad the object segmentation maps with zeros to obtain a square segmentation map without distorting the object contour. We then consistently resize all segmentation maps to a resolution of 125x125. PDs and PIs are generated in the same manner as described for the MPEG Shape dataset, except for the spread value chosen to be 20. We also use the same fully connected network architectures and hyperparameters as those used for training recognition networks using amplitude features and sparse PI features for the MPEG Shape dataset.

We compare the performance of both persistent features-based methods against Faster R-CNN [26], a state-of-the-art object detection method, on the UW Indoor Scenes Dataset. Similar to the persistent features-based methods, we perform fivefold training and testing of Faster R-CNN using the same five sets of living room images. Ground truth bounding box annotations are generated using LabelImg [34]. We use the InceptionResnet-V2 feature extractor [29] for the Faster R-CNN framework. For training, we use the implementation, pre-trained model, and hyperparameters available with the Tensorflow Object Detection API [15]. The recognition networks for our methods and Faster R-CNN models are trained on a working running Windows 10 operating system, equipped with a 2.20 GHz Intel Xeon E5-2630 CPU, GPU GeForce GTX 1080 and 32GB RAM.

V-B Results

We first examine the performance of both amplitude features and sparse PI features on the MPEG Shape dataset. We use the weighted F1 score, weighted precision, weighted recall and accuracy for evaluating the performance. Table I shows the test-time performance of the trained, fully connected object recognition networks. We observe that recognition using the Sparse PI features is better than the recognition using amplitude features with respect to all the four reported metrics. Recognition using sparse PI features achieves an accuracy of . A performance of 100% using 2D shape knowledge is not possible for this dataset since some classes contain shapes that are significantly different from others in the same class; they are more similar to shapes in other classes than to shapes in their class [19].

Amplitude Sparse PI
F1 score (w) 0.750.01 0.870.01
Precision (w) 0.770.01 0.890.02
Recall (w) 0.760.01 0.870.01
Accuracy 0.760.01 0.870.01
TABLE I: Performance comparison of amplitude and sparse PI features on the MPEG Shape Silhouette Dataset

Table II shows the performance of amplitude features and sparse PI features along with the performance of Faster R-CNN on the five splits of living room scene images from the UW Indoor Scenes Dataset. We observe that both the persistent features-based methods, which use only segmentation maps information, have a decent performance. We observe that recognition with sparse PI features, which achieves an accuracy of , is somewhat better than recognition with amplitude features, whose accuracy222To provide a fair judgement of the effectiveness of persistent features-based methods, we do not account for false negatives corresponding to objects that the segmentation model misses to detect. is . The difference between performance, however, is not as large as in the case of the MPEG Shape dataset. Moreover, Faster R-CNN, which uses RGB images as input, outperforms both of them with an accuracy of .

Metric Class Amplitude Sparse PI Faster R-CNN
F1 score Spoon 0.620.04 0.570.01 0.840.3
Fork 0.150.05 0.160.07 0.810.02
Plate 0.830.02 0.890.01 0.980.01
Bowl 0.950.01 0.940.02 0.980.01
Cup 0.670.02 0.780.04 0.910.01
Pitcher base 0.810.03 0.870.02 0.920.02
Bleach cleanser 0.730.02 0.720.02 0.870.03
Mustard bottle 0.630.02 0.680.03 0.840.03
Soup can 0.680.04 0.690.02 0.840.04
Chips can 0.720.02 0.750.02 0.920.03
Meat can 0.550.05 0.560.04 0.820.03
Gelatin box 0.550.02 0.640.04 0.910.02
Screwdriver 0.580.04 0.720.04 0.910.02
Padlock 0.740.03 0.720.03 0.970.02
F1 score (w) - 0.680.01 0.710.01 0.890.01
Precision (w) - 0.690.01 0.710.02 0.910.01
Recall (w) - 0.690.01 0.710.01 0.880.01
Accuracy - 0.690.01 0.710.01 0.880.01
TABLE II: Performance comparison of proposed persistent features-based methods with Faster R-CNN on the living room images from the UW Indoor Scenes Dataset

We believe that the performance of persistent features-based methods performance is affected by the generated segmentation maps’ quality. Notably, for the fork, performance is considerably poor as compared to performance for other objects. Therefore, we compare the performance of both persistent features-based methods against the performance of a human on the object segmentation maps. Table III summarizes the results of the comparison. For persistent features based-methods, we report the accuracy considering only those images where the human recognizes the object correctly. We observe that a human achieves an accuracy of , which is lower that Faster R-CNN performance, and finds it difficult to recognize objects based on generated segmentation maps, especially for spoon and fork classes. Refer Section VI for further discussion regarding segmentation maps quality.

Class Human performance Amplitude Sparse PI
Spoon 0.370.07 0.440.05 0.470.09
Fork 0.400.11 0.160.08 0.120.07
Plate 0.960.02 0.900.03 0.900.04
Bowl 0.970.01 0.960.01 0.970.01
Cup 0.920.03 0.650.05 0.810.03
Pitcher base 0.970.01 0.870.04 0.920.02
Bleach cleanser 0.910.01 0.860.02 0.770.02
Mustard bottle 0.910.03 0.630.04 0.730.04
Soup can 0.760.05 0.830.06 0.810.06
Chips can 0.890.04 0.810.04 0.740.07
Meat can 0.860.03 0.550.07 0.610.03
Gelatin box 0.880.04 0.590.03 0.560.07
Screwdriver 0.830.03 0.630.06 0.740.06
Padlock 0.850.04 0.810.05 0.820.03
Total 0.840.01 0.740.01 0.760.01
TABLE III: Comparison of the performance of proposed persistent features-based methods with a human

Despite the challenges posed by the use of segmentation maps, the main benefit of using them lies in the fact that recognition performance is expected to remain almost unchanged even when the objects’ environments vary considerably, provided the segmentation maps’ quality remains consistent. Therefore, we test the recognition performance of persistent features-based methods and Faster R-CNN on the all the warehouse scene images of the UW Indoor Scene Dataset without any retraining of the models333To ensure that the segmentation maps’ quality remains unchanged, we fine-tune the segmentation model on 131 out of the 200 warehouse scene images. Similar to the living room scene images casee, we also add horizontally flipped counterparts to the test set. Table IV

summarizes the performances of all the three methods on the warehouse test environment. Numbers in bold indicate cases where the performance is significantly better according to a paired two-tailed student’s t-test (

). We observe that the overall recognition accuracy using sparse PI features is almost unchanged, and better than accuracy using amplitude features. The recognition accuracy using amplitude features drops by 4%, whereas the overall accuracy of Faster R-CNN drops by approximately 25% without any fine-tuning. Moreover, recognition performance obtained using sparse PI features is better than Faster R-CNN for many object classes. We discuss these results in further detail, along with some more observations in Section VI.

Metric Class Amplitude Sparse PI Faster R-CNN
F1 score Spoon 0.440.04 0.520.02 0.390.03
Fork 0.280.04 0.200.04 0.240.03
Plate 0.160.07 0.040.02 0.270.16
Bowl 0.940.00 0.910.01 0.970.01
Cup 0.510.02 0.900.01 0.940.01
Pitcher base 0.860.01 0.940.00 0.930.03
Bleach cleanser 0.790.02 0.780.01 0.840.04
Mustard bottle 0.760.02 0.740.01 0.360.05
Soup can 0.730.02 0.660.02 0.850.02
Chips can 0.760.01 0.740.01 0.710.03
Meat can 0.580.09 0.710.01 0.750.05
Gelatin box 0.230.04 0.250.02 0.700.03
Screwdriver 0.780.02 0.720.01 0.420.04
Padlock 0.610.03 0.840.01 0.890.03
F1 score (w) - 0.650.01 0.700.00 0.650.01
Precision (w) - 0.670.01 0.700.01 0.770.02
Recall (w) - 0.660.01 0.710.01 0.630.01
Accuracy - 0.660.01 0.710.01 0.630.01
TABLE IV: Comparison of the performance of proposed persistent features-based methods and Faster R-CNN on the warehouse images from the UW Indoor Scenes Dataset.

V-C Robot Implementation

We also implement our proposed framework on the LoCoBot platform built on the Yujin Robot Kobuki Base (YMR-K01-W1) and powered by the Intel NUC NUC7i5BNH Mini PC. We mount the ZED2 camera with stereo vision on top of the LoCoBot. The locobot is controlled using the PyRobot interface [20]

. Images captured by the camera are fed to the trained segmentation model and recognition networks that are run on NVIDIA Jetson AGX Xavier Developer Kit. It is equipped with a 512-core Volta GPU with Tensor Cores and 8-core ARM v8.2 64-bit CPU. We use TensorRT

[21] for optimizing the trained segmentation model. Fig. 5 shows a screenshot of the setup.

Fig. 5: Screenshot of the LoCoBot operating in a warehouse setup

Vi Discussion

We observe in Table II and Table IV that performance of persistent features based-methods is particularly poor for the fork and spoon classes, as compared to the other object classes. We also note that, a similar observation can be made from Table III for human performance as well. We attribute this trend to the inherent similarity between forks and spoons combined with segmentation difficulty. Fig. 5(a) shows one such example where the object in question is a fork but the shape in the segmentation map looks more similar to a spoon. In multiple poses of the fork, especially when observed from distances as large as two meters, the current segmentation model finds it hard to segment the tines.

We also observe from Table III that our methods reach up to 76% percent of human performance and 74% percent of human performance using sparse PI features and amplitude features respectively. We believe this difference in performance is largely due the addition capability of humans of completing shapes. The segmentation model sometimes results into incomplete shapes due to poor segmentation at the scene segmentation stage, or poorly segmented shapes due to poor segmentation at the object segmentation level. For instance, Fig. 5(b) shows a segmentation map where the padlock is only partially segmented. Humans can look at the shackle and still infer that it is a padlock. However, topologically, the shape in the segmentation map is reasonably different from that of a padlock.

On comparing classwise performance metrics for the living room and warehouse environment from Tables II and IV, respectively, we observe some variation, except for the plate and gelatin box classes where the drop is significant. These variations can be attributed to changes in camera viewing angle resulting in new unseen object poses. Additionally, there are naturally occurring variations in object placement across both environments. Fig. 5(c) illustrates this problem for the gelatin box. The leftmost image and corresponding segmentation map belong to an image from the living room scene. The middle image and its segmentation map show the same object pose captured from a different camera viewing angle in the warehouse. In this case, the shape becomes extremely similar that of the chips can in the rightmost image. Unlike other objects such as pitcher base and cup, the shape in this 2D segmentation map contains very little discriminative information for correctly recognizing the object. Similarly, Fig. 5(d) shows how the change in the camera viewing angle from the left most image (living room) to the middle image (warehouse) results in a completely different 2D shape. The plate’s shape resembles the half-visible spoon in the rightmost image more than it resembles the plate from the leftmost image. Since this shape never appears when training the recognition network using living room images, it is almost always wrongly recognized as a spoon. Not only our methods but Faster R-CNN also finds this new shape particularly challenging to recognize. We believe that incorporating depth information at the time of extracting persistent features can address this problem.

Fig. 6: Sample failure cases. Fig. 5(a) shows the case where a poorly segmented fork is confused with a spoon. Fig. 5(b) shows a case where, unlike proposed methods, a human succeeds because of completion. Fig. 5(c) and 5(d) illustrate that a change in perspective results in new shapes for some objects, making them similar to other objects.

Vii Conclusions

In this letter, we propose the use of topologically persistent features for object recognition in indoor environments. We construct cubical complexes from binary segmentation maps of the objects. For every cubical complex, we obtain multiple filtrations using height functions in multiple directions. Persistent homology is applied to these filtrations to obtain topologically persistent features that capture the objects’ shape information used for recognition. We propose using two different kinds of persistent features, namely sparse PI features and amplitude features, for training a fully connected recognition network. Unlike a state-of-the-art object detector, the proposed persistent features-based methods’ overall recognition performance remains relatively unaffected even on a different test environment without retraining provided the quality of segmentation maps used for persistent feature extraction is maintained. Moreover, our methods also outperform the state-of-the-art detector for certain object classes, making them a promising first step in achieving robust object recognition.

In the future, we intend to enhance the recognition performance of our system by adding depth information obtained using RGB-D or stereo cameras to deal with the challenges associated with the camera viewing angle discussed in Section VI

. Depth information would also help extend the segmentation capabilities to instance segmentation and deal with incomplete segmentation maps and partial occlusion of objects. Finally, we also plan to explore the use of topologically persistent features in estimating 6D poses of objects using few-shot deep learning methods.


  • [1] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier (2017) Persistence images: A stable vector representation of persistent homology. J. Mach. Learn. Res. 18 (1), pp. 218–252. Cited by: §II-C.
  • [2] P. Bubenik (2015) Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res. 16 (1), pp. 77–102. Cited by: §II-C.
  • [3] B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar (2015) Benchmarking in manipulation research: using the yale-CMU-Berkeley object and model set. IEEE Robot. Automat. Mag. 22 (3), pp. 36–52. Cited by: §IV-B.
  • [4] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conf. Comput. Vis., Cited by: §III-A.
  • [5] B. Cheng, Y. Wei, H. Shi, R. Feris, J. Xiong, and T. Huang (2018) Revisiting R-CNN: On awakening the classification power of Faster R-CNN. In uropean Conf. Comput. Vis., pp. 453–468. Cited by: §I.
  • [6] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In

    IEEE Conf. Comput. Vis. Pattern Recognit.

    pp. 1251–1258. Cited by: §V-A.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A large-scale hierarchical image database. In IEEE Conf. Comput. Vis. Pattern Recognit., pp. 248–255. Cited by: §I, §V-A.
  • [8] C. Eriksen, A. Nicolai, and W. Smart (2018) Learning object classifiers with limited human supervision on a physical robot. In IEEE Intl Conf. Robot. Comput., pp. 282–287. Cited by: §I.
  • [9] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The PASCAL visual object classes (VOC) challenge. Intl. J. Comput. Vis. 88 (2), pp. 303–338. Cited by: §V-A.
  • [10] A. Garin and G. Tauzin (2019)

    A topological ”reading” lesson: classification of mnist using tda

    In 2019 18th IEEE Intl. Conf. Mach. Learn. Applicat. (ICMLA), pp. 1551–1556. Cited by: §I, §II-A, §II-B, §III-B.
  • [11] M. Gavish and D. L. Donoho (2014) The optimal hard threshold for singular values is . IEEE Trans. Inf. Th. 60 (8), pp. 5040–5053. Cited by: §III-B1.
  • [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014-06) Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conf. Comput. Vis. Pattern Recognit., Cited by: §I.
  • [13] R. Girshick (2015) Fast R-CNN. In IEEE Intl. Conf. Comput. Vis., pp. 1440–1448. Cited by: §I.
  • [14] W. Guo, K. Manohar, S. L. Brunton, and A. G. Banerjee (2018) Sparse-TDA: Sparse realization of topological data analysis for multi-way classification. IEEE Trans. Knowl. Data. Eng. 30 (7), pp. 1403–1408. Cited by: §I, §III-B1.
  • [15] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In IEEE Conf. Comput. Vis. Pattern Recognit., pp. 7310–7311. Cited by: §V-A.
  • [16] T. Kaczynski, K. Mischaikow, and M. Mrozek (2006) Computational homology. Vol. 157, Springer Science & Business Media. Cited by: §II-A.
  • [17] S. H. Kasaei, M. Oliveira, G. H. Lim, L. S. Lopes, and A. M. Tomé (2018) Towards lifelong assistive robotics: A tight coupling between object perception and manipulation. Neurocomputing 291, pp. 151–166. Cited by: §I.
  • [18] L. Kunze, N. Hawes, T. Duckett, M. Hanheide, and T. Krajník (2018) Artificial intelligence for long-term robot autonomy: A survey. IEEE Robot. Autom. Lett. 3 (4), pp. 4023–4030. Cited by: §I.
  • [19] L. J. Latecki, R. Lakamper, and T. Eckhardt (2000) Shape descriptors for non-rigid shapes with a single closed contour. In IEEE Conf. Comput. Vis. Pattern Recognit., pp. 424–429. Cited by: §IV-A, §V-B.
  • [20] A. Murali, T. Chen, K. V. Alwala, D. Gandhi, L. Pinto, S. Gupta, and A. Gupta (2019)

    PyRobot: An open-source robotics framework for research and benchmarking

    arXiv preprint arXiv:1906.08236. Cited by: §V-C.
  • [21] Nvidia NVIDIA/tensorrt. External Links: Link Cited by: §V-C.
  • [22] D. Pachauri, C. Hinrichs, M. K. Chung, S. C. Johnson, and V. Singh (2011) Topology-based kernels with application to inference problems in Alzheimer’s disease. IEEE Trans. Med. Imag. 30 (10), pp. 1760–1770. Cited by: §I.
  • [23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: Unified, real-time object detection. In IEEE Conf. Comput. Vis. Pattern Recognit., pp. 779–788. Cited by: §I.
  • [24] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In IEEE Conf. Comput. Vis. Pattern Recognit., pp. 7263–7271. Cited by: §I.
  • [25] J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt (2015) A stable multi-scale kernel for topological machine learning. In IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4741–4748. Cited by: §I, §II-B.
  • [26] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. In Adv. Neural Inform. Process. Syst., pp. 91–99. Cited by: §I, §V-A.
  • [27] V. Robins, P. J. Wood, and A. P. Sheppard (2011) Theory and algorithms for constructing discrete Morse complexes from grayscale digital images. IEEE Trans. Pattern Anal. Mach. Intell. 33 (8), pp. 1646–1658. Cited by: §II-A.
  • [28] A. Som, H. Choi, K. Natesan Ramamurthy, M. P. Buman, and P. Turaga (2020) PI-Net: A deep learning approach to extract topological persistence images. In IEEE Conf. Comput. Vis. Pattern Recognit. Workshop, pp. 834–835. Cited by: §I.
  • [29] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi (2016)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    arXiv preprint arXiv:1602.07261. Cited by: §V-A.
  • [30] G. Tauzin, U. Lupo, L. Tunstall, J. B. Pérez, M. Caorsi, A. Medina-Mardones, A. Dassatti, and K. Hess (2020) Giotto-tda: A topological data analysis toolkit for machine learning and data exploration. External Links: 2004.02551 Cited by: §V-A.
  • [31] Tensorflow Tensorflow/models. External Links: Link Cited by: §V-A.
  • [32] S. Thys, W. Van Ranst, and T. Goedemé (2019) Fooling automated surveillance cameras: adversarial patches to attack person detection. In IEEE Conf. Comput. Vis. Pattern Recognit. Workshop, pp. 0–0. Cited by: §I.
  • [33] K. Turner, S. Mukherjee, and D. M. Boyer (2014) Persistent homology transform for modeling shapes and surfaces. Information and Inference: A Journal of the IMA 3 (4), pp. 310–344. Cited by: §III-B.
  • [34] Tzutalin (2015) Tzutalin/labelimg. External Links: Link Cited by: §V-A.
  • [35] H. Wagner, C. Chen, and E. Vuçini (2012) Efficient computation of persistent homology for cubical data. In Topological methods in data analysis and visualization II, pp. 91–106. Cited by: §II-A.
  • [36] Wkentaro (2019) GitHub -wkentaro/labelme. External Links: Link Cited by: §V-A.
  • [37] B. Xiong, S. D. Jain, and K. Grauman (2018) Pixel objectness: learning to segment generic objects automatically in images and videos. IEEE Trans. Pattern Anal. Mach. Intell. 41 (11), pp. 2677–2692. Cited by: §III-A.
  • [38] A. W. Yu, L. Huang, Q. Lin, R. Salakhutdinov, and J. Carbonell (2017) Block-normalized gradient method: An empirical study for training deep neural network. arXiv preprint arXiv:1707.04822. Cited by: §V-A.