Log In Sign Up

Unseen Object Instance Segmentation with Fully Test-time RGB-D Embeddings Adaptation

by   Lu Zhang, et al.

Segmenting unseen objects is a crucial ability for the robot since it may encounter new environments during the operation. Recently, a popular solution is leveraging RGB-D features of large-scale synthetic data and directly applying the model to unseen real-world scenarios. However, even though depth data have fair generalization ability, the domain shift due to the Sim2Real gap is inevitable, which presents a key challenge to the unseen object instance segmentation (UOIS) model. To tackle this problem, we re-emphasize the adaptation process across Sim2Real domains in this paper. Specifically, we propose a framework to conduct the Fully Test-time RGB-D Embeddings Adaptation (FTEA) based on parameters of the BatchNorm layer. To construct the learning objective for test-time back-propagation, we propose a novel non-parametric entropy objective that can be implemented without explicit classification layers. Moreover, we design a cross-modality knowledge distillation module to encourage the information transfer during test time. The proposed method can be efficiently conducted with test-time images, without requiring annotations or revisiting the large-scale synthetic training data. Besides significant time savings, the proposed method consistently improves segmentation results on both overlap and boundary metrics, achieving state-of-the-art performances on two real-world RGB-D image datasets. We hope our work could draw attention to the test-time adaptation and reveal a promising direction for robot perception in unseen environments.


page 1

page 3

page 7

page 8


Unseen Object Instance Segmentation for Robotic Environments

In order to function in unstructured environments, robots need the abili...

The Best of Both Modes: Separately Leveraging RGB and Depth for Unseen Object Instance Segmentation

In order to function in unstructured environments, robots need the abili...

Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation

Segmenting unseen objects in cluttered scenes is an important skill that...

Test-Time Adaptation with Shape Moments for Image Segmentation

Supervised learning is well-known to fail at generalization under distri...

Segmenting unseen industrial components in a heavy clutter using rgb-d fusion and synthetic data

Segmentation of unseen industrial parts is essential for autonomous indu...

Test-Time Adaptation to Distribution Shift by Confidence Maximization and Input Transformation

Deep neural networks often exhibit poor performance on data that is unli...

Towards Generalization on Real Domain for Single Image Dehazing via Meta-Learning

Learning-based image dehazing methods are essential to assist autonomous...

I Introduction

In recent years, an inevitable development trend in robotics is the transition from controlled labs to unstructured environments. When encountering new environments and objects, the robot system must have the ability to adjust itself and recognize unseen objects. Such ability is essential for robots to better understand working environments and perform various manipulation tasks. To take a step toward this goal, we approach the task of Unseen Object Instance Segmentation (UOIS) [41, 39, 40, 42], which aims to conduct instance-aware segmentation of unseen objects in tabletop scenes. In UOIS, the robot system needs to learn the concept of “object” and generalize it to unseen ones.

Fig. 1: Overview of the proposed method. (a) Existing UOIS models [41, 39, 40, 42] are typically trained with large-scale synthetic data and then directly run on the unseen realistic scenarios. (b) FTEA aims to mitigate the domain shift between the synthetic and realistic data by further adapting the model during test time. Specifically, the convolutional layers are frozen and BN layers are modulated with two novel unsupervised objectives, i.e. the NEO and the CKD.

However, unlike ImageNet



[21] that have spurred great development of the classification and object detection for natural images, a large-scale dataset of real images that contains sufficient objects for robotic manipulation scenes is absent now [39]. Therefore, current UOIS methods generally utilize large-scale synthetic RGB-D data for training. For example, Xie et al[41] propose to use synthetic scenes that can be rendered into RGB-D images from different viewpoints with automatically generated labels. Since the depth inputs have better generalization promise than the non-photorealistic RGB image, previous works [41, 39, 40, 42] typically take the depth or RGB-D images as inputs to train a model that can separately segment each object instance. Afterward, the model is directly deployed on unseen realistic datasets. Though such a workaround achieves reasonable performance, it in turn leads to the neglect of domain shift caused by the “sim2real gap”.

In general, no simulation perfectly replicates reality. For robot perception, the synthetic data fail to model many aspects of the real world, like the object texture, lighting conditions, depth noises, etc. Such a sim2real gap degrades the model’s performance on realistic data, especially those in an unseen environment. Therefore, rather than betting the generalization ability to the elaborated model, we would like to emphasize the adaptation to domains in which the model is implemented. A natural solution for eliminating such domain shift is performing unsupervised domain adaptation (UDA) [6, 7, 23], which takes data from the labelled source (synthetic) domain and unlabelled target (realistic) domain to bridge domain distribution gaps. However, the UOIS model cannot consistently access the large-scale synthetic data when it is deployed. There is also an extreme imbalance between the synthetic and realistic data, which is undesirable for the UDA. Besides, the unseen realistic data is infeasible to be fully collected and labelled. Thus, without relying on synthetic data or supervision, the model must adapt only given its parameters and unseen realistic images.

In this paper, we consider the open-set nature of UOIS and propose a Fully Test-time RGB-D Embeddings Adaptation (FTEA) method to mitigate the domain gap between synthetic data and unseen realistic data. The overview of our method is illustrated in Figure 1. First, we propose a novel Non-parametric Entropy Objective (NEO), since the entropy is shown to be related to error and shift [34]

. More confident predictions are generally more correct and have fewer shifts. Specifically, NEO leverages the unsupervised clustering to obtain centroids that present pseudo instance labels, and calculates the classification probabilities in a non-parametric way. Thus we can use the distributions to formulate Shannon entropy as the learning objective for test-time adaptation. Second, we design a Cross-modality Knowledge Distillation (CKD) module to encourage knowledge transfer during test time. CKD aims to distill the knowledge from the multimodal feature map to unimodal ones, since fused features provably perform better with more aggregated information


. Specifically, we use the full multimodal network as the teacher and the smaller partial unimodal network as the student. Finally, to avoid the model divergence and instability caused by tuning all parameters of the model, we utilize the affine transformation provided by Batch-Normalization (BN)

[14] as the modulation parameters. By re-calibrating the channel of RGB and depth feature maps, better-fused RGB-D embeddings for unseen object instance segmentation are obtained.

Given the pre-trained model, FTEA is independent of the large-scale synthetic training data and does not introduce extra parameters, which establishes a key advantage of flexibility. Meanwhile, compared to the inference process itself, the computation overhead of the proposed adaptation is ignorable (taking a total of 20s on a single GPU), which makes FTEA particularly efficient. The proposed method is evaluated on two real-world RGB-D image datasets for the unseen object instance segmentation, i.e. OSD [28] and OCID [31]

. Extensive experiments show that FTEA consistently improves the segmentation performances and achieves state-of-the-art results on various evaluation metrics, demonstrating the effectiveness of our method.

The main contributions of this work are summarized as follows. (1) We pay attention to the understudied domain shift problem in UOIS and introduce a novel framework for fully test-time RGB-D embeddings adaptation. (2) To realize effective test-time adaptation for UOIS, we design two unsupervised learning objectives,

i.e. the NEO and the CKD. (3) The proposed method can be efficiently implemented along with the inference process and introduces little computation overhead. (4) FTEA achieves state-of-the-art results on two standard real-world RGB-D datasets, i.e. OSD and OCID.

Ii Related Work

Ii-a Unseen Object Instance Segmentation

UOIS aims to conduct instance-aware segmentation of unseen objects in tabletop environments, which is useful for robots to perform various manipulation tasks. As a pioneer, Xie et al. [41, 42] tackle this problem by proposing a two-stage framework. The framework first operates only on depth to produce rough initial segmentation masks and then refines those masks with RGB. Then, Xiang et al. [39] introduce a fully convolutional network based model called UCN that can be trained end-to-end. Different from previous approaches that mainly rely on depth for segmentation, UCN utilizes both depth images and non-photorealistic RGB images to produce feature embeddings for every pixel, which can be used to learn a distance metric for clustering to segment unseen objects.

Recently, several methods are proposed to tackle specific challenges in UOIS. For instance, RICE [40] focuses on the occlusion problem in clutter scenes and utilizes a graph-based representation of instance masks to refine the outputs of previous methods. UOAIS-Net [1] presents a new unseen object amodal instance segmentation (UOAIS) task to emphasize the amodal perception for robotic manipulation in a cluttered scene. Besides, UOAIS-Net introduce a large-scale photo-realistic synthetic dataset named UOAIS-SIM to improve the sim2real transferability.

Though these methods have been demonstrated to be effective, the domain shift problem is not explicitly concerned and tackled. It is worth noting that the photo-realistic data used by UOAIS-Net are also generated with rendered scenes in the simulator, thus synthetic. Differently, we present a new observation to the solution of the domain shift problem in UOIS and propose an efficient framework to conduct the adaptation during test time.

Fig. 2: Illustration of the proposed FTEA. We use the two-stream CNN architecture for RBG-D inputs, which is largely simplified for better visualization. The pre-trained model with synthetic data is based on UCN [39]. During test time, we construct a novel NEO module to calculate the entropy in a non-parametric way. Then a CKD module is further proposed to encourage the cross-modality information transfer. In fully test-time adaptation, all convolutional layers are frozen, only the affine parameters and normalization statistics of the BN layer are modulated.

Ii-B Unsupervised Domain Adaptation

Our work is also related to unsupervised domain adaptation (UDA) since we aim to mitigate the domain shift between the synthetic and realistic data in UOIS. The goal of UDA is to bridge the gap between an annotated source domain and an unannotated target domain. Current UDA approaches can be roughly classified into three categories: data generation, self-training based on pseudo-labels, and domain-invariant representation learning.

Data generation is a popular solution for UDA. Hoffman et al. [11] align input images to match the style of target images by using image translation mechanisms based on GANs [48]. [44] swap a portion of frequency information of the source image with the target image to perform style transfer. In recent years, pseudo-labels generated for the target domain have been used as additional training material to reach better performance [20, 25]. In order to reduce the effect of noise, Zhang et al. [47]

propose pseudo-category estimation and an online correction mechanism to improve label quality. Besides, approaches from different categories can be combined to obtain state-of-the-art results 

[15, 36, 24].

Domain-invariant representation learning aims to reduce the representation discrepancy between source and target domain in a specific feature space so that domain-invariant representation can be learned. Adversarial learning  [33, 3, 27, 18] is widely used to learn domain-invariant features. These methods employ a discriminator network and a gradient reversal layer (GRL) [6] to align the features either at the global or local image level. By hypothesizing that domain-related knowledge is represented by the statistics of the BN layer, Li et al. [19] propose the AdaBN to adopt domain-specific normalization for different domains. Due to its effectiveness, several recent works [2, 49, 16] share the same inspiration. Moreover, modulating BN parameters is also efficient since it does not introduce an extra layer or structure to learn domain-invariant representations.

However, during adaptation, the UDA method typically needs: 1) annotated source data, 2) unannotated target data, and 3) domain annotations for each instance, which is impractical for the UOIS. Thus, it is difficult to simply utilize the UDA method to bridge the domain gap in UOIS, which makes domain adaptation in UOIS a much more challenging task.

Ii-C Test-time Adaptation

When the model is deployed, it is inevitable to encounter unlabelled images that are not observed before. This is a key characteristic of robot perception and the UOIS task. Therefore, developing strategies to adapt the model at test time is essential. Recently, test-time training (TTT, TTT+) [32, 22]

uses unlabelled test instances to conduct self-supervised learning as the adaptation. But such methods heavily rely on the choice of proxy tasks and also need to visit training data that could be unavailable in practice. To address the above limitations, Wang

et al[34] propose a fully test-time adaptation method named Tent to reduce generalization error by test-time entropy minimization. Tent only needs the model parameters and unlabeled testing data for adaptation. By minimizing the entropy, the model optimizes itself according to its own predictions, which is not relevant to proxy tasks.

However, the aforementioned test-time adaptation method is mainly applied in traditional close-set tasks such as image classification, and exploits the output distributions of classifiers as the test-time objective optimization. For the UOIS task, the model generally can not be trained with an explicit discriminative layer with a certain number of classes, which poses a major obstacle for the test-time adaptation in UOIS.

Iii Method

Iii-a Overview

Iii-A1 Network Architecture

To deal with RGB-D inputs, we adopt the two-stream CNN with late fusion as our basic network architecture. As shown in Figure 2, a pair of RGB-D images are separately processed with CNN, and the RGB and depth feature maps are bilinearly upsampled to the full resolution as the input images. Then, the late fusion is conducted for a joint representation. We use the UCN [39] as our pre-trained model due to its conciseness and end-to-end fashion.

Iii-A2 The Pipeline

During test time, we construct a non-parametric entropy objective (NEO) with the unsupervised clustering and non-parametric classification probability, which is explained in detail in the following Section III-B. To encourage test-time information transfer, a cross-modality knowledge distillation (CKD) module is further proposed, as described in Section III-C. Finally, we fix all convolutional layers and minimize the proposed NEO and CKD loss to modulate the channel-wise affine transformations of Batch-Normalization (BN) [14]. Details of the adaptation process are described in Section III-D.

Iii-B Non-parametric Entropy Objective

To modulate features during test time, an learning objective based on the model’s predictions of test data is typically needed. Recent works [34, 26] use the discriminative outputs (e.g

. classification probabilities) to calculate the entropy object and have shown its effectiveness. However, different from the standard recognition model, UOIS does not explicitly train a discriminative layer that produces logits and probabilities for the direct calculation of entropy. To address this problem, we propose a novel non-parametric entropy objective (NEO). First, we utilize unsupervised clustering to obtain centroids that represent pseudo-classes. Then, the non-parametric classification probability is designed to calculate the entropy objective. Next, we describe each step in detail.

Iii-B1 Unsupervised clustering

Given a bunch of RGB-D embeddings on a feature map, we aim to cluster all pixels into groups to segment unseen objects. But the number of unseen objects is uncertain, which prevents the usage of clustering algorithms with a known number of clusters such as

-means or spectral clustering. Thus, we follow UCN

[39] to use the mean shift [4] clustering algorithm with the von Mises-Fisher (vMF) distribution [17], which automatically discovers the number of objects and generates a segmentation mask for each object. After the mean shift clustering, we can calculate the centroid of each cluster. The

-th cluster’s centroid vector

is obtained by averaging all feature map vectors which belong to the -th cluster as


where and denote locations on the feature map along the -axis and -axis. After performing the average operation for each cluster, we obtain the set of all centroids as


where is the number of objects, which is estimated by the mean shift clustering algorithm.

Iii-B2 Non-parametric classification probability

Different from the classification with standard supervised learning, UOIS segments objects in the instance-level, thus having an uncertain number of “classes”. Inspired by recent self-supervised learning methods with instance discrimination [38, 43], we propose to calculate classification probabilities in a non-parametric way. Below we describe and highlight its key feature and the differences between parametric and non-parametric ones.

Parametric classification probability. We formulate the parametric classification objective based on the popular softmax criterion. For a candidate feature vector , its probability to be recognized as the -th class is


where denotes the classification weights optimized by supervised learning.

Non-parametric classification probability. Without specifying the number of classes, the classification weights can not be formulated and learned. Hence we propose to conduct non-parametric classification by using the metric between the candidate feature vector and the -th cluster’s centroid vector as follows:



is the cosine similarity to measure how well

matches the -th cluster/object.

Iii-B3 Entropy objective

As an unsupervised objective, the Shannon entropy [29] is widely used and demonstrated to be effective in the test-time adaptation [34]. With the proposed non-parametric classification probability, we can calculate the Shannon entropy as


where is the set of numbers to indicate nearest clusters for the candidate , is the probability that is recognized as the -th cluster/object given a collection of centroids .

Iii-C Cross-modality Knowledge Distillation

To enhance the cross-modality knowledge transfer for better-fused representations during test-time adaptation, we further propose a cross-modality knowledge distillation (CKD) module. CKD aims to distill the knowledge from the multimodal feature map to uni-modal ones, since fused features provably perform better with more aggregated information [13]. Therefore, CKD uses the full multimodal network as the teacher, and the smaller partial uni-modal network is treated as the student.

For each candidate feature vector on the fused/teacher feature map, the soft targets (i.e. probabilities of the input belonging to the classes) are calculated as


where is the temperature factor to control the importance of each soft target. Due to the absence of a parametric classifier, here we also use the non-parametric way as in Section III-B to calculate the classification probabilities.

For the individual RGB and depth modality (i.e. the student), we use the same spatial cluster assignment as in the teacher, but they hold different representations and , thus producing different cluster centroids and . Then, the and for the RGB and depth modality can be similarly calculated as in Equation 7. Accordingly, the cross-modality knowledge distillation objective can be formulated as



denotes the Kullback-Leibler divergence loss. By optimizing Equation

8, we can make the outputs of student network match the ones of teacher network.

Iii-D Fully Test-time RGB-D Embeddings Adaptation

By minimizing the aforementioned entropy objective and distillation objective , we can adapt our model fully in test time. However, tuning all parameters like that in the training phase is inefficient and could easily cause model divergence [34]. As the channel of feature map can be considered as a feature detector [45, 37], re-calibrating channel responses has been widely studied and utilized in network pruning [10], multimodal fusion [35, 46], representation learning [12, 30], etc. Therefore, to stabilize the test-time adaptation and aim for an effective fusion of RGB-D embeddings, we use the channel-wise affine transformation provided by Batch-Normalization (BN) [14] as our modulation parameters.

The BN layer is widely used in deep learning to eliminate internal covariate shift and improve generalization. It performs a linear transformation followed by the convolutional or fully-connected layers. We denote by

the -th channel for the -th layer feature maps of -th modality (RGB or depth in our setting), then the transformation of the BN layer can be written as


where the scaling and shift factors and are adjustable affine parameters, the normalization statistics and are updated with momentum.

Thus, by adapting the scaling and shift factors , and updating the statistics , of the BN layer, the proposed method does not introduce new parameters. Given the pre-trained model, our method is independent of the large-scale training data, i.e. if the model can be run with testing data, it can be adapted. Besides, the fully test-time adaptation re-calibrates channel-wise responses of RGB and depth feature maps, thus providing better weighted fused RGB-D embeddings for unseen object instance segmentation. Finally, the overall objective for test-time adaptation can be formulated as


where the two terms and are weighted by the balancing parameters and .

Method Input OCID [31] OSD [28]
Overlap Boundary Overlap Boundary
P R F P R F F@.75 P R F P R F F@.75

Mask RCNN [8]
RGB-D 80.8 73.9 76.1 68.2 58.4 61.8 - 74.4 72.7 73.4 53.1 48.1 49.8 -

UOIS-Net-2D [41]
Depth 88.3 78.9 81.7 82.0 65.9 71.4 69.1 80.7 80.5 79.9 66.0 67.1 65.6 71.9

UOIS-Net-3D [42]
Depth 86.5 86.6 86.4 80.0 73.4 76.2 77.2 85.7 82.5 83.3 75.7 68.9 71.2 73.8

UCN [39]
RGB-D 86.0 92.3 88.5 80.4 78.3 78.8 82.2 84.3 88.3 86.2 67.5 67.5 67.1 79.3

UCN+ [39]
RGB-D 91.6 92.5 91.6 86.5 87.1 86.1 89.3 87.4 87.4 87.4 69.1 70.8 69.4 83.2

UOAIS-Net [1]
RGB-D 70.7 86.7 71.9 68.2 78.5 68.8 78.7 85.3 85.4 85.2 72.7 74.3 73.1 79.1

FTEA (Ours)
RGB-D 86.2 93.9 89.5 79.5 79.5 79.1 85.1 85.8 92.0 88.6 69.2 75.7 71.7 87.3
FTEA+ (Ours) RGB-D 92.0 93.3 92.3 86.5 88.0 86.7 91.1 89.9 89.4 89.5 72.6 76.0 73.8 88.3

TABLE I: The unseen object instance segmentation (UOIS) performances of the proposed FTEA and other state-of-the-art methods on OCID [31] and OSD [28] datasets. In spite of different types of input modalities, we show the best performance for each method. “+” denotes the zoom-in operation [39] to refine segmentation results.
TTA OCID [31] OSD [28]
Overlap Boundary Overlap Boundary
P R F P R F F@.75 P R F P R F F@.75

86.0 92.3 88.5 80.4 78.3 78.8 82.2 84.3 88.3 86.2 67.5 67.5 67.1 79.3

85.7 94.0 89.3 78.0 78.9 78.0 84.9 85.0 91.9 88.2 67.6 74.6 70.4 89.8
86.2 92.8 88.9 80.5 78.9 79.2 83.1 85.3 89.2 87.1 70.1 70.2 69.8 81.9
86.2 93.9 89.5 79.5 79.5 79.1 85.1 85.8 92.0 88.6 69.2 75.7 71.7 87.3

TABLE II: Ablation studies of the proposed FTEA. Note that test-time adaptation (TTA) is in conjunction with at least one objective function, i.e or .

Iv Experiments

Iv-a Datasets and Evaluation Metrics

Iv-A1 Datasets

The model is pre-trained on the synthetic Tabletop Object Dataset (TOD) [41], which consists of 40k synthetic scenes of cluttered objects in tabletop environments. The proposed method is evaluated and compared on two real-world benchmarks, named OSD [28] and OCID [31]. OSD consists of 111 images in tabletop environments with averaged 3.3 objects per image. OCID has 2,346 images on both tabletop and floor with averaged 7.5 objects per image. It is worth noting that OSD has manually labeled segmentation masks while OCID generates semi-automatically labeled annotations, which are easily influenced by the noise of depth images.

Iv-A2 Evaluation metrics

Following previous works [41, 42, 39], we use the precision/recall/F-measure (Overlap P/R/F) metrics to evaluate the object segmentation performance. These metrics are first computed between all pairs of predicted masks and ground truth masks. Then the Hungarian method is used to match the predictions and ground truth. Given the matching, the final P/R/F is calculated by


where and respectively denote the set of pixels belonging to the predicted object and the ground truth object , is the set of pixels of the matched ground truth object of .

To take the object boundaries into account, as is introduced in [41], we also use the Boundary P/R/F to evaluate how sharp the predicted boundary matches against the ground truth boundary, where the true positives are counted by pixel overlap of the two boundaries. Moreover, F@.75 is used to denote the percentage of segmented objects with Overlap F-measure [39]. All P/R/F and F@.75 measures are reported in the range of .

Iv-B Implementation Details

For fair comparisons, we follow previous work [39]

to use a 34-layer, stride-8 ResNet (ResNet34-8s) as the backbone, and the full resolution

feature map with embedding dimensions is obtained by bilinearly upsampling. We set in Equation 6 (i.e. selecting the top 2 nearest clusters) for the calculation of the non-parametric entropy objective . The temperature factor in cross-modality knowledge distillation is 1. The weight factor for the overall loss is set as . The momentum of BN layers is set to 0.5. During test time, our model is adapted with the SGD optimizer. We use batchsize=1 as in the typical inference phase. For the first 100 iterations, the learning rate is linearly warmed up to the base value

, then decayed with a cosine scheduler for another 400 iterations. We use the same learning scheduler for the OSD and OCID datasets. We do not shuffle the dataset in test-time adaptation unless otherwise stated. All experiments are conducted on a single NVIDIA 2080Ti GPU with PyTorch.

Iv-C Comparison to State-of-the-art Methods

In this section, we compare the proposed FTEA with several state-of-the-art methods. FTEA+ adopts the zoom-in refinement strategy as in [39]. As shown in Table I, our method outperforms all competitors on both OCID and OSD datasets. Specifically, FTEA+ achieves 92.3 and 89.5 Overlap F-measure, 86.7 and 73.8 Boundary F-measure, 91.1 and 88.3 F@.75 on OSD and OCID respectively, which validates the effectiveness of the proposed approach. Additionally, when we use the end-to-end model without extra zoom-in refinement, i.e. FTEA in the next-to-last line in Table I, the improvement of our method is consistent. Compared to UCN [39], FTEA improves the Overlap F-measure by 1.0 and 2.4, the Boundary F-measure by 0.3 and 4.6, the F@.75 by 1.8 and 5.1 on OCID and OSD, respectively.

Adaptation consumption. Besides the performance, we further investigate the computation consumption of the proposed method. Table III shows the averaged time consumption of the inference and adaptation process. Compared to the inference, our adaptation time is significantly short, i.e. averaged 0.04s v.s. 1.22s per iteration. Based on the setting of 500 iterations as stated in Section IV-B, the total time consumption of the adaptation is 20s on a single NVIDIA 2080Ti GPU, which is very efficient in practical application.

# Iters for avg BatchSize Inference time Adaptation time

1 1.22s 0.04s

TABLE III: The time consumption of the inference and adaptation process, which is averaged over 500 iterations. Results are evaluated on the same single NVIDIA 2080Ti GPU.
Fig. 3: Qualitative results of the proposed approach, (a) results on the OCID dataset, and (b) results on the OSD dataset.

Iv-D Ablation Studies

Iv-D1 Non-parametric entropy objective

This objective is crucial for UOIS to conduct test-time adaptation. As shown in Table II, when equipped with the non-parametric entropy objective , the model’s performances on two real-world datasets are overall improved. Especially, significantly improves the overlap and boundary recall, which is desirable for discovering potential objects in unseen environments.

Iv-D2 Cross-modality knowledge distillation

The proposed cross-modality knowledge distillation provides an effective learning objective for test-time adaptation. Table II shows that works well on improving the overlap and boundary precision of segmentation results. Additionally, as shown in the last row in Table II, the cross-modality knowledge distillation can be effectively combined with the former unsupervised entropy objective , thus further improving the performance on most evaluation metrics for unseen object instance segmentation.

Iv-D3 Test-time adaptation

Table II shows the performance with and without test-time adaptation (TTA), which is a key part of our method. TTA needs to be enabled with at least one proposed learning objective. In Table II we can observe that whether TTA is in conjunction with the or , it can effectively improve the segmentation results on both OCID and OSD.

Iv-E Discussions

Iv-E1 Qualitative Results

We visualize some qualitative results with and without the proposed FTEA in Figure 3. We can observe in Figure 3(a) that the proposed adaptation process mitigates the under-segmentation problem of two close objects. Besides, different from the smooth depth images in synthetic training data, realistic depth images are generally noisy, especially on the object’s boundary. This problem makes outputs of boundaries blurred and causes over-segmentation around the boundary, as illustrated in Figure 3(b). The last row in Figure 3(b) shows that, with the proposed adaptation process in FTEA, this problem can be largely alleviated.

1 2 3 all
Overlap F 89.5 89.4 89.3 88.9
Boundary F 73.8 73.6 73.2 72.8
F@.75 88.3 88.1 88.0 87.6
0.8 1 1.2 1.5
Overlap F 88.9 89.5 89.5 89.3
Boundary F 72.5 73.8 73.5 73.0
F@.75 87.8 88.3 87.2 87.5

Segmentation results with varying hyperparameters on OSD.

Iv-E2 Hyperparameters

We study the impact of the varying hyperparameters: (1) the number of nearest clusters in and (2) the temperature factor in . As shown in Table IV,

achieves the best performance since it avoids introducing noises from outliers. In Table

IV, we also observe that the model achieves the best results when we set , which means directly matching distributions of the teacher and student without extra sharpening or softening is effective. Overall, when varying the hyperparameters, the proposed method exhibits stable performances that are better than the baseline.

Modalities Overlap Boundary
RGB Depth F F F@.75

87.4 69.4 83.2
87.6 70.0 86.1
88.6 70.9 87.4
89.5 73.8 88.3
TABLE V: Segmentation performances with different adaptation modalities on OSD.
Fig. 4: The evolution of performances on OSD when we gradually increase the number of BN layers for adaptation.
Fig. 5: 20 Repeated runs of different sampling orders.
Fig. 6: Visualizations of segmentation failure cases from our method.

Iv-E3 Modalities for adaptation

Table V shows results with different adaptation modalities on OSD. First, segmentation performances can be improved whether RGB or depth modality is tuned. Second, by simultaneously tuning RGB and depth modalities, we can obtain better-fused RGB-D embeddings. Thus the performances become significantly better (see the last row in Table V) due to the effective use of multimodal data.

Iv-E4 How many BN layers to use

For brevity, we split all layers into four main building blocks as in ResNet [9]. Figure 5 illustrates the segmentation performances on OSD when we gradually increase the number of BN layers for adaptation. We can observe that the Overlap F-measure, Boundary F-measure, and F@.75 are consistently improved with more BN layers, demonstrating the effectiveness of the proposed method. On the other hand, the improvement is not plateaued, which implies that the adaptation process can be further enhanced by considering other effective parameters in addition to the BN layers of the model.

Iv-E5 Robustness to the sampling order

Additionally, we train our models for multiple runs on different random sampling orders. Figure 5 illustrates the cumulative means of performance with confidence intervals across 20 repeated runs on OCID and OSD. It can be observed that the overall performances of the model are stable, demonstrating the robustness of the proposed method. Note that the experiments on OCID also use random samples (not just the sampling order) since we conduct the adaptation with only 500 test-time images out of 2,346 images in OCID.

Iv-E6 Failure cases

Figure 6 illustrates some failure cases from our method. We can see that two close objects with similar depth are not correctly separated, e.g. the fruit on the top of a can, the side-by-side boxes or pens. Besides, there is an over-segmentation problem if one object is truncated by another, as shown in the last three columns in Figure 6. In the future, active vision with robots and amodal perception strategies could be explored to fix these failures.

V Conclusions

In this paper, we target the task of unseen object instance segmentation and emphasize the adaptation process for unseen realistic data. To mitigate the domain shift between the synthetic training and realistic testing data, a novel FTEA framework is proposed to conduct the fully test-time adaptation. Specifically, during test time, we fix all convolutional layers and adjust the affine transformations provided by BN parameters via optimizing two novel unsupervised objectives, i.e

. the NEO and the CKD. NEO calculates the entropy of probability distributions of UOIS in a non-parametric way. CKD further encourages cross-modality knowledge transfer during test time. Extensive experiments on realistic RGB-D datasets OCID and OSD demonstrate the effectiveness of the proposed approach. In recent years, simulation in robotics plays an increasingly important role in both manipulation and perception. It typically creates strong AI in the virtual world and applies it in the real world. However, the realistic scenarios for various applications are exhaustless and even unpredictable. Therefore, how to build an effective test-time adaptation strategy to better leverage the intrinsic information from the inference data is worth being considered and explored in the future.


  • [1] S. Back, J. Lee, T. Kim, S. Noh, R. Kang, S. Bak, and K. Lee (2021) Unseen object amodal instance segmentation via hierarchical occlusion modeling. arXiv preprint arXiv:2109.11103. Cited by: §II-A, TABLE I.
  • [2] W. Chang, T. You, S. Seo, S. Kwak, and B. Han (2019) Domain-specific batch normalization for unsupervised domain adaptation. In CVPR, pp. 7354–7362. Cited by: §II-B.
  • [3] M. Chen, H. Xue, and D. Cai (2019) Domain adaptation for semantic segmentation with maximum squares loss. In ICCV, pp. 2090–2099. Cited by: §II-B.
  • [4] D. Comaniciu and P. Meer (2002) Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (5), pp. 603–619. Cited by: §III-B1.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §I.
  • [6] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    In ICML, pp. 1180–1189. Cited by: §I, §II-B.
  • [7] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016)

    Domain-adversarial training of neural networks


    The Journal of Machine Learning Research

    17 (1), pp. 2096–2030.
    Cited by: §I.
  • [8] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, pp. 2961–2969. Cited by: TABLE I.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §IV-E4.
  • [10] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In ICCV, pp. 1389–1397. Cited by: §III-D.
  • [11] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell (2018) Cycada: cycle-consistent adversarial domain adaptation. In ICML, pp. 1989–1998. Cited by: §II-B.
  • [12] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, pp. 7132–7141. Cited by: §III-D.
  • [13] Y. Huang, C. Du, Z. Xue, X. Chen, H. Zhao, and L. Huang (2021) What makes multi-modal learning better than single (provably). NeurIPS 34. Cited by: §I, §III-C.
  • [14] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Cited by: §I, §III-A2, §III-D.
  • [15] M. Kim and H. Byun (2020) Learning texture invariant representation for domain adaptation of semantic segmentation. In CVPR, pp. 12975–12984. Cited by: §II-B.
  • [16] M. Klingner, J. Termöhlen, J. Ritterbach, and T. Fingscheidt (2022) Unsupervised batchnorm adaptation (ubna): a domain adaptation method for semantic segmentation without using source domain representations. In WACV, pp. 210–220. Cited by: §II-B.
  • [17] T. Kobayashi and N. Otsu (2010) Von mises-fisher mean shift for clustering on a hypersphere. In ICPR, pp. 2130–2133. Cited by: §III-B1.
  • [18] S. Li, C. H. Liu, B. Xie, L. Su, Z. Ding, and G. Huang (2019) Joint adversarial domain adaptation. In ACM MM, pp. 729–737. Cited by: §II-B.
  • [19] Y. Li, N. Wang, J. Shi, X. Hou, and J. Liu (2018) Adaptive batch normalization for practical domain adaptation. Pattern Recognition 80, pp. 109–117. Cited by: §II-B.
  • [20] Y. Li, L. Yuan, and N. Vasconcelos (2019) Bidirectional learning for domain adaptation of semantic segmentation. In CVPR, pp. 6936–6945. Cited by: §II-B.
  • [21] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, pp. 740–755. Cited by: §I.
  • [22] Y. Liu, P. Kothari, B. van Delft, B. Bellot-Gurlet, T. Mordan, and A. Alahi (2021) TTT++: when does self-supervised test-time training fail or thrive?. NeurIPS 34. Cited by: §II-C.
  • [23] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2017)

    Deep transfer learning with joint adaptation networks

    In ICML, pp. 2208–2217. Cited by: §I.
  • [24] J. Lv, K. Liu, and S. He (2021) Differentiated learning for multi-modal domain adaptation. In ACM MM, pp. 1322–1330. Cited by: §II-B.
  • [25] K. Mei, C. Zhu, J. Zou, and S. Zhang (2020) Instance adaptive self-training for unsupervised domain adaptation. In ECCV, pp. 415–430. Cited by: §II-B.
  • [26] C. K. Mummadi, R. Hutmacher, K. Rambach, E. Levinkov, T. Brox, and J. H. Metzen (2021) Test-time adaptation to distribution shift by confidence maximization and input transformation. arXiv preprint arXiv:2106.14999. Cited by: §III-B.
  • [27] F. Qi, X. Yang, and C. Xu (2018) A unified framework for multimodal domain adaptation. In ACM MM, pp. 429–437. Cited by: §II-B.
  • [28] A. Richtsfeld, T. Mörwald, J. Prankl, M. Zillich, and M. Vincze (2012) Segmentation of unknown objects in indoor environments. In IROS, pp. 4791–4796. Cited by: §I, TABLE I, TABLE II, §IV-A1.
  • [29] C. E. Shannon (1948) A mathematical theory of communication. The Bell System Technical Journal 27 (3), pp. 379–423. Cited by: §III-B3.
  • [30] W. Shao, S. Tang, X. Pan, P. Tan, X. Wang, and P. Luo (2020) Channel equilibrium networks for learning deep representation. In ICML, pp. 8645–8654. Cited by: §III-D.
  • [31] M. Suchi, T. Patten, D. Fischinger, and M. Vincze (2019) EasyLabel: a semi-automatic pixel-wise object annotation tool for creating robotic rgb-d datasets. In ICRA, pp. 6678–6684. Cited by: §I, TABLE I, TABLE II, §IV-A1.
  • [32] Y. Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt (2020) Test-time training with self-supervision for generalization under distribution shifts. In ICML, pp. 9229–9248. Cited by: §II-C.
  • [33] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2019) Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In CVPR, pp. 2517–2526. Cited by: §II-B.
  • [34] D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell (2020) Tent: fully test-time adaptation by entropy minimization. In ICLR, Cited by: §I, §II-C, §III-B3, §III-B, §III-D.
  • [35] Y. Wang, W. Huang, F. Sun, T. Xu, Y. Rong, and J. Huang (2020) Deep multimodal fusion by channel exchanging. NeurIPS 33, pp. 4835–4845. Cited by: §III-D.
  • [36] Z. Wang, M. Yu, Y. Wei, R. Feris, J. Xiong, W. Hwu, T. S. Huang, and H. Shi (2020) Differential treatment for stuff and things: a simple unsupervised domain adaptation method for semantic segmentation. In CVPR, pp. 12635–12644. Cited by: §II-B.
  • [37] S. Woo, J. Park, J. Lee, and I. S. Kweon (2018) Cbam: convolutional block attention module. In ECCV, pp. 3–19. Cited by: §III-D.
  • [38] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR, pp. 3733–3742. Cited by: §III-B2.
  • [39] Y. Xiang, C. Xie, A. Mousavian, and D. Fox (2020) Learning rgb-d feature embeddings for unseen object instance segmentation. In CoRL, Cited by: Fig. 1, §I, §I, Fig. 2, §II-A, §III-A1, §III-B1, TABLE I, §IV-A2, §IV-A2, §IV-B, §IV-C.
  • [40] C. Xie, A. Mousavian, Y. Xiang, and D. Fox (2021) Rice: refining instance masks in cluttered environments with graph neural networks. In CoRL, Cited by: Fig. 1, §I, §I, §II-A.
  • [41] C. Xie, Y. Xiang, A. Mousavian, and D. Fox (2019) The best of both modes: separately leveraging rgb and depth for unseen object instance segmentation. In CoRL, Cited by: Fig. 1, §I, §I, §II-A, TABLE I, §IV-A1, §IV-A2, §IV-A2.
  • [42] C. Xie, Y. Xiang, A. Mousavian, and D. Fox (2021) Unseen object instance segmentation for robotic environments. IEEE Transactions on Robotics. Cited by: Fig. 1, §I, §I, §II-A, TABLE I, §IV-A2.
  • [43] C. Yang, Z. Wu, B. Zhou, and S. Lin (2021) Instance localization for self-supervised detection pretraining. In CVPR, pp. 3987–3996. Cited by: §III-B2.
  • [44] Y. Yang and S. Soatto (2020) Fda: fourier domain adaptation for semantic segmentation. In CVPR, pp. 4085–4095. Cited by: §II-B.
  • [45] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In ECCV, pp. 818–833. Cited by: §III-D.
  • [46] L. Zhang, Z. Liu, S. Zhang, X. Yang, H. Qiao, K. Huang, and A. Hussain (2019) Cross-modality interactive attention network for multispectral pedestrian detection. Information Fusion 50, pp. 20–29. Cited by: §III-D.
  • [47] P. Zhang, B. Zhang, T. Zhang, D. Chen, Y. Wang, and F. Wen (2021) Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In CVPR, pp. 12414–12424. Cited by: §II-B.
  • [48] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In ICCV, pp. 2223–2232. Cited by: §II-B.
  • [49] Z. Zhuang, L. Wei, L. Xie, T. Zhang, H. Zhang, H. Wu, H. Ai, and Q. Tian (2020) Rethinking the distribution gap of person re-identification with camera-based batch normalization. In ECCV, pp. 140–157. Cited by: §II-B.