DIRL: Domain-Invariant Representation Learning for Sim-to-Real Transfer

11/15/2020 ∙ by Ajay Kumar Tanwani, et al. ∙ berkeley college 1

Generating large-scale synthetic data in simulation is a feasible alternative to collecting/labelling real data for training vision-based deep learning models, albeit the modelling inaccuracies do not generalize to the physical world. In this paper, we present a domain-invariant representation learning (DIRL) algorithm to adapt deep models to the physical environment with a small amount of real data. Existing approaches that only mitigate the covariate shift by aligning the marginal distributions across the domains and assume the conditional distributions to be domain-invariant can lead to ambiguous transfer in real scenarios. We propose to jointly align the marginal (input domains) and the conditional (output labels) distributions to mitigate the covariate and the conditional shift across the domains with adversarial learning, and combine it with a triplet distribution loss to make the conditional distributions disjoint in the shared feature space. Experiments on digit domains yield state-of-the-art performance on challenging benchmarks, while sim-to-real transfer of object recognition for vision-based decluttering with a mobile robot improves from 26.8 a wide variety of objects. Code and supplementary details are available at https://sites.google.com/view/dirl

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data-driven deep learning models tend to perform well when plenty of labelled training data is available and the testing data is drawn from the same distribution as that of the training data. Collecting and labelling large scale domain-specific training data for robotics applications, however, is time consuming and cumbersome [Levine_armfarm_18]. Additionally, the sample selection bias in data collection limits the model use to very specific environmental situations. Training deep models in simulation for robot manipulation is becoming a popular alternative [Tobin_sim2real_17, Peng_sim2real_17, openai_2019]. Despite the efforts to build good dynamic models and high fidelity simulators for scalable data collection, the modeling inaccuracies makes it difficult to transfer the desired behaviour in real robot environments.

This paper investigates the problem of sample-efficient domain adaptation to learn a deep model that can be transferred to a new domain [Wang_da_survey_18]. We analyze the situation where we have a lot of labeled training data from the simulator or the source distribution, and we are interested in adapting the model with limited or no labelled training data drawn from the target distribution. Existing popular approaches to domain adaptation learn common feature transformations by aligning marginal distributions across domains in an unsupervised manner [Ganin_DANN_16, Tzeng_adda_17, Hoffman_cycada_17, Saito_MCDDA_18]. They implicitly assume that the class conditional distributions of the transformed features is also similar across domains, i.e., if the model performs well on the source domain and the features overlap across domains, the model is going to perform well on the target domain. This, however, creates ambiguity in class alignment across source and target domains in the shared feature space that can result in a negative transfer to a new domain as shown in Fig. 1 (middle). More intuitively, the feature space can mix apples of the simulator with oranges of the target domain and yet have have a perfect alignment of marginal distributions.

To this end, we present a domain-invariant representation learning (DIRL) algorithm that resolves the ambiguity in domain adaptation by leveraging upon a

     

Figure 1:

conventional supervised learning on labeled source data

Figure 2: cross-label match with swapped labels across the decision boundary
Figure 3: label-shift with uneven mixing of class boundaries
Figure 4: conventional domain alignment with marginal distributions
Figure 5: DIRL: domain and policy alignment with marginal and conditional distributions
Figure 1: Domain-invariant representation learning on 2D synthetically generated data with classes (see App. for details): (top) conventional supervised learning on the source domain does not generalize to the target domain drawn from a different distribution, (middle) marginal distributions alignment only across domains can lead to negative transfer: a) cross-label match by swapping the class labels for the same target distribution, b) label-shift in decision boundary with class imbalance across domains, (bottom) DIRL leverages upon a few labeled target examples to semantically align both the marginal and the conditional distributions. Note that the density plots at the top and the side indicate the marginal domain distributions, while the contour plots in the middle represent the class conditional distributions.

few labeled examples of the target domain to align both the marginal and the conditional distributions in the shared feature space with adversarial learning, and increases the inter-class variance and reduces the intra-class variance of the shared feature space with a triplet distribution loss. We apply the approach to vision-based robot decluttering, where a mobile robot picks objects from a cluttered floor and sorts them into respective bins 

[Gupta15, Gupta18, Tanwani_2019]. We seek to train the deep models with simulated images of object meshes and use a small number of labeled real world images to mitigate the domain shift between the image and the object distributions of the simulation and the real environments.

This paper makes three contributions: 1) we illustrate the limitations of cross label mismatch and/or label shift with unsupervised domain adaptation approaches that align marginal distributions in a shared feature space only, 2) we propose a domain-invariant representation learning (DIRL) algorithm to mitigate the covariate and conditional shift by aligning the marginal and the conditional distributions across domains, and making them disjoint with a novel triplet distribution loss in a shared feature space, and 3) we present experiments on MNIST benchmark domains and sim-to-real transfer of a single-shot object recognition model for mobile vision-based decluttering suggesting performance improvement over state-of-the-art domain adaptation methods.

2 Related Work

Domain Adaptation with Adversarial Learning: A popular class of domain adaptation methods learn invariant transferable representations by matching feature and/or pixel distributions with similarity metrics like maximum mean discrepancy [Long_DA_16], associative adaptation [Hausser_ADA_17], reconstruction from latent features [Bousmalis_DSN_16, Ghifary_reconstructionDA_16], and/or adversarial learning [Ganin_DANN_16, Bousmalis_PixelDA_16, Tzeng_adda_17, Hoffman_cycada_17, Zhu_cyclegan_17]. A common trend among these prior methods is to align the marginal distributions across the domains. This, however, does not guarantee that that the samples from the same class across the domains are mapped nearby in the feature space (see Fig. 1 for an illustrative example). The lack of semantic alignment is a major limitation when the feature representation has conditional and/or label shift across the domains.

Label and Conditional Domain Adaptation: An under explored line of work addresses matching the conditional or the label distributions in the feature space. Prior work in this direction make use of linear projections in the shared feature space [Long13, gong_icml_16]

, estimate importance weights for label shift 

[Lipton_LS_18, Azizzadenesheli_19], use domain discriminators with class specific adaptation [Hoffman_fcns_16, Chen_cdan_17], combine class predictions with shared features as input to the discriminator [Long_cada_18]

, maximize the conditional discrepancy with two classifiers 

[Saito_MCDDA_18], augment domain adversarial training with conditional entropy regularization [shu_dirtt_18], bound the density ratio with asymmetrical distribution alignment [Wu_19], and align the weighted source and target distribution with importance weights under generalized label shift [Combes_2020].

Learning class conditional distributions of the unlabeled target domain requires pseudo-labels to encourage a low-density separation between classes [Kang_cvpr_19]. Tuning the thresholds for reliable prediction of pseudo-labels on unlabeled target data using only the source domain can be cumbersome and domain-specific. Semi-supervised methods provide a practical alternative when target data is scarce (for example, when acquired via kinesthetic or teleoperation interfaces). Prior examples of semi-supervised domain adaptation include coupling of domain and class input to the discriminator [Motiian_fada_17] and minimax conditional entropy optimization [Saito_ssda_19].

In comparison to these approaches, the proposed domain-invariant represent learning approach provisions for both marginal and conditional alignment of domain distributions, while encouraging the feature representation of each class to be disjoint with a triplet distribution loss. We leverage upon a few target examples that stabilizes adaptation to challenging domains with large domain shift.

Sim-to-Real Transfer: Perception and control policies learned in simulation often do not generalize to the real robots due to the modeling inaccuracies. Domain randomization methods [Tobin_sim2real_17, Peng_sim2real_17, Chebotar_simopt_18, James_simlearn_18, Seita_2020] treat the discrepancy between the domains as variability in the simulation parameters, assuming the real world distribution as one such randomized instance of the simulation environment. In contrast, domain adaptation methods learn an invariant mapping function for matching distributions between the simulator and the robot environment. Related examples to the work presented here include using synthetic objects to learn a vision based grasping model [Saxena_grasping_08, mahler_dexnet2_17], sim-to-real transfer of visuomotor policies for goal-directed reaching movement by adversarial learning [zhang_adda_grasp_19]

, adapting dynamics in reinforcement learning 

[Eysenbach_2020], and adapting object recognition model in new domains [Saenko_daor_10, Chen_darcnn_18, zhu_cvpr_19]. This paper focuses on a sample-efficient approach to learn invariant features across domains while reducing the cost of labeling real data. We show state-of-the-art performance on MNIST benchmarks and sim-to-real transfer of a single-shot object recognition model used in vision based surface decluttering with a mobile robot.

3 Problem Statement

We consider two environments: one belonging to the simulator or source domain comprising of the dataset and the other belonging to the real or target domain samples with a few labeled samples . The samples are drawn from , where is the input space and is the output space and the superscripts and

indicate the draw from two different distributions of source and target random variables

and respectively. Each output labeling function is represented as , with and as the output policy of the source and the target domain. Let denote the intermediate representation space induced from by a feature transformation , which is mapped to the output conditional distribution by under the transformation . The composite function defines the policy of a domain, and the loss of a candidate policy represents the disagreement with respect to the source policy on domain as

. For example, loss function for classification on target domain is represented by

, while for regression as .

The goal is to learn the policy

using the labeled source examples and a few or no labeled target examples such that the error on the target domain is low. We seek to minimize the joint discrepancy across the domains in a shared feature space and map the output policy on the shared features to minimize the target error. Denoting the joint distribution of input domain and output labels across source and target domain as

, and respectively, the DIRL objective is to minimize,

(1)

The joint discrepancy consequently depends on both the marginal discrepancy and the conditional discrepancy

between the source and the target domains defined in terms of a probability distance measure, and is minimized by aligning both the marginal and conditional distributions.

Definition 3.1

Marginal Distributions Alignment: Given two domains and drawn from two different distributions with non-zero covariate shift , marginal alignment corresponds to finding the feature transformation such that the discrepancy between the transformed marginal distributions is minimized, i.e., .

Definition 3.2

Conditional Distributions Alignment: Given two domains and drawn from random variables and

with different output conditional probability distributions

, conditional alignment corresponds to finding the transformation such that the discrepancy between the transformed conditional distributions is minimized, i.e., .

Note that the methods aligning marginal distributions only implicitly assume the same output conditional distribution across the domains for the adaptation to be effective. This assumption of different marginal distribution across domains, but similar conditional distribution is known as covariate shift [Candela_DS_09]. Several methods attempt to find invariant transformation to mitigate covariate shift such that is similar across the domains by minimizing a domain discrepancy measure [Ganin_DANN_16, Tzeng_adda_17, Saito_MCDDA_18]. However, it is not clear if the assumption of also remains the same across the domains after the transformation. As the target labels may not be available in unsupervised domain adaptation, the class conditional distributions are naively assumed to be true under the transformation . In this paper, we consider the joint marginal and conditional distributions alignment problem across the domains. Note that it is a non-trivial problem since we have access to only a few labels in the target domain. We consider the problem in a semi-supervised setting to resolve the ambiguity in conditional distributions alignment with a few labeled target examples.

4 Theoretical Insights and Limitations

Theoretical insights of a family of domain adaptation algorithms are based on the hypothesis that a provable low target error can be obtained by minimizing the marginal discrepancies between two classifiers [Ben-David_datheory_10, Zhao_DA_19, Johansson_DA_19, zhang_uda_2019, Li_2020].

Theorem 4.1

(Ben-Davide et al. [Ben-David_datheory_10]). Given two domains and , the error of a hypothesis in the target domain is bounded by the sum of: 1) the error of the hypothesis in the source domain, 2) the marginal discrepancy of the hypothesis class between the domains , and 3) the best-in-class joint hypothesis error ,

(2)

The divergence can be empirically measured by training a classifier that discriminates between source and target instances, and subsequently minimized by aligning the marginal distributions between (unlabeled) source and target instances. The joint hypothesis error is widely assumed to be small, i.e., there exists a policy that performs well on the induced feature distributions of the source and the target examples after marginal alignment. More generally, the optimal joint error represents the cross-domain performance of the optimal policies . Hence, the upper bound on the target error can more appropriately be represented as,

(3)

In Fig. 1 (middle) showing conventional domain adaptation by aligning marginal distributions, the cross-overlapping area between class categories in the shared feature space (blue and orange) represents the optimal joint error given by or . High joint error signifies that the conventional marginal alignment approaches fail in the presence of conditional or label shift between the domains (see also  Zhao_DA_19 and  zhang_uda_2019). Consequently, we highlight two main limitations of aligning marginal distributions only with unsupervised domain adaptation as highlighted in Fig. 1: 1) cross label matching: labels of different classes are swapped in the shared feature space, and 2) label shift: the class distributions across the source and the target domains are imbalanced, leading to samples of one class being mixed with another class. In this work, we align both the marginal and the conditional discrepancies across domains and use a few labeled target samples to avoid ad-hoc mixing of class categories and negative transfer with cross label assignments.

5 DIRL Algorithm

The central idea of the DIRL approach is to minimize the joint distribution error in a shared feature space by aligning the marginal and the conditional distributions across domains. To align the marginal distributions, we use a domain classifier to discriminate between the shared features of the source and the target domains. The shared features are adapted to deceive the domain classifier with adversarial learning [Ganin_DANN_16, Tzeng_adda_17, shu_dirtt_18]. Aligning the conditional distributions, however, is non-trivial due to the lack of labeled target examples. We align the conditional distributions in a semi-supervised manner by using class-wise domain discriminators for each class in a similar spirit to Chen_cdan_17. We further use a novel triplet distribution loss to make the conditional distributions disjoint in the feature space. The overall architecture of DIRL is shown in Fig. 2.

Marginal Distributions Alignment: Given the joint distribution , we align the marginal distributions of the transformed source and target domain with adversarial learning. The generator encodes the data in a shared feature space, and the discriminator predicts the binary domain label whether the data point is drawn from the source or the target distribution. The discriminator loss is optimized using domain labels,

(4)

The generator subsequently adopts the target domain features to confuse the discriminator with inverted domain labels to avoid vanishing gradients [Tzeng_adda_17]. Note that the gradient reversal layer can also be used [Ganin_DANN_16]. The generator loss adapts the feature extractor,

(5)

Without loss of generality, we denote the adaptation of the feature extractor with respect to the target data only. The objective is optimized in a minimax fashion where the discriminator maximizes the empirical marginal discrepancy for a given features distribution, and the feature extractor minimizes the discrepancy by aligning the marginal distributions.

Conditional Distributions Alignment: Conditional distributions alignment can overcome the issues of cross-label matching and the shift in labeling distributions with aligning marginal distributions alignment only. Effective alignment of conditional distributions depends upon two factors: 1) prediction of target pseudo-labels, and 2) balanced sampling of features per class across domains.

To this end, we leverage upon a few labeled target examples and train the output network with labeled source and target examples using the cross-entropy loss,

(6)

We predict the pseudo-labels for the unlabeled target data, , by querying the network pre-trained on labeled data at an earlier stage during training and retain only top- pseudo-labels of each class category based on their confidence. We sample with replacement to create a balanced mini-batch with half source and half labeled target examples, and augment the mini-batch with pool of pseudo-labeled target examples after the pre-training stage only.

A minimax game with adversarial learning aligns the conditional distribution with a domain discriminator for each class. First, the class discriminator estimates the conditional discrepancy with respect to the source and the target data for a fixed feature extractor. Second, the generator adapts the feature extractor to minimize the conditional discrepancy for a fixed discriminator. Formally, the adversarial loss for each (ground-truth and predicted) class is,

(7)

Minimizing the conditional discrepancy penalizes the feature extractor to separate the cross domain overlap for each class to give low joint error in Eq. (3) for provably effective adaptation to the target domain.

Figure 2: (left) DIRL aligns marginal and conditional distributions of source and target domains, and uses a soft metric learning triplet loss to make the feature distributions disjoint in a shared feature space, (right) experimental setup for decluttering objects into bins with HSR.

Triplet Distribution Loss: To increase the inter-class variance and reduce the intra-class variance, we introduce a variant of triplet loss [Schroff15, Rippel15] that operates on distributions of anchor, positive and negative examples, instead of tuples of individual examples (see also [Florence_thesis_2019]). Given labeled examples in a mini-batch, the loss posits that the Kullback-Leibler (KL)-divergence between anchor and positives distribution in the shared feature space is less than the KL-divergence between anchor and negatives distribution by some constant margin . Mathematically,

(8)

where is the hinge loss, is normalized to extract scale-invariant features similar to [Schroff15], is short for distribution of examples, , with a Gaussian situated on normalized anchor example in the feature space , and are the number of positive and negative examples in a mini-batch, and

is the hyper-parameter to control the variance of the Gaussian distribution. For each anchor, the positive examples belong to the same class as that of anchor, while the negative examples are sampled from other classes. The soft variant of triplet loss encourages the features distribution to be robust to outliers, while increasing the inter-class variance and reducing the intra-class variance across similar examples in the shared feature space.

Overall Algorithm: The overall DIRL algorithm comprises of the classification loss on the labeled source and target examples, marginal and conditional distributions alignment loss on the features generator, and triplet distribution loss on the labeled source and target examples. Given the weight coefficients of the respective losses, the overall loss function that DIRL optimizes is,

(9)

6 Experiments, Results and Discussions

In this section, we first benchmark the DIRL algorithm on digits domains, followed by sim-to-real transfer of vision-based decluttering with a mobile robot (see Appendix for 2D synthetic example in Fig. 1

). We empirically investigate what representations transfer better with unsupervised approaches, and the effect of a few labeled target examples in transfer learning across domains.

Methods MNIST MNIST SVHN USPS USPS MNIST
MNISTM USPS MNIST SVHN MNIST SVHN
unsupervised
RDA
DANN
Triplet
ADA 0.895 0.359
MCD 0.941 0.978 0.288 0.932
semi-supervised
DANN
FADA*
DIRL
0.948 0.951 0.903 0.802 0.962 0.837
Table 1: Average test accuracy on target domains of Digits datasets with unsupervised and semi-supervised domain adaptation. DIRL consistently performs well across all target domains in comparison to other baselines. *results from [Motiian_fada_17].

6.1 Digits Datasets

We compare the DIRL approach with state-of-the-art methods including DANN [Ganin_DANN_16], associative domain adaptation (ADA) [Hausser_ADA_17], reconstruction based domain adaptation (RDA) [Ghifary_reconstructionDA_16], MCD [Saito_MCDDA_18], and FADA [Motiian_fada_17] on six sourcetarget benchmarks, namely: MNISTMNISTM, MNISTUSPS, SVHNMNIST, USPSSVHN, USPSMNIST, and MNISTSVHN.

Results in Table 1 suggest that the unsupervised methods aligning marginal distributions only often do not perform well when the target domain discrepancy increases relative to the source domain. MCD performs better among unsupervised approaches by minimizing the conditional discrepancy loss using two classifiers, but gives unsatisfactory results with challenging adaptation situations such as USPSSVHN and MNISTSVHN. DIRL addresses the large domain shift by aligning both the marginal and the conditional distributions using only a few target examples, and consistently outperforms the compared unsupervised and semi-supervised DANN and FADA approaches. As an example, MNIST SVHN accuracy increases by from -shot to -shot and by from -shot to -shot target examples of each class.

6.2 Vision-Based Decluttering by Sim-to-Real Transfer

Robots picking diversely shaped and sized novel objects in cluttered environments has a wide range of near-term applications in homes, schools, warehouses, offices and retail stores. We consider this scenario with a mobile Toyota Human Support Robot (HSR) that observes the state of the floor as a RGB image and a depth image . The robot recognizes the objects as belonging to the object categories , and subsequently plans a grasp action corresponding to the D object position and the planar orientation of the most likely recognized object. After grasping an object, the robot places the object into appropriate bins (see Fig. 2 (right) for an overview).

In this work, we investigate the feasibility of vision-based decluttering with a mobile robot by sim-to-real transfer with the proposed DIRL approach in a semi-supervised manner. We simulate the decluttering environment in Pybullet similar to the setup in Fig. 2, and collect synthetic RGBD images of cluttered object meshes on the floor, each containing objects drawn from a distribution of screwdriver, wrench, fruit, cup, bottle, assembly part, hammer, scissors, tape, toy, tube and utility object meshes collected from publicly available repositories. We vary the camera viewpoint, the background texture and color of object meshes in each image and store the ground-truth bounding box locations, object categories and analytically evaluated grasps for uniformly sampled antipodal pairs on the object meshes in an image. Additionally, we collect RGBD images with the HSR on sq. meter white tiled floor in a similar manner from a distribution of household and machine shop objects, and hand-label the bounding boxes and object categories.

Methods mAP sim_eval real_eval SS
Sim Only
Real Only
Sim Real
Triplet
DANN
MCD
DIRL 94.2 91.0 0.69
Table 2: Performance evaluation of domain-invariant object recognition by sim-to-real transfer on target test set using mean Average Precision (mAP), classification accuracy on synthetic test images , real test images and silhouette score . DIRL performs better than other compared approaches across both domains.

We modify the single shot multi-box detector (SSD) [Lin17] with focal loss and feature pyramid as the base model for domain-invariant object recognition. Domain classifiers for marginal and conditional discrepancy are added on top of feature pyramid networks (see supplementary materials for architecture and training details). Results with labeled target examples on test set are summarized in Table 3. Note that the Silhouette score (SS) metric in Table 3 measures the tightness of a cluster relative to the other clusters without using any labels (unsupervised); while the classification accuracy and the mAP are supervised metrics that use the target test set labels. We observe that the object recognition model trained on synthetic data only gives poor performance on real data with accuracy, in comparison to accuracy obtained with training on real labeled data only. Naively combining the synthetic and real data in a mini-batch is also sub-optimal. Using triplet loss on labeled source examples preserves the structure of the features for transfer to real examples. DANN improves performance in both domains by aligning marginal distributions. MCD further improves the performance in domain adaptation with conditional alignment of distributions. DIRL outperforms the compared approaches by combining marginal and conditional distributions alignment with triplet distribution loss.

Decluttering with Toyota HSR: We test the performance on the Toyota HSR for picking objects from the floor and depositing them in target bins as shown in Fig. 2 (right). We load objects in a bin from a set of physical objects and drop them on the floor in front of the robot. The objects may overlap after dropping; a pushing primitive is used to singulate the cluttered objects if the overlap is more than a threshold. The domain-invariant object recognition model gives accuracy in real experiments. The cropped depth image from the output bounding box of the object recognition model is fed as input to the grasp planning model trained on simulated depth images, adapted from [Tanwani_2019, staub_hsr_2019, mahler_dexnet2_17]. The grasping network gives accuracy of picking up the identified object. Without using the grasping network and only grasping orthogonal to the principal axis of the point cloud of the predicted object location gives accuracy. We observe that the robot performs well in grasping compliant objects and objects with well-defined geometry such as cylinders, screwdrivers, tape, cups, bottles and utilities; while assembly parts and small bowls in inverted pose induced repeated failures in grasping the target object (see supplementary materials for details).

7 Conclusion

In this paper, we have presented a sample-efficient domain-invariant representation learning approach for adapting deep models to a new environment. The proposed DIRL approach overcomes the ambiguity in transfer with unsupervised approaches by mitigating both the marginal and the conditional discrepancies across domains with a small amount of labeled target examples, and uses a triplet distribution loss to make the feature distributions disjoint in a shared feature space. Experiments with digit domains yield state-of-the-art performance across challenging transfer scenarios with large domain shift, while vision-based decluttering with a mobile robot suggest the feasibility of grasping diversely shaped and sized novel objects by sim-to-real transfer. In future work, we plan to investigate domain-invariance while deploying models across multiple environments with a fleet of robots [tanwani_rilaas_20] and close the real-to-sim loop in transferring models across new environments.

The research work was performed at UC Berkeley, in collaboration with the AutoLab, the Berkeley Artificial Intelligence Research (BAIR), and the Real-Time Intelligent Secure Execution (RISE) Lab. The authors would like to thank Matthew Trepte, Daniel Zeng, Kate Sanders, Yi Liu, Lerrel Pinto, Trevor Darrell, and Ken Goldberg for their contributions, feedback and suggestions.

References

Appendix A Experimental Details

a.1 2D Synthetic Example

We consider a -dimensional problem comprising of classes. Source data is generated from Gaussian distributions with means centered around and respectively for the two classes. Covariance matrices for all the Gaussians are represented by: with and 0.25. Similarly, Target data is drawn from Gaussian distributions with respective class means centered around and . We sample instances from the source and the target distribution each for training, and instances from the target distribution are used for the testing the learned model. We use the same weight for all the constituent loss functions with . Network architecture comprises of hidden layers of neurons with ReLU activation for each of the shared feature space, output classifier, domain discriminator and class-conditional domain discriminator. During training, we use a mini-batch of samples comprising of half source, labeled target (if applicable) and unlabeled target examples. For conditional domain discriminators, we sub-sample a new mini-batch comprising of half labeled source and half labeled target examples for each class. We use Adam optimizer with a learning rate of for K iterations. Note that we do not use labeled target instances for classification loss, and only use them for triplet distribution and conditional discriminator loss, in order to better analyze the effect of constituent losses on the target accuracy. Results are summarized in Fig. 1 and the animations are available on the project website: https://sites.google.com/view/dirl

a.2 Digits Datasets

We choose four commonly used digits datasets for benchmarks (see Fig. 3): MNIST [lecun_mnist_2010], MNISTM [Ganin_DANN_16], USPS [hull_usps_94], and SVHN [netzer_svhn_11]. We select first dataset as the source and the second one as the target dataset for adaptation in the following configurations: MNISTMNISTM, MNISTUSPS, SVHNMNIST, USPSSVHN, USPSMNIST, MNISTSVHN.

Dataset Train Instances Test Instances
MNIST
MNISTM
SVHN
USPS
Table 3: Digits Dataset Instances.

Experimental details are as follows: Input dimension of all datasets are reshaped to , output classifier dimension is corresponding to the unit digits. Network architecture consists of the shared feature space: sets of convolution layers with

hidden layers each separated by a max pool layer with stride of length

. The output of the final layer is flattened to dimensional space and fed to the output classification network, domain discriminator and conditional domain discriminator with dense layers of size and the output dimensions of respectively.

We use the same weight for all the constituent loss functions with . We perform experiments with labeled instances to evaluate the proposed approach. Overall batch size comprises of half-source and half-target examples. Source examples are always labeled, while target examples are labeled and unlabeled within a mini-batch. No pseudo-labels are used for the target examples with the digits experiments. Adam optimizer with learning rate of for iterations to minimize the constituent loss functions. We also attempted instance normalization with this setup similar to [shu_dirtt_18], but it did not have any significant effect on the results.

We compare the average test accuracy of the target domains in unsupervised and semi-supervised setting. Methods aligning marginal distributions only such as DANN often do not perform well due to lack of conditional alignment across source domains. Metric learning with triplet loss on the source domain increases the separation between class categories, which helps in transfer to the new environment. Adding reconstruction loss on top of DANN to force the feature transformation to be invertible decreases the performance on the target domain. This performance degradation is likely due to the additional constraints of having distinct features for each sample, making it difficult to align the marginal distributions as also suggested in Johansson_DA_19. Associative domain adaptation imposes a cyclic loss to bring source and target examples close in the shared feature space, however, yields unreliable performance across datasets. MCD performs better across unsupervised baselines by minimizing the conditional discrepancy loss using two classifiers, however, gives unsatisfactory results with challenging adaptation situations such as USPSSVHN and MNISTSVHN like other unsupervised domain adaptation methods.

DIRL addresses the limitations of the existing approaches and outperforms other unsupervised and semi-supervised approaches with DANN, FADA in -shot, -shot and -shot scenario. DIRL uses a few target labels for effective conditional distributions alignment (see Fig. 3 for a qualitative comparison). Results presented in this paper do not make use of the pseudo-labeled target data. Note that DIAL performs well across all datasets, especially on challenging problems of USPSSVHN and MNISTSVHN with large domain shift by aligning the marginal and the conditional distributions using only a few samples for the target class. Specifically, MNIST SVHN increases from -shot to -shot and from -shot to -shots.

Figure 3: (top) Sample image of label from MNIST, MNISTM, SVHN and USPS digits datasets, (bottom) T-SNE visualization of MNISTMNISTM (source in blue, target in red). DIRL compactly clusters the class distributions across datasets for transfer learning in comparison to DANN and source only transfer.

a.3 Vision-Based Decluttering by Sim-to-Real Transfer

Simulation and Real Dataset: We simulate the decluttering environment in a Pybullet simulator. The simulated dataset comprises of synthetic RGB and depth images of cluttered object meshes on floor. Each image randomly contains between objects that are split across categories, namely screwdriver, wrench, fruit, cup, bottle, assembly part, hammer, scissors, tape, toy, tube and utility. The object meshes are collected from Turbosquid, Kit, 3dNet, ShapeNet repositories. We use domain randomization to vary the camera viewpoint, background texture and color of object meshes in each image, and store the ground-truth bounding box locations, object categories, segmentation masks and analytically evaluated grasps for uniformly sampled antipodal pairs on the object meshes in an image.

The physical dataset comprises of real RGB and depth images collected with the Toyota HSR looking at sq. meter white tiled floor from a distribution of household and machine shop objects, and hand-label the bounding boxes and object categories. We hand-label the bounding boxes and object categories similar to the synthetic classes above.

Object Recognition Network: We use the MobileNet-Single Shot MultiBox Detector (SSD) [Liu15, Lin17] algorithm with focal loss and feature pyramids as the base model for object recognition. The input RGB image is fed to a pre-trained VGG network, followed by feature resolution maps and a feature pyramid network. The feature pyramid network produces feature maps across resolutions: , , , , that are concatenated before being fed to the class prediction and box prediction networks.

We modify the base model by flattening the output of the class predictions with background features and adding domain and class discriminators on top. Domain discriminator consists of three fully connected layers of size and output neurons for classifying simulator vs real images. Class conditional domain discriminator consists of two fully connected layers of size before the output layer of neurons. We use an input batch size of with clones that are split across half-sim, labeled and unlabeled real data. We sample most like class predictions (without background) in the same proportion as the batch size for triplet distribution loss, marginal and conditional discriminators. After pretraining the network for iterations, we assigned pseudo-labels to the unlabeled real images and used them with triplet distribution loss and conditional discriminators. The class embeddings were uniformly sampled for each conditional discriminator to mitigate the effect of imbalanced proportion of classes.

Results with labeled target examples on test set are summarized in the main body of the paper. We use three performance metrics, namely: mean Average Precision (mAP), classification accuracy on real and sim images on a held-out test set, and Silhouette score (SS). Note that the Silhouette score (SS) is an unsupervised metric measures the tightness of a cluster relative to the other clusters without using any labels; while the classification accuracy and the mAP are supervised metrics that use the labels of the test set. We observe that the object recognition model trained on synthetic data only gives poor performance on real data with accuracy, in comparison to accuracy obtained with training on real labeled data only. Naively combining the synthetic and real data in a mini-batch is also sub-optimal. Using triplet loss on labeled source and target examples preserves the structure of the features for transfer to real examples. DANN improves performance in both domains by aligning marginal distributions. MCD further improves the performance in domain adaptation with conditional alignment of distributions. DIRL outperforms the compared approaches by combining marginal and conditional distributions alignment with triplet distribution loss.

Figure 4: Vision-Based decluttering examples of object recognition and grasp planning. (top) Output bounding boxes from the object recognition network are fed to the grasp planning network to yield the most likely grasp to success shown with red whisker plots, (bottom) examples of grasping a target object with real and simulated depth images.

Grasp Planning Network: The grasp planning model comprises of two parts: grasp sampling and grasp evaluation

. The grasp planning model samples antipodal pairs on the cropped depth image of the object and and feeds them to a convolutional neural network to predict the probability of successful grasp as determined by the wrench resistance metric. Good grasps are successively filtered with a cross-entropy method to return the most likely grasp for the robot to pick and place the object into corresponding bin.

The cropped depth image is centered around the sampled grasp center and orientation to make the network predictions rotation-invariant. The network takes a downsampled depth image around the grasp center, and processes it with a convolution layer of size with filters and convolution layers of size with filters each, followed by fully connected layer of size . The average height of the grasp center region from the camera is processed through a separate fully connected layer of size before it is concatenated with the image stream features and a fully connected layer of size , followed by the output layer of neurons to predict the grasp quality score .

We test the performance of the trained models on the Toyota HSR for picking objects from the floor and depositing them in target bins. We load objects in a bin from a set of physical objects and drop them on the floor in front of the robot. The objects may overlap after dropping; a pushing primitive is used to singulate the cluttered objects if the overlap is more than a threshold. The domain-invariant object recognition model gives accuracy in real experiments. The grasping network gives accuracy of picking up the identified object. Without using the grasping network and only grasping orthogonal to the principal axis of the point cloud of the predicted object location gives accuracy. We observe that the robot performs well in grasping compliant objects and objects with well-defined geometry such as cylinders, screwdrivers, tape, cups, bottles and utilities; while assembly parts and small bowls in inverted pose induced repeated failures in grasping the target object (see video on the project website for more details).