Automated Synthetic-to-Real Generalization

07/14/2020 ∙ by Wuyang Chen, et al. ∙ 9

Models trained on synthetic images often face degraded generalization to real data. As a convention, these models are often initialized with ImageNet pre-trained representation. Yet the role of ImageNet knowledge is seldom discussed despite common practices that leverage this knowledge to maintain the generalization ability. An example is the careful hand-tuning of early stopping and layer-wise learning rates, which is shown to improve synthetic-to-real generalization but is also laborious and heuristic. In this work, we explicitly encourage the synthetically trained model to maintain similar representations with the ImageNet pre-trained model, and propose a learning-to-optimize (L2O) strategy to automate the selection of layer-wise learning rates. We demonstrate that the proposed framework can significantly improve the synthetic-to-real generalization performance without seeing and training on real data, while also benefiting downstream tasks such as domain adaptation. Code is available at:



There are no comments yet.


page 8

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training a deep convolutional neural network (DCNN) can require large amounts of labeled data in computer vision tasks such as segmentation 

(Ros et al., 2016; Richter et al., 2016, 2017)

, depth/flow estimation 

(Dosovitskiy et al., 2015; Mayer et al., 2016; Gaidon et al., 2016), object detection (Johnson-Roberson et al., 2016), visual navigation (Savva et al., 2019), and grasping (Coumans and Bai, 2016). When there is label scarcity, a popular approach is to resort to training with synthetic images, where full supervision can be obtained at a low cost. This finds applications in label-scarce domains such as robotics and autonomous driving where simulation can play an important role.

However, there are many challenges to train with synthetic images. Models trained on synthetic images often face problems from degraded generalization on the real domain. Such a domain gap is usually caused by limitations on rendering quality, including unrealistic texture, appearance, illumination and scene layout, etc. As a result, networks are prone to overfitting to the synthetic domain with learned representations that differ from those obtained on real images. To this end, domain generalization methods (Li et al., 2017; Pan et al., 2018; Yue et al., 2019) have been proposed to overcome the above domain gaps and improve model generalization on real target domains.

Figure 1: Both heuristic solutions (early stopping, small learning rates, etc.) and recent works (e.g. IBN-Net (Pan et al., 2018)

) fall in poor generalization in synthetic-to-real transfer learning, which suffers from the huge appearance gap between the source and the target domain. Here, we studied different learning rates (“LR”) or optimization strategies for the backbone and the last fully-connected classification layer (“FC”). All settings start with an ImageNet pre-trained backbone and a randomly initialized classification layer. Please see section

3.2 for experiment details.

Synthetic-to-real transfer learning involves training a model only on synthetic images (source domain) without seeing any real ones, and targets on the generalization performance on unseen real images (target domain). Recent synthetic-to-real generalization algorithms often start with an ImageNet pre-trained model. To achieve the best generalization performance, it is a common practice to fine-tune the pre-trained model on synthetic images for only a few epochs (i.e.

early-stopping) with a small learning rate. Figure 1 illustrates the evaluation dynamics of several popular heuristic solutions on the VisDA-17 dataset (Peng et al., 2017). One could clearly see the high performance in early epochs, and the improvements of fine-tuning with a small learning rate (or even a fixed backbone) over training with a large one (red dashed line). Similar behavior exists in recent works (e.g. IBN-Net (Pan et al., 2018)). This observation implies an important clue: all these heuristics try to retain the ImageNet domain knowledge during the synthetic-to-real transfer learning. It explains why the heuristic solutions in Figure 1

work: they allow the classifier to quickly adjust from ImageNet to the task defined by the synthetic images, while preventing the ImageNet pre-trained representations of natural images to be “washed out” due to catastrophic forgetting.

Unfortunately, existing solutions (e.g. IBN-Net) still face degraded generalization and are highly dependent on manual selections of training epochs and schedules (learning rates). Motivated by this open issue, we propose an Automated Synthetic-to-real Generalization (ASG) framework to improve synthetic-to-real transfer learning. This method is automated from two aspects: (1) It stably improves the generalization during transfer learning, avoiding the difficulty of choosing epochs to stop. (2) It automates the complicated tuning of layer-wise learning rates towards better generalization. The core of our work is the intuition that a good synthetically-trained model should share similar representations with ImageNet-models, and we leverage this intuition as a proxy guidance to search layer-wise training schedules through learning-to-optimize (L2O).

Summary of Contributions:

  • We examine the behaviors of various training heuristics, in order to study the role of the ImageNet domain knowledge in synthetic-to-real generalization, which is not thoroughly discussed by the literature to the best of our knowledge.

  • We provide a novel perspective to address synthetic-to-real generalization, by formulating it as a lifelong learning problem. We enforce the representation similarity between synthetically trained models and ImageNet pre-trained model, and treat their similarity as a proxy guidance of generalization performance. An overall design is illustrated in Figure 2.

  • We demonstrate that proxy guidance not only dramatically improves the generalization performance, but can also be easily integrated by existing transfer learning frameworks as a simple drop-in module, without requiring any additional training beyond synthetic images. Experiments also prove the cross-task generalizability of our proxy guidance, which magnifies the strength of synthetic-to-real transfer learning.

  • We design a reinforcement learning based learning-to-optimize (RL-L2O) approach to make the synthetic-to-real generalization practically more convenient, by automating the complicated heuristic designs with layer-wise learning rates. We demonstrate that our RL-L2O method out-performs hand-crafted decisions and learns explainable learning rate strategy.

Figure 2: We formulate the synthetic-to-real transfer learning as a lifelong learning problem: training on synthetic images (new task) while still memorizing ImageNet classification (old task), acting as our proxy guidance during the transfer learning.

2 Automated Syn-to-Real Generalization

In our work, we propose an automated framework to address the synthetic-to-real transfer learning, dubbed Automated Synthetic-to-real Generalization (ASG). We assume an ImageNet pre-trained model as our starting point. Our target is to maximize the performance of the model on a target domain which consists of unseen real images, by utilizing only synthetic images from the source domain.

2.1 Syn-to-Real Generalization with Proxy Guidance

The accessibility to model pre-trained on ImageNet (Deng et al., 2009) implicitly provides the domain knowledge of real images. As we are transferring a model trained on synthetic data to unseen real images, retaining the ImageNet domain knowledge is potentially beneficial to the generalization. Motivated by this, we force the model to memorize how to capture the representation learned from ImageNet while training on synthetic images, to maintain both the domain knowledge on real images and task-specific information provided by the synthetic data.

We start with an ImageNet pre-trained model , and formulate our transfer learning as a life-long learning problem: training on synthetic images as the new task while still memorizing the old ImageNet classification task. While updating the model with synthetic images, we also keep a copy of the original ImageNet pre-trained model which is frozen during the training. In addition to the cross-entropy loss calculated on the synthetic dataset, we also forward the synthetic images through and minimize the KL divergence between the output of and . Formally, we leverage the minimization of as a proxy guidance during our transfer learning process:


Here, is a balancing factor that controls how much ImageNet domain knowledge the model should retain. denotes the parameters for the synthetic-to-real transfer learning (i.e. the classifier layers for the new task),

denotes the parameters for ImageNet classifier which will output the predicted probabilities on the ImageNet domain.

denotes the parameters for the feature extractor (a.k.a. backbone) updated for the new tasks, and denotes the parameters for the feature extractor which is frozen with ImageNet pre-trained weights. and share the same structure. is the current batch size, , and are sample and ground truth from the new task in the current batch. This synthetic-to-real transfer learning with proxy guidance is illustrated in Figure 2. The new task and the old ImageNet task are jointly optimized during the training.

Cross-task proxy guidance: It is important to note that, the new task is not necessarily limited to be also for the image classification purpose. For some models in semantic segmentation (e.g. ResNet based FCN (Long et al., 2015a)), a pixel-wise provides a much denser supervision than the image-wise in Eq. 4. To spatially balance and , we also make denser by applying it on cropped feature map patches:


Here, are cropped patches from . Later in section 3.4 we will demonstrate that this formulation also works well for cross-task training.

2.2 Automate LR Selection via Learning-to-Optimize

As observed in Figure 1, different convolution blocks contribute differently to the generalizability. This leads to a question: does different layers in a deep network require different training strategy towards optimal synthetic-to-real generalization performance during the transfer learning?

To avoid manually tuning the hyperparameters, we propose a reinforcement learning based learning-to-optimize (RL-L2O) framework to automatically adjust the learning rates for layers. In the RL-L2O framework, we aim to learn a parameterized policy

to dynamically control the learning rates given the training statistics of our model during transfer learning.

Generally, the goal of the reinforcement learning algorithm is to learn a policy that maximizes the total expected reward over time. More precisely,


where the expectation is taken over the sequence of states (or observations) and actions. In short, an action produced by will update the learning rates for in the RL-L2O framework. A state contains optimization related statistics of the model during the transfer learning, and the reward measures how well the optimization performs.

Design of Optimization Coordinates: One challenge in applying reinforcement learning in our setting is that we want to be able to control the training schedules of a deep network of up to a hundred layers (ResNet-101), each of them requiring an action from our policy. As layers may have strong correlations during the optimization (Ghiasi et al., 2018), the policy may fall into sub-optimal solutions in this large scale action space. To avoid this difficulty and simplify our policy training, we leverage the underlying structures in current deep networks. Specifically, layers in with similar input resolution will be grouped into a block, named as an optimization coordinate. Taking the ResNet family as an example, we group layers into a new coordinate whenever the feature map resolution is reduced. This grouping strategy keeps the action space of the policy small, and speeds-up the L2O training.

Design of Action Space: Intuitively, our policy could directly output learning rate for each coordinate. However, the model could be very sensitive to the learning rate (as observed in Figure 1), and the learning rate usually resides in a small value range (e.g. ). Directly predicting the value of the learning rate could be very unstable. Instead, we propose a learning rate scaling factor as the action. We first provide the policy a base learning rate . In the following steps, outputs discrete coordinate-wise learning rate scale factors as its actions where is the number of optimization coordinates in . We formulate as categorical actions, where each learning rate scale factor . The learning rate for each coordinate is set to be

, and we leverage the gradients and momentums calculated by stochastic gradient descent (SGD)

(Rumelhart et al., 1986) to update the parameters in .

Figure 3: Workflow of the proposed L2O framework. is the learning rate scale factor for the coordinates, and indicates the learning rate. is dot product.
Figure 4: Architecture of the policy network.

Design of Observation Space and Reward: At each step, the state (observation) for includes: current (Eq. 3) and (Eq. 4, Eq. 5), the training progress of (i.e. , where equals to total training steps (i.e., “total epochs”

“iterations per epoch”)), the mean and standard deviation of the weights of the classifier (

and ), and finally the scale factors from the last step . The policy learning is guided by reward .

Policy Training: We update our policy via the REINFORCE algorithm (Williams, 1992) to minimize:


where is the unroll length for . Algorithm LABEL:algo:L2O illustrate the procedure of our RL-L2O framework.


Once we obtained the learned policy , we then freeze and apply it to the synthetic-to-real transfer learning of together with SGD, as illustrated in Figure 3.

3 Experiments

3.1 Datasets

VisDA-17 (Peng et al., 2017) We perform ablation study on the VisDA-17 image classification benchmark. The VisDA-17 dataset provides three subsets (domains), each with the same 12 object categories. Among them, the training set (source domain) is collected from synthetic renderings of 3D models under different angles and lighting conditions, whereas the validation set (target domain) contains real images cropped from the Microsoft COCO dataset (Lin et al., 2014).

GTA5 (Richter et al., 2016) is a vehicle-egocentric image dataset collected in a computer game with pixel-wise semantic labels. It contains 24,966 images with a resolution of 10521914. There are 19 classes that are compatible with the Cityscapes dataset.

Cityscapes (Cordts et al., 2016) contains urban street images taken on a vehicle from some European cities. There are 5,000 images with pixel-wise annotations. The images have a resolution of 10242048 and are labeled into 19 semantic categories.

3.2 Implementation

Image classification: For VisDA-17, we choose ResNet-101 (He et al., 2016) as the backbone, and one fully-connected layer as the classifier. Backbone is pre-trained on ImageNet (Deng et al., 2009), and then fine-tuned on source domain, with learning rate = , weight decay = , momentum = 0.9, and batch size = 32. The model is trained for 30 epochs and for is set as . In section 3.3, we will additionally study how to choose .

Semantic segmentation: We study both FCN with ResNet-50 and FCN with VGG-16 (Long et al., 2015a). Backbones are pre-trained on ImageNet. Our learning rate is , weight decay is , momentum is 0.9, and batch size is six. We crop the images into patches of 512512 and train the model with multi-scale augmentation (0.75 1.25) and horizontal flipping. The model is trained for 50 epochs, and for is set as . Note that in segmentation is considerably larger since is a pixel-wise dense loss.

RL-L2O policy: We set the learning rate for policy training as

. The size of the hidden state vector

is set to 20, and the unroll length . We train for 50 epochs. For the ResNet family, we follow the convention (He et al., 2016) to group the layers into coordinates: , and the . For VGG-16 (Long et al., 2015a), we also group the layers into coordinates: , and the remaining layers.

Proxy guidance: For all backbones we studied (ResNet-50, ResNet-101, and VGG-16), we forward the feature maps extracted by group into the ImageNet classifier (parameterized by ) to calculate .

3.3 ASG for Image Classification

We first perform the ablation studies on the VisDA-17 image classification task111There is no previous synthetic-to-real transfer work on VisDA-17 classification task, only domain adaptation works..

Generalization with Proxy Guidance. To evaluate the effect of our proxy guidance, we apply our loss on different learning rate settings we studied in Figure 1. As demonstrated in Figure 2, once we force the model to memorize the ImageNet domain knowledge, we achieve stably increasing and eventually better generalization performance for each setting we explored in Figure 1. The relative ranking still holds among the different learning rate settings, while the degraded generalizability is addressed. Early stopping is no longer needed, as models enjoy improved generalization given sufficient training epochs. This ablation study validates the contribution of retaining the ImageNet domain knowledge during the synthetic-to-real transfer learning. It is also worth noting that our proxy guidance can be also applied to different networks (e.g. the IBN-Net (Pan et al., 2018), green line in Figure 2), which demonstrate the easy integration of our approach as a simple drop-in module with existing synthetic-to-real generalization works, without requiring any additional training beyond synthetic images.

Figure 5: The degraded generalization during the synthetic-to-real transfer learning (studied in Figure 1) can be solved by forcing the model to retain the ImageNet domain knowledge via our proxy guidance222. Task: ResNet-101 VisDA-17 Classification. .
22footnotetext: We could not utilize the proxy guidance when the backbone is fixed (“Train FC Only” blue dashed curve in Figure 1). The is always zero in this case as the group is not updated.

Moreover, a vital conclusion from Figure 2 is that, only reporting the (final) performance as a number is far from sufficient for analyzing and comparing synthetic-to-real transfer learning methods. Instead, the curve of the target performance during training can better demonstrate how well a model’s generalizability is. Meanwhile, a stably increasing training curve implies that, the model is both better leveraging synthetic images and retaining ImageNet domain knowledge, instead of overfitting on synthetic appearance and leaving the domain gap an open issue.

How to choose : We also study the effect of different strengths of the proxy guidance loss by adjusting in Equation 2 for a ResNet-101 model trained with a small learning rate for the backbone and a large one for the classification layer (blue line in Figure 2). In Table 1, we adjust in a wide range from 0.01 to 1. While we obtain the best generalization accuracy with , we can see that our proxy guidance is very robust to different strength of . Therefore, choosing is much easier than tuning hyperparameters in heuristic solutions like epochs.

0.01 0.05 0.1 0.5 1
Accuracy (%) 58.9 59.4 60.1 58.5 59.7
Table 1: Ablation of for the proxy guidance loss . Model: ResNet-101. Task: VisDA-17 Classification.

Automated Syn-to-Real Generalization. We next evaluate the performance of our RL-L2O framework. Specifically, we want to make sure the policy learned by our RL-L2O can perform better than both the random policy and the best hand-tuned learning rate policy we explored in Figure 2. A random policy means that the controller will always randomly pick an action as the learning rate scale factor. In all these three settings we start from the same base learning rate . Figure 6 demonstrates that, although the hand-tuned learning rate strategy is better than a random policy, RL-L2O can still out-perform it (blue line).

Figure 6: Our RL-L2O framework can out-perform both the random policy and a carefully hand-tuned learning rate strategy. All three settings include with the same during training. Model: ResNet-101. Task: VisDA-17 Classification.

Additional Ablation Study on VisDA-17. We conduct additional ablation studies on VisDA-17 to further analyze the learning behaviors of ASG. Specifically, as both the proxy guidance and the RL-L2O frameworks are motivated to carefully preserve the ImageNet representations while targeting updates from the new tasks on synthetic data, it is interesting and important to connect the relation between the level of retained ImageNet knowledge and the synthetic-to-real generalization. In our experiment, we compute ImageNet validation accuracy as well as the generalization performance on Visda-17 target domain for the classification task.

Table 2 demonstrates two conclusions: 1) Heuristic solutions that retain more ImageNet domain knowledge achieve higher synthetic-to-real generalization (#3 versus #1), i.e., using hand-crafted small learning rates to prevent the ImageNet pre-trained representations of natural images from being “washed out” due to catastrophic forgetting; 2) By leveraging Proxy Guidance, the generalization performance on VisDA-17 is dramatically improved, while the ImageNet accuracy is also maintained with almost no drop. It is interesting that Proxy Guidance leads to learned model parameters that achieve high accuracy simultaneously on both ImageNet and VisDA-17. In contrast, naively freezing the backbone and only fine-tuning the classifier layer (“Oracle” #5) results in inferior synthetic-to-real generalization despite high ImageNet performance.

# Model VisDA-17 ImageNet
1. Large LR for all layers 28.2 0.8
2. + our Proxy Guidance 58.7 (+30.5) 76.2 (+75.4)
3. Small LR for backbone 49.3 33.1
and large LR for FC
4. + our Proxy Guidance 60.2 (+10.9) 76.5 (+43.4)
5. Oracle on ImageNet333 53.3 (+4.0) 77.4
6. ROAD (Chen et al., 2018) 57.1 (+7.8) 77.4
7. Vanilla L2 distance 56.4 (+7.1) 49.1
8. SI (Zenke et al., 2017) 57.6 (+8.3) 53.9
9. ASG (ours) 61.1 76.7
Table 2: Our Proxy Guidance improves the synthetic-to-real generalization (Visda-17) by retaining the ImageNet domain knowledge. Learning rate (LR) settings were studied in Figure 1 and 2. FC: the last fully-connected classification layer. Top1 accuracies are in percentage (%). Model: ResNet-101.

Oracle is obtained by freezing the ResNet-101 backbone while only training the last new fully-connected classification layer on the Visda-17 source domain (the FC layer for ImageNet remains unchanged). We use the PyTorch official model of ImageNet pre-trained ResNet-101.

In addition, we compare ASG with several other lifelong learning algorithms, including both feature-level regularization (Chen et al., 2018) and weight-level importance-reweighted constraints (Zenke et al., 2017). Row #58 in Table 2 shows that although the three comparing methods indeed retain ImageNet domain knowledge while improving over the baseline (49.3%), they are not performing as well as the proxy guidance (60.2%) under the same LR policy.

3.4 ASG for Semantic Segmentation

We also conduct comprehensive experiments to evaluate the synthetic-to-real generalization performance of ASG on the semantic segmentation task. In particular, we treat GTA5 as the synthetic source domain and train segmentation models on it. We then treat the Cityscapes validation/test sets as target domains where we directly evaluate the segmentation performance of the synthetically trained models.

Figure 7: Dynamics of evaluation accuracy with training epochs. Models are trained on GTA5 and directly tested on the Cityscapes validation set. We use FCN-VGG16 as the backbone for segmentation models. In addition, in all comparing methods share the same parameter during synthetic source training.

Figure 7 shows the dynamics of evaluation accuracy on the Cityscapes validation set. Again, ASG demonstrates significantly improved generalization performance on semantic segmentation over naive synthetic training. In addition, integrating proxy guidance with RL-L2O also consistently outperforms baselines where proxy guidance is integrated with other policy strategies. Note that in this case, both and are oriented to the classification task, while and designed for segmentation. This showcases the ability of ASG to generalize across different tasks.

In Table 3, we compare our method with prior domain generalization methods for semantic segmentation. One can see that ASG achieves the best performance gain. Among the comparing methods, IBN-Net (Pan et al., 2018) improves domain generalization by fine-tuning the mixed IN-BN residual building blocks, while (Yue et al., 2019) transfers the styles from images in ImageNet to synthetic images. It is worth noting that (Yue et al., 2019) requires ImageNet images during training and implicitly leverages ImageNet label information (i.e. “Auxiliary Domains”) which brings potential advantages. In contrast, our method requires minimum extra information without using any additional images or labels, therefore can be conveniently applied to existing frameworks as a drop-in training strategy.

Methods Model mIoU % mIoU %
No Adapt FCN-Res50 22.17 7.47
IBN-Net (2018) 29.64
No Adapt FCN-Res50 32.45 4.97
Yue et al. (2019) 37.42
No Adapt FCN-Res50 23.29 8.60
Ours 31.89
No Adapt FCN-VGG16 29.81 6.3
Yue et al. (2019) 36.11
No Adapt FCN-VGG16 19.89 11.58
Ours 31.47
Table 3: Comparison to prior methods on domain generalization for semantic segmentation (GTA5 Cityscapes).

Policy Behaviors. Figure 8 shows clear and explainable behavior patterns of our policy for FCN-VGG16 on the segmentation task. In FCN-VGG16, groups belong to the ImageNet pre-trained backbone, while and the remaining layers act as the classifier for the dense predictions. The feature map captured by is forward into to calculate . As is close to the calculation of , fixing (i.e. selecting action = 0 which represents the learning rate scale factor = 0) can effectively minimize and retain the ImageNet domain knowledge. As parameters from group to are gradually far from the supervision, the corresponding selected actions also increase.

Figure 8: Action behavior of our RL-L2O framework during the policy training for = FCN-VGG16 for the GTA5Cityscapes segmentation transfer learning. Categorical actions are smoothed for better visualization purpose. Actions of indicate learning rate scale factors .

On the other hand, to perform dense prediction in semantic segmentation, the extracted feature maps are first forwarded to and then to . In addition, similar trend holds for the classifier part: as is the closest group to , it is assigned with the highest scale factor for learning rate.

Figure 9: Generalization results on GTA5 Cityscapes. Rows correspond to sample images in Cityscapes. From left to right, columns correspond to original images, ground truth, predication results of baseline (FCN-VGG16 (Long et al., 2015a)), and prediction by model trained with our ASG framework.

3.5 ASG for Unsupervised Domain Adaptation

The proposed ASG framework not only can improve the synthetic-to-real generalization performance, but also can considerably benefit downstream tasks such as unsupervised domain adaptation. Here we present synthetic-to-real domain adaptation results on VisDA-17 (Peng et al., 2017) in Table 4, where the model trained by ASG (which did not use any real target images during training) is leveraged as the source model (i.e., starting point for the unsupervised domain adaptation training), and the CBST/CRST frameworks are adopted exactly following (Zou et al., 2018, 2019) for fair comparison purposes.

Starting from a much better initialization (our 61.1% compared with 51.6% in (Zou et al., 2019)), we significantly boost the adaptation performance over 6% compared with CBST/CRST, achieving 84.6% on Visda-17. It is important to emphasize that such improvement is obtained without any extra supervision and external knowledge. The only difference lies in smarter synthetic-to-real source training which ultimately leads to improved adaptation.

Method Tgt Img Accuracy
Source (Saito et al., 2017) 52.4
DANN (Ganin et al., 2016) 57.4
MCD (Saito et al., 2018b) 71.9
ADR (Saito et al., 2017) 74.8
SimNet-Res152 (Pinheiro, 2018) 72.9
GTA-Res152 (2018) 77.1
Source-Res101 (Zou et al., 2019) 51.6
CBST (Zou et al., 2018) 76.4 (0.9)
CRST (MRKLD) (2019) 77.9 (0.5)
CRST (MRKLD + LRENT) (2019) 78.1 (0.2)
Source-Res101 (ASG) 61.1
ASG + CBST 82.5 (0.7)
ASG + CRST (MRKLD) 84.6 (0.4)
ASG + CRST (MRKLD + LRENT) 84.5 (0.4)
Table 4: Synthetic-to-real adaptation on Visda-17. We follow the same settings in (Zou et al., 2019) to set the weights as 0.1 and 0.25 for MRKLD and LRENT respectively, and report the averages and standard deviations (in brackets) of the evaluation results over five runs. Model: ResNet-101. “Tgt Img”: whether the method leveraged target real images during training. Top-1 accuracies are in percentage (%).
Figure 10: t-SNE visualization of feature embeddings of different models on the target domain of VisDA-17. From left to right: source model (Zou et al., 2019), CBST (Zou et al., 2018), CRST (MRKLD+LRENT) (Zou et al., 2019), and ASG + CRST (MRKLD+LRENT).

Feature visualization. We show the t-SNE visualization of the feature embeddings extracted by the backbone (ResNet-101) of different models in Fig. 10. Compared with Both CBST (Zou et al., 2018) and CRST (MRKLD+LRENT) (Zou et al., 2019), feature embeddings obtained by ASG + CRST form purer clusters in terms of semantic labels.

4 Related Work

4.1 Domain Generalization and Adaptation

Domain generalization considers the problem of generalizing a model on the unseen target domain without leveraging any target domain images (Gan et al., 2016; Muandet et al., 2013; Yuan et al., 2020). Muandet et al. (2013) proposed to use the MMD (Maximum Mean Discrepancy) to align the distributions from different domains and train the network with adversarial learning. Li et al. (2017) built separate networks for each source domain and used the shared parameters for the test. Li et al. (2018) improved the generalization performance by using a meta-learning approach on the split training sets. Pan et al. (2018)

boosted a CNN’s generalization by carefully integrating the Instance Normalization and Batch Normalization as building blocks.

Unsupervised domain adaptation (UDA) trains a model towards a specific target domain, where the (unlabeled) images from the target domain are available for training. One major idea is to learn domain invariant embeddings by minimizing the distribution divergence between the source and target domain (Long et al., 2015b; Sun and Saenko, 2016; Tzeng et al., 2014). Hoffman et al. (2017) reduced domain gap by first translating the source images into target style with a cycle consistency loss, and then aligning the feature maps of the network across different domains through the adversarial training. Other works that leverage image level translation to bridge the domain gap include domain stylization (Dundar et al., 2020) and DLOW (Gong et al., 2019). Besides image-level translation, a number of works also perform adversarial learning at feature (Saito et al., 2018a; Chen et al., 2019; Liu et al., 2019) or output level (Tsai et al., 2018) for the improved domain adaptation performance. In addition, Zou et al. (2018; 2019)

proposed an expectation-maximization like UDA framework based on an iterative self-training process, where the loss of the latent variable is minimized. This is achieved by alternatively generating pseudo labels on target data and re-training the model with the mixed source and pseudo target labels.

In contrast to the above existing domain generalization and adaptation methods, we resort to leveraging the ImageNet pre-trained model as a proxy guidance during the synthetic-to-real transfer learning, without any extra adversarial training or modification to model architecture.

4.2 Lifelong Learning

Lifelong learning (Thrun, 1998) focuses on flexibly appending new tasks to the model’s training schedules, while maintaining the knowledge captured from previous old tasks. Li & Hoiem (2017) leverages only new task data to train the network while preserving the original capabilities by minimizing the outputs between the old network and the newly learned one. Lopez-Paz and Ranzato (2017) proposed a Gradient Episodic Memory (GEM) to alleviate the knowledge forgetting while transferring knowledge from previous tasks. Shin et al. (2017) developed a Deep Generative Replay framework, which is used to sample training data from previous tasks when training the new task. A number of other works on lifelong learning with related or similar applications include (Zenke et al., 2017; Kirkpatrick et al., 2017; Shafahi et al., 2019) where lifelong learning is shown to avoid catastrophic forgetting and benefit tasks such as incremental tasks learning, domain adaptation and adversarial defense. One work that is particularly related to our synthetic-to-real generalization theme is (Chen et al., 2018) where the authors propose a spatial aware adaptation scheme and also leverage a distillation loss to avoid overfitting to synthetic data. Our work differs from the above prior works by carefully looking into the important role played by layer-wise learning rate policies in synthetic-to-real transfer learning problems and accordingly propose a principled solution to automate the policy search.

4.3 Learning to Optimize

Andrychowicz et al. (2016)

proposed the first learning-to-optimize framework, where both the optimizee’s gradients and loss function values were formulated as the input features for a Recurrent neural network (RNN) optimizer. Their RNN optimizer adopted coordinate-wise weight sharing to alleviate the dimensionality challenge. Li and Malik

(2016) used the gradient history and objective values as observations and step vectors as actions in their reinforcement learning framework. Chen et al. (2017) leveraged RNN to train a meta-optimizer to optimize black-box functions (e.g. Gaussian process bandits). Recently, Wichrowska et al. (2017) introduced an optimizer of multi-level hierarchical RNN architecture augmented with additional architectural features, in order to improve the generalizability of the optimization tasks. (Cao et al., 2019; You et al., 2020) further extended learned optimizers to handling Bayesian swarm optimization, and graph network training, respectively. In our work, we leverage the learning-to-optimize approach to control the layer-wise learning rates for the training of deep CNNs, where the deep CNN (i.e. optimizee) will be transferred from the synthetic source domain to the real target domain, extending the application range of the current learning-to-optimize methods.

5 Conclusion

In this paper, we present an Automated Synthetic Generalization (ASG) method for the synthetic-to-real transfer learning problem. We carefully analyzed the pitfall in existing generalization approaches where the ImageNet domain knowledge is catastrophically forgotten. By leveraging the minimization of predictions between ImageNet pre-trained model and the model for the new task as a proxy guidance, the generalization performance is dramatically improved during the whole training process. We further include a reinforcement learning based learning-to-optimize strategy to automate the layer-wise learning rates towards a better generalization performance. Our experiments demonstrate both the superior generalization performance and the automated learning schedules by our ASG framework.

6 Acknowledge

Work done during internship at NVIDIA. We appreciate the computing power supported by NVIDIA GPU infrastructure. We also thank for the discussion and suggestions from four anonymous reviewers and the help from Yang Zou for the domain adaptation experiments. The research of Z. Wang was partially supported by NSF Award RI-1755701.


  • M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas (2016) Learning to learn by gradient descent by gradient descent. In NeurIPS, Cited by: §4.3.
  • Y. Cao, T. Chen, Z. Wang, and Y. Shen (2019) Learning to optimize in swarms. In NeurIPS, Cited by: §4.3.
  • Y. Chen, W. Li, X. Chen, and L. V. Gool (2019) Learning semantic segmentation from synthetic data: a geometrically guided input-output adaptation approach. In CVPR, Cited by: §4.1.
  • Y. Chen, W. Li, and L. Van Gool (2018) Road: reality oriented adaptation for semantic segmentation of urban scenes. In CVPR, Cited by: §3.3, Table 2, §4.2.
  • Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. de Freitas (2017) Learning to learn without gradient descent by gradient descent. In ICML, Cited by: §4.3.
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR, Cited by: §3.1.
  • E. Coumans and Y. Bai (2016)

    Pybullet, a python module for physics simulation for games, robotics and machine learning

    Note: Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §2.1, §3.2.
  • A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In ICCV, Cited by: §1.
  • A. Dundar, M. Liu, Z. Yu, T. Wang, J. Zedlewski, and J. Kautz (2020) Domain stylization: a fast covariance matching framework towards domain adaptation. IEEE Trans. PAMI. Cited by: §4.1.
  • A. Gaidon, Q. Wang, Y. Cabon, and E. Vig (2016) Virtual worlds as proxy for multi-object tracking analysis. In CVPR, Cited by: §1.
  • C. Gan, T. Yang, and B. Gong (2016) Learning attributes equals multi-source domain generalization. In CVPR, Cited by: §4.1.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. JMLR 17 (1), pp. 2096–2030. Cited by: Table 4.
  • G. Ghiasi, T. Lin, and Q. V. Le (2018) Dropblock: a regularization method for convolutional networks. In NeurIPS, Cited by: §2.2.
  • R. Gong, W. Li, Y. Chen, and L. V. Gool (2019) Dlow: domain flow for adaptation and generalization. In CVPR, Cited by: §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.2, §3.2.
  • J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2017) Cycada: cycle-consistent adversarial domain adaptation. arXiv:1711.03213. Cited by: §4.1.
  • M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, and R. Vasudevan (2016) Driving in the matrix: can virtual worlds replace human-generated annotations for real world tasks?. arXiv:1610.01983. Cited by: §1.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §4.2.
  • D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2017) Deeper, broader and artier domain generalization. In ICCV, Cited by: §1, §4.1.
  • D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2018) Learning to generalize: meta-learning for domain generalization. In AAAI, Cited by: §4.1.
  • K. Li and J. Malik (2016) Learning to optimize. arXiv:1606.01885. Cited by: §4.3.
  • Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE Trans. PAMI 40 (12), pp. 2935–2947. Cited by: §4.2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §3.1.
  • X. Liu, S. Li, L. Kong, W. Xie, P. Jia, J. You, and B. Kumar (2019) Feature-level frankenstein: eliminating variations for discriminative recognition. In CVPR, Cited by: §4.1.
  • J. Long, E. Shelhamer, and T. Darrell (2015a) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §2.1, Figure 9, §3.2, §3.2.
  • M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015b) Learning transferable features with deep adaptation networks. arXiv:1502.02791. Cited by: §4.1.
  • D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. In NeurIPS, Cited by: §4.2.
  • N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, Cited by: §1.
  • K. Muandet, D. Balduzzi, and B. Schölkopf (2013) Domain generalization via invariant feature representation. In ICML, Cited by: §4.1.
  • X. Pan, P. Luo, J. Shi, and X. Tang (2018) Two at once: enhancing learning and generalization capacities via ibn-net. In ECCV, Cited by: Figure 1, §1, §1, §3.3, §3.4, Table 3, §4.1.
  • X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko (2017) VisDA: the visual domain adaptation challenge. arXiv:1710.06924. Cited by: §1, §3.1, §3.5.
  • P. O. Pinheiro (2018) Unsupervised domain adaptation with similarity learning. In CVPR, Cited by: Table 4.
  • S. R. Richter, Z. Hayder, and V. Koltun (2017) Playing for benchmarks. In ICCV, Cited by: §1.
  • S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: ground truth from computer games. In ECCV, Cited by: §1, §3.1.
  • G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, Cited by: §1.
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning representations by back-propagating errors. Nature 323 (6088), pp. 533–536. Cited by: §2.2.
  • K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2017) Adversarial dropout regularization. arXiv:1711.01575. Cited by: Table 4.
  • K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2018a) Adversarial dropout regularization. In ICLR, Cited by: §4.1.
  • K. Saito, K. Watanabe, Y. Ushiku, and T. Harada (2018b) Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, Cited by: Table 4.
  • S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa (2018) Generate to adapt: aligning domains using generative adversarial networks. In CVPR, Cited by: Table 4.
  • M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, et al. (2019) Habitat: a platform for embodied ai research. In ICCV, Cited by: §1.
  • A. Shafahi, P. Saadatpanah, C. Zhu, A. Ghiasi, C. Studer, D. Jacobs, and T. Goldstein (2019) Adversarially robust transfer learning. arXiv:1905.08232. Cited by: §4.2.
  • H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In NeurIPS, Cited by: §4.2.
  • B. Sun and K. Saenko (2016) Deep coral: correlation alignment for deep domain adaptation. In ECCV, Cited by: §4.1.
  • S. Thrun (1998) Lifelong learning algorithms. In Learning to learn, pp. 181–209. Cited by: §4.2.
  • Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: §4.1.
  • E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell (2014) Deep domain confusion: maximizing for domain invariance. arXiv:1412.3474. Cited by: §4.1.
  • O. Wichrowska, N. Maheswaranathan, M. W. Hoffman, S. G. Colmenarejo, M. Denil, N. de Freitas, and J. Sohl-Dickstein (2017) Learned optimizers that scale and generalize. In ICML, Cited by: §4.3.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §2.2.
  • Y. You, T. Chen, Z. Wang, and Y. Shen (2020) L-gcn: layer-wise and learned efficient training of graph convolutional networks. In CVPR, Cited by: §4.3.
  • Y. Yuan, W. Chen, T. Chen, Y. Yang, Z. Ren, Z. Wang, and G. Hua (2020) Calibrated domain-invariant learning for highly generalizable large scale re-identification. In WACV, Cited by: §4.1.
  • X. Yue, Y. Zhang, S. Zhao, A. Sangiovanni-Vincentelli, K. Keutzer, and B. Gong (2019) Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In ICCV, Cited by: §1, §3.4, Table 3.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In ICML, Cited by: §3.3, Table 2, §4.2.
  • Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang (2019) Confidence regularized self-training. In ICCV, Cited by: Figure 10, §3.5, §3.5, §3.5, Table 4, §4.1.
  • Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV, Cited by: Figure 10, §3.5, §3.5, Table 4, §4.1.