Deep Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance on several image processing and computer vision tasks like image classification, object detection, and segmentation[1, 2]
. Numerous applications depend on the ability to infer knowledge about the environment through image acquisition and processing. Hence, scene understanding as a core computer vision problem has received a lot of attention. Semantic segmentation, the task of labelling pixels by their semantics (like ’person’, ’dog’, ’horse’), paves the road for complete scene understanding. Current state-of-the-art methods for semantic segmentation are dominated by deep convolutional neural networks (DCNNs)[3, 4, 5]. However, training end-to-end CNNs requires large scale annotated datasets. Even with a large enough dataset, training segmentation models with only image level annotations is quite challenging [6, 7, 8] as the architecture needs to learn from higher level image labels and then predict low-level pixel labels. The significant problem here is the need for pixel-wise annotated labels for training, which becomes a time-consuming and expensive annotation effort.The Pascal Visual Object Classes (VOC)  challenge considered to be a standard dataset for challenges like classification, detection, segmentation, action classification, and person layout provides only 1464 (training) and 1449 (validation) pixel-wise labelled images for semantic segmentation challenge. Some researchers have extended training dataset  with 8.5k strong pixel-wise annotations (consisting of the same 21 classes as PASCAL VOC) to counter this problem. In practical applications, the challenge still stands since various classes of objects need to be detected, and annotated dataset for such training is always required.
To reduce the annotation efforts, recent reports use weakly-annotated datasets to train deep CNN models for semantic segmentation. Typically, such weak annotations take the form of bound-boxes, because forming bound-boxes around every instance of a class is around 15 times faster than doing pixel-level labelling . These approaches rely on either defining some constraints  or on using multiple instance learning  techniques.  uses GraphCut for approximating bound-boxes to semantic labels. Although deep CNNs (such as the one proposed by  using DeepLab model ) significantly improved the segmentation performance using such weakly-annotated datasets, they failed to provide good visualization on test images.
There have been solutions proposed to reduce annotation efforts by employing transfer learning or simulating scenes. The research community has proposed multiple approaches for the problem of adapting vision-based models trained in one domain to a different domain[16, 17, 18, 19, 20]. Examples include: re-training a model in the target domain ; adapting the weights of a pre-trained model 
; using pre-trained weights for feature extraction; and, learning common features between domains . Augmentation of datasets with synthetically rendered images or using datasets composed entirely of synthetic images is one of the techniques that is being explored to address the dearth of annotated data for training all kinds of CNNs. Significant research in transfer learning from synthetically rendered images to real images has been published [25, 26]. Most researchers have used gaming or physics rendering engine to produce synthetic images  especially in the automotive domain. Peng et al. 
have done progressive work in the Object Detection context, understanding various cues affecting transfer learning from synthetic images. But they train individual classifiers for each class after extracting features from pre-trained CNN. They show that adding different cues like background, object textures, shape to the synthetic image increases the performance[28, 26] for object detection. There has not been an attempt yet to benchmark performances on the standard PASCAL VOC  semantic segmentation benchmark using synthetic images.
To the best of our knowledge, our report is the first attempt at combining weak annotations (generating semantic labels from bound box labels) and synthetically rendered images from freely available 3D models for semantic segmentation. We demonstrate a significant increase in segmentation performance (as measured by the mean of pixel-wise intersection-over-union (IoU)) by using semantic labels from weak annotations and synthetic images. We used a Fully Convolutional Network (FCN-8s) architecture  and evaluate it on the standard PASCAL VOC  semantic segmentation dataset. The rest of this paper is organized as follows: our methodology is described in section 2, followed by the results we obtain reported in section 3, finally concluding the paper in section 4.
Given an RGB image capturing one or many of the 20 objects included in PASCAL VOC 2012 semantic segmentation challenge, our goal is to predict a label image with pixel-wise segmentation for each object of interest. Our approach, represented in Figure 1, is to train a deep CNN with synthetic rendered images from available 3D models. We divide the training of FCN into two stages: fine-tuning FCN with the Weak(10k) dataset (real images with bound box labels) generated from bound box labelled images, and fine-tuning with our own Syn(2k) dataset (synthetic images rendered from 3D models). Our methodology can be divided into two major parts: dataset generation and fine-tuning of the FCN. These are explained in the following subsections.
2.1 Dataset generation
Weakly supervised semantic annotations: To train the CNN for semantic segmentation, we use the available bound-box annotations in PASCAL-VOC object detection challenge training set (10k images with 20 classes). Since the bound-boxes fully surround the object including pixels from background, we filter those pixels into foreground and background. Later the foreground pixels are given their corresponding object label in cases where multiple objects are present in an image. Two methods were chosen for converting bound-boxes to semantic segmentation namely Grab-Cut  and Conditional Random Fields (CRF) as deployed by [5, 15]. Based upon the performance on a few selected images, we use the labels from CRF for training the CNN. Figure 2 shows the comparison of results from both methods. Grab-Cut tends to miss labelling smaller objects but is precise in labelling larger objects. CRF labels objects of interest accurately with a small amount of noise around the edges.
Synthetic images rendered from 3D models: We use the open source 3D graphics software Blender for this purpose. Blender-Python APIs facilitates the loading of 3D models and automation of scene rendering. We use Cycles Render Engine available with Blender since it supports ray-tracing to render synthetic images. Since all the required information for annotation is available, we use the PASCAL Segmentation label format with labelled pixels for 20 classes.
Real world images have lot of information embedded about the environment, illumination, surface materials, shapes etc. Since the trained model, at test time must be able to generalize to the real world images, we take into consideration the following aspects during generation of each scenario:
Number of objects
Shape, Texture, and Materials of the objects
Background of the object
Position, Orientation of camera
Illumination via light sources
In order to simulate the scenario, we need 3D models, their texture information and metadata. Thousands of 3D CAD models are available online. We choose ShapeNet  database since it provides a large variety of models in the 20 categories for PASCAL segmentation challenge. Figure 3a shows few of the models used for rendering images. The variety helps randomize the aspect of shape, texture and materials of the objects. We use images from SUN database  as background images. From the large categories of images, we select few categories relevant as background to the classes of objects to be recognized.
For generating training set with rendered images, the 3D scenes need to be distinct. For every class of object, multiple models are randomly chosen from the model repository. Each object is scaled, rotated and placed at random location within the field of view of the camera which is placed at a pre-defined location. The scene is illuminated via directional light source. Later, a background image is chosen from the database and the image is rendered with Cycles Render Engine, finally generating RGB image and pixel-wise labelled image. Figure 3b shows few rendered images used as training set while Figure 3c shows the subset of real images from PASCAL Object Detection dataset (Weak(10k)) used in training.
2.2 Fine-tuning the Deep CNN
We fine-tune FCN-8s 
pretrained on ImageNet initially with 10k real images along with semantic labels generated from bound-boxes using CRF. All layers in the network are fined-tuned with base learning rate of . We further reference this model as baseline model. In next stage, we fine tune the baseline model with synthetic images generated from Blender. Selected layers (score_pool3, score_pool4, upscore2, upscore_pool4 and upscore8 shown in Figure 4) consisting of 2 convolutional and 3 deconvolutional layers are fine-tuned with base learning rate of
. The network is trained with Adam optimizer for pixel-wise softmax loss function. Since the rendered images from 3D models are not rich in terms cues like textures, shadows and hence are not photo-realistic, we choose to fine-tune only few layers to capture majorly the higher hierarchical features like shape of the object.
3 Results and Discussion
The experiments were carried out using the workstation with Intel Core i7-5960X processor accelerated by NVIDIA GEFORCE GTX 1070. NVIDIA-DIGITS (v5.0) 
was used with Caffe library to train and manage the datasets. The proposed CNN was evaluated on the PASCAL-VOC 2012 segmentation dataset consisting of 21 classes (20 foreground and 1 background class). The PASCAL 2012 segmentation dataset consists of 1464 (train) and 1449 (val) images for training and validation respectively.
Table 1 shows the comparison of various CNN models trained on datasets listed in the first column. The performances reported are calculated according to the standard metric, mean of pixel-wise intersection-over-union (IoU). The first row lists the 21 classes and the mean IOU over all 21 classes.
The first row displays the performance when FCN is fine-tuned for real images with strong pixel-wise annotations from PASCAL VOC 2012 segmentation training dataset addressed as Real(1.5k). Using bound-box annotated images as weak annotations was an alternative proposed earlier by  which performed better than just using standard training dataset (with size of approximately 1.5k images). The second row showcases that the model trained with 10k weak bound-box annotated data (converted to pixel-wise labels using CRF) improved the mean-IoU performance from 47.68% to 52.80%. The predictions from CNN trained on Weak(10k) are represented in Figure 5 third column. On comparing them with ground truth, it is observed that predictions miss the shape and sharp boundaries of the object.
Considering the CNN fine-tuned for Weak(10k) dataset as the baseline, we further fine-tuned it with rendered images of single class. Table 1 highlights the effect of using synthetic dataset on few classes namely car, bottle and aeroplane. Syn_Car(100) denotes the dataset of 100 synthetic images with car as the object of interest. We observed that by using few synthetic images from single class, the performance of segmenting car as well as 7 other classes improved. The improvement in other classes can be explained by common features learned from the car images. This trend can be observed for other classes like bottle (Syn_Bot(100)) and aeroplane (Syn_Aero(100)).
Finally, we fine-tuned the baseline model with complete set of synthetic images (100 images per class; 20 classes) addressed as Syn(2k). The mean IoU of this model increased from 52.80% to 55.47% as shown in Table 1 which clearly proves our hypothesis of supplementing synthetic images with weak annotated dataset. Some classes (car, bottle) showcased a significant improvement (10% for car, 8% for bottle) indicating the synthetic images in such cases to be more informative than others. While the classes like bicycle, dog, person and TV-monitor had lower IoU values since we had fewer 3D models available for those object types. Since objects like cow, cat, person etc. have highly variable appearance compared to other object classes, we observe lesser improvement in performance.
To further explore the usefulness of synthetic and weak annotated dataset in conjunction with strong annotated real dataset, we fine-tune FCN with
Real(1.5k)+Weak(10k)+Syn(2k). The model achieves 58.27% (mean IoU) while
Real(1.5k)+Syn(2k) achieves 5.08% (mean IoU) indicating the negative effect of non-photorealistic rendered images on strong annotation in real dataset.
Figure 5 shows the comparison of the semantic labels generated from network trained on Weak(10k) dataset and Weak(10k)+Syn(2k) dataset. The latter predictions are better since they produce sharper edges and shapes. The results prove that shape information from the synthetic models help eliminate the noise generated from CRF in labels. It is worth noting that even though synthetic images are non photo-realistic, and lack visual information from relevant backgrounds for objects, multiple object class in a single image or rich textures but they represent higher hierarchical feature like shape and thus can be used alongside weakly annotated images to achieve better performance on semantic segmentation tasks. The benchmark performance of FCN-8s on PASCAL test data when trained on augmented real image dataset released by  with strong annotations is 62.2% (mean IoU). While comparing with the benchmark performance, our model performs reasonably well with 55.47% (mean IoU) trained with the total of 12k images (Weak(10k)+Syn(2k)).
Our report demonstrates a promising approach to minimize the annotation and dataset collection efforts by using rendered images from freely available 3D models. The comparison shows that using 10k weakly annotated images (which approximately equals the annotation efforts for 1.5k strong labels) with just 2k synthetic rendered images gives a significant rise in segmentation performance.
This work can be extended by training CNN with larger synthetic dataset, with richer 3D models and relevant backgrounds. Adding other features like relative scaling and occlusions can further strengthen the synthetic dataset. The effect of using synthetic dataset with improved architectures for semantic segmentation are being explored further. Further investigation can be done on factors like domain adaptation, co-adaptation among deeper layers that affect the transfer learning from synthetic to real images.
We acknowledge funding support from Innit Inc. consultancy grant
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in International Conference on Neural Information Processing Systems. Curran Associates Inc., 2012, pp. 1097–1105.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07-12-June. IEEE, jun 2015, pp. 1–9.
-  E. Shelhamer, J. Long, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” Arxiv Preprint, vol. 1605.06211, may 2016.
-  V. Badrinarayanan, A. Handa, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling,” Arxiv Preprint, vol. 1505.07293, may 2015.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” Arxiv Preprint, vol. 1606.00915, jun 2016.
-  A. Vezhnevets, V. Ferrari, and J. M. Buhmann, “Weakly supervised structured output learning for semantic segmentation,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, jun 2012, pp. 845–852.
-  J. Verbeek and B. Triggs, “Region Classification with Markov Field Aspect Models,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, jun 2007, pp. 1–8.
-  J. Xu, A. G. Schwing, and R. Urtasun, “Tell me what you see and i will show you where it is,” in CVPR, 2014.
-  M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes Challenge: A Retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, jan 2015.
-  B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik, “Semantic contours from inverse detectors,” in 2011 International Conference on Computer Vision. IEEE, nov 2011, pp. 991–998.
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8693 LNCS, no. PART 5, pp. 740–755, 2014.
-  D. Pathak, P. Krähenbühl, and T. Darrell, “Constrained Convolutional Neural Networks for Weakly Supervised Segmentation,” Arxiv Preprint, vol. 1506.03648, jun 2015.
-  D. Pathak, E. Shelhamer, J. Long, and T. Darrell, “Fully Convolutional Multi-Class Multiple Instance Learning,” Arxiv Preprint, vol. 1412.7144, dec 2014.
-  C. Rother, V. Kolmogorov, A. Blake, C. Rother, V. Kolmogorov, and A. Blake, “"GrabCut",” in ACM SIGGRAPH 2004 Papers on - SIGGRAPH ’04, vol. 23, no. 3. New York, New York, USA: ACM Press, 2004, p. 309.
G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille, “Weakly-and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation,” in2015 IEEE International Conference on Computer Vision (ICCV). IEEE, dec 2015, pp. 1742–1750.
-  W. Li, L. Duan, D. Xu, and I. W. Tsang, “Learning With Augmented Features for Supervised and Semi-Supervised Heterogeneous Domain Adaptation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 6, pp. 1134–1148, jun 2014.
-  J. Hoffman, E. Rodner, J. Donahue, T. Darrell, and K. Saenko, “Efficient Learning of Domain-invariant Image Representations,” Arxiv Preprint, vol. 1301.3224, jan 2013.
-  J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko, “LSDA: Large Scale Detection Through Adaptation,” in Proceedings of the 27th International Conference on Neural Information Processing Systems. MIT Press, 2014, pp. 3536–3544.
-  B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what you get: Domain adaptation using asymmetric kernel transforms,” in CVPR 2011. IEEE, jun 2011, pp. 1785–1792.
M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferable features
with deep adaptation networks,” in
Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. JMLR.org, 2015, pp. 97–105.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Proceedings of the 27th International Conference on Neural Information Processing Systems. MIT Press, 2014, pp. 3320–3328.
Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou, “Revisiting batch normalization for practical domain adaptation,”International Conference on Learning Representations Workshop, 2017.
-  A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic Data for Text Localisation in Natural Images,” Arxiv Preprint, vol. 1604.06646, apr 2016.
-  E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep Domain Confusion: Maximizing for Domain Invariance,” Arxiv Preprint, vol. 1412.3474, dec 2014.
-  G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. Lopez, “The SYNTHIA Dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in CVPR, 2016.
-  X. Peng and K. Saenko, “Synthetic to Real Adaptation with Generative Correlation Alignment Networks,” Arxiv Preprint, vol. 1701.05524, jan 2017.
-  X. Peng, B. Sun, K. Ali, and K. Saenko, “Learning Deep Object Detectors from 3D Models,” in 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, dec 2015, pp. 1278–1286.
-  X. Peng and K. Saenko, “Combining Texture and Shape Cues for Object Recognition With Minimal Supervision,” Arxiv Preprint, vol. 1609.04356, sep 2016.
-  A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An Information-Rich 3D Model Repository,” Arxiv Preprint, vol. 1512.03012, no. 10.1145/3005274.3005291, dec 2015.
-  J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, jun 2010, pp. 3485–3492.
-  Jia Deng, Wei Dong, R. Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, jun 2009, pp. 248–255.
-  “Image Segmentation Using DIGITS 5.” [Online]. Available: https://devblogs.nvidia.com/parallelforall/image-segmentation-using-digits-5/