Modern supervised learning algorithms need plenty of data to help train classifiers. More data with higher quality is always desired in real-world applications; but sometimes, it is beneficial to turn to synthetic data. For example, to help identify criminals, many criminal investigations can only rely on a synthetic face sketch rather than a facial photograph of a suspect which may not be available. Such synthetic face data is normally drawn by an expert based on descriptions of eyewitnesses and/or victim(s). Several photo-sketch examples are shown in Fig.1. In this application, recognition based on synthetic data is very crucial.
Directly using synthetic data in a learning algorithm is unfortunately very challenging since synthetic data is different from real data at least to some extent, e.g. exaggerated facial shapes in sketch images in Fig. 1 as compared with real images. As a result, the feature distributions of synthetic data may be shifted away from those of real data as illustrated in Fig. 2. We term such shift in distributions as synthetic gap. Synthetic gap is largely caused by the generating process of synthetic data: whereas the synthetic data are generated by replicating principal patterns such as eyes, mouth, nose and hairstyle, rather than replicating every detail of real data. The synthetic gap is a major obstacle in using synthetic data in recognition problems, since synthetic data may fail to simulate potentially useful patterns of real data which are important to a successful recognition. To solve this problem, we associate synthetic data with real data, and jointly learn from them in a Stacked Multichannel Autoencoder (SMCAE) which can help bridge the synthetic gap by transforming characteristics of synthetic data to better simulate real data.
This paper addresses the problem of learning a mapping from synthetic data to real data. Specifically, we propose a novel framework – SMCAE. The training process of SMCAE facilitates the bridging of the synthetic gap between the real and the synthetic data by learning how to transform: (1) synthetic to real data and (2) real to real data. In (2) the model learns most essential ‘characters’ and ‘patterns’ of real data, while in (1) it learns how to augment the synthetic data to best reproduce the distribution of real data. Because the two tasks are learned simultaneously, with shared parameters, the essential ‘characteristics’ learned in (2) help to regularize results in (1) and vice versa as we will illustrate in the Handwritten Digit experiments.
We highlight two main contributions of this paper: (1) To the best of our knowledge, this is the first attempt to address the problem of synthetic gap, by demonstrating that the synthetic data could be used to improve the performance on a recognition task. (2) We propose a Stacked Multichannel Autoencoder (SMCAE) model to bridge the synthetic gap and jointly learn from both real and synthetic data.
Ii Related Work
aims to extract the knowledge from one, or more, source tasks and apply it to a target task. Transfer learning can be used in many different applications, such as web page classification and zero-shot classification . A more detailed survey of transfer learning is given by . Our method is a specific form of transfer learning, termed domain adaptation [6, 32, 34]. Nonetheless, different from previous domain adaptation approaches, we assume the the synthetic gap is caused by the shift in feature distribution of synthetic data from real data and so we assume that the main ’characters’ and ’patterns’ strongly co-exist in both the synthetic and real data. Our SMCAE is thus developed based on this assumption.
Autoencoder29]. Autoencoder with its different variants [10, 12, 2, 20] was shown to be successful in learning and transferring shared knowledge among data source from different domains [5, 8, 11]
, and thus benefit other machine learning tasks. Our framework borrows the idea of autoencoder to jointly learn two different and yet related tasks: mapping synthetic to real data; and real to real data. It is worth noting that in
, a multimodal autoencoder with structure similar to ours is proposed. Their multimodal autoencoder put two normal autoencoders together by sharing a hidden layer. In their structure, data at input end and output end are fully symmetric and each modal of data occupy one branch of the antuencoder. In contrast to their structure, the proposed SMCAE composes the structure of both normal autoencoder and denoising autoencoder. With this composition, one branch of SMCAE is capable exploring intrinsic features of data in one domain, and another branch of SMCAE is going to transfer data from one domain to another domain using features discovered from both branches. The structure of SMCAE could be easily expanded to more branches to compensate more complicated multi-task learning problems. Our experiments show that our SMCAE is better than other autoencoders in this regard.
Learning from synthetic templates. Some recent works of learning from synthetic data [26, 27, 4] mostly generate synthetic data either by applying a simple geometric transformation or adding image degradation to real data. To help offline recognition of handwritten text [26, 27], a perturbation model combined with morphological operation is applied to real data. To enhance the quality of degraded document , degradation models such as brightness degradation, blurring degradation, noise degradation, and texture-blending degradation, were used to create a training dataset for a handwritten text recognition problem. These methods did not address the synthetic gap problem, and thus have been limited to a small performance improvements by using synthetic data. In , computer graphics 3D models are used to ease training data generation. To simulate pedestrian in a picture, authors track volunteers pose from multiple views and human bodies are reshaped using a morphable 3D human model. The reshaped picture of human bodies later are composed with real world backgrounds. The same idea has been adopted in 
where in addition to render a 3D model to simulate an object in a real scene, features extracted from synthetic data are adapted to better train an object detector.
Iii Stacked Multichannel Autoencoder (SMCAE)
We propose the SMACE model to learn a mapping from synthetic and real data. To learn this mapping, the SMCAE model is formulated as a stacked structure of multichannel autoencoders which facilitates an efficient and flexible way of jointly learning from both synthetic and real data. The structure and configuration of the SMCAE is illustrated in Fig. 3. Specifically, we set the left and right tasks in two channels of the SMCAE respectively. The left task, as illustrated in left channel of Fig. 3, takes synthetic data as input and real data as reconstruction target; while the right task of the right channel in Fig. 3 uses real data in both input and reconstruction target. All between-layer connections that are colored in gray are shared by tasks of the two channels. The SMCAE structured in this way attempts to transform synthetic data to real data in left task using representation learned from real data in right task.
Iii-a Problem setup
We first illustarte the setup of a single layer in each channel of our SMCAE. For a single channel of our SMCAE is basically an autoencoder . Assume an input dataset with instances where . To encode the input data, we have where
is a sigmoid function and, is a set of encoding parameters in -th layer. In contrast, the decoding process is defined as with the decoding parameters , and the encoded representations .
To minimize the reconstruction error, we have
where is a weight decay term added to improve generalisation of the autoencoder and leverages the importance of this term. To avoid learning the identity mapping in the autoencoder, a regularisation term that penalizes over-activation of the nodes in the hidden layer is added111 is a sparsity parameter and is empirically set to in all our experiments.. is an averaged activation of all nodes in the hidden layer and is computed as: . Thus the objective of single channel is updated to:
where controls sparsity of representation in hidden layer.
Iii-B The SMCAE model
The structure of the SMCAE model is extended from an autoencoder so that it can simultaneously deal with tasks in both the left and right channels. Specifically, we use the notation to denote the configuration of input data (short for ) and reconstruction target at the output layer (short for ) in one channel of SMCAE. We thus label the tasks in the left and right channels of SMCAE as and individually, where and indicate the left and right channel branch of SMCAE. , stand for synthetic and real data respectively. The tasks in the two channels share the same parameters in all hidden layers which enforces the autoencoder to learn common structures of both tasks. At the output layer, we divide the SMCAE into two separate channels with their own parameters and .
Our target is to minimize the reconstruction error of the two tasks of SMCAE together while taking into account the balance between two channels. The new objective function of SMCAE is thus,
We add as a regularisation term to balance the learning rate between the two channels.
The regularization term of is a novel contribution of our SMCAE. Basically, penalizes a situation where the difference of learning errors between two channels are large. Since in the configuration of the SMCAE the data at the input and output end of two channels are not symmetric, the learning error resulted by optimizing learning process in two channels are very different. Having in our objective will prevent from a situation where the optimization of one channel dominates the entire SMCAE so as to help SMCAE to better leverage the learning process and find a compromising balance between two channels. For importance of in our objective, we show the learning results of setting different for in Fig. 7
The minimization of Eq. 6
is achieved by back propagation and stochastic gradient descent using a Quasi-Newton method – LBFGS. In the SMCAE, with balance regularization added to the objective, the only difference as opposed to sparse autoencoder is the gradient computation of unknown parametersand . We clarify these differences in the following equations:
We train a SMCAE in a greedy manner where one layer gets trained at a time. The configuration for training one layer of SMCAE is shown in Fig. 3(right). The output of a trained layer is then sent as input to the next layer for training. A fine-tuning is implemented to the entire stacked structure once all layers are trained. Thus, after SMCAE has been trained, to transform new synthetic data, the data is sent to the left channel of the SMCAE . We take output of this process as transformed synthetic data.
As shown in Fig. 4, we compare the SMCAE configuration to three alternative configurations: (1) SMCAE-II which places two separate channels on the structure, i.e. and . (2) Stacked autoencoder type-I (SAE-I) which merges the tasks in a single channel stacked autoencoder, with the configuration of :. (3) Stacked autoencoder type-II (SAE-II) which simply transforms source data to target data, and configures as: .
Compared with SAE-I and SAE-II, our two channel structures endow more flexibility. Critically, the single channel models force synthetic data to fit real data, which causes synthetic data to lose information and become less useful for recognition. In contrast, SMCAE can explore ‘characters’ and ‘patterns’ common in both synthetic and real data. Intrinsically, SMCAE first encodes both synthetic and real data into common hidden layers which model common information useful for recognition. Then the decoding process transforms the synthetic data to better simulate real data. Although SMCAE-II has the same two branches in the structure, it does not learn such transformation between synthetic data and real data.
Iv Experiments and Results
We first compare SMCAE on the challenging task of face-sketch recognition [31, 33] using the CUFSF dataset. We show that SMCAE is better than alternative configurations. To further validate the efficacy of our framework, we train SMCAE on handwritten digit images and generate synthetic data to simulate real images. We show that the synthetic data can help train classifiers for recognition.
Dataset. We conduct our experiments on two different datasets: (1) The CUFSF dataset [31, 33] containing the photos and sketches of people with lighting variations. We employ the standard split defined in [31, 33] which selects persons as the training set, and the remaining persons as the testing set. (2) handwritten digits dataset222collected from UCI machine learning repository (HWDUCI) . (HWDUCI) containing instances in total in which samples are used for training and samples are used for testing. The handwritten digits from to in this dataset are collected from people: contributed to the training set and the other to the test set. For all experiments, we empirically set the number of hidden layers in SMCAE to two and each layer has 1000 nodes. The same settings are used to make SMCAE, SMCAE-II, SAE-I and SAE-II more comparable.
Evaluation Metrics. We report the following metrics when they are available: (1) F1-score, which is defined as . (2)Receiving Operator Characteristic (ROC) curves and VR@0.1FAR which is the performance of Verification Rate (VR) at 0.1 False Acceptance Rate (FAR). VR@0.1
FAR is a standard evaluation metric and proposed in. (3) Rank-1 recognition accuracy.
Features.(1) Similar to , in the CUFSF dataset we use Histogram of Oriented Gradients (HOG). To further reduce the computational cost, the resolution of all photos and sketches is reduced to . So the cell size of HOG features is set to 3. (2)The HWDUCI dataset uses HOG features with cell size 3.
For CUFSF dataset, nearest-neighbor search with Euclidean metric is used in retrieving the most similar photo to the query sketch. In the handwritten digit classification, a Support Vector Machine (SVM) with RBF kernel333The parameters are cross-validated is used in the experiments.
Iv-a Results on the CUFSF dataset
In all experiments on this dataset, HOG features of sketch images are first transformed by the SMCAE and then used as queries. We first compare the results of photo-sketch matching using HOG feature transformed by SMCAE, SMCAE-II, SAE-I and SAE-II. The results are reported as ROC curve starting with VR@0.1FAR. The dissimilarity between a photo and a sketch is computed as the Euclidean distance between descriptors.
The ROC curves and VR@0.1FAR are shown in Fig. 5. Clearly, the proposed SMCAE achieves the highest results on AUC values and VR@0.1FAR accuracy and significantly outperforms the alternative configurations. Note that we also report the state-of-the-art approaches of VR@0.1FAR including LFDA , CITE  and classic eigenfaces(PCA). It is worth noting that in some of previous works, a better result could be obtained by combining multiple features. For example in 
, multiple CITE features generated by a random forest are used to batter matching photos and sketches. Here, to enable a comparison with more fairness, we focus our comparison on matching results obtained by using uncombined feature only.
There are several reasons why our SMCAE outperform the other approaches. First, compared with SMCAE-II, the configuration of SMCAE involves a task that handles the transformation from synthetic to real data, and thus better eliminates the distance between them. Second, compared with SAE-I, rather than merging two tasks in a single channel SMCAE employs two channels to better clarify each task with the aim of reconstructing the main ‘characters’ and ‘patterns’ co-existing in both tasks. Thus synthetic data can be more easily transformed to real data with less error. Finally, SMCAE is better than SAE-II as SMCAE learns features of real data in task . These features will better compensate the difference between synthetic data and real data during the transformation.
We further validate the results by using Rank-1 recognition accuracy which is also reported in [13, 30]. The results are shown in Fig. 6. The methods of [13, 30] are comparable to our SMCAE. Method  employed a discriminant common subspace to maximize the between-class variations and minimize the within-class variations. Method  used a structure composed of two autoencoders. As can be seen Fig. 6, the SMCAE outperforms all other methods.
Parameter Validation in Eq. 6. To validate the significance of in Eq. 6. We set with different values and report the rank-1 accuracy in Fig 7. Particularly, when is , it takes 2 times longer for SMCAE to converge compared with used in this work, Further with the rank-1 accuracy is dropped by more than . This validates the importance of term discussed in Sec. 3.2.
Qualitative results. Some qualitative results are shown in Fig. 8. It shows that a sketch HOG transformed by our SMCAE is more similar to the ground truth photo HOG.
Iv-B Handwritten Digit Recognition
Generating synthetic data. A synthetic version of each real character is generated as a variant of a centralized model learned from real characters. The centralized model of digit is shaped by control points settled on the boundary of the digit. A technique called migration is used to locate corresponding control points on each real digit image. A synthetic digit image then could be generated by filling areas closed by the control points 444Please refer to supplementary material for details.. Examples of generated synthetic digits are shown in Fig. 9
. To generate more synthetic data which is used to train the classifier once transformed by the trained SMCAE, we assume that locations of the control points follow a multivariate normal distributionwith and estimated using control points on the synthetic digit images. For each digit, new synthetic images are generated by randomly drawing samples from .
We compare our SMCAE with SMCAE-II, SAE-I, SAE-II, LeNet-5  and the best results  reported on this data set. The classification performance is evaluated by F1-score. A Support Vector Machine (SVM) classifier with RBF kernel is used in the experiments. For SMCAE, SMCAE-II, SAE-I and SAE-II in the test, real training data together with transformed synthetic data are used to train the SVM.
As shown in Fig. 10 (left), the SVM classifier with our SMCAE is better than all the alternative methods. This validates the effectiveness of our framework in generating synthetic data to better help training a classifier.
To further demonstrate how transformed synthetic data improve the classification results, we conducted more evaluations by training classifiers using different combinations of training sets in Fig. 10 (right). Particularly, four combinations of training sets are used. First, to have a performance baseline of SVM, we trained the SVM using real data only. To investigate how much improvement we could obtain in classification using a SVM trained by transformed synthetic data, we compare a SVM trained by synthetic data and transformed synthetic data respectively. The best performance is obtained with a SVM trained by real data together with transformed synthetic data.
With more synthetic training data generated by SMCAE, we gain a large margin of improvement in the classification. We notice that we can get the same result () by using Transformed synthetic and Real+Transformed Synthetic separately in Fig. 10 (right), which highlights the effectiveness of SMCAE in transforming synthetic data to simulate real data.
Finally, it is interesting to evaluate how the amount of synthetic data affects the classification results. We increasingly add more transformed synthetic data (from 300 to 3,300 samples) when training the SVM. The classification results are reported in Fig. 11. The curve shows an ascending trend when adding more samples, which means that all transformed synthetic data added to this test are highly effective and useful in the classification.
In this paper we identify the synthetic gap problem. To solve this problem, we propose a novel Stacked Multichannel autoencoder (SMCAE) model. SMCAE has multiple channels in its structure and is an extension of a standard autoencoder. We show that SMCAE not only bridges the synthetic gap between real data and synthetic data, but also jointly learns from both real and synthetic data.
-  Fevzi Alimoglu and Ethem Alpaydin. Combining multiple representations and classifiers for handwritten digit recognition. In ICDAR, 1997.
-  Fares Alnajar, Zhongyu Lou, Jose Alvarez, and Theo Gevers. Expression-invariant age estimation. In BMVC, 2014.
-  Kevin Bache and Moshe Lichman. UCI machine learning repository, 2013.
-  Gungor Bal, Gady Agam, Ophir Frieder, and Gideon Frieder. Interactive degraded document enhancement and ground truth generation. In Electronic Imaging 2008. International Society for Optics and Photonics, 2008.
Autoencoders, unsupervised learning, and deep architectures.Unsupervised and Transfer Learning Challenges in Machine Learning, Volume 7, page 43, 2012.
-  Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Mach. Learn., 2010.
-  Yoshua Bengio. Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
-  Yoshua Bengio. Deep learning of representations for unsupervised and transfer learning. Unsupervised and Transfer Learning Challenges in Machine Learning, Volume 7, page 19, 2012.
-  Gunilla Borgefors. Distance transforms in digital images. In Computer Vision, Graphics, and Image Processing, volume 34, pages 344–371. Elsevier, 1986.
-  Minmin Chen, Zhixiang Xu, Kilian Q. Weinberger, and Fei Sha. Marginalized denoising autoencoders for domain adaptation. In International Conference on Machine Learning, 2012.
-  Jun Deng, Zixing Zhang, Erik Marchi, and Bjorn Schuller. Sparse autoencoder-based feature transfer learning for speech emotion recognition. In Affective Computing and Intelligent Interaction (ACII), 2013.
-  Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML, 2011.
-  Meina Kan, Shiguang Shan, Haihong Zhang, Shihong Lao, and Xilin Chen. Multi-view discriminant analysis. In ECCV, 2012.
-  Brendan F Klare, Zhifeng Li, and Anil K Jain. Matching forensic sketches to mug shot photos. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(3):639–646, 2011.
-  Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE TPAMI, 2013.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Erik G Miller, Nicholas E Matsakis, and Paul A Viola.
Learning from one example through shared densities on transforms.
Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 1, pages 464–471. IEEE, 2000.
-  Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE TKDE, 2010.
-  Leonid Pishchulin, Arjun Jain, Christian Wojek, Mykhaylo Andriluka, Thorsten Thormählen, and Bernt Schiele. Learning people detection models from few training samples. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1473–1480. IEEE, 2011.
-  Adria Ruiz, Joost Van de Weijer, and Xavier Binefa. Regularized multi-concept mil for weakly-supervised facial behavior categorization. In BMVC, 2014.
-  Kanoksri Sarinnapakorn and Miroslav Kubat. Combining subclassifiers in text categorization: A dst-based solution and a case study. IEEE TKDE, 2007.
Nitish Srivastava and Ruslan R Salakhutdinov.
Multimodal learning with deep boltzmann machines.In Advances in neural information processing systems, pages 2222–2230, 2012.
-  Baochen Sun and Kate Saenko. From virtual to reality: Fast adaptation of virtual object detectors to real domains. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
-  Matthew A Turk and Alex P Pentland. Face recognition using eigenfaces. In Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, pages 586–591. IEEE, 1991.
-  Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579-2605):85, 2008.
-  Tamás Varga and Horst Bunke. Effects of training set expansion in handwriting recognition using synthetic data. In In 11th Conf. of the International Graphonomics Society. Citeseer, 2003.
-  Tamás Varga and Horst Bunke. Comparing natural and synthetic training data for off-line cursive handwriting recognition. In Frontiers in Handwriting Recognition, 2004. IWFHR-9 2004. Ninth International Workshop on, pages 221–225. IEEE, 2004.
-  Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
-  Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2011.
-  Wen Wang, Zhen Cui, Hong Chang, Shiguang Shan, and Xilin Chen. Deeply coupled auto-encoder networks for cross-view classification. arXiv preprint arXiv:1402.2031, 2014.
-  Xiaogang Wang and Xiaoou Tang. Face photo-sketch synthesis and recognition. IEEE TPAMI, 2009.
-  Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In ICML, 2009.
-  Wei Zhang, Xiaogang Wang, and Xiaoou Tang. Coupled information-theoretic encoding for face photo-sketch recognition. In CVPR, pages 513–520. IEEE, 2011.
-  Fan Zhu, Ling Shao, and Jun Tang. Boosted cross-domain categorization. In British Machine Vision Conference, 2014.
Vi Optimization of SMCAE
With two branches in the SMCAE, we target to minimize the reconstruction error of two tasks together while taking into account the balance between two branches. The new objective function is given as:
is a regularization added to balance the learning rate between two branches. In the SMCAE, with balance regularization added to the objective, the only difference as opposed to sparse autoencoder is the gradient computation of unknown parameters and . We clarify these differences in the following equations:
The exact form of gradients of and varies according to different sparsity regularization used in the framework.
Vii Generating Synthetic Data
Synthetic data are created to highlight the potential useful pattern in real images. In the proposed approach, the synthetic data are represented as a parametric model of a set of control points and edges associated to these points in the images. From the control points, the synthetic images could be generated to simulate the real images in terms of having the same structure or a similar appearance. Initially, the control points are selected from a centralized prototype that generalize all images in the same class. Then the locations of the control points are iteratively optimized until convergence in order to minimize the distance between synthetic images generated by control points and the real image. We annotate the control points and edges associated to them as, where is the set of the control points, and is the set of edges connecting control points. A generalized algorithm of getting the best matching synthetic image is provided in Algorithm 1.
Vii-a Learning Synthetic Prototype from Data
In hand written digit dataset used in this work, we learn a centralized prototype from given data. A digit prototype is generated for all images with the same digit. Congealing algorithm proposed in  is employed in this step to produce the synthetic prototypes for digits. In congealing, the project transformations are applied to images to minimize a joint entropy. Thus the prototype is considered to be an average image of all images after congealing, shown in Fig. 12.
Then control points are evenly sampled from the boundary detected from the prototype image. The control points needs to be mapped to each digit image in order to generate a synthetic image. To find this mapping we implement an approach that migrates the control points from the prototype images to destination image.
This point migration algorithm is based on a series of intermediate images generated in between synthetic prototype and destination image. To generate the intermediate images, we binarize all the images and the distance transformed images
of the synthetic prototype and the real image are generated. Given the number of steps, an intermediate image then is generated as a binarized image of linear interpolation between two distance transformed images. In each step, the control points are snapped to the closest boundary pixels of the intermediate image. The algorithm of OptimizeControlPoints() in this situation is given in Algorithm 2, we fix the number of steps to in this algorithm. A step by step examples is given in Fig. 14. A zoom in example showing how control points moved from one digit to another is shown in Fig. 15.
To generate more synthetic digit images, We assume the distribution of control points on each digit image follows a multivariant normal distribution that where and are computed using existing control points. The visualization of the distribution of control points of each digit is then shown in Fig. 16.
Illustration of distributions of control points on each digit image, where colors from blue to red are used to represent the probability density from low to high.