1 Introduction
Every concept in the world can be presented in many different digital modalities such as an image, a video, a clip of sound, or a few text. In recent years, many researchers in the multimedia field have been devoted to crossdomain generation, which aims to generate data of the same modality but with different kinds of representations/styles. For example, transferring a photo to an art painting. Moreover, some researchers try to generate crossmodal data. For example, given a video of violin playing, generating an audio clip of violin sound. In general, a pair of crossdomain data show different representations/styles of a concept, but the paired data still have a common shape structure. On the other hand, a pair of crossmodal data usually have heterogeneous features with quite different distributions, and therefore it is much more challenging to model the relationship between crossmodal data.
Deep learning has made tremendous progress on style transfer and crossdomain generation. For example, fully convolutional network (FCN) [1] was first proposed to achieve image to image transformation by using inverse convolutional layer. It can successfully process an input image into an output segmentation result. Isola et al
. introduced Pix2Pix
[2], which combines conditional GAN and L1/L2 distance loss of paired data to generate images with more clear visual content. Moreover, DiscoGAN [3], CycleGAN [4] and DualGAN [5] achieved unsupervised crossdomain transformation (i.e. the training process can be done without paired data) by using cycleconsistency structure to find the common distribution. Coupled GAN [6] performed crossdomain paired data generation by coupling two GANs with shared weightings. As for crossmodal generation, Reed et al. [7] successfully generated images from text by using conditional GAN and a matchingaware discriminator. Chen et al. [8] used the same method to transfer between sound and images. All of the above crossmodal GANs are conditional based; therefore, they can only perform onedirectional transformation. That is, these GANs cannot generate a pair of synchronous crossmodal data simultaneously.In this work, we propose a new GAN model named SyncGAN, which can learn a synchronous latent space representing the crossdomain or crossmodal data. The main contributions of this work are summarized as follows: (1) A synchronizer is introduced to estimate the synchronous probability and constrain the latent space of generators using paired data without any class label. (2) Our SyncGAN model can successfully generate synchronous data from identical random noise. Moreover, the latent code of data in one modality can be recovered by inverting the mappings of the generator, and the corresponding data in another modality can be generated based on the latent code. (3) The proposed SyncGAN can also achieve semisupervised learning by using a small amount of training data with information of synchronous/asynchronous, which makes our model more flexible for practical applications. To the best of our knowledge, SyncGAN is the first generative model which can perform synchronous data generation and bidirectional modality transformation.
2 Model description
2.1 Generative Adversarial Nets
Generative Adversarial Nets [9] are widely applied in generative models and have outstanding performance. The main concept of the Generative Adversarial Nets is the adversarial relationship between two models: a generative model and a discriminative model . The objective of is to capture the distribution of training data, and estimates the probability that a sample came from the training data rather than the generator . With training data and sampled noise , is used to maximize and is for minimizing . and play the following minimax game with the value function :
2.2 Synchronous Generative Adversarial Nets
Fig. 1 illustrates the architecture of the proposed SyncGAN model, which is constructed based on two general GANs () and (). Let indicates the data modality, each generator is used to generate data of modality (i.e. ) and each discriminator is used to estimate the probability that a sample is real data of modality . Moreover, two generators are connected to an additional network named synchronizer (), which is used to synchronize the latent space of two generators. The details of each block are described as follows.
Discriminators: Our discriminators work in a similar way as the general GANs. The objective of a discriminator is to estimate the probability that a sample came from the training data rather than the generator
by maximizing the loss function defined in Eq.
2.2. In this loss function, is the real data from the distribution , andis the latent vector of the modality
. Both and are sampled from the same distribution .Synchronizer: The synchronizer takes the data from different modalities as input. The objective of the synchronizer is to estimate the probability that two input data are synchronous (i.e. representing the same concept). We use synchronous and asynchronous real data to train the synchronizer by maximizing the loss function defined in Eq. 2.2. In this loss function, and denote the IDs of pairwise data in modality 1 and modality 2, respectively. Two synchronous data will have the same ID, while two different IDs imply that the two data are asynchronous.
Fig. 2 shows the structure of the synchronizer . Note that we design different synchronizer structures for different tasks. For the crossmodal data generation tasks, each data has its own network
for feature extraction, and the extracted feature maps will be concatenated as the input of a network
composed of fully connected layers (Fig. 2a). For the styletransfer tasks, data from different domains will be concatenated directly as the input of the synchronizer network (Fig. 2b).Generators: We use the noise
sampled from normal distribution as the input of the generator
. Two generators are trained to capture the distribution of their target modalities. Like most existing GAN methods, the goal of the generators is to fool the discriminators by maximize the loss function defined in Eq. 1. Moreover, to constrain the latent space of generators, we should also consider the synchronous loss defined in Eq. 2. To be more precise, the network tends to generate synchronous data when the same noise is used for and . In addition, asynchronous data would be generated when different noises are used for and .(1)  
(2) 
Note that in the training process, if we only consider the data sampled with , the mode collapse issue will be serious. It is important to also sample data with . In this work, the ratio of identical and distinct pairs used for training the synchronizer is 0.5.
2.3 Semisupervised Training
Based on the loss functions defined in Eq. 1 and Eq. 2, we can train the synchronizer and discriminator independently. In other words, the SyncGAN can achieve semisupervised learning by using a small amount of training data with information of synchronous/asynchronous to learn the synchronous concept and a large amount of training data without information of synchronous/asynchronous to learn the data distribution. During an iteration of the training stage, and are sampled from a normal distribution, and unpaired data and are used to compute the loss related to data distribution, i.e., and . We then construct another training batch by concatenating synchronous/asynchronous latent vector and data with information of synchronous/asynchronous to compute synchronous correlated loss, i.e., and . Finally we update the network parameters of each part. Please refer to (Alg. 1) for the overall training procedure.
3 Experiments
To validate the proposed SyncGAN model, we conducted experiments on several datasets: MNIST [10], FashionMNIST [11], UT Zappos50K [12][13] and an instrument dataset collected by ourselves. We perform crossmodal and crossdomain data generation on these datasets and the details are described in Section 3.1 and Section 3.2, respectively. We also used our model to transfer data between different modalities/domains and the results are shown in Section 3.3. Section 3.4 evaluates the synchronous rate of our model.
3.1 Crossmodal Generation
MNIST and FashionMNIST: The MNIST and FashionMNIST datasets contain ten image classes with 28x28 grayscale images of handwritten digits and clothes, respectively. Since the digits and the clothes do not have similar structure, we can consider the two datasets as two different modalities even though they are both image datasets. We used 30000 pairs of synchronous data from these two datasets for experiments. According to Table 1, each paired data is obtained by sampling an image from class of the MNIST dataset and another image from also the class of the FashionMNIST dataset. Fig. 3(a) shows that our SyncGAN model successfully generated image pairs of corresponding digit and cloth. Note that our model only needs to know the correspondence between two data rather than exact class labels.
Image and audio of instruments: We also conducted experiments on a crossmodal dataset containing images and audio clips of 5 kinds of instruments, including violin, trumpet, tuba, clarinet and sax. Each kind of instruments has 250 images and corresponding audio clips. Every image was cropped and normalized to 64x64 pixels. For audio data, we randomly clipped 512 samples from the wave file and downsampled these samples to 128 values, denoted as . We further transformed the 1D sequence to a 2D form with dimension of 64x128 as illustrated in Fig. 4. Fig. 5 shows the results of paired crossmodal data generated by our SyncGAN model. The generated audio waves are well synchronous with the generated images and we will show the synchronous rate in Section 3.4.
Class Index  MNIST  FashionMNIST 

C0  0  Tshirt/top 
C1  1  Trouser 
C2  2  Pullover 
C3  3  Dress 
C4  4  Coat 
C5  5  Sandal 
C6  6  Shirt 
C7  7  Sneaker 
C8  8  Bag 
C9  9  Ankle boot 
3.2 Crossdomain Generation
Our SyncGAN model can achieve good performance on crossmodal generation. In addition, it can also be applied to crossdomain generation for data with similar structures. We rotated 30000 images in the MNIST dataset by to create the image data in another domain. As shown in Fig. 6, our model successfully generated synchronous images with and of images for the MNIST dataset. We also applied SyncGAN on the shoes images/sketches of the UT Zappos50K dataset [2]. 20000 pairs of crossdomain data were used to train the SyncGAN, and Fig. 7 shows the synchronous generation of shoes sketches/photos.
3.3 Modality and Domain Transfer
For the tasks of crossmodal and crossdomain transfer, we follow the method proposed by Lipton et al. [14], which reconstructs latent vectors by performing gradient descent over the components of the latent representations (Eq. 3).
(3) 
With the reconstructed latent vector and the generator of another modality/domain, we can successfully achieve bidirectional crossmodal or crossdomain transfer. Please refer to Fig. 8  Fig. 10.
3.4 Evaluation of Synchronous Rate
For each crossmodal dataset, we trained a classifier for each modality based on data with concept labels. Given a crossmodal data pair generated by our SyncGAN, we examined whether the data pair is synchronous, i.e. two classifiers output the same labels. The synchronous rate is defined by the number of synchronous data pairs divided by the number of total generated data pairs. Fig.
11(a) shows the synchronous rate for the MNIST dataset and the FashionMNIST dataset with different semisupervised rate, which is defined by the number of paired data divided by the number of total data. The batch size in these experiments are 128. We observed that using semisupervised rate of 0.4 can achieve almost the same performance as supervised training (i.e. using semisupervised rate of 1.0). Fig. 11(b) demonstrates the synchronous rate of image and audio clip datasets. The result showed our model can be applied to real crossmodal data of image and sound, and the synchronous rate is about 0.7. The batch size in this experiment is 64.4 Conclusions and Future work
Crossdomain GANs adopt several special mechanisms such as cycleconsistency and weightsharing to extract the common structure of crossdomain data automatically. However, the common structure does not exist between most crossmodal data due to the heterogeneous gap. Therefore, the model need paired information to relate the different structures between data of various modalities which are of the same concept. In this paper, we present a novel network named synchronizer, which can constrain the latent space of generators in the GANs. Our SyncGAN model can successfully generate synchronous crossmodal/crossdomain data from identical random noises and perform transformation between different modalities/domains. Our model can generate synchronous image and audio data of instruments, and also can transfer data between these two modalities. However, the number of samples for each audio clip is only 512, which is around 0.01 secs. In the future, we will apply our model to sequential data such as longer audio clip and text.
References

[1]
Jonathan Long, Evan Shelhamer, and Trevor Darrell,
“Fully convolutional networks for semantic segmentation,”
in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2015, pp. 3431–3440. 
[2]
Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros,
“Imagetoimage translation with conditional adversarial networks,”
CVPR, 2017.  [3] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim, “Learning to discover crossdomain relations with generative adversarial networks,” arXiv preprint arXiv:1703.05192, 2017.
 [4] JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpaired imagetoimage translation using cycleconsistent adversarial networks,” arXiv preprint arXiv:1703.10593, 2017.
 [5] Zili Yi, Hao Zhang, Ping Tan Gong, et al., “Dualgan: Unsupervised dual learning for imagetoimage translation,” arXiv preprint arXiv:1704.02510, 2017.
 [6] MingYu Liu and Oncel Tuzel, “Coupled generative adversarial networks,” in Advances in neural information processing systems, 2016, pp. 469–477.
 [7] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee, “Generative adversarial text to image synthesis,” arXiv preprint arXiv:1605.05396, 2016.
 [8] Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, and Chenliang Xu, “Deep crossmodal audiovisual generation,” arXiv preprint arXiv:1704.08292, 2017.
 [9] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
 [10] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [11] Han Xiao, Kashif Rasul, and Roland Vollgraf, “Fashionmnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
 [12] A. Yu and K. Grauman, “Finegrained visual comparisons with local learning,” in Computer Vision and Pattern Recognition (CVPR), Jun 2014.
 [13] A. Yu and K. Grauman, “Semantic jitter: Dense supervision for visual comparisons via synthetic images,” in International Conference on Computer Vision (ICCV), Oct 2017.
 [14] Zachary C Lipton and Subarna Tripathi, “Precise recovery of latent vectors from generative adversarial networks,” arXiv preprint arXiv:1702.04782, 2017.