Log In Sign Up

SyncGAN: Synchronize the Latent Space of Cross-modal Generative Adversarial Networks

by   Wen-Cheng Chen, et al.

Generative adversarial network (GAN) has achieved impressive success on cross-domain generation, but it faces difficulty in cross-modal generation due to the lack of a common distribution between heterogeneous data. Most existing methods of conditional based cross-modal GANs adopt the strategy of one-directional transfer and have achieved preliminary success on text-to-image transfer. Instead of learning the transfer between different modalities, we aim to learn a synchronous latent space representing the cross-modal common concept. A novel network component named synchronizer is proposed in this work to judge whether the paired data is synchronous/corresponding or not, which can constrain the latent space of generators in the GANs. Our GAN model, named as SyncGAN, can successfully generate synchronous data (e.g., a pair of image and sound) from identical random noise. For transforming data from one modality to another, we recover the latent code by inverting the mappings of a generator and use it to generate data of different modality. In addition, the proposed model can achieve semi-supervised learning, which makes our model more flexible for practical applications.


page 5

page 6

page 7


CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning

It is known that the inconsistent distribution and representation of dif...

SCH-GAN: Semi-supervised Cross-modal Hashing by Generative Adversarial Network

Cross-modal hashing aims to map heterogeneous multimedia data into a com...

M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention

Generative adversarial networks have led to significant advances in cros...

A Novel Approach to Artistic Textual Visualization via GAN

While the visualization of statistical data tends to a mature technology...

Visual-Tactile Cross-Modal Data Generation using Residue-Fusion GAN with Feature-Matching and Perceptual Losses

Existing psychophysical studies have revealed that the cross-modal visua...

RPGAN: GANs Interpretability via Random Routing

In this paper, we introduce Random Path Generative Adversarial Network (...

Symbiotic Adversarial Learning for Attribute-based Person Search

Attribute-based person search is in significant demand for applications ...

1 Introduction

Every concept in the world can be presented in many different digital modalities such as an image, a video, a clip of sound, or a few text. In recent years, many researchers in the multimedia field have been devoted to cross-domain generation, which aims to generate data of the same modality but with different kinds of representations/styles. For example, transferring a photo to an art painting. Moreover, some researchers try to generate cross-modal data. For example, given a video of violin playing, generating an audio clip of violin sound. In general, a pair of cross-domain data show different representations/styles of a concept, but the paired data still have a common shape structure. On the other hand, a pair of cross-modal data usually have heterogeneous features with quite different distributions, and therefore it is much more challenging to model the relationship between cross-modal data.

Deep learning has made tremendous progress on style transfer and cross-domain generation. For example, fully convolutional network (FCN) [1] was first proposed to achieve image to image transformation by using inverse convolutional layer. It can successfully process an input image into an output segmentation result. Isola et al

. introduced Pix2Pix 

[2], which combines conditional GAN and L1/L2 distance loss of paired data to generate images with more clear visual content. Moreover, DiscoGAN [3], CycleGAN [4] and DualGAN [5] achieved unsupervised cross-domain transformation (i.e. the training process can be done without paired data) by using cycle-consistency structure to find the common distribution. Coupled GAN [6] performed cross-domain paired data generation by coupling two GANs with shared weightings. As for cross-modal generation, Reed et al[7] successfully generated images from text by using conditional GAN and a matching-aware discriminator. Chen et al[8] used the same method to transfer between sound and images. All of the above cross-modal GANs are conditional based; therefore, they can only perform one-directional transformation. That is, these GANs cannot generate a pair of synchronous cross-modal data simultaneously.

In this work, we propose a new GAN model named SyncGAN, which can learn a synchronous latent space representing the cross-domain or cross-modal data. The main contributions of this work are summarized as follows: (1) A synchronizer is introduced to estimate the synchronous probability and constrain the latent space of generators using paired data without any class label. (2) Our SyncGAN model can successfully generate synchronous data from identical random noise. Moreover, the latent code of data in one modality can be recovered by inverting the mappings of the generator, and the corresponding data in another modality can be generated based on the latent code. (3) The proposed SyncGAN can also achieve semi-supervised learning by using a small amount of training data with information of synchronous/asynchronous, which makes our model more flexible for practical applications. To the best of our knowledge, SyncGAN is the first generative model which can perform synchronous data generation and bidirectional modality transformation.

2 Model description

Figure 1: The structure of the proposed SyncGAN model.

2.1 Generative Adversarial Nets

Generative Adversarial Nets [9] are widely applied in generative models and have outstanding performance. The main concept of the Generative Adversarial Nets is the adversarial relationship between two models: a generative model and a discriminative model . The objective of is to capture the distribution of training data, and estimates the probability that a sample came from the training data rather than the generator . With training data and sampled noise , is used to maximize and is for minimizing . and play the following minimax game with the value function :

2.2 Synchronous Generative Adversarial Nets

Fig. 1 illustrates the architecture of the proposed SyncGAN model, which is constructed based on two general GANs () and (). Let indicates the data modality, each generator is used to generate data of modality (i.e. ) and each discriminator is used to estimate the probability that a sample is real data of modality . Moreover, two generators are connected to an additional network named synchronizer (), which is used to synchronize the latent space of two generators. The details of each block are described as follows.

Discriminators: Our discriminators work in a similar way as the general GANs. The objective of a discriminator is to estimate the probability that a sample came from the training data rather than the generator

by maximizing the loss function defined in Eq. 

2.2. In this loss function, is the real data from the distribution , and

is the latent vector of the modality

. Both and are sampled from the same distribution .

Synchronizer: The synchronizer takes the data from different modalities as input. The objective of the synchronizer is to estimate the probability that two input data are synchronous (i.e. representing the same concept). We use synchronous and asynchronous real data to train the synchronizer by maximizing the loss function defined in Eq. 2.2. In this loss function, and denote the IDs of pairwise data in modality 1 and modality 2, respectively. Two synchronous data will have the same ID, while two different IDs imply that the two data are asynchronous.

Figure 2: Structure of the synchronizer. (a) is for cross-modal tasks and (b) is for style-transfer tasks.

Fig. 2 shows the structure of the synchronizer . Note that we design different synchronizer structures for different tasks. For the cross-modal data generation tasks, each data has its own network

for feature extraction, and the extracted feature maps will be concatenated as the input of a network

composed of fully connected layers (Fig. 2a). For the style-transfer tasks, data from different domains will be concatenated directly as the input of the synchronizer network (Fig. 2b).

Generators: We use the noise

sampled from normal distribution as the input of the generator

. Two generators are trained to capture the distribution of their target modalities. Like most existing GAN methods, the goal of the generators is to fool the discriminators by maximize the loss function defined in Eq. 1. Moreover, to constrain the latent space of generators, we should also consider the synchronous loss defined in Eq. 2. To be more precise, the network tends to generate synchronous data when the same noise is used for and . In addition, asynchronous data would be generated when different noises are used for and .

     // Data distribution loss
     // Synchronous loss
     // Update network parameters
  until convergence
Algorithm 1 Training the SyncGAN model

Note that in the training process, if we only consider the data sampled with , the mode collapse issue will be serious. It is important to also sample data with . In this work, the ratio of identical and distinct pairs used for training the synchronizer is 0.5.

2.3 Semi-supervised Training

Based on the loss functions defined in Eq. 1 and Eq. 2, we can train the synchronizer and discriminator independently. In other words, the SyncGAN can achieve semi-supervised learning by using a small amount of training data with information of synchronous/asynchronous to learn the synchronous concept and a large amount of training data without information of synchronous/asynchronous to learn the data distribution. During an iteration of the training stage, and are sampled from a normal distribution, and unpaired data and are used to compute the loss related to data distribution, i.e., and . We then construct another training batch by concatenating synchronous/asynchronous latent vector and data with information of synchronous/asynchronous to compute synchronous correlated loss, i.e., and . Finally we update the network parameters of each part. Please refer to (Alg. 1) for the overall training procedure.

3 Experiments

To validate the proposed SyncGAN model, we conducted experiments on several datasets: MNIST [10], Fashion-MNIST [11], UT Zappos50K [12][13] and an instrument dataset collected by ourselves. We perform cross-modal and cross-domain data generation on these datasets and the details are described in Section 3.1 and Section 3.2, respectively. We also used our model to transfer data between different modalities/domains and the results are shown in Section 3.3. Section 3.4 evaluates the synchronous rate of our model.

3.1 Cross-modal Generation

MNIST and Fashion-MNIST: The MNIST and Fashion-MNIST datasets contain ten image classes with 28x28 grayscale images of handwritten digits and clothes, respectively. Since the digits and the clothes do not have similar structure, we can consider the two datasets as two different modalities even though they are both image datasets. We used 30000 pairs of synchronous data from these two datasets for experiments. According to Table 1, each paired data is obtained by sampling an image from class of the MNIST dataset and another image from also the class of the Fashion-MNIST dataset. Fig. 3(a) shows that our SyncGAN model successfully generated image pairs of corresponding digit and cloth. Note that our model only needs to know the correspondence between two data rather than exact class labels.

Image and audio of instruments: We also conducted experiments on a cross-modal dataset containing images and audio clips of 5 kinds of instruments, including violin, trumpet, tuba, clarinet and sax. Each kind of instruments has 250 images and corresponding audio clips. Every image was cropped and normalized to 64x64 pixels. For audio data, we randomly clipped 512 samples from the wave file and down-sampled these samples to 128 values, denoted as . We further transformed the 1D sequence to a 2D form with dimension of 64x128 as illustrated in Fig. 4. Fig. 5 shows the results of paired cross-modal data generated by our SyncGAN model. The generated audio waves are well synchronous with the generated images and we will show the synchronous rate in Section 3.4.

Class Index MNIST Fashion-MNIST
C0 0 T-shirt/top
C1 1 Trouser
C2 2 Pullover
C3 3 Dress
C4 4 Coat
C5 5 Sandal
C6 6 Shirt
C7 7 Sneaker
C8 8 Bag
C9 9 Ankle boot
Table 1: Corresponding class of MNIST and Fashion-MNIST

Figure 3: Synchronous generation of MNIST and Fashion-MNIST.

Figure 4: Illustration of the audio processing procedure.

Figure 5: Synchronous generation of images and sound for instruments.

Figure 6: MNIST dataset with and .

Figure 7: Synchronous generation of sketches and photos for shoes.

3.2 Cross-domain Generation

Our SyncGAN model can achieve good performance on cross-modal generation. In addition, it can also be applied to cross-domain generation for data with similar structures. We rotated 30000 images in the MNIST dataset by to create the image data in another domain. As shown in Fig. 6, our model successfully generated synchronous images with and of images for the MNIST dataset. We also applied SyncGAN on the shoes images/sketches of the UT Zappos50K dataset [2]. 20000 pairs of cross-domain data were used to train the SyncGAN, and Fig. 7 shows the synchronous generation of shoes sketches/photos.

3.3 Modality and Domain Transfer

For the tasks of cross-modal and cross-domain transfer, we follow the method proposed by Lipton et al[14], which reconstructs latent vectors by performing gradient descent over the components of the latent representations (Eq. 3).


With the reconstructed latent vector and the generator of another modality/domain, we can successfully achieve bidirectional cross-modal or cross-domain transfer. Please refer to Fig. 8 - Fig. 10.

3.4 Evaluation of Synchronous Rate

For each cross-modal dataset, we trained a classifier for each modality based on data with concept labels. Given a cross-modal data pair generated by our SyncGAN, we examined whether the data pair is synchronous, i.e. two classifiers output the same labels. The synchronous rate is defined by the number of synchronous data pairs divided by the number of total generated data pairs. Fig. 

11(a) shows the synchronous rate for the MNIST dataset and the Fashion-MNIST dataset with different semi-supervised rate, which is defined by the number of paired data divided by the number of total data. The batch size in these experiments are 128. We observed that using semi-supervised rate of 0.4 can achieve almost the same performance as supervised training (i.e. using semi-supervised rate of 1.0). Fig. 11(b) demonstrates the synchronous rate of image and audio clip datasets. The result showed our model can be applied to real cross-modal data of image and sound, and the synchronous rate is about 0.7. The batch size in this experiment is 64.

4 Conclusions and Future work

Cross-domain GANs adopt several special mechanisms such as cycle-consistency and weight-sharing to extract the common structure of cross-domain data automatically. However, the common structure does not exist between most cross-modal data due to the heterogeneous gap. Therefore, the model need paired information to relate the different structures between data of various modalities which are of the same concept. In this paper, we present a novel network named synchronizer, which can constrain the latent space of generators in the GANs. Our SyncGAN model can successfully generate synchronous cross-modal/cross-domain data from identical random noises and perform transformation between different modalities/domains. Our model can generate synchronous image and audio data of instruments, and also can transfer data between these two modalities. However, the number of samples for each audio clip is only 512, which is around 0.01 secs. In the future, we will apply our model to sequential data such as longer audio clip and text.

Figure 8: Bidirectional cross-modal transfer of MNIST and Fashion-MNIST.

Figure 9: Bidirectional cross-modal transfer of images and sound for instruments.

Figure 10: Bidirectional cross-modal transfer of sketches and photos for shoes.
Figure 11: Synchronous rates for (a) MNIST and Fashion-MNIST (b)images and sounds.