It has been widely accepted by cognitive science community that infants begin learning their native language not by learning words, but by discovering the correlations between the speech signal and visual information . Infants know some aspects of their language by 6–12 months, while they do not understand the common native-language words until 12 months . When communicating with their parents, the infants only receive continuous speech signals from the parents and the visual signals from the surrounding. And the infants can learn the correlation between the high-frequency speech words and the objects or local visual textures. Thus, it is interesting to explore whether a machine can translate the speech signals into images directly, without the help of language words. Translating data between different modalities is a cutting-edge area recently. However, speech-to-image translation has not been well-studied while the similar topic, text-to-image translation, have been investigated in recent literature [56, 55, 53]. Besides, many languages have no writing form, which calls for the approaches to understand and visualize the speech directly . Not to mention the potential applications in human-computer interaction, art creation and computer-aided design, where speech is the nature input and middle text representation is not necessary. So exploring speech-to-image translation is necessary and meaningful.
As illustrated in Fig. 1, given the raw speech descriptions: “this bird has a red head and a white tail”, the corresponding images can be synthesized, which means that the machine has understood the speech signal to some extent and been able to translate the semantic information in the speech signal into the image.
Speech and image are in different modalities and the modality gap between these two types of data makes direct speech-to-image translation not trivial. Text-to-image translation [44, 56, 55, 53] is a closely related topic to ours, which has been investigated for several years. In some text-to-image models, zero-shot learning based methods [44, 40] and generative adversarial networks (GANs)  have been used to extract features and synthesize realistic images, respectively. These models generalize better on the new testing classes by leveraging the teacher-student learning to train the text encoder [44, 45]. Compared with the text-to-image translation, speech-to-image translation might be more challenging because the speech signals are continuous, unaligned and noisy. In addition to the text-to-image translation, several models for audio-to-image generation were also presented in recent years. Chen et al.  and Hao et al.  synthesized instrument images from different music inputs; Oh et al.  and Amanda et al.  reconstructed the human face images from input speech based on the positive correlations between a person’s appearance and his voice, and both their frameworks contain a speech encoder and a face decoder. Different from these audio-to-image generation works, which model the acoustic or phonetic information mainly, our speech-to-image translation aims to model the linguistic information in the input speech and translate it into the images. Recent works about audio-visual correlation learning [23, 24] have shown it is feasible to learn the correlation between visual speech descriptions and the objects or local texture in the images, forming the basis of our speech-to-image translation task. In some other topics about speech processing, such as speech-to-speech translation and speech keyword search, recent works [34, 5] have attemped to translate or search the speech without the help of the transcription text, indicating it is feasible to understand the linguistic information in the speech without the help of the middle text representations.
Highly inspired by these previous related works, we design a model to extract features from speech data and train the model in the teacher-student learning manner. In particular, the speech signal is firstly represented as a low-dimensional embedding feature via a speech encoder, then this feature is fed into a conditional generative adversarial network as the condition, and the generator synthesizes the corresponding image with semantic consistency. To the best of our knowledge, our work is the first one to attempt to translate the speech signals into images without the help of text. Compared with the straightforward “two-stage” method, the classifier-based method and the text-to-image models, our method shows better performance than the “two-stage” method and the classifier-based method, even achieves comparable performance to the text-to-image models on the synthesized datasets. Experiments on the real speech data also show the potential for the real application scenarios, like human-computer interaction. etc.
The main contributions of this work can be summarized as follows:
We propose a framework to translate the speech signals into images directly. Experiments on the synthesized data and real data demonstrate the effectiveness of our proposed framework.
We train the speech encoder via teacher-student learning that transfers the knowledge in a pretrained image encoder into the speech encoder. Experiments on the synthesized data show that our method can learn the semantic information in the speech descriptions better than the previous classifier-based method, and provide better translation results.
Ablation study about the loss items, image scales and feature interpolation gives more insights about our method and the speech-to-image translation problem.
The rest of this paper is organized as follows: Section II briefly reviews related works on generative adversarial networks, text-to-image translation, audio-to-image generation, audio-visual correlation learning, teacher-student learning, and direct speech translation, Section III presents our speech-to-image model in detail, and Section IV introduces and analyzes the experimental results on both synthesized and real data, Section V conducts the ablation study. Finally, Section VI concludes this paper.
Ii Related Works
In this section, we review the related works on generative adversarial networks, text-to-image translation, audio-to-image generation, audio-visual correlation learning, teacher-student learning, and direct speech translation.
Ii-a Generative Adversarial Networks
Generative adversarial networks (GANs) have drawn much attention since it was presented by Goodfellow et al. 
due to its ability to generate high-dimensional data,e.g.
images. In the model, the generators aim to generate fake data that cannot be separated from the real data, while the discriminators aim to differentiate the generated fake data from the real data. The whole model is optimized via the following loss functions[16, 55]:
where is the generator and is the discriminator. It is a two-player zero-sum game to arrive a local Nash equilibrium , at which neither the discriminator nor the generator can decrease its respective loss. The generator learns a mapping between the noise distribution (e.g. the uniform or Gaussian) and the real data (e.g. the images or text). When synthesizing images via GANs, attributions , text descriptions , sketches , or images with another style  have been used as the conditions to control the appearance of the generated images.
Ii-B Text-to-Image Translation
Text-to-image translation aims to synthesize images which are semantically consistent with the input text descriptions. It is challenging due to the modality gap between text and images. The computer vision and machine learning community did not pay much attention to this challenging problem until Reedet al.  used a GAN to synthesize the images conditioned on a low-dimensional representation extracted from the text description. Following Reed’s work , StackGAN  and StackGAN V2  were proposed to generate photo-realistic images up to a resolution of from the text descriptions via a pretrained text encoder . Multi-scale discriminators for increasing resolutions were used in StackGAN and StackGAN V2 to generate images progressively because synthesizing images with a high resolution in one stage had been demonstrated with difficulty . Besides, spatial attention was applied in text-to-image translation  by training a multimodal similarity model to calculate the similarity between the word embedding features and the local image features. With text encoder trained by teacher-student learning, these text-to-image translation models generalize well on the new testing classes.
Ii-C Audio-to-Image Generation
Based on the correlation between audio and images, such as music and instruments, human voices and face appearances, audio-to-image generation aims to generate the images paired with the input audio signals. Chen et al.  firstly attempted to generate instrument images from the music by leveraging a classifier-based feature extractor and a GAN, but another model is needed if we want to generate music from the image. Hao et al.  proposed a uniform framework using cycle constraint for the visual-audio mutual generation. Recently, Oh et al.  presented a model for generating the face images from a voice using a pretrained face decoder, but the generated results are not sharp due to the concern of privacy. Duarte et al.  generated sharp face images conditioned on the input speech segmentation using GANs. Different from these previous audio-visual generation works, speech-to-image translation aims to capture the linguistic information in the speech signals and generate images semantically consistent with the input speech descriptions.
Ii-D Audio-Visual Correlation Learning
Audio-visual correlation learning aims to learn a joint embedding feature space over both audio (e.g. music, speech, nature sound, etc.) and visual (e.g. images, videos, etc.) data using an embedding alignment model. Based on the prior works on text embedding , Harward et al. 
firstly investigated this task to align visual objects and speech signals by a region convolutional neural network (RCNN) and a spectrogram convolutional neural network . Furthermore, Harward et al.  used vision as an interlingual semantic embeddings of unaligned audios without the use of linguistic transcriptions or conventional speech recognition technologies. Recently, Harward et al. [23, 24] operated directly on the image pixels and speech waveforms to associate segments of visual audio captions without relying on any labels, segmentation information or alignment between these modalities. In addition to learning an embedding feature space for speech and images, our work further generates images from the speech embeddings.
Ii-E Teacher-Student Learning
Teacher-student learning is a transfer learning approach, where a pretrained teacher model is used to “teach” a student model. It is widely used in model compression [6, 30] and domain adaption . Reed et al.  firstly used the GoogLeNet 
pretrained on ImageNet as the teacher network to learn the deep representation for zero-shot tasks. Following this work, several text-to-image models [45, 56, 55] were proposed for the text-to-image task, based on the same teacher-student learning method to train the text encoder. Recently, teacher-student learning was used to generate the face behind a voice . As a comparison, traditional audio-to-image generation models  used a classifier as the feature extractor. In our experiments, we compare the teacher-student learning method with the classifier-based methods and show that the teacher-student learning performs better.
Ii-F Direct Speech Translation
Speech-to-speech translation is one of the most challenging tasks in speech processing and machine learning community, which has tremendous applicable value in our daily life. A speech translation system typically has three components: automatic speech recognition (ASR), machine translation (MT) and text to speech synthesizer (TTS)
. Using text as a middle representation to divide this difficult task into three stages have been used for several decades. Recently, with the development of deep learning, which shows great potential to model complicated data distribution, researchers have attempted to solve the challenging speech translation without the middle text representation. Bérard et al.  firstly attempted to build an end-to-end speech-to-text translation system without the text transcription, and showed comparable performance on the synthesized data. Subsequently, Duong et al. 
introduced an attentional model for speech-to-speech translation without speech-to-text transcription, showing the superiority on the low-resources languages. Recently, Jiaet al.  proposed a sequence-to-sequence model to directly translate the speech into another speech, showing comparable performance (only slightly underperform) to a baseline, which cascades of a direct speech-to-text translation model and a text-to-speech synthesis model, on two Spanish-to-English speech translation datasets. These results demonstrated that extracting semantic information from raw speech signals without the middle text representation is practical.
Motivated by the different principles for understanding speech signals, we design a framework to translate the speech signals into images directly, without the help of middle text representation. Specifically, a speech encoder is designed to encode the raw speech signals into a low-dimensional embedding feature. The speech encoder is trained by the teacher-student learning manner via a pretrained image encoder. Subsequently, the speech embedding features are fed into a generator to synthesize images with semantic consistency. Experimental results on both synthesized and real data demonstrate that our proposed model is capable of translating speech into images without the help of middle text representation.
Iii Speech-to-Image Model
The modality gap between speech signals and image signals makes it not feasible to directly regress the pixel value from speech signals. Inspired by the common text-to-image architectures [45, 56, 55, 53], we design a speech encoder to encode the speech signals into a low-dimensional embedding feature. Then this embedding feature is used to synthesize the corresponding images with semantic consistency, The diagram of the proposed algorithm is illustrated in Fig. 2.
Iii-a Speech Encoder
The input raw speech signal is first represented as a time-frequency spectrogram, then it is encoded by our speech encoder into a low-dimensional embedding feature with convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Typically, the speech spectrogram is modeled via an RNN in ASR[17, 19, 1], however, the long input spectrogram of our model may not convinent enough for RNN model optimization . Inspired by the character-based text embedding architecture  and audio-visual cross-modal embedding learning , a multi-layer CNN is inserted before the RNN to reduce the signal length, as shown in Fig. 2. Given a speech signal and its spectrogram , the speech embedding can be obtained:
where denotes our speech encoder. The input time-frequency spectrogram is firstly normalized along with frequency in the first layer of the CNN, then the output of the CNN is subsequently encoded as a 1024 dimensional embedding feature by an RNN. The input length of the RNN can be variable, with the only constraint that the input length of the CNNs should be longer than 64 because the model shortens the input sequence by 64 times.
Iii-B Teacher-Student Learning
The optimization for our speech encoder is not trivial because the ground truth of the speech embedding is not available. Although the class label of the speech description is accessible, there might be generalization problem when the trained model is tested on the new unseen data (new class respect to the training set). Inspired by the cross-modal generation models [44, 56, 22, 24, 41], in this work, we use teacher-student learning  to overcome this problem to some extent.
Given an image and its speech description, , an image encoder and a speech encoder are used to represent the image/speech as a low-dimensional embedding feature, respectively:
where is the time-frequency spectrogram of . / denotes the image/speech encoder, / is the low-dimensional embedding feature of the input image/speech. It is worth mentioning that the image encoder is pretrained on a large dataset, such as ImageNet , and it is fixed when training the speech encoder. In our model, the pretrained image encoder is the “teacher”, while the speech encoder is the “student”. The goal of teacher-student learning is to optimize the student model to learn a similar feature space with the teacher model. So the dimension of the student model’s feature space should be the same as the teacher’s. In our model, GoogLeNet  is used as the teacher model to represent the input image with a resolution of as a 1024 dimensional feature. As a result, our speech encoder also encodes the input time-frequency spectrogram into a feature with 1024 dimensions.
Iii-C Generative Network
The generative network is used to synthesize images conditioned on the speech embedding feature. Following the recent works about text-to-images [56, 55, 53], we use a stacked conditional GAN, also known as StackGAN v2 , to synthesize the images due to its promising performance on generating photo-realistic images. As illustrated in Fig. 2, three branches are used in the generator to synthesize images with a resolution of , and three discriminators are used to distinguish the generated images with a resolution of , , from the real images, respectively. Each upsampling block in the generator contains an upsampling layer and two residual blocks  to synthesize details based on the input low-resolution image. Given the condition and the input noise
which samples from the Gaussian distribution, the generator synthesizes the fake images:
where denotes the synthesized fake images, denotes the stacked generative network. In our model, the embedding feature of the input speech ( in Fig. 2) is used as the condition for the generator.
The whole model is trained with two steps. Firstly, the speech encoder is optimized with the image encoder. Secondly, the generator is trained conditioned on the speech embedding representations, which are extracted by the pretrained speech encoder.
Iii-D1 Training the Speech Encoder
The speech encoder is trained via teacher-student learning , which transfers knowledge from a large model into a small model. In our model, the knowledge in the pretrained image encoder needs to be transferred into our speech encoder. We can optimize the norm to train the speech encoder, however, training student network with the norm optimization alone is slow and unstable . To stabilize and accelerate the training, additional loss items are introduced. Taking inspiration from the text-image embedding learning  and audio-to-image generation  models, norm loss, jointly embedding loss (JEL), and knowledge distilling loss (KDL) are used in our teacher-student loss (TSL) for training the speech encoder. Given a batch of triplet data , the spectrogram , the image encoder and the speech encoder , the objective function between image and speech encoder is defined as:
where is the class label for and , are hyper-parameters for fusing the three items. Following , and are tuned to make the gradient magnitudes of the three items with respect to be with a similar scale at an early training iteration. In , / is the margin for pairs with different/same class label in a batch data, respectively. / is set to control the inter/intra class distance, respectively. Here, we optimize the 1-norm in . is the embedding feature of the input speech description. is the embedding represention of the input image. In , .
Iii-D2 Training the Generator
Following , the generator is trained by a multi-scale triplet loss (MSTL), which includes three items in each discriminator’s loss: conditional item, unconditional item and wrong pair item. Given a triplet , where denotes the wrong image which belongs to the different classes with :
where is the data distribution for the pair , and both contain three scales. The loss for each discriminator has three items to model both conditional and unconditional distribution. By contrast, the traditional conditional GAN  only contains the conditional item , without taking the unconditional distribution into account.
In the inference phase, the time-frequency spectrograms of the input speech descriptions are encoded as low-dimensional embedding features by the speech encoder. Subsequently, the generator synthesizes images with a resolution of conditioned on the embedding features. The inference can be denoted as:
is the noise vector,is the spectrogram of the input speech, is the synthesized images semantically consistent with the input speech descriptions.
Iii-F Implementation Details
The raw speech signals are represented as log Mel filter bank spectrograms, following 
. Specifically, the DC component of each audio is removed via mean subtraction, followed by the pre-emphasis filtering and the short-time Fourier transform (STFT) computation by using a 25 ms Hamming window with 10 ms shift. Then the squared magnitude spectrum of each frame is taken into consideration and the log energies with each of 40 Mel filter bands are computed. As a result, the spectrograms with shape, where the band here is with variable frame number, can be obtained. When training the speech encoder, we use GoogLeNet 
as the image encoder. As for the hyperparameters, we setafter tuning them to balance the gradients respect to the speech embedding feature.
Iv Experiments Results and Ayalyses
In this section, we verify the effectiveness of the proposed model for translating speech signals into images without middle text representations and show how well our model can achieve. Firstly, we generate images from the synthesized speech data to demonstrate the effectiveness of our model and compare our model with the straightforward “two-stage” method, traditional classifier-based method, and text-to-image models. Secondly, experiments on real data are conducted to explore our model’s robustness to the real noise and the potential for the real application scenarios. Finally, ablation study about the different loss items, image scales, and feature interpolation gives more insights about our model.
Iv-a Datasets and Metrics
Datasets. Several datasets, like Places 205 dataset , Caltech-UCSD Birds 200-2011 (CUB-200) , Oxford Flower with 102 categories (Oxford-102)  and Microsoft COCO , are used in the audio-visual correlation learning or text-to-image translation. Harwath et al.  used the Places 205 dataset with speech descriptions [23, 25] to learn the association between spoken audio caption segments and nature images portions. Reed et al.  used CUB-200 to learn deep representations of visual text descriptions. Most investigations about text-to-image translation [45, 56, 55] conducted their experiments on CUB-200, Oxford-102, or COCO dataset. However, these datasets only contain the text descriptions and do not have any speech label, making us difficult to leverage the dataset off-the-shelf to conduct the experiments. Fortunately, with the development of text-to-speech generation [3, 4, 43] technologies based on deep learning and large-scale labeled speech datasets [27, 42], we can use some mature commercial text-to-speech systems like Baidu TTS eigine111https://cloud.baidu.com/product/speech/tts or Microsoft Bing TTS222https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech/
to synthesize large-scale high-quality speech from the text descriptions. The synthesized speech data are continuous and unaligned, just like real speech, although they have no background noise and speaker variance.
To compare our model with the text-based models, we use CUB-200  and Oxford-102  dataset to test the performance of our model and synthesize the speech descriptions from the text descriptions via Baidu TTS. As illustrated in Table I,CUB-200  dataset contains 11788 bird images classified into 200 categories, with one bounding box and 10 sentences text descriptions per image. Oxford-102  dataset has 8189 flower images classified into 102 classes. The speech descriptions are synthesized by Baidu TTS engine from the text captions in both datasets. Following , we crop all the images to ensure that bounding boxes have greater-than-0.75 object-image-size ratios for CUB-200 dataset. We also conduct experiments on a subset of Places 205 dataset [23, 25] with real speech descriptions to explore the potential for real applications.
Evaluation Metrics. Generally, it is difficult to fairly evaluate the generative models. As in [45, 56, 55], we use inception score  and Fréchet inception distance [12, 28] to quantitatively evaluate our models. They are formulated as,
Inception score (IS)  is a metric for both image quality and diversity, which is found correlating well with the human evaluation. The conditional label distribution measures the quality of the generated images, and the images are high-quality when the distribution is with low entropy. The marginal distribution measures the diversity, and the generated images are more diverse when the distribution is with high entropy. By jointly considering them two together, the KL-divergence of and provides the evaluation for both image quality and its diversity. When calculating IS in our experiments, we use the Inception-v3 model finetuned on CUB-200 or Oxford-102 dataset following . Fréchet inception distance (FID)  measures the distance between the generated and real data. Lower FID means higher similarity for the generated and real data distribution. To calculate the FID between the generated images and the real images, we use all speech labels in the testing set to generate a large number images (e.g. 30k for CUB-200, 11k for Oxford-102). When calculating FID in our experiments, we use the Inception-v3 model pretrained on the ImageNet dataset  following .
Iv-B Experimental Results on Synthesized Data
To demonstrate the effectiveness of our model, experiments are conducted on the synthesized speech data, including CUB-200  and Oxford-102  dataset. The synthesized speech data are continuous and unaligned, although they are well-normalized and less noisy. The CUB-200 dataset contains 200 classes totally, where 150 classes are used to train our model and the rest are used as the testing set. The Oxford-102 dataset contains 102 classes totally, where 82 classes are used for training and 20 classes for testing. The statistical information for training/testing split of these two datasets is shown in Table I. The splitting manner for the training set and testing set follows . The qualitative and quantitative results are illustrated in Fig. 3 and Table II, respectively. Some conclusions can be drawn from the results:
|Dataset||Training set||Testing set||Total|
Iv-B1 Our model can synthesize images semantically consistent with the input speech
qualitative results on CUB-200 testing set and Oxford-102 testing set are shown in Fig. 3. For each row, we show the waveform of the input speech description (left) and 8 synthesized images (right) with different input noises conditioned on the same input speech embedding. It is worth mentioning that the transcription results are shown only for readability, and they are not used in our model. The results show that our model can generate realistic images from the input speech description although the input speech is continuous and unaligned. Moreover, the visual information shown in the synthesized images is mostly accordant with the semantic information in the speech descriptions. The generated results from CUB-200 dataest  (row 15 in Fig. 3) show that the input speech description controls the color of the generated bird’s different parts, such the head, feather, tail, etc. In comparison, the background, gesture, even shape of the generated bird changes when the input noise is different. This means that our model has disentangled the bird’s color from the background and gestures to some extent. Similar conclusion also can be drawn from the results of Oxford-102  dataset (row 510 in Fig. 3). The input speech description mainly controls the color of the generated flowers, and the background, size, even the categories of the generated flowers are different when the input noises change. This demonstrates that our model has learned the semantic information in the input speech descriptions to some extent and visualized the semantic information onto the images.
Iv-B2 Our model is better than the “two-stage” model with text
To compare our “one-stage” method (translating speech to images without text) with the “two-stage” method (using text as middle representation), we train the “two-stage” model on CUB-200 and Oxford-102 datasets. In our experiments, we use a pretrained ASR model, DeepSpeech333https://github.com/mozilla/DeepSpeech [18, 2], to transcribe the speech into text and use the (text, image) paired data to train a text encoder and a generator following . For fair comparison, the generator and the hyper-parameters for its training in the “two-stage” method are the same as our model’s. The results of the “two-stage” method on CUB-200/Oxford-102 are shown in the 1st/6th row of Table II, respectively. The “two-stage” model is slightly inferior to our “one-stage” model on both datasets. On the CUB-200 dataset , our “one-stage” model performs better by on FID although the IS score of the “two-stage” method is comparable to our model’s. Similarly, on the Oxford-102 dataset , our “one-stage” method surpasses the “two-stage” method on FID (), while IS scores of the both models are the same (). We think the reason why the “two-stage” method performs worse is the word errors in speech recognition.
It is worth mentioning that DeepSpeech might be not a strong baseline in our experiment although DeepSpeech has millions of parameters and is trained on a large dataset, because our speech data are synthesized while the training data of DeepSpeech are real speech data. The word error rates (WERs) of DeepSpeech on our datasets are above , showing that DeepSpeech does not perform well on our synthesized speech data. To address this problem, in the following, we compare our proposed model with text-to-image models, which can be viewed as “two-stage” frameworks with an ideal ASR model (WER is ), to evaluate the performance of our model.
Iv-B3 Our method is comparable to the text-to-image methods
To compare with the upper bound of the “two-stage” method, we compare our method with the text-to-image methods, which can be seen as “two-stage” methods with a perfect ASR model. Two text-to-image methods [56, 55], which use similar structures with our model, are used in the experiments, as shown in Table II. On CUB-200  dataset, our model is slightly inferior to StackGAN-v2  on FID (), but performs better on IS (). We think that the reason is that our model generates more diverse results due to the speech signal is higher-dimensional than text signals. Compared with StackGAN-v1 , our model performs better with a large gain because we use a stronger generator. On the Oxford-102 dataset , our proposed model surpasses the StackGAN v1 on both FID () and IS (), although it only surpasses the StackGAN v2 slightly.
This result has demonstrated that our model performs closely with the upper bound of the “two-stage” models, and our speech encoder has extracted the semantic information in the input speech descriptions into the embedding feature, which is subsequently used as the condition in the generator, just as the text-to-image model in .
Iv-B4 Our teacher-student learning method is better than the classifier-based method
To compare our teacher-student learning method with previous classifier-based methods , we train our speech encoder via the classified-based method rather than the teacher-student learning. Specifically, a classifier layer is added after the speech encoder and the speech encoder is trained with the cross entropy loss. To compare fairly, the classifier-based model uses the same feature extractor structure as our model’s. In particular, we use the speech encoder in our model as a feature extractor and add a linear layer as the classifier, then train this classifier-based model on the training set. When testing, only the feature extractor is used. By this way, the parameters size and the hyperparameters for classified-based method and our proposed method are the same. The only difference is the training method. The results are illustrated in the 2nd and 7th row of Table II. On the CUB-200  dataset, our model performs rather better than the classifier-based method on both IS () and FID (), indicating that our model’s better generalization ability when using the same structure and parameters. On the Oxford-102 dataset, FID score () shows our model is better, however, the IS score () draws different conclusions. IS takes both the image quality and diversity into account, so these results indicate that our model generates data closer to the real data but less diverse when compared with the classifier-based method on the Oxford-102 dataset. Consequently, our proposed method shows better FID and comparable IS on both CUB-200  and Oxford-102  dataset due to its teacher-student learning method when compared with the classifier-based method.
Iv-C Experimental Results on Real data
In addition to the synthesized speech data, we also evaluate our model on the real speech data to assess the potential for real applications. Places Audio Captions dataset [25, 24], collected via Amazon Mechanical Turk (AMT), is a real speech dataset for visual descriptions for Places 205 dataset .We use a subset of this dataset, which includes 13803 paired data with 7 classes4447 classes in Places 205: bedroom, dinette, dining room, home office, hotel room, kitchenette, living room (Places-Subset), to evaluate the robustness for the real data of our model. The testing set contains 2870 images, which are randomly selected from the dataset. It is difficult to generate the details for high-resolution images due to the dataset’s diversity and the variance between different speakers, so we generate images with a resolution of . Some sampled results are shown in Fig. 4. Although the details are not sharp due to the low resolution, we can also see that the color and the scene of the generated images are semantically consistent with the input speech descriptions, which means the model has captured the semantic information in the input speech descriptions to some extent. In addition to visualization examples, we also evaluate the result with objective matrics.
Our model achieves for FID, as shown in Table III (IS is not used because no finetuned Inception model is available for Places-Subset). Our method is rather better than the classifier-based method (83.06 vs 232.39), demonstrating the effectiveness of the teacher-student learning on the real data. Besides, different from the synthesized datasets, our model performs not as well as the “two-stage” method (83.06 vs 64.59), because the real data are much challenging than the synthesized data to extract the speech semantic feature for our model. These results are encouraging and show the potential for our model to the real scenario, such as human-computer interactions and computer-aided design.
V Ablation Study
In this section, we conduct ablation study for our model to analyze the different components of the proposed model and the influence of some hyper-parameters as well as the embedding feature space. To compare fairly, we train the models from the scratch for the same iterations with the same hyperparameters except for the specified hyperparameter, such as the loss items for the speech encoders, or the scale of the synthesized images. Specifically, the speech encoders are trained for 100 epochs for both dataset, and the generators are trained for 220k/100k iterations for CUB-200/Oxford-102  dataset, respectively. The training iterations is set to avoid overfitting or model collapse.
V-a The Loss Function
In the training of our speech encoder, teacher-student loss (TSL) (Eq. 6), including , and , is used to stabilize and accelerate the training. The experiments are conducted on CUB-200  and Oxford-102  datasets to study how these items influence the final result. To compare fairly, the training hyper-parameters are all set the same except the loss items. The same structure and parameters are used in the generator training, and the generator is with three branches to generate images with a resolution of . The weight for is set as , respectively, to ensure the gradient to the speech embedding is within the similar scale (following ). Table IV lists the results on the testing sets of CUB-200 dataset  and Oxford-102 dataset . When the only is used to train the speech encoder (1st row and 4th row in Table IV), the model performs the worst on both datasets, demonstrating that is not enough to obtain good performance. improves the model with a big margin on both datasets (2nd row and 5th row in Table IV), even achieves the best FID on CUB-200 dataset (2nd row in Table IV). As a comparison, only boosts the model slightly on Oxford-102 dataset (6th row in Table IV) and even leads an FID drop on CUB-200 dataset (3th row in Table IV). In general, both and can improve the model more or less.
V-B Different Scales
Higher resolution provides us more details, making the generated images more realistic. However, higher resolution increases the complexity of the model, making the model unstable. The scale of the generated images affects the performance of our model in different aspects. So in this subsection, we conduct experiments to study the influences of different scales. Experiments are conducted on CUB-200 dataset  and Oxford-102 dataset . All the experiments use the same training hyperparameters and network structure, except for the resolution of the generated images and the discriminator number for the different scales. Specifically, all the experiments use the same pretrained speech encoder to extract embedding features of the speech descriptions. The generator to synthesize images with a resolution of // uses discriminators to model the data distribution, respectively. IS  and FID  are used to evaluate the performance of the generator. As illustrated in Table V, based on the tree-like structure , the higher resolution our model synthesizes, the better performance we can achieve. The only exception is the model for the resolution of obtains the best IS on Oxford-102 dataset (5th row in Table V), however, it only performs better within () than the model for the resolution of . Taking the FID into account, we can still conclude that higher resolution leads to better performance. The similar conclusion is also drawn in the text-to-image translation [56, 55].
V-C Feature Interpolation
To study the feature space derived from our speech encoder further, we conduct experiments of feature interpolation on CUB-200  and Oxford-102  datasets to show that the feature space learned by our speech encoder is a linear space to some extent. Specifically, given two embedding features , 9 features are sampled by combining and linearly: . Then these embedding features are fed into the generator to study the semantic transition from to . As illustrated in Fig. 5, in the first row, the color of the small bird’s back shows a smoothing transition from brown to blue, while the color of belly changes from yellow to white, which exactly shows the semantic described in the input speech descriptions. In the second row, the breast of the small bird changes from brown to red smoothly, just as described in the input speech. The third row and the fourth row are from Oxford-102. Similar to the first two rows, most of the flowers are realistic and the color of the flower transits gradually from the left to the right, although there are some artifacts in some images. The results of feature interpolation verify that our model has learned a linear embedding feature space to some extent.
In this paper, we have described a new framework to translate the speech signals into the images without the help of middle text representation. We addressed this problem by extracting a low-dimensional embedding feature from the speech descriptions and synthesizing images from this feature via a stacked GAN. We have demonstrated that our proposed model can synthesize images semantically consistent with the input speech description on both synthesized and real data. Moreover, our model performed better than the “two-stage” method and the classifier-based method, even achieved comparable performance to the text-to-image models on the synthesized datasets. We believe that synthesizing images from speech signals without text is a new perspective to understand the semantic information in the speech signals and can open up new research directions.
This work was supported by the National Natural Science Foundation of China and Royal Society (61961130392), National Natural Science Foundation of China (61632001), which are gratefully acknowledged.
-  (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In International Conference on Machine Learning, pp. 173–182. Cited by: §III-A.
-  (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In International Conference on Machine Learning, pp. 173–182. Cited by: §IV-B2.
-  (2017) Deep voice 2: multi-speaker neural text-to-speech. arXiv preprint arXiv:1705.08947. Cited by: §IV-A.
-  (2017) Deep voice: real-time neural text-to-speech. In International Conference on Machine Learning, pp. 195–204. Cited by: §IV-A.
-  (2017-12) End-to-end asr-free keyword search from speech. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1351–1359. External Links: Cited by: §I.
-  (2014) Do deep nets really need to be deep?. In Advances in neural information processing systems, pp. 2654–2662. Cited by: §II-E.
-  (2014) Word embeddings for speech recognition. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §II-D.
-  (2016) Listen and translate: a proof of concept for end-to-end speech-to-text translation. arXiv preprint arXiv:1612.01744. Cited by: §II-F.
-  (2012) At 6–9 months, human infants know the meanings of many common nouns. Proceedings of the National Academy of Sciences 109 (9), pp. 3253–3258. Cited by: §I.
-  (2017) Deep cross-modal audio-visual generation. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 349–357. External Links: Cited by: §I, §II-C, §II-E, §IV-B4, TABLE II.
Imagenet: a large-scale hierarchical image database.
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Cited by: §II-E, §III-B, §IV-A.
The fréchet distance between multivariate normal distributions.
Journal of multivariate analysis12 (3), pp. 450–455. Cited by: §IV-A.
-  (2019) Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 3. Cited by: §I, §II-C.
-  (2016) An attentional model for speech translation without transcription. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 949–959. Cited by: §II-F.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §II-D.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §I, §II-A.
-  (2013) Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645–6649. Cited by: §III-A.
-  (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §IV-B2.
-  (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §III-A.
Cmcgan: a uniform framework for cross-modal visual-audio mutual generation.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §I, §II-C.
-  (2018-04) Vision as an interlingua: learning multilingual semantic embeddings of untranscribed speech. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4969–4973. External Links: Cited by: §II-D.
-  (2015) Deep multimodal semantic embeddings for speech and images. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on, pp. 237–244. Cited by: §II-B, §II-D, §III-B.
-  (2017) Learning word-like units from joint audio-visual analysis. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 506–517. Cited by: §I, §II-D, §IV-A, §IV-A.
-  (2018-09) Jointly discovering visual objects and spoken words from raw sensory input. In The European Conference on Computer Vision (ECCV), Cited by: §I, §II-D, §III-A, §III-B, §III-F, §IV-A, §IV-C.
-  (2016) Unsupervised learning of spoken language with visual context. In Advances in Neural Information Processing Systems, pp. 1858–1866. Cited by: §IV-A, §IV-A, §IV-C.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §III-C.
-  (2018) TED-lium 3: twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer, A. Karpov, O. Jokisch, and R. Potapova (Eds.), Cham, pp. 198–208. External Links: Cited by: §IV-A.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §IV-A, §V-B.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §II-A.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §II-E, §III-B, §III-D1.
-  (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A field guide to dynamical recurrent neural networks. IEEE Press. Cited by: §III-A.
-  (2017) Beyond face rotation: global and local perception gan for photorealistic and identity preserving frontal view synthesis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2439–2448. Cited by: §II-A.
-  (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §II-A.
-  (2019) Direct speech-to-speech translation with a sequence-to-sequence model. arXiv preprint arXiv:1904.06037. Cited by: §I, §II-F.
-  (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §II-F.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §IV-A.
-  (2018) A teacher-student learning approach for unsupervised domain adaptation of sequence-trained asr models. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 250–257. Cited by: §II-E.
-  (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §III-D2.
-  (2006) A visual vocabulary for flower classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 1447–1454. Cited by: Fig. 3, §IV-A, §IV-A, §IV-B1, §IV-B2, §IV-B3, §IV-B4, §IV-B, TABLE I, TABLE II, Fig. 5, §V-A, §V-B, §V-C, TABLE IV, TABLE V, §V.
-  (2019) Speech2Face: learning the face behind a voice. arXiv preprint arXiv:1905.09773. Cited by: §I, §II-C.
-  (2019) Speech2Face: learning the face behind a voice. arXiv preprint arXiv:1905.09773. Cited by: §I, §II-E, §III-B, §III-D1, §V-A.
-  (2015) Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 5206–5210. Cited by: §IV-A.
-  (2018) Deep voice 3: 2000-speaker neural text-to-speech. In International Conference on Learning Representations, External Links: Cited by: §IV-A.
-  (2016) Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–58. Cited by: §I, §II-B, §II-E, §III-A, §III-B, §III-D1, §IV-A, §IV-B.
-  (2016) Generative adversarial text to image synthesis. In 33rd International Conference on Machine Learning, pp. 1060–1069. Cited by: §I, §II-A, §II-B, §II-E, §III, §IV-A, §IV-A.
-  (2016) Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242. Cited by: §IV-A, §V-B.
-  (2010) Connecting modalities: semi-supervised segmentation and annotation of images using unaligned text corpora. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 966–973. Cited by: §II-D.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §II-E, §III-B, §III-F.
-  (1982) Spoken and written language: exploring orality and literacy. Vol. 32, ABLEX Publishing Corporation. Cited by: §I.
-  (1981) Semantic comprehension in infancy: a signal detection analysis. Child development, pp. 798–803. Cited by: §I.
-  (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: Fig. 3, §IV-A, §IV-A, §IV-B1, §IV-B2, §IV-B3, §IV-B4, §IV-B, TABLE I, TABLE II, Fig. 5, §V-A, §V-B, §V-C, TABLE IV, TABLE V, §V.
-  (2015-March 3) Enhanced speech-to-speech translation system and methods for adding a new word. Google Patents. Note: US Patent 8,972,268 Cited by: §II-F.
-  (2018-06) AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §I, §II-B, §III-C, §III.
-  (2013) KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7893–7897. Cited by: §II-E.
-  (2018) StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Cited by: §I, §I, §II-A, §II-B, §II-E, §III-C, §III-D2, §III, §IV-A, §IV-A, §IV-A, §IV-B2, §IV-B3, §IV-B3, TABLE II, §V-B.
-  (2017) StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5908–5916. Cited by: §I, §I, §II-B, §II-E, §III-B, §III-C, §III, §IV-A, §IV-A, §IV-B3, TABLE II, §V-B.
-  (2014) . In Advances in neural information processing systems, pp. 487–495. Cited by: §IV-A, §IV-C.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §II-A.