Log In Sign Up

Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph

This paper addresses the unsupervised learning of content-style decomposed representation. We first give a definition of style and then model the content-style representation as a token-level bipartite graph. An unsupervised framework, named Retriever, is proposed to learn such representations. First, a cross-attention module is employed to retrieve permutation invariant (P.I.) information, defined as style, from the input data. Second, a vector quantization (VQ) module is used, together with man-induced constraints, to produce interpretable content tokens. Last, an innovative link attention module serves as the decoder to reconstruct data from the decomposed content and style, with the help of the linking keys. Being modal-agnostic, the proposed Retriever is evaluated in both speech and image domains. The state-of-the-art zero-shot voice conversion performance confirms the disentangling ability of our framework. Top performance is also achieved in the part discovery task for images, verifying the interpretability of our representation. In addition, the vivid part-based style transfer quality demonstrates the potential of Retriever to support various fascinating generative tasks. Project page at


page 9

page 14

page 15

page 20

page 21

page 22

page 23

page 24


Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization

Text style transfer aims to alter the style (e.g., sentiment) of a sente...

Arbitrary Style Transfer via Multi-Adaptation Network

Arbitrary style transfer is a significant topic with both research value...

Global Rhythm Style Transfer Without Text Transcriptions

Prosody plays an important role in characterizing the style of a speaker...

Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Non-parallel many-to-many voice conversion, as well as zero-shot voice c...

Rethinking Content and Style: Exploring Bias for Unsupervised Disentanglement

Content and style (C-S) disentanglement intends to decompose the underly...

Unsupervised Speech Decomposition via Triple Information Bottleneck

Speech information can be roughly decomposed into four components: langu...

Metrics for Exposing the Biases of Content-Style Disentanglement

Recent state-of-the-art semi- and un-supervised solutions for challengin...

Code Repositories


[ICLR2022] Code for "Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph"

view repo

1 Introduction

Human perceptual systems routinely separate content and style to better understand their observations  (Tenenbaum and Freeman, 2000)

. In artificial intelligence, a content and style decomposed representation is also very much desired. However, we notice that existing work does not have a unified definition for content and style. Some definitions are dataset-dependent

(Chou and Lee, 2019; Ren et al., 2021), while some others have to be defined on a certain modality (Lorenz et al., 2019; Wu et al., 2019). We wonder, since content-style separation is helpful to our entire perception system, why is there not a unified definition that applies to all perception data?

In order to answer this question, we must first study the characteristics of data. The data of interest, including text, speech, image, and video, are structured. They can be divided into standardized tokens, either naturally as words in language and speech or intentionally as patches in images. Notably, the order of these tokens matters. Disrupting the order of words can make a speech express a completely different meaning. Reversing the order of frames in a video can make a stand-up action become a sit-down action. But there is also some information that is not affected by the order of tokens. For example, scrambling the words in a speech does not change the speaker’s voice, and a frame-shuffled video does not change how the person in the video looks. We notice that the content we intuitively think of is affected by the order of tokens, while the style is usually not. Therefore, we could generally define style as token-level permutation invariant (P.I.) information and define content as the rest of the information in structured data.

However, merely dividing data into two parts is not enough. As Bengio et al. (2013) pointed out, if we are to take the notion of disentangling seriously, we require a richer interaction of features than that offered by simple linear combinations. An intuitive example is that we could never generate a colored pattern by linearly combining a generic color feature with a gray-scale stimulus pattern. Inspired by this, we propose to model content and style by a token-level bipartite graph, as Figure 1 illustrates. This representation includes an array of content tokens, a set of style tokens, and a set of links modeling the interaction between them. Such a representation allows for fine-grained access and manipulation of features, and enables exciting downstream tasks such as part-level style transfer.

Figure 1: Illustration of the proposed modal-agnostic bipartite-graph representation of content and style. The content is an array of position-sensitive tokens and the style is a set of P.I. tokens.

In this paper, we design a modal-agnostic framework, named Retriever, for learning bipartite graph representation of content and style. Retriever

adopts the autoencoder architecture and addresses two main challenges in it. One is how to decompose the defined content and style in an unsupervised setting. The other is how to compose these two separated factors to reconstruct data.

To tackle the first challenge, we employ a cross-attention module that takes the dataset-shared prototype as query to retrieve the style tokens from input data (Carion et al., 2020). A cross-attention operation only allows the P.I. information to pass (Lee et al., 2019), which is exactly what we want for style. On the other content path, we employ a vector quantization (VQ) module (van den Oord et al., 2017) as the information bottleneck. In addition, we enforce man-induced constraints to make the content tokens interpretable. To tackle the second challenge, we innovate the link attention module for the reconstruction from the bipartite graph. Specifically, the content and style serve as the query and value, respectively. Links between the content and style are learnt and stored in the linking keys. Link attention allows us to retrieve style by content query. As such, the interpretability is propagated from content to style, and the entire representation is friendly to fine-grained editing.

We evaluate Retriever in both speech and image domains. In the speech domain, we achieve state-of-the-art (SOTA) performance in zero-shot voice conversion, demonstrating a complete and precise decomposition of content and style. In the image domain, we achieve competitive results in part discovery task, which demonstrates the interpretability of the decomposed content. More excitingly, we try part-level style transfer, which cannot be offered by most of the existing content-style disentanglement approaches. Vivid and interpretable results are achieved.

To summarize, our main contributions are three-folds: i) We provide an intuitive and modal-agnostic definition of content and style for structured data. We are the first to model content and style with a token-level bipartite graph. ii) We propose an unsupervised framework, named Retriever, for learning the proposed content-style representation. A novel link attention module is designed for data reconstruction from content-style bipartite graph. iii) We demonstrate the power of Retriever in challenging downstream tasks in both speech and image domains.

2 Related Work

Content-style decomposed representation can be approached in supervised or unsupervised settings. When style labels, such as the speaker labels of speeches (Kameoka et al., 2018; Qian et al., 2019; Yuan et al., 2021) and the identity labels of face images (Mathieu et al., 2016; Szabó et al., 2018; Jha et al., 2018; Bouchacourt et al., 2018; Gabbay and Hoshen, 2020), are available, latent variables can be divided into content and style based on group supervision.

Recently, there has been increased interest in unsupervised learning of content and style. Since there is no explicit supervision signal, the basic problem one must first solve is the definition of content and style. We discover that all existing definitions are either domain-specific or task-specific. For example, in speech domain, Chou and Lee (2019) assume that style is the global statistical information and content is what is left after instance normalization (IN). Ebbers et al. (2021) suggest that style captures long-term stability and content captures short-term variations. In image domain, the definition is even more diverse. Lorenz et al. (2019) try to discover the invariants under spatial and appearance transformations and treat them as style and content, respectively. Wu et al. (2019) define content as 2D landmarks and style as the rest of information. Ren et al. (2021) define content as the most important factor across the whole dataset for image reconstruction, which is rather abstract. In this work, we attempt to find a general and modal-agnostic definition of content and style.

Style transfer is partially related to our work, as it concerns the combination of content and style to reconstruct data. AdaIN (Huang and Belongie, 2017; Chou and Lee, 2019) goes beyond the linear combination of content and style  (Tenenbaum and Freeman, 2000)

and proposes to inject style into content by aligning the mean and variance of the content features with those of the style features. However, style is not separated from content in this line of research.  

Liu et al. (2021) touch upon the part-based style transfer task as we do. They model the relationship between content and style by a one-one mapping. They follow the common definition of content and style in the image domain as shape and appearance, and try to disentangle them with hand-crafted data augmentation methods.

Besides, the term “style” is often seen in image generative models, such as StyleGAN (Karras et al., 2019). However, the style mentioned in this type of work is conceptually different from the style in our work. In StyleGAN, there is no concept of content, and style is the whole latent variable containing all the information including the appearance and shape of an image. Following StyleGAN, Hudson and Zitnick (2021) employ a bipartite structure to enable long-range interactions across the image, which iteratively propagates information from a set of latent variables to the evolving visual features. Recently, researchers have become interested in disentangling content and style from the latent variables of StyleGAN (Alharbi and Wonka, 2020; Kwon and Ye, 2021). However, they only work for well-aligned images and are hard to be applied to other modalities.

3 Content-Style Representation for Structured Data

In this section, we provide definitions of content and style for structured data, introduce the framework for content-style decomposition, and propose the token-level bipartite graph representation.

3.1 Definition of Content and Style

The data of interest is structured data that can be tokenized, denoted by . We think of text, speech, and image, among many others, as structured data. Each token can be a word in text, a phoneme in a speech, or a patch in an image. These data are structured because non-trivial instances (examples of trivial instances are a silent speech or a blank image) are not able to keep their full information when the order of tokens is not given. Inspired by this intuition, we define style of as the information that is not affected by the permutation of tokens, or permutation invariant (P.I.) information. Content is the rest of information in .

Figure 2: An illustration of our content-style separation mechanism.

3.2 Content-Style Separation

The information in a piece of structured data is carried either in the content or in the style. By definition, style can be extracted by a P.I. function , which satisfies , where represents permutation of tokens. To achieve content-style decomposition, we naturally adopt an autoencoder architecture, as shown in Figure 2. The bottom path is the style path which implements . We shall find a powerful P.I. function which will let all the P.I. information pass. The top path is responsible for extracting the content, but the challenge is that there does not exist a function which only lets pass the non-P.I. information. Therefore, we employ in the content path a permutation-variant function, which will let all information pass, including the P.I. information. Obviously, if we do not pose any constraint on the content path, as shown in Figure 2 (a), the style information will be leaked to the content path. To squeeze the style information out of the content path, an information bottleneck is required. A perfect bottleneck, as shown in Figure 2 (c), can avoid the style leak while achieving perfect reconstruction. But an imperfect bottleneck, being too wide or too narrow, will cause style leak or content loss, as shown in Figure 2 (b) and Figure 2 (d), respectively.

3.3 Token-level Bipartite Graph Representation of Content and Style

After dividing the structured input into content feature and style feature , we continue to explore how the relationship between and should be modeled. Ideally, we do not want an all-to-one mapping, in which the style is applied to all the content as a whole. Nor do we want a one-to-one mapping where each content token is associated with a fixed style. In order to provide a flexible way of interaction, we propose a novel token-level bipartite graph modeling of and .

The token-level representations of and are and , respectively. Note that there is a one-one correspondence between and , so the order of the structured data is preserved in the content . While the order is preserved, the semantic meaning of each is not fixed. This suggests that the bipartite graph between and cannot be static. In order to model a dynamic bipartite graph, we introduce a set of learnable linking keys . The linking keys and style tokens form a set of key-value pairs . Our linking key design allows for a soft and learnable combination of content and style. The connection weight between a content token and a style token is calculated by , where is a learnable linking function parameterized with . For a content token , its content-specific style feature now can be calculated by where are normalized weights.

4 The Retriever Framework

4.1 Overview

Figure 3: Overview of the proposed Retriever framework. The name is dubbed because of the dual-retrieval operations: the cross-attention module retrieves style for content-style separation, and the link attention module retrieves content-specific style for data reconstruction.

This section presents the design and implementation of the Retriever framework, which realizes the bipartite graph representation of content and style as described in Section  3. Retriever, as shown in Figure 3, is an autoencoder comprising of four major blocks, two of which implement the most important style retrieval functions. This is why the name Retriever is coined.

Retriever is modal-agnostic. The input of the framework is an array of input tokens, denoted by , obtained by modal-specific tokenization operations from raw data. The first major component is a pre-processing module to extract the feature from . We implement it as a stack () of non-P.I. transformer encoder blocks (Vaswani et al., 2017) to keep all the important information. Then is decomposed by the content encoder and the style encoder . is implemented by vector quantization to extract content . Structural constraint can be imposed on to improve interpretability. A cross-attention module is employed as to retrieve style . At last, the decoder, implemented by the novel link attention module, reconstructs from the bipartite graph.

4.2 Content Encoder

Content encoder should serve two purposes. First, it should implement an information bottleneck to squeeze out the style information. Second, it should provide interpretability such that meaningful feature editing can be enabled.

Information bottleneck. Vector quantization (VQ) is a good candidate of information bottleneck (Wu et al., 2020), though many other choices also easily fit into our framework. VQ maps data from a continuous space into a set of discrete representations with restricted information capacity, and has shown strong generalization ability among data formats, such as image (van den Oord et al., 2017; Razavi et al., 2019; Esser et al., 2021), audio (Baevski et al., 2020a, b) and video (Yan et al., 2021).

We use product quantization (Baevski et al., 2020a, b), which represents the input with a concatenation of groups of codes, and each group of codes is from an independent codebook of entries. To encourage all the VQ codes to be equally used, we use a batch-level VQ perplexity loss, denoted as . Please refer to Appendix A for more details. When the bottleneck capacity is appropriate, the content can be extracted as , where

Interpretability. Priori knowledge can be imposed on the content path of Retriever to increase the interpretability of feature representations. In this work, we introduce structural constraint

to demonstrate this capability. Other prior knowledge, including modal-specific ones, can be added to the framework as long as they can be translated into differentiable loss functions.

The structural constraint we have implemented in Retriever is a quite general one, as it reflects the locality bias widely exists in natural signals. In image domain, we force the spatially adjacent tokens to share the same VQ code, so that a single VQ code may represent a meaningful object part. In speech domain, we discourage the VQ code from changing too frequently along the temporal axis. As such, adjacent speech tokens can share the same code that may represent a phoneme. The visualization in the next section will demonstrate the interpretability of Retriever features.

Figure 4: Implementation of (a) the style encoder and (b) the decoder in Retriever.

4.3 Style Encoder with Cross-Attention

We have defined style as the P.I. information that can be extracted by a P.I. function from the structured data. Cross-Attention, which is widely used in set prediction task (Carion et al., 2020) and set data processing such as 3D point cloud (Lee et al., 2019), is known to be a P.I. function. It is also a powerful operation that can project a large input to a smaller or arbitrary target shape. Previously, Perceiver (Jaegle et al., 2021b) and Perceiver IO  (Jaegle et al., 2021a) have used this operation as a replacement of the self-attention operation in Transformers to reduce the complexity. However, cross-attention operation does not preserve data structure. In order to compensate for this, additional positional embedding is implemented to associate the position information with each input element.

Conversely, our work takes advantage of cross-attention’s non-structure-preserving nature to retrieve the P.I. style information. A key implementation detail is that position information should not be associated with the input tokens. Otherwise, there will be content leaks into the style path. To implement the style encoder, we follow previous work (Carion et al., 2020; Lee et al., 2019) to learn dataset-shared prototypes as seed vectors. Figure 4 (a) shows the implementation details of the style encoder, which is a stack of cross-attention blocks, token mixing layers, and feed-forward network (FFN) layers. The final output is the retrieved style .

4.4 Decoder with Link Attention

In an autoencoder architecture, the decoder is responsible for reconstructing data from latent features. In learning a content-style representation, the ability of the decoder limits the way content and style can be modeled. Therefore, decoder design becomes the key to the entire framework design.

In order to support the bipartite graph representation of content and style, we innovatively design link attention by unleashing the full potential of multi-head attention. Unlike self-attention and cross-attention, link attention has distinct input for , , and . In the Retriever decoder, content tokens are queries and style tokens are values. The keys are a set of learnable linking keys which are paired with the style tokens to represent the links in the content-style bipartite graph. Such a design allows us to retrieve content-specific style for data reconstruction, enabling fine-grained feature editing and interpretable style transfer.

Figure 4 (b) shows the implementation details of the Retriever decoder. We first use a convolutional layer directly after VQ to aggregate the contextual information as . Then the style is injected into content for times. In each round of style injection, we add a self-attention layer before the link attention layer and append an FFN layer after it. The element order in and does not matter, as long as and are paired. The output of the decoder is then detokenized to reconstruct the raw data, whose difference with the input raw data is used as the reconstruction loss.

4.5 Loss Functions

Our final loss function consists of three components: the reconstruction loss , the VQ diversity loss , and the structural constraint loss (see implementation details in Appendix D):


where , and are hyper-parameters controlling the weights of the three losses.

5 Experiments

We evaluate the Retriever

framework in both speech and image domains. Due to the differences in tasks, datasets, and evaluation metrics, we organize the experiments in these two domains in two subsections. We use visualizations or quantitative results to: i) demonstrate the effect of content-style separation and the interpretability of content; ii) illustrate how the bipartite graph representation conveys the interpretability of content to style and supports fine-grained style transfer.

5.1 Speech Domain

5.1.1 Training Retriever for Speech Signals

Retriever for speech signals is trained with the entire 44-hour CSTR VCTK Corpus (Veaux et al., 2017) containing 109 speakers. We use the pre-trained CPC network (Rivière et al., 2020) followed by a depth-wise convolution to perform tokenization. In Retriever, the content encoder is implemented by a VQ with two groups () and . A truncated neighborhood cross-entropy loss is used as to enforce the structural constraint, which is only applied to group . For the style encoder, we set the number of style tokens to 60. At the decoder, the content is first processed by a 1D depth-wise convolution with kernel size before being passed to link attention. In the detokenization step, we resample the output log-Mel spectrum to 80Hz and feed it into Parallel WaveGAN vocoder (Yamamoto et al., 2020) to generate the waveform. We apply L1 reconstruction loss on log-Mel spectrum as . More details can be found in Appendix F.1.

5.1.2 Visualization of Content-Style Representation

To understand how Retriever models the content, the style, and the bipartite graph between them, we visualize the VQ codes and the decoder link attention map for a 2.5s-long test utterance in Figure 5 (a). Each dot represents one token and the token hop is 10ms. Retriever encodes content with VQ, and we further impose structural constraint to VQ Group . We find that this group of code demonstrates good interpretability as desired. Codes for adjacent tokens are highly consistent, and the changes of codes are well aligned with ground-truth phoneme changes. The VQ codes of group exhibit a different pattern. They switch more frequently as they capture more detailed residual variations within each phoneme. We further study the interpretability of content by phoneme information probing (implementation described in Appendix F.3), which is used to show the information captured by the discovered content units (Chorowski et al., 2019)

. We measure the frame-level phoneme classification accuracy with or without context and compare with the baseline (CPC + k-means). Results in Table 

2 show that the two groups of VQ codes discovered in our unsupervised framework align surprisingly well with the annotated phoneme labels. Notably, group codes capture most of the phoneme information and group codes are less interpretable. This demonstrates how the structural constraint helps to increase the interpretability of content.

In decoder link attention, all the content tokens within the same phoneme tend to attend to the same style vector, while tokens from different phonemes always attend to different style vectors. We further visualize the co-occurrence map between the ground-truth phonemes and the style vectors in Figure 5 (b) (calculation described in Appendix B). We see that the phonetic content and style vectors have a strong correlation. Interestingly, each style vector encodes the style of a group of similarly pronounced phonemes, such as the group of ‘M’, ‘N’, and ‘NG’. This observation confirms that content-specific style is actually realized in Retriever features.

(a) (b)
Figure 5: Visualization of (a) VQ codes, decoder link attention map, and the corresponding Mel spectrum; (b) co-occurrence map between ground-truth phoneme and style tokens. Strong co-occurrence is labeled with phoneme names. For higher resolution see Appendix B.
Settings Frame Frame w/ context CPC + k-means 9.6% 9.5% Group #0 48.1% 62.7% Group #1 21.0% 45.9% Group #0, #1 52.3% 66.9% Methods SV Accu MOS SMOS AutoVC (Qian et al., 2019) 17.9% 1.650.11 1.920.13 AdaIN-VC (Chou and Lee, 2019) 46.5% 1.860.10 2.260.14 FragmentVC (Lin et al., 2021b) 89.5% 3.430.12 3.540.15 S2VC (Lin et al., 2021a) 96.8% 3.180.12 3.360.15 Retriever 99.4% 3.440.13 3.840.14
Table 1: Phoneme information probing.
Table 2: Comparison with other methods.

5.1.3 Zero-Shot Voice Conversion Task

Zero-shot voice conversion task converts the source speaker’s voice into any out-of-training-data speaker’s while preserving the linguistic content. We follow the setting of previous work to randomly select 1,000 test utterance pairs of 18 unseen speakers from the CMU Arctic databases (Kominek and Black, 2004). Each conversion is given five target utterances to extract the style. Both objective and subjective metrics are used to evaluate conversion similarity (SV accuracy, SMOS) and speech quality (MOSNet score, MOS). More details of the four metrics are in Appendix F.2.

Existing approaches fall into two categories: content-style disentanglement approaches (Qian et al., 2019; Yuan et al., 2021; Chou and Lee, 2019; Wu et al., 2020), and deep concatenative approaches (Lin et al., 2021b, a). Methods in the first category use a shared style embedding for all the content features while methods in the second category try to find the matching style for each content frame, resulting in high computational complexity. Our method provides a similar level of style granularity as the second class of approaches. But our bipartite graph representation is more compact and effective as it is capable of retrieving content-specific style with a light-weight link attention operation. Table 2 shows the comparison between our Retriever-based method and four SOTA methods. The superior performance demonstrates the power of the content-style bipartite graph representation.

5.1.4 Ablation Study on Retriever

Based on the zero-shot voice conversion task, we carry out ablation studies on the VCTK + CMU Arctic dataset to better understand Retriever.

Style Encoder. Unlike most previous work, which are limited to learning a single speaker style vector, our method can learn an arbitrary number of style tokens.When the number of style tokens increases from 1 to 10 and 60, the SV accuracy increases from 81.3% to 94.3% and 99.4%, showing the benefits of fine-grained style modeling.

Settings SV Accu MOSNet
AdaIN-decoder 72.2% 2.89
Too narrow B.N. 99.8% 2.96
Too wide B.N. 90.5% 3.10
Proper B.N. 99.4% 3.12
Table 3: Voice conversion quality at other content encoder and decoder settings.

Content Encoder. We study how the capacity of the information bottleneck affects the separation of content and style. Compared to the proper bottleneck setting (), speech quality significantly drops when the bottleneck is “too narrow” () and conversion similarity drops when the bottleneck is “too wide”(), as shown in Table  3. This indicates content loss and style leakage when the bottleneck is too narrow and too wide, respectively.

Decoder. To illustrate the advantages of our link attention-based decoder, we replace it with the traditional AdaIN module for experiment. As the AdaIN module only takes a single vector as input, we flatten the style tokens produced by Retriever into a very long vector. Results in Table 3 show that, although the amount of information provided to the decoder is the same, AdaIN-based decoder has a significant performance drop. This confirms the value of link attention.

5.2 Image Domain

The image domain is one of the first domains to carry out the research on content-style disentanglement. Conventionally, shape is treated as content and appearance is treated as style. Interestingly, our unified definition of content and style aligns with the intuitive understanding of images.

5.2.1 Training Retriever for Images

To tokenize the image, we downsample it by times using a stack of convolutions. For VQ module, we set to 1, and is dependent on the dataset. The convolution after VQ operation is of kernel size . We detokenize the output tokens to images by convolutions and PixelShuffles (Shi et al., 2016). For the reconstruction loss , we apply a typical perceptual loss (Chen and Koltun, 2017). For the structural constraint , we adopt a geometric concentration loss (Hung et al., 2019). We choose two commonly used datasets: Celeba-Wild (Liu et al., 2015) and DeepFashion (Liu et al., 2016). Images in Celeba-Wild are not aligned and each face has five landmark coordinates. The full-body images from DeepFashion are used. See Appendix E for more details and experiments.

5.2.2 Unsupervised Co-part Segmentation

SCOPS (w/o saliency) 46.62 22.11
SCOPS (with saliency) 21.76 15.01
Liu et al. (2021) 15.39 12.26
Retriever 13.54 12.14
Table 4: Landmark regression results on CelebA-Wild. indicates the number of foreground parts.

Unsupervised co-part segmentation aims to discover and segment semantic parts for an object. It indicates whether the content representations are semantically meaningful. Existing methods (Hung et al., 2019; Liu et al., 2021) rely heavily on hand-crafted dataset-specific priors, such as background assumption and transformations. We choose SCOPS (Hung et al., 2019) and Liu et al. (2021) as baselines and follow the same setting to use landmark regression on Celeba-Wild dataset as a proxy task for evaluation. Please refer to the Appendix E.2 for details.

The content code of Retriever is visualized in Figure 6 (a) for the Celeba-Wild dataset and in Figure 8 for the DeepFashion dataset. By setting a small in Retriever, the encoded content only preserves the basic shape information as segmentation maps. Each content token is consistently aligned with a meaningful part among different images. When compared to the state-of-the-art methods, as Table 4 shows, Retriever is one of the top performers on this task, even without any hand-crafted transformations or additional saliency maps.

(a) Co-part (b) Full (c) Eye (d) Jaw
Figure 6: Co-part segmentation and style transfer results on Celeba-Wild. Our method achieves desired appearance transfer even on the unaligned dataset.
(a) Upper body (b) Lower body (c) Head
Figure 7: Part-level appearance manipulation on DeepFashion. Our method can retrieve the correct appearance even with occlusion and large deformation.
Figure 8: Shape and appearance transferring on DeepFashion.

5.2.3 Unsupervised Part-level Style Transfer

Part-level style transfer requires disentanglement of shape and appearance at part-level, which is particularly challenging in an unsupervised manner. Previous works (Lorenz et al., 2019; Liu et al., 2021) propose to use one-one matching between shape and appearance, which is not flexible and leads to unrealistic results. Our content-style bipartite graph separates content and style at part-level and enables a more flexible content-specific style modeling. We can perform part-level style transfer by calculating the co-occurrence map between content and style tokens. See Appendix B for more details.

Both image-level and part-level transfer results are shown in Figure 6, 7, and 8. Retriever is capable of the explicit control of local appearance. Even without any supervision for the highly unaligned image pairs, our method can transfer the appearance to a target shape with high visual quality. See Appendix E.4 for more results.

6 Conclusion

In this paper, we have designed and implemented an unsupervised framework for learning separable and interpretable content-style features. At the core of the proposed Retriever framework are the two retrieval operations powered by the innovative use of multi-head attention. Cross-attention is used as the style encoder to retrieve style from the input structured data, and link attention is used as the decoder to retrieve content-specific style for data reconstruction. We have demonstrated that structural constraints can be integrated into our framework to improve the interpretability of the content. This interpretability is further propagated to the style through the innovative bipartite graph representation. As a result, the proposed Retriever enables a couple of fine-grained downstream tasks and achieves superior performance. As for the limitations, we have discovered in experiments that different tasks on different datasets need different model settings. We currently lack theoretical guidance for determining these settings, and we plan to work on it in the future.


  • Y. Alharbi and P. Wonka (2020) Disentangled image generation through structured noise injection. In CVPR, Cited by: §2.
  • A. Baevski, S. Schneider, and M. Auli (2020a)

    Vq-wav2vec: self-supervised learning of discrete speech representations

    In ICLR, Cited by: Appendix A, §4.2, §4.2.
  • A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020b) Wav2vec 2.0: A framework for self-supervised learning of speech representations. In NeurIPS, Cited by: Appendix A, §4.2, §4.2.
  • Y. Bengio, A. C. Courville, and P. Vincent (2013) Representation learning: A review and new perspectives. TPAMI. Cited by: §1.
  • D. Bouchacourt, R. Tomioka, and S. Nowozin (2018) Multi-level variational autoencoder: learning disentangled representations from grouped observations. In AAAI, Cited by: §2.
  • N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In ECCV, Cited by: §1, §4.3, §4.3.
  • Q. Chen and V. Koltun (2017) Photographic image synthesis with cascaded refinement networks. In ICCV, Cited by: §5.2.1.
  • J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord (2019) Unsupervised speech representation learning using wavenet autoencoders. IEEE ACM Trans. Audio Speech Lang. Process.. Cited by: §5.1.2.
  • J. Chou and H. Lee (2019) One-shot voice conversion by separating speaker and content representations with instance normalization. In Interspeech, Cited by: §F.4, §1, §2, §2, §5.1.3, Table 2.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: §E.3.
  • J. Ebbers, M. Kuhlmann, T. Cord-Landwehr, and R. Haeb-Umbach (2021) Contrastive predictive coding supported factorized variational autoencoder for unsupervised learning of disentangled speech representations. In ICASSP, Cited by: §2.
  • P. Esser, R. Rombach, and B. Ommer (2021) Taming transformers for high-resolution image synthesis. In CVPR, Cited by: §4.2.
  • A. Gabbay and Y. Hoshen (2020) Demystifying inter-class disentanglement. In ICLR, Cited by: §2.
  • X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, Cited by: Figure 20, §2.
  • D. A. Hudson and L. Zitnick (2021) Generative adversarial transformers. In ICML, Cited by: §2.
  • W. Hung, V. Jampani, S. Liu, P. Molchanov, M. Yang, and J. Kautz (2019) SCOPS: self-supervised co-part segmentation. In CVPR, Cited by: §D.1, §E.2, §5.2.1, §5.2.2.
  • A. Jaegle, S. Borgeaud, J. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhamer, O. J. Hénaff, M. M. Botvinick, A. Zisserman, O. Vinyals, and J. Carreira (2021a) Perceiver IO: A general architecture for structured inputs & outputs. CoRR abs/2107.14795. Cited by: §4.3.
  • A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira (2021b) Perceiver: general perception with iterative attention. In ICML, Cited by: §4.3.
  • E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In ICLR, Cited by: Appendix A.
  • A. H. Jha, S. Anand, M. Singh, and V. S. R. Veeravasarapu (2018) Disentangling factors of variation with cycle-consistent variational auto-encoders. In ECCV, Cited by: §2.
  • H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo (2018)

    StarGAN-vc: non-parallel many-to-many voice conversion using star generative adversarial networks

    In Spoken Language Technology Workshop, Cited by: §2.
  • T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In CVPR, Cited by: §2.
  • J. Kominek and A. W. Black (2004) The cmu arctic speech databases. In Fifth ISCA workshop on speech synthesis, Cited by: §5.1.3.
  • G. Kwon and J. C. Ye (2021) Diagonal attention and style-based GAN for content-style disentanglement in image generation and translation. In ICCV, Cited by: §2.
  • J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. Teh (2019)

    Set transformer: A framework for attention-based permutation-invariant neural networks

    In ICML, Cited by: §1, §4.3, §4.3.
  • J. Lin, Y. Y. Lin, C. Chien, and H. Lee (2021a) S2VC: A framework for any-to-any voice conversion with self-supervised pretrained representations. In Interspeech, Cited by: §F.1, §F.4, §5.1.3, Table 2.
  • Y. Y. Lin, C. Chien, J. Lin, H. Lee, and L. Lee (2021b) Fragmentvc: any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention. In ICASSP, Cited by: §F.2, §F.4, §5.1.3, Table 2.
  • S. Liu, L. Zhang, X. Yang, H. Su, and J. Zhu (2021) Unsupervised part segmentation through disentangling appearance and shape. In CVPR, Cited by: §D.1, §E.2, §E.4, §2, §5.2.2, §5.2.3, Table 4.
  • Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In CVPR, Cited by: §5.2.1.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In ICCV, Cited by: Figure 20, §5.2.1.
  • C. Lo, S. Fu, W. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H. Wang (2019) MOSNet: deep learning-based objective assessment for voice conversion. In Interspeech, Cited by: §F.2.
  • D. Lorenz, L. Bereska, T. Milbich, and B. Ommer (2019) Unsupervised part-based disentangling of object shape and appearance. In CVPR, Cited by: Figure 21, Figure 22, §E.4, §1, §2, §5.2.3.
  • M. Mathieu, J. J. Zhao, P. Sprechmann, A. Ramesh, and Y. LeCun (2016) Disentangling factors of variation in deep representation using adversarial training. In NeurIPS, Cited by: §2.
  • V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In ICASSP, Cited by: §F.6.
  • K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. D. Cox (2020) Unsupervised speech decomposition via triple information bottleneck. In ICML, Cited by: §F.1.
  • K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson (2019) Autovc: zero-shot voice style transfer with only autoencoder loss. In ICML, Cited by: §F.4, §2, §5.1.3, Table 2.
  • A. Razavi, A. van den Oord, and O. Vinyals (2019) Generating diverse high-fidelity images with VQ-VAE-2. In NeurIPS, Cited by: §4.2.
  • X. Ren, T. Yang, Y. Wang, and W. Zeng (2021) Rethinking content and style: exploring bias for unsupervised disentanglement. In ICCVW, Cited by: §1, §2.
  • M. Rivière, A. Joulin, P. Mazaré, and E. Dupoux (2020) Unsupervised pretraining transfers well across languages. In ICASSP, Cited by: §5.1.1.
  • B. Saleh and A. M. Elgammal (2015) Large-scale classification of fine-art paintings: learning the right metric on the right feature. CoRR abs/1505.00855. Cited by: Figure 20.
  • W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016)

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    In CVPR, Cited by: §5.2.1.
  • A. Szabó, Q. Hu, T. Portenier, M. Zwicker, and P. Favaro (2018) Challenges in disentangling independent factors of variation. In ICLRW, Cited by: §2.
  • J. B. Tenenbaum and W. T. Freeman (2000) Separating style and content with bilinear models. Neural computation. Cited by: §1, §2.
  • A. van den Oord, O. Vinyals, and K. Kavukcuoglu (2017) Neural discrete representation learning. In NeurIPS, Cited by: §1, §4.2.
  • L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne..

    Journal of machine learning research

    Cited by: Appendix C.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurPIS, Cited by: Appendix A, §4.1.
  • C. Veaux, J. Yamagishi, K. MacDonald, et al. (2017) CSTR vctk corpus: english multi-speaker corpus for cstr voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR). External Links: Document Cited by: §5.1.1.
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §E.4.
  • D. Wu, Y. Chen, and H. Lee (2020) VQVC+: one-shot voice conversion by vector quantization and u-net architecture. In Interspeech, Cited by: §4.2, §5.1.3.
  • W. Wu, K. Cao, C. Li, C. Qian, and C. C. Loy (2019) Disentangling content and style via unsupervised geometry distillation. In ICLRW, Cited by: §1, §2.
  • T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. B. Girshick (2021) Early convolutions help transformers see better. CoRR abs/2106.14881. Cited by: §E.3.
  • R. Yamamoto, E. Song, and J. Kim (2020) Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP, Cited by: §5.1.1.
  • W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas (2021) VideoGPT: video generation using VQ-VAE and transformers. CoRR abs/2104.10157. External Links: Link, 2104.10157 Cited by: §4.2.
  • S. Yuan, P. Cheng, R. Zhang, W. Hao, Z. Gan, and L. Carin (2021)

    Improving zero-shot voice style transfer via disentangled representation learning

    In ICLR, Cited by: §2, §5.1.3.

Appendix A Backgroud

Attention. Given query vectors, each with dimension : , an attention function maps queries to outputs using key-value pairs ,

where dot product measures the similarity between queries and keys. normalizes this similarity matrix to serve as the weight of combining the elements in .

Multi-Head Attention. Vaswani et al. (2017) propose multi-head attention, denoted as . It projects into

different groups of query, key, and value vectors, and applies attention function on each group. The output is a linear transformation of the concatenation of all attention outputs:

where , and are the projection matrix for query, key, and value in the -th head, is the linear transformation that fuses all the heads’ outputs.

Self-Attention. Given an input array , a self-attention or multi-head self-attention operation takes as query, key, and value:

Cross-Attention. Given input arrays , a cross-attention or multi-head cross-attention operation takes as query, and takes as key and value:

Property: (multi-head) cross-attention operation is P.I. to , that is,

where is the set of all permutations of indices

Permutation Invariant Property of Style Encoder For the -th style encoder block, we assume its input is P.I. to . The output of the multi-head cross-attention layer is . According to the property of multi-head cross-attention, we have is permutation invariant to

. The following residue connection combines

with , both of which are P.I. to . Thus the combination is P.I. to . The following token mixing layer and feed-forward layer don’t take as input. Because their inputs are P.I. to , their outputs are also P.I. to . To summarize, if is P.I. to , then is also P.I. to . It is obvious that is P.I. to since is fixed during the forward path. Therefore, the output of the whole style encoder is P.I. to .

Vector Quantization. For efficiency, we use product quantization (Baevski et al., 2020a, b). The quantized feature is the concatenation of embeddings , which are looked up from the corresponding codebooks , where is the number of entries in each codebook. To make VQ operation differentiable, Gumbel-softmax (Jang et al., 2017)

is adopted: Firstly, each input token is mapped to logits

. Then, is calculated by a weighted sum of codewords in codebook , and the weight for adding the -th code in group is given by:


where is temperature, , and

is independently sampled for all subscripts. We use the straight-through estimator 

(Jang et al., 2017), treating the forward and backward path differently. During the forward path, the one-hot version of the above weight is used, and in the backward path, the gradient of the original is used. To encourage all the VQ codes to be equally used, we use a batch-level VQ perplexity loss:



indicates the probability calculated by

averaged across all the tokens in batch, calculates the entropy of a discrete distribution.

Appendix B Visualization of learned Content-Style Bipartite Graph

Retriever models the content-style bipartite graph by introducing the novel link attention module. The graph’s edges are represented by the link attention map. To understand what content is linked to which style, we gather the statistics of the link attention map as the content-style co-occurrence map. To show content information in a 2-d map, the content is categorized by either the ground-truth labels, or the discovered parts or units. Specifically, the co-occurrence statistics of content category and style token , denoted as , is calculated as follows:

where indicates the set composed of all the content tokens in the target dataset. is the link attention amplitude between content token and style token .

is the characteristic function. It outputs 1 if its input is true and outputs 0 if its input is false.

gives the content category of token .

For each style token, we consider the content category that shows the strongest co-occurrence with it to be the major content category that it serves. For the convenience of finding the pattern inside the co-occurrence map, we normalize on both content category axis and style token axis, then sort the style tokens so that their major content category index is in the ascending order.

In audio domain, we categorize content tokens using ground-truth phoneme labels and calculate the co-occurrence map using the first link attention. The content-style co-occurrence map is shown in Figure 9.

Figure 9: Phoneme-style co-occurrence map of higher resolution.

In image domain, we categorize content tokens using the discovered parts and calculate the co-occurrence map using the third link attention. The part indices together with content-style co-occurrence map are shown in Figure 10.

Figure 10: Visualization of part-style co-occurrence map on Celeba-Wild and DeepFashion dataset.

Appendix C Style Space Visualization

In order to provide an intuitive impression about what the styles decomposed by our framework are, we present visualizations of some style spaces produced by t-SNE (Van der Maaten and Hinton, 2008). We see the clustered patterns in many style spaces which correspond to meaningful attributes like person gender and clothes color.

c.1 Speech Domain

We visualize the style vectors extracted from 500 randomly-selected utterances of ten unseen speakers from Librispeech test-clean split. In Figure

11, we show the latent space of four typical style vectors, which correspond to the style of nasal consonant (NG), voiceless consonant (T), voiced consonant (Z), and vowel (IH). Different speakers are labeled in different colors. We see that different speakers are clustered into different clusters.

Figure 11: Style space t-SNE visualization in speech domain: utterances of the same speaker cluster together.

c.2 Image Domain

We examine the style vector spaces on Deepfashion dataset, and find that some style vector spaces are separately or collectively correlated with the human-defined attributes. For example, we find that the style vector #43 is highly correlated with the color of the clothes. Figure 12 shows the t-SNE visualization of this vector space, as well as some images in each cluster. It is clear that images are clustered solely based on the color of the clothes despite of various model identifications, standing poses, and styles of the dress.

As this dataset also has the gender attribute labelled, we further look for the ‘gender vector’ in the decomposed style vector spaces. Interestingly, we find 14 vector spaces that are highly correlated to the gender attribute. We concatenate these 14 vectors and show the t-SNE visualization in Figure 13. We can see a clear boundary between male and female models despite of the their color of clothes or standing poses.

Figure 12: Style space t-SNE visualization in image domain: style vector #43 is a ‘clothes color vector’.
Figure 13: Style space t-SNE visualization in image domain: an ensemble of 14 style vectors identifies gender.

Appendix D Structural Constraint

d.1 Image domain

To make the content representation interpretable, we can apply additional structural constraint. Inspired by the observation that pixels belonging to the same object part are usually spatially concentrated and form a connected region, we follow Hung et al. (2019) to use a geometrical concentration loss as the structural constraint to shape the content representation to part segments. In practice, we denote the pixels assigned to the first entry in the VQ codebook as background pixels and the other pixels as the foreground pixels.

Given the logits that assign each pixel to the parts (entries in VQ codebook), for foreground pixels:


where and are the coordinates of the -th part center along axis , and is a normalization term. , and can be calculated as follows:


For the background pixels, different from previous methods using background loss (Liu et al., 2021) or saliency map to bound the region of interested object, we do not add any constraint. Thus

d.2 Audio domain

To discover phoneme-like units in speech, the discrete representation should not switch too frequently along the time axis. Therefore, we penalize the code switch based on the calculation of neighborhood cross-entropy.


where is the length of the input sequence, calculates the cross-entropy between discrete probability and given label, denotes the predicted discrete probability before Gumbel-softmax, and is the highest-probability code on position .

However, also penalizes the true phoneme switch, where the input acoustic feature really changes dramatically. To solve this problem, we use a truncated version of as our structural constraint:


where is a hyper-parameter indicating the truncation threshold. For the true phoneme switch, high is expected, thus the loss will be truncated, while within the same phoneme, low is expected, and behaves like the original CE loss. As such, the truncated loss is more suitable for discovering phoneme-like units. In our experiment, we set .

Appendix E Image: Additional Details

e.1 Implementation Details

We first show the tokenization module in Table 6 and the detokenization module in Table 6. For the layer numbers, we set to , respectively. The hyper-parameter settings of each style-encoder layer and decoder layer are shown in Table 8 and Table 8, where indicates the hidden dimension of the feed-forward network, and is the number of heads used in the multi-head attention modules. Specifically, for the decoder, we replace the FFN with Mix-FFN to improve the image quality. The Mix-FFN can be formulated as:

where is the input feature. The Gumbel-Softmax temperature in VQ is annealed from 2 to a minimum of 0.01 by a factor of 0.9996 at every update.

For the Celeba-Wild dataset, we resize the input to . For DeepFashion dataset, we resize the input to . For the number of style tokens, we use . We summarize the training setting in Table 9

. Our model is implemented with Pytorch and trained on 4 Nvidia V100 GPUs.

Conv , stride BatchNorm ReLu Conv , stride BatchNorm ReLu Conv , stride BatchNorm ReLu Conv , stride BatchNorm ReLu Conv , stride PixelShuffle(2) Conv , stride ReLu PixelShuffle(2) Conv , stride ReLu Conv , stride
Table 5: Tokenization module for images.
Table 6: Detokenization module for images.
Hyper-parameter Value 192 768 4 Hyper-parameter Value & 192 768 4
Table 7: Style encoder hyper-parameters.
Table 8: Decoder hyper-parameters.
Hyper-parameter Value
4 or 7
optimizer Adam ()
Learning rate 0.001
Batchsize 16
Iteration 150,000
Table 9: Training setting.

e.2 Details about Evaluation on Co-part Segmentation

For the evaluation on co-part segmentation, we follow the setting in Liu et al. (2021) and Hung et al. (2019). The proxy task for Celeba-Wild is landmark regression. We first convert part segmentations into landmarks by taking part centers as in Eq. 5. Then we learn a linear regressor to map the converted landmarks to ground-truth landmarks and evaluate the regression error on test data.

e.3 Ablation Study

Method Landmark regression error
Ours w/ bipartite graph 12.14
Ours w/ AdaIN 13.07
Ours w/o style branch 15.62
Ours w/ downsampling rate 44.89
Ours w/ downsampling rate 30.62
Ours w/ downsampling rate 12.14
Ours w/ 19.32
Ours w/ 12.14
Ours w/ 16.75
Ours w/ 15.98
Ours w/ 13.58
Ours w/ 12.84
Ours w/ 12.14
Ours w/ 21.60

Ours w/ Conv
Ours w/ PatchMerge 17.16
Table 10: Ablation study on CelebA-Wild.

Do the style encoder and decoder impact the co-part segmentation? In this work, we find VQ module is suitable for unsupervised co-part segmentation, as it is a cluster center among the whole dataset. Here we evaluate that if the style branch further helps the co-part segmentation. As shown in Table 10, with a style branch, the discovered part segmentation is more consistent. Moreover, we compare our link attention decoder with AdaIN operation and find that our design is better for co-part segmentation task. This further verified that our content-style bipartite graph is a powerful representation with high interpretability.

On using different structural constraint weight. We further carry out experiments with different loss weight for structural constraint. Without the structural constraint, the parts learned are scattered all over an image. As the weight for structural constraint getting larger, the part segmentation becomes tighter. With too large weight, the regions lose flexibility and fail to cover the parts.

On using different VQ diversity loss weight. We also test our model’s sensitivity to . As long as is not set to zero or too large (e.g., = 1), the model has a fairly stable landmark regression error even when is adjusted from 0.01 to 0.3.

(a) (b) (c)
Figure 14: Visual ablation study on weight of structural constraint .

On using different tokenization module. Different tokenization modules bring additional bias. Besides a stack of convolution layers to tokenize the input image, we also experiment with the PatchMerge operation in ViT (Dosovitskiy et al., 2021), which is a convolution layer with stride 4 in our case. As shown in Table 10, using convolution layers as the tokenization module can help the preprocessing transformer blocks inside Retriever work better, which is also observed by a concurrent work (Xiao et al., 2021).

Ours w/ PatchMerge
Ours w/ Conv
Figure 15: Visual ablation study on tokenization module. Tokenizing the images using convolutions can help our model extract the content and style better, which leads to better transfer results.

On using different downsampling rate. In this work, we mainly use downsampling rate. For the input image, we first tokenize it to . Here we experiment with additional different downsampling rates: and . In addition to the convolution layer for setting, we add additional convolution layers for further downsampling. We do not try due to the limited computational resource. As shown in Table 10, for the selected downsampling rate, larger one results in worse performance. We also provide visual comparison in Figure 16.

(a) (b) (c)
Figure 16: Visual ablation study on downsampling rate. Larger downsampling rate leads to coarser co-part segmentation results.

e.4 More qualitative results

Co-part segmentation. We first provide more visualization on Celeba-Wild in Figure 17. Besides, we also apply our method on CUB dataset (Wah et al., 2011). CUB is a dataset with 11, 788 photos of 200 bird species in total. This dataset is challenging due to the pose of the birds is of large diversity. We follow (Liu et al., 2021) select the photos from the first 3 categories, and get about 200 photos. The results are shown in Figure 18.

Figure 17: More co-part segmentation results on Celeba-Wild dataset.
Figure 18: Co-part segmentation results on CUB dataset. We set = 4. Similar to previous work, due to the geometric concentration loss, we find it struggles with distinguishing between orientations of near-symmetric objects, which is a common limitation.

Part-level style transfer. Besides the results in the main paper, we also apply our method on a high-resolution dataset: CelebA-HQ. We resize the images to . We show the part-level style transfer results in Figure 19 for mouth, nose, and eye.

Zero-shot image style transfer. We find our model capable of generalizing into unseen image datasets. In Figure 20, we use our model trained on CelebA-Wild to transfer artistic style from an artistic image. The transfer result keeps the pose and shape of the content images (the first row), while adopting the tone and appearance of the style image (the first column on the left).

Visual comparison. We find our method enables more natural style transfer compared with the previous work. Here we show side-by-side visual comparison on Deepfashion dataset in Figure 21 and 22. In image-level style transfer, the results of Lorenz et al. (2019) suffer from significant deformation artifact. In contrast, such artifact is not observed in our results. In part-level style transfer, we show the results of head transfer. Again, our results are more natural and contain more image details.

Figure 19: Part-level style transfer results on CelebA-HQ dataset for mouth, nose and eye. The resolution is .
Figure 20: Zero-shot image style transfer. Our model is trained on CelebA-Wild but can be generalized to artistic images. The artist image (left) is obtained by AdaIN (Huang and Belongie, 2017) from CelebA-Wild (Liu et al., 2015) and WikiArt (Saleh and Elgammal, 2015) dataset.
(a) Ours (b) Lorenz et al. (2019)
Figure 21: Visual comparison on Deepfashion dataset: image-level style transfer. Our results are more natural than the baseline. Results of Lorenz et al. (2019) are cropped from their official website:
(a) Ours (b) Lorenz et al. (2019)
Figure 22: Visual comparison on Deepfashion dataset: head style transfer. Our results are more natural than the baseline and contain more image details.

Appendix F Audio: Additional Details

f.1 Implementation details

For the tokenization module, we follow Lin et al. (2021a), using s3prl toolkit111 to extract the CPC feature, and use a depth-wise convolutional layer with kernel size 15 after the CPC feature. The depth-wise convolutional layer is trainable, while the CPC model is fixed during training. For the detokenization module, we use publicly available implementation of Parallel WaveGAN222 pretrained on LibriTTS. are set to 0, 3, 4, respectively. The dimension of content representation is . Following Qian et al. (2020), we add auxiliary quantized normalized feature into content for better voice quality. The Gumbel-Softmax temperature is annealed from 2 to a minimum of 0.01 by a factor of 0.9996 at every update. The hyper-parameter settings of each style-encoder layer and decoder layer are shown in Table 12 and Table 12, where indicates the hidden dimension of the feed-forward network, and is the number of heads used in the multi-head attention modules. The decoder output is converted to 80-d log-Mel spectrum with an MLP whose hidden dimension is 4096. The training hyper-parameter setting is listed in Table 13

. Each training sample is a random 4-second audio clip sampled from the training dataset. For utterances that are shorter than 4s, we pad zero at the end of the utterance. The code is implemented with Pytorch, and training takes 5 hours on 4 Nvidia V100 GPUs.

Hyper-parameter Value 192 512 4 dropout 0.1 Hyper-parameter Value 512 2048 8 dropout 0.1
Table 11: Style encoder hyper-parameters.
Table 12: Decoder hyper-parameters.
Hyper-parameter Value
Optimizer Adam ()
Learning rate 0.004
Learning rate schedule power (, warmup-steps = 625 )
batch-size 120
epoch 50
Table 13: Training setting.

f.2 Evaluation metrics of zero-shot VC task

For objective metrics, conversion similarity is measured by Resemblyzer speaker verification system 333 by calculating speaker verification accuracy (SV Accu) between converted speech and corresponding target utterance, as done in the previous work (Lin et al., 2021b)

. The conversion is considered successful if the cosine similarity between Resemblyzer speaker embedding of converted speech and target utterance exceeds a pre-defined threshold. The threshold is decided based on equal error rate (EER) of the SV system over the whole testing dataset. The SV accuracy is the percentage of successful conversions. The objective speech quality metric is estimated by MOSNet

(Lo et al., 2019). It takes converted speech as input and outputs a number ranging from 1 to 5 as the measurement of speech naturalness. Both metrics are higher the better.

For subjective metrics, we conduct two tests to calculate the mean opinion score of conversion similarity (denoted as SMOS), and speech quality (denoted as MOS). For conversion similarity, each subject is asked to listen to the target utterance and converted utterance, and judge how confident they are spoken by the same speaker. For speech quality, each subject is asked to listen to utterances that are randomly chosen from converted speech and real speech, and determine how natural they sound. Both test results are scored from 1 to 5, the higher, the better. We randomly sample 40 utterances from the test set for both tests. Each utterance is evaluated by at least 5 subjects. The score is then averaged and reported with 95% confidence intervals.

f.3 Phoneme information probing: training and testing

We conduct phoneme information probing to test how much phoneme information is encoded in the learned discrete content tokens, so as to demonstrate their interpretability. We experiment with two settings, one considers the contextual information (denoted as “Frame w/ context”), and the other considers only single-frame information (denoted as “Frame”). For these two settings, a 1d convolutional layer with kernel size 17 or a linear layer is used as the probing network respectively, to predict ground-truth phoneme label. For the input of the probing network, each frame is represented by the one-hot vector of the corresponding code. For the experiment involving two groups of code, the one-hot vector of both groups are concatenated on the channel axis. For the k-means clustering of CPC feature, the cluster number is set to 100, which is the same as the number of codebook entries per VQ group. We train the probing network on LibriSpeech train-clean-100 split and test it on LibriSpeech test-clean split. Adam optimizer is used with . Learning rate and batch size are set to 0.00005 and 30, respectively. Each training and testing sample is a random 2-second segment from the dataset. We drop the utterance that is shorter than 2 seconds. In Table 2, we report the test accuracy after the convergence of the training.

f.4 Details in system comparison

We choose four SOTA systems for comparison. Two content-style disentanglement approaches are AutoVC (Qian et al., 2019)444, and AdaIN-VC (Chou and Lee, 2019)555 Two deep concatenative approaches are FragmentVC (Lin et al., 2021b)666, and S2VC (Lin et al., 2021a)777 All these methods have officially released their code, together with the pretrained models also on VCTK dataset. We adopt these models and test them on our randomly sampled 1000 conversion pairs in CMU Arctic dataset.

f.5 Inference scalability on zero-shot VC task

# target 1 2 3 5 10
SV accu (%) 95.1 98.7 99.4 99.4 99.6
MOSNet 3.12 3.13 3.12 3.12 3.13
Table 14: Inference scalability

Table 14 shows the inference scalability of our method on VCTK + CMU Arctic dataset. “# target” indicates the number of available target utterances for inference. We see that Retriever performs quite well in the case that only one target utterance is given. As the available target utterance increases, the conversion similarity keeps increasing up to 99.6%, and it has no negative effect on speech quality, indicating the extracted style becomes more accurate when seeing more target samples.

f.6 Performance on LibriSpeech dataset

LibriSpeech dataset (Panayotov et al., 2015) is more diverse than VCTK + CMU Arctic in the aspect of emotion, vocabulary, and speaker number, thus closer to real-world application. In this experiment, the whole train-clean-100 split containing 251 speakers is used for training, and test-clean split containing 40 speakers is used for testing. When testing, the conversion source-target pairs are built as follows: for each utterance in the test set, we treat it as the source utterance, and assign it one target utterance that is randomly sampled from the test set. In this way, each utterance in the dataset serves as the source for one time.

Figure 23: Performance on LibriSpeech dataset.

Objective similarity metric is measured for using different style token numbers, as shown in Figure 23. The conversion similarity increases when using more style tokens, and the performance saturates at nearly 100%. To be specific, we achieve 98.4% SV accuracy and 3.13 MOSNet score when using 60 style tokens, demonstrating that our method can be generalized into more complicated scenarios and potentially can be used in the real-world application.

f.7 Loss Weight Ablation Study

Method SV Accu (%) MOSNet
Ours w/ 98.9 3.01
Ours w/ 98.5 3.10
Ours w/ 99.0 3.11
Ours w/ 98.5 3.12
Ours w/ 99.0 3.10
Ours w/ 99.4 3.12
Ours w/ 99.7 3.07
Ours w/ 99.3 3.12
Ours w/ 99.4 2.94
Ours w/ 99.5 3.08
Ours w/ 99.4 3.12
Ours w/ 99.5 3.08
Table 15: Loss weight ablation study on VCTK + CMU Arctic dataset.

To test the effectiveness of structural constraint and VQ diversity loss, we do ablation study on the corresponding loss weights, and . The SV accuracy and MOSNet scores achieved at different parameter settings are shown in Table 15. We observe that the absence of either loss term leads to noticeable speech quality (MOSNet score) drop. We further test the model’s sensitivity to these two loss weights and we are able to empirically conclude that our framework is not sensitive to the selection of these two hyper-parameters. When increases from 0.002 to 0.1 (by 50 times), SV accuracy only fluctuates by and MOSNet score only fluctuates by . When increases from 0.01 to 2.0 (by 200 times), SV accuracy only fluctuates by and MOSNet score only fluctuates by .

In a nutshell, both structural constraint and VQ diversity loss are necessary, and our model is not sensitive to the selection of the corresponding loss weights in a reasonably large range.