[CVPR'20] PyTorch code for our paper "Collaborative Distillation for Ultra-Resolution Universal Style Transfer"
Universal style transfer methods typically leverage rich representations from deep Convolutional Neural Network (CNN) models (e.g., VGG-19) pre-trained on large collections of images. Despite the effectiveness, its application is heavily constrained by the large model size to handle ultra-resolution images given limited memory. In this work, we present a new knowledge distillation method (named Collaborative Distillation) for encoder-decoder based neural style transfer to reduce the convolutional filters. The main idea is underpinned by a finding that the encoder-decoder pairs construct an exclusive collaborative relationship, which is regarded as a new kind of knowledge for style transfer models. Moreover, to overcome the feature size mismatch when applying collaborative distillation, a linear embedding loss is introduced to drive the student network to learn a linear embedding of the teacher's features. Extensive experiments show the effectiveness of our method when applied to different universal style transfer approaches (WCT and AdaIN), even if the model size is reduced by 15.5 times. Especially, on WCT with the compressed models, we achieve ultra-resolution (over 40 megapixels) universal style transfer on a 12GB GPU for the first time. Further experiments on optimization-based stylization scheme show the generality of our algorithm on different stylization paradigms. Our code and trained models are available at https://github.com/mingsun-tse/collaborative-distillation.READ FULL TEXT VIEW PDF
[CVPR'20] PyTorch code for our paper "Collaborative Distillation for Ultra-Resolution Universal Style Transfer"
Universal neural style transfer (NST) focuses on composing a content image with new styles from any reference image. This often requires a model with considerable capacity to extract effective representations for capturing the statistics of arbitrary styles. Recent universal style transfer methods based on neural networks [13, 4, 24, 38] consistently show that employing the representations extracted by a pre-trained deep neural network like VGG-19  achieves both visually pleasing transferred results and generalization ability on arbitrary style images. However, given limited memory on hardware, the large model size of VGG-19 greatly constrains the resolution of input images. Up to now, current universal style transfer approaches only report results around one megapixels (e.g., pixels) on one single GPU with 12GB memory. Although it is likely to achieve higher resolution style transfer through multiple GPUs, the fundamental problem of the massive model size of VGG-19 remains still, hindering neural style transfer from practical applications, especially on mobile devices.
Meanwhile, recent years have witnessed rapid development in the area of model compression [17, 21, 36, 19], which aims at reducing the parameters of a large CNN model without considerable performance loss. Despite the progress, most model compression methods only focus on high-level tasks, e.g., classification [18, 55, 53, 46] and detection [60, 19]. Compressing models for low-level vision tasks is still less explored. Knowledge distillation (KD) [2, 1, 21]
is a promising model compression method by transferring the knowledge of large networks (called teacher) to small networks (called student), where the knowledge can be softened probability (which can reflect the inherent class similarity structure known asdark knowledge) or sample relations (which can reflect the similarity structure among different samples) [42, 44, 57, 45]. This knowledge works as extra information on top of the one-hot labels, and hence can boost the student’s performance. However, this extra information is mainly label-dependent, thus hardly applicable to low-level tasks. What is the dark knowledge in low-level vision tasks (e.g., neural style transfer) remains an open question. Meanwhile, encoder-decoder based models are extensively employed in neural style transfer, where the decoder is typically trained via the knowledge of encoder. Notably, they together construct an exclusive collaborative relationship in the stylization process, shown in Fig. 2. Since the decoder is trained to exclusively work with the encoder , if another encoder can also work with , it means can functionally play the role of . Based on this idea, we propose a new knowledge to distill the deep models in neural style transfer: the collaborative relationship between the encoder and decoder.
Given a redundant large encoder (e.g., VGG-19), we propose a two-step compression scheme: First, train a collaborator network for the encoder, namely, the decoder in our context; second, replace the large encoder with a small encoder, then train the small encoder with the collaborator fixed. Since the small encoder typically has less channels, its output feature has smaller dimension than that of the large encoder. Therefore, the small network cannot directly work with the collaborator. To resolve this, we propose to restrict the student to learn a linear embedding of the teacher’s output, so that the teacher’s output can be reconstructed through a simple linear combination of the student’s output before being fed into the collaborator.
Notably, we do not restrict the specific collaboration form in our approach. In this paper, we will show it can be applied to two different state-of-the-art universal style transfer schemes: WCT  (where the collaboration is image reconstruction) and AdaIN  (where the collaboration is style transfer). The main contributions of this work are:
We propose a new knowledge distillation method for universal neural style transfer. The exclusive collaborative relationship between the encoder and its decoder is identified as a new kind of knowledge, which can be applied to different collaborative relationships.
To resolve the feature dimension mismatch problem between the student and teacher networks in our algorithm, we propose to restrict the student to learn linear embedding of the teacher’s output, which also acts as a regularizer to fuse more supervision into the middle layers of the student so as to boost its learning.
Extensive experiments show the merits of our method with different stylization frameworks (WCT , AdaIN , and Gatys ), parameter reduction with comparable to even better visual effect. Especially, on WCT, the compressed models enable us to conduct ultra-resolution ( megapixels) universal style transfer for the first time on a single 12GB GPU.
Prior to the deep learning era, image style transfer is mainly tackled by non-parametric sampling, non-photorealistic rendering [15, 50] or image analogy . However, these methods are designed for some specific styles and rely on low-level statistics. Recently, Gatys et al. 
propose the neural style transfer, which employs deep features from the pre-trained VGG-19 model and achieves the stylization by matching the second-order statistics between the generated image and given style image. Numerous methods have been developed to improve the visual quality [34, 54, 47, 61], speed [35, 52, 29, 11], user controls [39, 56, 14], style diversities [8, 3, 37, 24, 38], etc. However, one common limitation of all those neural network based approaches is that they cannot handle the ultra-resolution content and style images given limited memory. Some approaches [52, 29, 47] achieve high-resolution stylization results (up to megapixels, e.g., 30003000 pixels) by learning a small feedforward network for a specific style example or category, but they do not generalize to other unseen styles. In contrast, our goal is to realize the ultra-resolution image style transfer for universal styles with one model only.
Model compression. The CNN compression and acceleration have also attracted much attention recently, which aim to obtain a smaller and faster model without a considerable compromise in performance. Existing methods broadly fall into five categories, i.e., low-rank decomposition [7, 27, 32, 60], pruning [33, 18, 17, 36, 55, 19, 53], quantization [5, 46, 62, 25], knowledge distillation [2, 1, 21, 58] and compact architecture redesign or search [26, 23, 48, 59, 43, 51, 10]. However, these methods are mainly explored in high-level vision tasks, typically the classification and detection. Few approaches have paid attention to low-level vision tasks such as style transfer, where many methods are also limited by the massive model size of CNNs. Unlike the CNN compression for high-level vision where it only needs to maintain the global semantic information of features to retain accuracies, the extra challenge of model compression for low-level vision may be how to maintain the local structures, e.g., local textures and color diversity in style transfer.
In this work, we develop a deeply-supervised knowledge distillation method to learn a much smaller model from pre-trained redundant VGG-19 . The compressed model enjoys more than parameter and computation reduction. More importantly, the decrease of model size enables universal style transfer on ultra-resolution images. To our best knowledge, only one recent work  employs GAN  to learn unpaired style transfer network on ultra-resolution images. However, they achieve this by working on image subsamples and then merging them back to a whole image. In contrast, our method fundamentally reduces the model complexity, which can directly processes the whole image.
Style-agnostic stylization methods typically adopt an encoder-decoder scheme to learn deep representations for style rendering and then invert them back into the stylized images. Since the style information is not directly encoded in the model, the encoder part needs to be expressive enough to extract informative representations for universal styles. Existing methods [13, 4, 24, 38] commonly choose VGG-19  as the encoder considering its massive capacity and hierarchical architecture. As for the decoder, depending on different stylization schemes, it can have different collaborative relationships with the encoder. Two state-of-the-art arbitrary style transfer approaches, WCT  and AdaIN , are discussed here. (i) For WCT, the stylization process is to apply Whitening and Coloring Transform 
to the content features using the second-order statistics of style feature. Then the transformed content feature is inverted to an image by the decoder. Hence, the decoder training is not directly involved with stylization. The collaborative relationship in WCT is essentially image reconstruction. (ii) For AdaIN, unlike WCT, its decoder training is directly involved in the stylization. Two images (content and style) are fed into the encoder, then in the feature space, the content feature is rendered by the statistics (mean and variance) of the style feature. Finally, the decoder inverts the rendered content feature back to stylized images. The stylized images are supposed to be close to the content (or style) in terms of content (or style) distance. Therefore, the collaborative relationship for AdaIN is style transfer.
Despite paradigm difference for the above two schemes, they are both encoder-decoder based, and the decoder is trained through the knowledge of the encoder. This means, during the training of the decoder, the knowledge of the encoder is leaked into the decoder. Presumably and confirmed empirically, the decoder can only work with its matching encoder like a nut with its bolt. For another encoder , even if it has the same architecture as , and cannot work together (see Fig. 2). This exclusivity shows the decoder has some inherent information specific to its encoder. If we can find a way to make the network compatible with too, it means can functionally replace the original encoder . If is meanwhile much smaller than , then we achieve the model compression goal. Based on this idea, we propose a new distillation method specific to NST, named Collaborative Distillation, consisting of two steps.
For the first step, based on the task at hand, we train a collaborator network for the large encoder. As shown in Fig. 3(a), for WCT , the decoder is trained to invert the feature to be as faithful to the input image as possible (i.e., image reconstruction), where both the pixel reconstruction loss and the perceptual loss  are employed,
where denotes the th stage of VGG-19; denotes the feature maps of the
ReLU_i_1 layer; is the weight to balance the perceptual loss and pixel reconstruction loss; and denote the original image and reconstructed image, respectively. For AdaIN , the decoder is involved in style transfer directly. Thus, its decoder loss is made up of both the content loss and the style loss,
where is the Gram matrix to describe style [12, 13], is the weight to balance the style loss and the content loss; subscript “st”, “”, and “” represent the stylized image, content image, and style image, respectively.
After we have the collaborator, the second step of our algorithm is to replace the original encoder with a small encoder . For simplicity, in this work we use with the same architecture of but fewer filters in each layer. We expect that the small encoder can functionally equivalent to the original encoder , as shown in (a) and (b) of Fig. 3. Similar to the first step, the collaborator network loss, denoted as , can take different forms depending on the specific collaboration tasks. In the context of this work, for WCT and for AdaIN.
In the proposed collaborative distillation method, the small encoder is connected with the decoder network. In their interface comes a feature size mismatch problem. Specifically, if the original encoder outputs a feature of size , and thus the input of the decoder is also supposed to be of size . However, because the small encoder has less filters, it will output a feature of size , which cannot be accepted by the decoder. To resolve this, we first look at how the channel numbers play a role in the stylization process. As pioneered by Gatys [12, 13]
, the style of an image is described by the gram matrix of the deep features extracted from VGG-19,
where is the deep feature of size extracted from certain convolutional layer of VGG-19; denotes the Gram matrix of size ; stands for matrix transpose. Since we aim to compress these features, i.e., they are regarded as redundant, it can be formulated as that
is a linear combination of some feature basis vectors in a lower dimension,
where is a transform matrix of size , is the feature basis matrix of size , which can be viewed as the linear embedding of the original deep feature . Then it is easy to see that the Gram matrix for the new feature
has the same number of eigenvalues as the original Gram matrix. In other words, the style description power is maintained if we adopt in place of the original redundant . In the context of our method, is the output of the original encoder, is the output of the small encoder. The transformation matrix is learned through a fully-connected layer
without non-linear activation functionto realize the linearity assumption. Hence, the linear embedding loss can be formulated as
One more step forward, the proposed solution above is not limited to the final output layer. For the middle layers of the small encoder, it can also find an application. Concretely, we apply the linear embedding to the other four middle layers (
ReLU_k_1, ) between the original encoder and the small encoder, as shown in Fig. 3(c). We have two motivations for this. First, in the proposed method, when the small encoder is trained with the SGD algorithm, its only gradient source would be the decoder, passed through the fully-connected layer . However, does not have many parameters, so it will actually form a information bottleneck, slowing down the learning of the student. With these branches plugged into to the middle layers of network, they will infuse more gradients into the student and thus boost its learning, especially for deep networks which are prone to gradient vanishing. Second, in neural style transfer, the style of an image is typically described by the features of many middle layers [13, 24, 38]. Therefore, adding more supervision to these layers is necessary to ensure that they do not lose much the style description power for subsequent use in style transfer.
To this end, the total loss to train the small encoder in our proposed algorithm can be summarized as
where is the weight factor to balance the two loss terms.
In this section, we will first demonstrate the effectiveness of our compressed VGG-19 compared with the original VGG-19 within the universal style transfer frameworks WCT . Then we show the proposed collaborative distillation is not limited to one kind of collaborative relationship on AdaIN . Finally, to show the generality of the proposed approach, we also evaluate on the optimization-based style transfer using Gatys , where the collaborative relationship is the same as WCT, i.e., image reconstruction. We first conduct comparisons in the maximal image resolution which the original VGG-19 can possibly handle (30003000), then show some stylized samples in larger resolutions (i.e., ultra-resolutions) with the small models. All experiments are conducted on one Tesla P100 12GB GPU, namely, given the same limited memory.
Evaluated compression method. Since there are few model compression methods specifically designed for low-level image synthesis tasks, we compare our method with a typical compression algorithm in the high-level image classification task, i.e., filter pruning (FP) 
. Specifically, we first apply FP to VGG-19 in classification to obtain a compressed model with the same architecture as ours. It is then fine-tuned on ImageNet to regain performance. Finally, its decoder is obtained by optimizing the loss in (1).
|(a) Content||(b) Style||(c) Original||(d) FP||(e) Ours|
Since we need the original decoder as collaborator , we first train a decoder with the mirrored architecture of VGG-19 on the MS-COCO dataset  for image reconstruction.
During the training, the encoder is fixed, with .
We randomly crop patches from images as input.
Adam  is used as optimization solver with fixed learning rate , batch size .
In WCT , a cascaded coarse-to-fine stylization procedure is employed for best results, so the decoders of the -stage VGG-19 (up to
ReLU_k_1, ) are all trained.
Then an encoder-decoder network is constructed with loss (6), in which is set to .
The compressed encoder can be randomly initialized, but we empirically find using the largest filters (based on -norms) from the original VGG-19 as initialization will help the compressed model converge faster, so we use this initialization scheme in all our experiments.
-epoch training, we obtain the compressed encoders. Their mirrored decoders are trained by the same rule as the first step using the loss (1).
Fig. 4 shows the comparison of the stylized results. Generally, our model achieves comparable or even more visually pleasing stylized images with much fewer parameters. Stylized images produced by original VGG-19 model and our compressed model are fairly better-looking (more colorful and sharper) than those by the FP-slimmed model. Compared with the original VGG-19 model, our compressed model tends to produce results with fewer messy textures, while the original model often highlights too many textures in a stylized image. For example, sky and water usually look smooth in an image, but actually, there are no absolutely smooth parts in natural images, so there are still nuances in the sky and water area. In Fig. 5, the original VGG-19 model tends to highlight these nuances to so obvious an extent that the whole image looks messy, while our model only emphasizes the most semantically salient parts. This phenomenon can be explained since a model with fewer parameters has limited capacity, which is less prone to overfitting. It is natural that an over-parameterized model will spend its extra capacity fitting the noises in data, e.g., the nuanced textures of the sky and water in this context. Meanwhile, even though a compressed model tends to be less overfitting, the model pruned by FP  has a problem of losing description power, which is embodied in two aspects in Fig. 4. First, as we see, FP-slimmed model tends to produce the stylized images with less color diversity. Second, when looked closer, the stylized images by FP-slimmed model have serious checkerboard artifacts.
|(a) Content||(b) Original||(c) Ours|
User study. Style transfer has an open problem of lacking broadly-accepted comparison criteria , mainly because stylization is quite subjective. For a more objective comparison, previous works  conducted user studies to investigate user preference over different stylized results. Here we adopt this idea by investigating which is the most visually pleasing among the stylized images produced by the three models. We generate -pair stylized images using the three models. Among them, pairs are randomly selected for each subject to choose which one is the most visually pleasing. In this part, we received valid votes. The user study results are shown in Tab. 1, where the stylized images by our compressed model are top-rated on average, in line with the qualitative comparison in Fig. 4.
Style distance. To further quantitatively evaluate the three models, we explore the style similarity between the stylized image and the style image. Intuitively, a more successfully stylized image should be closer to the style image in the style space, so we define style distance as
where is the Gram matrix.
Particularly, we feed the -pair stylized images in the user study to the original VGG-19 to extract features of the -stage layers (
ReLU_k_1, ), then the style distances are calculated based on these features.
Tab. 2 shows that the stylized images generated by the original and our compressed models are significantly closer to the style images than those by the FP-slimmed model.
Our model is fairly comparable with the original one, which agrees with the user preference (Tab. 1).
|Model||# Params ()||Storage (MB)||# GFLOPs||Inference time (GPU/CPU, s)||Max resolution|
Summary of the original models and our compressed models. Storage is measured in PyTorch model. GFLOPs and Inference time are measured when the content and style are both RGB images ofpixels. Max input resolution is measured on GPU using PyTorch implementation, with content and style are square RGB images of the same size.
Aside from the high-resolution stylized images shown above, we show an ultra-resolution stylized image in Fig. 1. The shapes and textures are still very clear even zoomed in at a large scale. To our best knowledge, this is the first time that we can obtain ultra-resolution universal style transfer results using a single 12GB GPU. We also report the statistics about the model size and speedup in Tab. 3. Our compressed model is smaller and faster than the original VGG-19 on GPU. Note that the total size for all the 5-stage models is only MB, easy to fit in mobile devices.
Ablation study. Here we explore the effect of the two proposed sub-schemes, collaborative distillation () and linear embedding (). The ablation results in Fig. 6 show that, both the linear embedding and collaboration losses can transfer considerable knowledge from the teacher to the student. Generally, both schemes can independently produce fairly comparable results as the original models even if smaller. Between the two losses, only transfers the information of feature domain rather than image domain, so its results have more distortion and also some checkerboard effects (please zoom in and see the details of Fig. 6(c)). Meanwhile, mainly focuses on the image reconstruction, so it produces sharper results with less distortion, which confirms our intuition that the decoder has considerable knowledge which we can leverage to train a small encoder. When both losses employed, we obtain the sweet spot of the two: The results do not have checkerboards, and also properly maintain the artistic distortion.
|(a) C/S||(b) Original||(c)||(d)||(e) Both|
We further evaluate the proposed method on AdaIN, where the encoder-decoder collaboration task is style transfer. The training process and network architectures are set the same way as Sec. 4.1, except that the (Eq. 2) is now utilized as the collaborator loss. is set to and is set to . The results are shown in Fig. 7, where our compressed encoder produces visually comparable results with the original one, while FP-slimmed model degrades the visual effect significantly. This visual evaluation is further justified by the user study in Tab. 1, where ours and the original model receive similar votes, significantly more than those of the model compressed by the FP algorithm .
Although the work of  is not encoder-decoder based, it is worthwhile to check whether the small encoder compressed by our method can still perform well in this case.
Conv_k_1 () are selected for style matching and layer
Conv4_2 selected for content matching.
The L-BFGS  is used as the optimization solver.
Fig. 8 shows that the compressed model can achieve fairly comparable results with those generated by the original VGG-19.
Similar to the experiments of artistic style transfer in Sec. 4.1, we also conduct a user study for a more objective comparison.
Results in Tab. 1 show that our compressed model is slightly less popular than the original model (which is reasonable considering that the compressed model has fewer parameters), while still more preferred than the model compressed by the FP method.
We briefly explain why existing knowledge distillation methods [42, 44, 57, 45] generally work less effectively for style transfer. The FP-pruned VGG-19 (called
A) in our paper achieves a top-5 accuracy of on ImageNet. We had an intermediate model (called
B) in the middle of the fine-tuning process of
A, with a lower top-5 accuracy . We compare the stylization quality of
B with WCT and find that the results of
A does not show any advantage over those of
B as the unpleasant messy textures still remain (Fig. 9, Row 1). This implies that a small accuracy gain in classification (typically less than , which is what the state-of-the-art distillation methods [42, 44, 57, 45] can achieve at most) cannot really translate to the perceptual improvement in neural style transfer.
In addition, we only apply the proposed method to distilling the encoder, not the decoder, because it will simply degrade the visual quality otherwise, as shown in Fig. 9 (Row 2). The reason is that, the decoder is responsible for image reconstruction, which is already trained with proper supervisions, i.e., the pixel and perceptual losses in Eq. (1). When applying distillation to the small decoder, the extra supervision from the original decoder does not help but undermining the effect of loss (1), thus deteriorating the visual quality of stylized results.
|(a) C/S||(b) FP (A)||(c) FP (B)||(d) Ours|
|(a) C/S||(b) Original||(c) Ours||(d) Ours|
Input resolution is an important limitation for universal neural style transfer due to the large model size of CNNs. In this work, we present a new knowledge distillation method (i.e
., Collaborative Distillation) to reduce the model size of VGG-19, which exploits the phenomenon that encoder-decoder pairs in universal style transfer construct an exclusive collaborative relationship. To resolve the feature size mismatch problem, a linear embedding scheme is further proposed. Extensive experiments show the merits of our method on two universal stylization approaches (WCT and AdaIN). Further experiments within the Gatys stylization framework demonstrate the generality of our approach on the optimization-based style transfer paradigm. Although we mainly focus on neural style transfer in this work, the encoder-decoder scheme is also generally utilized in other low-level vision tasks like super-resolution and image inpainting. The performance of our method on these tasks is worth exploring, which we leave as the future work.
We thank Wei Gao and Lixin Liu for helpful discussions. This work is supported in part by the Natural Key R&D Program of China under Grant No.2017YFB1002400 and US National Science Foundation CAREER Grant No.149783.
Whitening and coloring transforms for multivariate gaussian random variables.Project Rhea, 2016.