“I can’t change the direction of the wind, but I can adjust my sails to always reach my destination”
In this paper we propose an alternative approach, called Dynamic-Net, that resolves this for some scenarios. Rather than training a single fixed network, we split the training into two phases. In the first, we train the blocks of a “main network” using a certain objective. At the second phase we train additional residual “tuning-blocks”, using a different objective. Then, at inference time, we can decide whether we want to incorporate the tuning-blocks or not and even control their contribution. This way, we actually have at hand a dynamic network that can be assembled at inference time from the main- and tuning- blocks. Our underlying assumption is that the tuning-blocks can capture the variation between the two objectives, thus allowing traversal of the objective space. The Dynamic-Net can thus be easily geared towards the first or second objective, by tuning scalar parameters, at test-time.
The key idea behind our approach is inspired by the Jimmy Dean citation at the beginning of the introduction. We acknowledge that we cannot directly modify the objective at test-time. However, what we can do is modify the latent space representation. Therefore, our approach relies on manipulation of deep features in order to emulate a manipulation in objective space.
The main advantages of the Dynamic-Net are three-fold. First, using a single training session the Dynamic-Net can emulate networks trained with a variety of different objectives, for example, networks which produce stronger or weaker stylization effects, as illustrated in Figure 1. Second, it facilitates image-specific and user-specific adaptation, without re-training. Via a simple interface, a user can interactively choose at real-time the level of stylization or a preferred inpainting result. Last, the ability to traverse the objective space at test-time shrinks the search-space during training. More specifically, we show that even when the choice of objective for training is sub-optimal, the Dynamic-Net can reach a better working point.
We show these benefits through a broad range of applications in image generation, manipulation and reconstruction. We explore a variety of objectives and architectures, and present both qualitative and quantitative evaluation.
2 Related Work
Many state-of-the-art solutions for image manipulation, generation and reconstruction utilize a multi-loss objective. For example, Isola et al.  combine and adversarial loss  for image-to-image transformation. Johnson et al.  trade-off between a style loss (i.e. Gram loss) and content loss (i.e. the perceptual loss) for fast style-transfer and super-resolution, SRGAN  balances between a content loss and adversarial loss for perceptual super-resolution, and  combines , a style loss and an adversarial loss for texture synthesis. In all of these cases, the weighting between the loss terms is fixed during training, producing a trained network that operates at a specific working point.
The impact of the trade-off between multiple objectives has been discussed before in several contexts.  show that in image restoration algorithms there is an inherent trade-off between distortion and perceptual quality. They analyze the trade-off and show the benefits of different working points. In  it is shown empirically that a different balance between the style loss and content loss leads to different stylization effects.
The importance and difficulty of choosing the optimal balance between different loss terms is tackled by methods for multi-task learning [5, 18, 12, 21, 20]. In these works a variety of solutions have been proposed for learning the weights for different tasks. Mutual to all of these methods is that their outcome is a network trained with a certain fixed balance between the objectives.
Deep feature manipulation
The approach we propose is based on training “tuning-blocks” that learn how to manipulate deep features in order to achieve a certain balance between the multiple objectives. It is thus related to methods that employ manipulation of deep features in latent space. These methods are based on the basic hypothesis of 
that “CNNs linearize the manifold of natural images into a Euclidean subspace of deep features”, suggesting that linear interpolation of deep features makes sense. Inspired by this hypothesis, learn the linear “direction” that modifies facial attributes, such as adding glasses or a mustache. In  a more sophisticated manipulation approach is proposed. They introduced blocks to be added to an auto-encoder network in order to learn the required manipulation to modify a facial attribute. While producing great results on manipulation of face images, their approach implicitly assumes that the training images are similar and roughly aligned.
|(a) Single-block framework||(b) Multi-block framework|
3 Proposed Approach: Dynamic-Net
To date, one has to re-train the network for each objective. In this section we propose Dynamic-Net that allows changing the objective at inference time, without re-training. Dynamic-Nets can emulate a plethora of “intermediate working points” between two given objectives and , by simply tuning a single parameter. One can think of this as implicit interpolation between the objectives. Our solution is relevant in cases where such an interpolation between the objectives and is meaningful.
To provide some intuition we begin with an example. A common scenario where an “intermediate working point” is intuitive, is when the objectives consist of a super-position of two loss terms: and , where , are loss terms, and are scalars. Assuming, without loss of generality, that , an intermediate working point corresponds to an objective , such that . Our goal is to approximate at inference time the results of a network trained with any objective , while using only and during training.
The key idea behind the approach we propose, is to use interpolation in latent space in order to approximate the intermediate objectives. For simplicity of presentation we start with a simple setup that uses linear interpolation, at a single layer of the network. Later on we extend to non-linear interpolation.
3.1 Single-block Dynamic-Net
Our single-block framework is illustrated in Figure 2(a). It first trains a CNN, to which we refer as the “main network blocks”, with objective . We then add an additional block to the network, to which we refer as the “tuning-block” that learns the “direction of change” in latent space , that corresponds to shifting the objective from to another working point . Our hypothesis is that walking along the “direction of change” in latent space can emulate a plethora of “intermediate” working points between and .
In further detail, our pipeline is as follows:
Train the main network blocks by setting the objective to .
Fix the values of the main network, add a tuning-block between layers and , and post-train only by setting the objective to . The block will capture the variation between the latent representations and , that correspond to and , respectively.
Fix both the main blocks as well as the tuning block , and do as follows:
Propagate the input until layer of the main network to get .
Generate an “intermediate” point in latent space, , by tuning the scalar parameter .
Propagate through the rest of the main network to obtain outcome that corresponds to objective .
The justification for our approach stems from the following two assumptions:
We adopt the hypothesis of  that “CNNs linearize the manifold of natural images into a Euclidean subspace of deep features”.
This assumption implies that the latent representation of an intermediate point can be written as where . Setting yields working point 0 while setting yields working point 1.
For any pair of working points ,, with corresponding latent representations , it is possible to train a block such that .
To provide further intuition we revisit the example where the objectives are of the form . Here the parameter controls the balance between the two loss terms and . To interpolate in objective space we would like to modify but this is not possible to do directly at test-time. Instead, our scheme enables interpolation in latent space by modifying the parameter , which controls . Our main hypothesis is that the suggested training scheme will lead to a proportional relation between and . That is, increasing will correspond to a monotonic increase in , thus implicitly achieving the desired interpolation in objective space.
In the more general case, when the objectives and are of different forms, the interpolation we propose in objective space cannot be formulated mathematically so intuitively. Nonetheless, the conceptual meaning of such interpolation could be sensible. For example, we could train an image generation network with two different adversarial objectives, one that prefers blond hair and another that prefers dark hair. Interpolating between the two objectives should correspond to generating images with varying hair shades. Therefore, to prove broad applicability of the proposed approach to a variety of objectives, we present in Section 4 several applications and corresponding results.
3.2 Multi-block Dynamic-Net
In practice, adding a single tuning-block, at a specific layer, might be insufficient. It limits the manipulation to linear transformations in a single layer of the latent space. Therefore, we propose adding multiple blocks, at different layers of the network as illustrated in Figure2(b).
The training framework is similar to that of single-block, except that now we have multiple tuning blocks , each associated with a corresponding weight . When training the tuning-blocks we fix all the weights to . Then at inference-time, we can tune each of the weights independently to yield a plethora of networks and results.
In this section we present experiments with several applications that demonstrate the utility of the proposed Dynamic-Net and support the validity of our hypotheses. In order to emphasize broad applicability we selected applications of varying nature, with a variety of loss functions and network architectures, as summarized in Table1. Tuning-blocks were implemented as . Further implementation details, architectures and parameter values are listed in the supplementary.
The motivation behind Dynamic-Net was three-fold: (i) provide ability to modify the working point at test-time, (ii) allow image-specific adaptation, and (iii) reduce the dependence on optimal objective selection at training time. In what follows we explore these contributions one by one, through various applications.
In the next subsections, if not stated otherwise, we used the multi-block framework while setting all to be equal, i.e., , and we tune .
4.1 Tuning the objective at test-time
:: Style Transfer
Our first step is to show that the proposed approach can indeed traverse the objective space, and emulate multiple meaningful working points at test-time, without re-training. We chose to show this via experiments in Style Transfer.
Super-position of objectives:
We begin with the common scenario where the objective-space is a super-position of two loss terms. We followed the setup of fast style transfer , where the goal is to transfer the style of a specific style image to any input image. This is done by training a CNN to optimize the objective: , where is the Perceptual loss  between the output image and input image, and is the Gram loss  between the output image and style image. The hyper-parameter balances between preserving the content image and transferring the texture and appearance of the style image. Our goal here is to show that tuning of the Dynamic-Net at test-time can replace tuning of at training-time.
Following the training procedure suggested in Section 3 we first train the main-blocks with objective , then freeze their weights and train the tuning-blocks with . Similar to  we use the MS-COCO  dataset for training.
Figure 3 shows a few example results together with the corresponding working points in the objective-space, which trade-offs the content and style loss terms. We successfully control the level of stylization, at test-time, by tuning . Ideally, we would like to achieve a monotonic change in both content and style loss. It can be seen in Figure 3(plot) that tuning each separately (blue curve) enables to keep the style loss lower, in comparison to the case where all are equal to each other (green curve). Not surprisingly, the ability to achieve low loss terms that change monotonically is improved when the training objectives are closer to each other (purple curves). An important result is that the working points emulated by the Dynamic-Net correspond to fixed networks trained for that specific working point (marked by ). We achieve this when the training objectives are not too far (purple curves).
The figure also compares to interpolation in image space, i.e., blending images directly. We tried blending the stylized images obtained by fixed networks trained for working points , , (red, orange and yellow curves). It can be seen that the results are inferior qualitatively and quantitatively, since the style loss does not change monotonically.
To further explore the generality of our approach we next experiment with disjoint objectives. As a case study we chose to traverse between stylization with two different style images. That is, was trained with one style image, while was trained with a different style image. At test time we tune to traverse between the two objectives. Figure 4 presents results when the style images are completely different, while in Figure 5 the style images are two versions of the same style image, albeit at different resolutions. It can be seen that Dynamic-Net provides a smooth transition between the objectives.
We also examine the ability to extrapolate in objective-space as shown in Figure 5. Specifically, we wanted to see if we can emulate working points that are not intermediate to those used during training. Interestingly, setting or also leads to meaningful results, corresponding to extrapolation in scale space of the style.
|GT||Best LPIPS (vgg)|
(at test time) produces different images. The results of the original pix2pix () are not always the best ones and better LPIPS scores (lower is better, marked in green) can be obtained with different for different images. The original network hallucinates texture details that sometimes produce artifacts, such as the white stain on the handle of the bottom bag. Tuning reduces these artifacts. Bottom: The mean LPIPS per shows that the best result is obtained not with the original pix2pix network, but rather with (different values when using VGG or AlexNet). The horizontal lines at the bottom of the plot represent the mean score when for each image we select the best , in terms of LPIPS. This allows the user to fine tune the hyper-parameter – without retraining the network.
|Dark Hair Blond Hair|
4.2 The objective is image specific
The second motivation behind our approach is to allow image-specific tuning. This is important because the optimal working point is oftentimes image specific. Similar observation was made in super-resolution  were viewers preferred different working point per image. This suggests that fixing the objective at training-time is sub-optimal, as for each image a different objective might be preferable. We next demonstrate that this indeed the case.
We chose the seminal work of Isola et al. , pix2pix, where the objective is a super-position of the norm between the output image and the target image and the conditional adversarial loss. The adversarial term pushes the images towards the real manifold, producing more diverse, sharp and realistic images. In our experiments we trained the main network with the objective 111, than added three tuning-blocks and trained them with . At test-time we generated results for a range of values for and computed the LPIPS  measure of perceptual similarity to the ground-truth.
Figure 6 presents a few sample results as well as statistics over the entire test set. We draw several observations from this figure. First, we observe that tuning modifies the resulting images, suggesting that indeed the working point does change with . Second, we note, that for different images the best perceptual distance, measured by LPIPS , is obtained for different values of . Furthermore, if we automatically select for each image the best , then the overall LPIPS score is improved. This suggests that a single value for at training time could not provide the optimal result for each and every input. Finally, it is notable that for each image there exists a value of for which the LPIPS score is better than that of the results of the original pix2pix network. This stands to show the importance of allowing the user to fine-tune the results at test-time.
4.3 The objective is user specific
:: Face Generation
In some applications the desired output is not only image dependent but further depends on the user’s preference. This could be observed previously in the style transfer experiments, were every user could prefer different stylization options. As another example for such a case we chose the task of face generation, where our approach endows the user with fine control over certain facial attributes, such as hair color or gender.
We adopted the architecture of DCGAN  that is trained with a single adversarial loss over the CelebA  dataset. To provide control over an attribute, such as hair color, we split the dataset into two sub-sets, e.g., dark hair vs. blond. Both, the main network and the tuning-blocks were trained with an adversarial loss, but with different data sub-set. The two objectives are thus disjoint in this case.
At test-time, the user can tune to generate a face with desired properties. For example, the user can tune the hair color or the masculinity of the generated face. Qualitative results are presented in Figure 7 for two attributes: male-to-female and dark-to-blond. It can be seen that our Dynamic-Net smoothly traverses between the two objectives, generating plausible images, with a smooth attribute control.
4.4 Robustness to hyper-parameter
:: Image Completion
Our last goal is to show that our approach shrinks the required search space over the objective at training time. This is evident to some extent in all of the applications, for example, in edges2handbags we can refine the network without re-training according to the best LPIPS on the validation set. We further strengthen our claims via the task of image completion. We intentionally chose objectives that are sub-optimal, and lead to poor quality completion and artifacts, and used them to train the main network and the tuning blocks. Then, at test time, we tune and show that high-quality results can still be achieved. This was in order to show that even when the main network is of poor quality, adding the tuning blocks with an appropriate could result in a better overall network, getting rid of the artifacts. This is possible because we can traverse the objective space and thus identify good working points, even when those used for training were sub-optimal.
In our experimental setup the input is a face image with a large hole at the center, and the goal is to complete the missing details in a faithful and realistic manner. As architecture we adopted a version of pix2pix  (see supplementary for details). The objective for training the main blocks was and three tuning-blocks were trained with . We present results for three values .
Figure 8 shows some of our results. It can be seen that good quality completion is obtained with the main network, when trained with . Conversely, the completion suffers from artifacts when using either or . In both cases, however, adding the tuning blocks can fix this and lead to high-quality results, similar to those obtained when training with .
This suggests that during training rather then trying multiple values for one can just select a single value, and then at test-time adapt . The training of the tuning-blocks demonstrate robustness and implies that our Dynamic-Net forms a good alternative to the traditional greedy search. Setting is fast and provides an interesting alternative to hyper-parameter search at training time, both in terms of computing efficiency and as it enables image and user specific tuning.
5 Method Analysis
We next present two additional experiments on the task of style transfer, the first compare the two suggested frameworks and the second discuss the method limitations.
Single-block vs. Multi-block Figure 9 compares between our two frameworks. As the plot shows, the single-block framework (blue and purple curves) produce a curve in the objective space, while the multi-block framework produce an entire surface over the objective space (green dots). Furthermore, the single-block framework emulates the fixed-nets slightly better, however, the multi-block framework better minimize objective (point B).
Limitations Figure 10 present the limitation of the proposed method, when using extreme objectives. Specifically, we trained the tuning-blocks without a style loss term, i.e. . We observe that, the simple image interpolation (blue curve) achieves better results than our method (red curve) when approaching near point C, that is, near the objective . The main reason for that, is that the main network was trained for style transfer, and the ability of the tuning blocks to “Turn the table upside down” and produce image with very little style, is limited. Last, we show that using Dynamic-Net with a smaller range between the objectives, , (green curve) outperform both methods and approximate the fixed nets accurately.
We propose Dynamic-Net a novel two phase training framework that allow traversing the objective space at inference time without re-training the model. We have shown its broad applicability on variety vision tasks: style transfer, image-to-image transformation, face generation and image completion. In all application we showed that our method allow easy and intuitive control of the objective trade-off. This work is a first step in providing a model that is not limited to a specific static working point – a dynamic model. Future work include bringing the dynamic concept to other application and expend it to other objective spaces.
In the supplementary we present additional results and provide implementation details.
This research was supported by the Israel Science Foundation under Grant 1089/16 and by the Ollendorf foundation.
-  Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai. Better mixing via deep representations. In ICML, 2013.
-  Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor. 2018 pirm challenge on perceptual image super-resolution. In ECCVW, 2018.
-  Y. Blau and T. Michaeli. The perception-distortion tradeoff. In CVPR, 2018.
-  Y.-C. Chen, H. Lin, M. Shu, R. Li, X. Tao, X. Shen, Y. Ye, and J. Jia. Facelet-bank for fast portrait manipulation. In CVPR, 2018.
-  Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, 2018.
L. A. Gatys, A. S. Ecker, and M. Bethge.
Image style transfer using convolutional neural networks.In CVPR, 2016.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.In CVPR, 2017.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
-  T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. In ICML, 2017.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
-  Z. Li and D. Hoiem. Learning without forgetting. IEEE TPAMI, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
-  G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro. Image inpainting for irregular holes using partial convolutions. In ECCV, 2018.
-  Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
-  R. Mechrez, I. Talmi, F. Shama, and L. Zelnik-Manor. Maintaining natural image statistics with the contextual loss. In ACCV, 2018.
-  R. Mechrez, I. Talmi, and L. Zelnik-Manor. The contextual loss for image transformation with non-aligned data. In ECCV, 2018.
-  I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi-task learning. In CVPR, 2016.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
-  C. Rosenbaum, T. Klinger, and M. Riemer. Routing networks: Adaptive selection of non-linear functions for multi-task learning. In ICLR, 2018.
-  S. Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
-  W. Shen and R. Liu. Learning residual images for face attribute manipulation. In CVPR. IEEE, 2017.
-  P. Upchurch, J. R. Gardner, G. Pleiss, R. Pless, N. Snavely, K. Bala, and K. Q. Weinberger. Deep feature interpolation for image content changes. In CVPR, 2017.
-  R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
-  Y. Zhou, Z. Zhu, X. Bai, D. Lischinski, D. Cohen-Or, and H. Huang. Non-stationary texture synthesis by adversarial expansion. arXiv preprint arXiv:1805.04487, 2018.