Humans are usually the ones designing and properly training neural networks. Since neural networks are capable of overcoming humans on various learning tasks, there appears a natural question: can they also be better at creating new neural networks? This question was already asked few times, but we do not have a clear answer yet.
There are two common approaches to the above question. The first direction focuses on generating the whole network architecture. It was attacked by employing reinforcement learning[Zoph and Le2016]Maziarz et al.2018]. Although this allows to create a novel architectures from scratch, it is very time and resource consuming. In the second approach, we assume a predefined architecture and focus on generating correct weights. For this purpose, one can design a hypernetwork which acts as a generator, which returns weights to other networks (target networks) [Ha et al.2016]. Hypernetworks were successfully applied to reducing trainable parameters [Ha et al.2016]
, generative models for neural networks, maximum likelihood estimation[Sheikh et al.2017], in Bayesian context [Krueger et al.2017], [Louizos and Welling2017], etc. [Brock et al.2017], [Zhang et al.2018a], [Lorraine and Duvenaud2018]. Basic problem is that a single hypernetowrk generates weights, which could only solve one specific problem. In this paper, we focus on designing hypernetwork, which could return multiple target networks that are customized to serve different purposes.
We present two applications of this model. In the first one, we use hypernetworks to create functional continuous representation of iamges. More precisely, the hypernetwork takes an image and produces weights to target network , which approximates RGB intensities of each coordinate pair. Due to the continuity of representation, we can look at the image at different scales, which is experimentally verified in the case of super-resolution, Figure 1. In contrast to typical super-resolution approaches, we have a single model responsible for upscalling the image to any size. Moreover, we can create non-standard sizes at test time, Figure 2. Since we have one neural network responsible for creating individual image models, similar images are described by similar target networks. In consequence, it is possible to interpolate between weights of target networks and produces natural images, Figure 3. We also demonstrate that hypernetworks can create generative models for new data sets at test time. To show its usefulness, we design a hypernetwork for describing distributions of 2D points clouds.
2 Hypernetwork model
We start with recalling a basic hypernetwork model. Next, we present how to adapt it to create multiple target networks for different tasks on two practical examples.
Let be a target neural network, where is a set of trainable parameters and
are input and output domains, respectively. Our objective is to find weight vector
which solves a given learning problem. As an alternative to a typical backpropagation procedure, the hypernetworks mechanism can be used. In this framework, an additional neural network (hypernetwork)
is employed, where is some input domain and is a weights vector. Given an instance , the hypernetwork returns weights to corresponding target network . Thus the solution is given by a network . In the following subsections, we show how to train the hypernetwork, which returns weights to multiple target networks used for different purposes.
2.1 Functional image representation
Image is usually represented as a two-dimensional pixel matrix. This representation is discrete and, in consequence, it is difficult to look at the image at different scales. Moreover, it is impossible to perform typical mathematical operations such as differentiation, applying continuous filters, etc.. As a remedy, one could create a functional continuous representation of images. More precisely, we aim at creating a function
which approximates RGB intensities of each coordinate pair .
In the simplest case, this function can be obtained by linear or quadratic interpolation. However, restricting to a given class of functions may br insufficient. On the other hand, one could represent an image with a use of neural network, but there is no point in training separate model for each image. We approach this task by introducing a hypernetwork
which for an image returns weights to the corresponding target network. In consequence, an image is represented as a function , which for any coordinates returns corresponding RGB intensities of image .
The above model can be trained by minimizing classical MSE loss. More precisely, we take an input image , generate weights to target network and compare obtained representation with input image pixel by pixel. We minimize the expected mean square error over traing set of images:
Observe that we train only a single neural model (hypernetwork), which allows to produce a great variety of functions at test time. We expect that target networks for similar images will be similar. In consequence, interpolation between weights of target networks should lead to reasonably looking images. In contrast, if we created individual network for every image, such an interpolation would be misleading (see experimental section for details).
2.2 Generative model
Generative models allow to learn the underlying data distribution and create new objects, e.g. images, texts, etc.. However, an individual model has to be trained from scratch for every new data set.
We present how to train hypernetwork to produce generative model for a new data set at test time. The hypernetwork operates on the family of data sets and for each data set returns weights to the corresponding generative model . As a proof of concept, we describe how to realize this idea in a low dimensional spaces.
Let an empirical probability distribution of a given data setbe described by . We would like to model by a transformation of some simple probability distribution defined on the space . Thus we look for weights to neural network . Generative properties are verified by comparing with , where is a data set generated from
. In low dimensional spaces, one can use typical kernel density estimation to approximate transformed density. More precisely, we generate a samplefrom , transform it by and obtain , for . Next, we create kernel density estimation by
where is model parameter. Finally, we compare with by minimizing the cross-entropy:
where the sum is taken over elements of data set .
The advantage of our approach is that, at test time the hypernetwork returns a complete generative model for a given data set. Clearly, this approach can be successful only in low-dimensions, because we use kernel density estimation and compare probability distributions in the original space. To adapt this model to higher dimensions, we should use autoencoder structure and compare distributions using e.g. MMD distance[Li et al.2017], see [Tolstikhin et al.2017]. In the experiments, we demonstrate that the model in this form has a potential to be used for object modeling spaces.
We present potential applications of the proposed methodology. First, we apply the model to functional image representation and examine its performance in super-resolution. Next, we use it to describe 2D shapes. In both cases, we show some interesting geometrical properties of target networks space. We start with presenting the architecture used in the experiments.
Designing correct architectures of target network and hypernetwork for a given task is very important. In particular, optimal architectures for super-resolution are different from those used in generative models. Since this paper is devoted to give a proof of concept, we use a single architecture in all experiments.
|Scale||Bicubic||SRCNN [Dong et al.2016a]||RDN [Zhang et al.2018b]||Ours|
|Scale||Bicubic||SRCNN [Dong et al.2016a]||RDN [Zhang et al.2018b]||Ours|
An architecture of the target network is supposed to be simple and small. This allows to keep the performance of training phase at the highest possible level as the target network is not directly trained. Moreover, small networks can be easily reused for other applications.
The target network consists of five fully-connected layers, see Figure 4
. The layers’ dimensions are being gradually increased. This is happening up to the middle layer. Later on, they are being decreased. This is because steep transitions of layers’ dimensions negatively affect the learning ability of neural network. Additionally, batch normalization is done between each layer[Ioffe and Szegedy2015]. We have chosen a
to be an activation function between two consecutive layers[Goodfellow et al.2016]
, which worked much better than ReLU for our purpose. An activation function for the last layer is sigmoid. Since the size of hyper-network’s output depends on the number of trainable parameters in the target network, we used residual connections in the target network. No convolutions were used, because the input of the target network is too simple.
Hypernetwork is a convolutional neural network with some modifications, see Figure5. We created an eight layered network with one residual connection. To reduce the number of trainable parameters, we adapt an approach used in the inception network [Szegedy et al.2015]. More precisely, instead of convolution, we used convolution followed by a convolution (the notation is going to be used for simplicity of description). To generate the weights for each target network’s layer, we designed the following process. The first few layers of the hyper-network are common and take part in generating weights for every target network’s layer. Next, they are split into different branches. There is a branch for each layer of the target network. The purpose of initial layers is to extract features from an input image. The following layers are supposed to find weights for the target network based on these features. This process led to faster training than creating a separate hyper-network for each layer of the target network. ReLU was selected to be an activation function for every layer in the hypernetwork. Additionally, there is a batch normalization used after each layer.
Since target network gives a functional continuous representation of input image, we can upscale the image to any size and, in consequence, apply our approach to super-resolution.
To make this approach successful we feed the hypernetowrk with low resolution images and evaluate MSE loss on high resolution ones. More precisely, we take the original image of the size , downscale it to using bicubic interpolation and input it to the hypernetwork. The hypernetwork produces the weights to target network, which defines the functional representation of input image. To evaluate its quality, we take a grid of the size on the image returned by target network and compare the values with pixels intensities of original image using MSE loss.
Since input images can have different resolutions, we split them into overlapping parts of fixed sizes. In consequence, the value at each coordinate is described by multiple target networks. To produce a single output for every coordinate at test phase, we take the (weighted) average of values returned by all target networks covering this coordinate. This also allows to smooth the output function.
To test our approach, we trained the model on examples from DIV2K data set [Agustsson and Timofte2017]. Its performance was evaluated on Set5 [Bevilacqua et al.2012], Set14 [Zeyde et al.2010], B100 [Martin et al.2001], and Urban100 [Huang et al.2015]. As a quality measure we used two typical measures applied in super-resolution tasks, PSNR and SSIM [Wang et al.2004]. Their high values indicate better performance of a model. We considered scale factors .
As a baseline, we used bicubic interpolation. Moreover, we compared our approach with SRCNN [Dong et al.2016b], which was a state-of-the-art method in 2016. We also included a recent state-of-the-art – RDN [Zhang et al.2018c]. Our goal is to train a single hypernetwork model to generate images at various scales. This is more general solution than typical super-resolution approaches, where every model is responsible for upscaling the image to only one resolution. In consequence, it is expected that both SRCNN and RDN will perform better than our method. We trained our network once on images downscaled 2, 3 and 4 times. Moreover, we supplied a target network with an additional parameter , which indicated the scaling factor.
The results presented in Tables 1 and 2 demonstrate that our model gave significantly better performance than baseline bicubic interpolation (see also Figure 1 for sample result). Surprisingly, a single hypernetwork trained on all scales achieved a comparable performance to SRCNN, which created a separate model for each scale factor. It shows high potential of our models as a method of image representation. Nevertheless, it was not able to obtain as high scores as recent state-of-the-art in super-resolution. It might be caused by insufficient architecture of hypernetwork. In our opinion, designing similar architecture of hypernetwork to RDN should lead to comparable performance. The main advantage of our approach is its generality – we trained a single model for various scale factors.
3.3 Target networks geometry
It is believed that high dimensional data, e.g. images, are embedded in low-dimensional manifolds[Goodfellow et al.2016]. In consequence, direct linear interpolation between images will not produce nicely looking pictures.
In this experiment, we would like to inspect the space of target network weights. In particular, we verify whether linear interpolation between weights of two target networks produces true images. For this purpose, we train the hypernetwork model presented in previous subsection on images in a single scale. In other words, we are not interested in rescaling images, but only in creating their functional representation. We use CelebA data set [Liu et al.2015] with a trivial preprocessing of cropping central 128x128 pixels from the image and resizing it to the size of 64x64 pixels.
In test phase, we generate target networks for two images and take the linear interpolation between weights of these networks. Figure 3
presents images returned by interpolated target networks. It is evident that interpolation produces images from true data distribution. It means that we transformed a manifold of images into more compact structure (set of weights) where linear transformation can be applied. It allows to suspect that similar images have similar weights to their target networks. For a comparison, we generated classical pixel-wise interpolation between analogical examples. As can be seen in Figure6, the results are much worse, because interpolation gives superimposition of images.
Going further, we verified a layer-wise interpolation. Namely, we took weights to one target network and gradually changed weights of the first layers in a direction to corresponding weights in the second target network. As can be seen in Figure 7 each layer may be understood as having different functionality, i.e. third layer is responsible for the general shape, while the last layer corrects the colors in the image. In future work, it may be interesting to use a hypernetworks mechanism to obtain disentangled representation.
3.4 Generating models of objects
Generative models give an opportunity to create new examples from the underlying data set. We demonstrate that hypernetworks can be trained so that to return an individual generative model for a new data set at test time.
As a proof of concept, we tested our method on 2-dimensional data sets. We used MNIST database[LeCun et al.1998] and interpret each image as an individual probability distribution on 2D space. Namely, the probability of generating an example for a given image, where , is proportional to the brightness of the pixel . We would like to create an individual generative model for every image (understood as 2D data set). For this purpose, we need to find a transformation
of some simple probability distribution, which allows to produce a given image. In this example we look for a transformation of a uniform distribution on a unit circlein 2D space. Intuitively, we would like to transform a circle into a shape of a given image. In every batch, we take a sample from a circle, transform it by and use kernel density estimation to model resulting distribution . Similarity between and empirical distribution is verified by computing cross-entropy as described in Section 2.2.
Sample results are presented in Figure 8. Blue points show the values produced by a target network, while the line was traced between subsequent points mapped from a circle. As can be seen, making use of target network we obtain an outline of a given digit if we iterate over the circle in small steps. Despite the simplicity of this experiment, its extension may find the applications in objects modeling. For example, we may use this approach in 3D printing, where finding intermediary steps between sampled points is a crucial task [Zarzar Gandler2017]. Since it is essential that the running time of such algorithms should be low [Morse et al.2005], the use of methods that work in constant time (such as a neural network) may be beneficial. In future, we plan to use reversible generative networks [Kingma and Dhariwal2018] to remove redundant loops visible in the image of circle.
Analogically to previous experiment, we also examined the interpolation between two generative models (weights of two target networks). Sample results shown in Figure 9 demonstrate that obtained shapes change gradually. It means that target networks with similar weights produce similar images. This effect could not be achieved if we use separate hypernetwork for every image.
We presented the extension of hypernetworks mechanism, which allows to create target networks for various purposes using a single hypernetwork model. This approach was applied for creating functional image representation and constructing generative models for new data sets at test time. Due to the continuity of representation, we were able to upscale the image to any resolution. We also observed that constructed hypernetwork transformed a manifold of images to more compact space, where linear interpolation between images could be performed. Namely, we can traverse linearly from one image to another and not fall out from the true data distribution. Our experiments suggest that hypernetworks can be used to produce generative models. In future, we plan to use our approach to create generative models for higher dimensional data.
- [Agustsson and Timofte2017] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In , pages 126–135, 2017.
- [Bevilacqua et al.2012] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. 2012.
- [Brock et al.2017] Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. SMASH: one-shot model architecture search through hypernetworks. CoRR, abs/1708.05344, 2017.
- [Dong et al.2016a] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016.
- [Dong et al.2016b] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016.
- [Goodfellow et al.2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
- [Ha et al.2016] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
- [Huang et al.2015] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5197–5206, 2015.
- [Ioffe and Szegedy2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- [Kingma and Dhariwal2018] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10236–10245, 2018.
- [Krueger et al.2017] David Krueger, Chin-Wei Huang, Riashat Islam, Ryan Turner, Alexandre Lacoste, and Aaron Courville. Bayesian hypernetworks. arXiv preprint arXiv:1710.04759, 2017.
- [LeCun et al.1998] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[Li et al.2017]
Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás
Mmd gan: Towards deeper understanding of moment matching network.In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.
- [Liu et al.2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
- [Lorraine and Duvenaud2018] Jonathan Lorraine and David Duvenaud. Stochastic hyperparameter optimization through hypernetworks. CoRR, abs/1802.09419, 2018.
Christos Louizos and Max Welling.
Multiplicative normalizing flows for variational bayesian neural
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2218–2227. JMLR. org, 2017.
- [Martin et al.2001] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In null, page 416. IEEE, 2001.
- [Maziarz et al.2018] Krzysztof Maziarz, Andrey Khorlin, Quentin de Laroussilhe, and Andrea Gesmundo. Evolutionary-neural hybrid agents for architecture search. arXiv preprint arXiv:1811.09828, 2018.
[Morse et al.2005]
Bryan S Morse, Terry S Yoo, Penny Rheingans, David T Chen, and Kalpathi R
Interpolating implicit surfaces from scattered surface data using compactly supported radial basis functions.In ACM SIGGRAPH 2005 Courses, page 78. ACM, 2005.
- [Sheikh et al.2017] Abdul-Saboor Sheikh, Kashif Rasul, Andreas Merentitis, and Urs Bergmann. Stochastic maximum likelihood optimization via hypernetworks. arXiv preprint arXiv:1712.01141, 2017.
- [Szegedy et al.2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- [Tolstikhin et al.2017] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
- [Wang et al.2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- [Zarzar Gandler2017] Gabriela Zarzar Gandler. Evaluation of probabilistic representations for modeling and understanding shape based on synthetic and real sensory data, 2017.
- [Zeyde et al.2010] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. In International conference on curves and surfaces, pages 711–730. Springer, 2010.
- [Zhang et al.2018a] Chris Zhang, Mengye Ren, and Raquel Urtasun. Graph hypernetworks for neural architecture search. CoRR, abs/1810.05749, 2018.
- [Zhang et al.2018b] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- [Zhang et al.2018c] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2472–2481, 2018.
- [Zoph and Le2016] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.