Multi-task hypernetworks

02/27/2019 ∙ by Sylwester Klocek, et al. ∙ Jagiellonian University 0

Hypernetworks mechanism allows to generate and train neural networks (target networks) with use of other neural network (hypernetwork). In this paper, we extend this idea and show that hypernetworks are able to generate target networks, which can be customized to serve different purposes. In particular, we apply this mechanism to create a continuous functional representation of images. Namely, the hypernetwork takes an image and at test time produces weights to a target network, which approximates its RGB pixel intensities. Due to the continuity of representation, we may look at the image at different scales or fill missing regions. Second, we demonstrate how to design a hypernetwork, which produces a generative model for a new data set at test time. Experimental results demonstrate that the proposed mechanism can be successfully used in super-resolution and 2D object modeling.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans are usually the ones designing and properly training neural networks. Since neural networks are capable of overcoming humans on various learning tasks, there appears a natural question: can they also be better at creating new neural networks? This question was already asked few times, but we do not have a clear answer yet.

There are two common approaches to the above question. The first direction focuses on generating the whole network architecture. It was attacked by employing reinforcement learning

[Zoph and Le2016]

or genetic algorithms

[Maziarz et al.2018]. Although this allows to create a novel architectures from scratch, it is very time and resource consuming. In the second approach, we assume a predefined architecture and focus on generating correct weights. For this purpose, one can design a hypernetwork which acts as a generator, which returns weights to other networks (target networks) [Ha et al.2016]. Hypernetworks were successfully applied to reducing trainable parameters [Ha et al.2016]

, generative models for neural networks, maximum likelihood estimation

[Sheikh et al.2017], in Bayesian context [Krueger et al.2017], [Louizos and Welling2017], etc. [Brock et al.2017], [Zhang et al.2018a], [Lorraine and Duvenaud2018]. Basic problem is that a single hypernetowrk generates weights, which could only solve one specific problem. In this paper, we focus on designing hypernetwork, which could return multiple target networks that are customized to serve different purposes.

Figure 1: A higher resolution image (scaling factor

) obtained with a use of the bicubic interpolation (left) and our method (middle). A low-resolution input image is on the right side.

Figure 2: Hypernetwork can resize the image to non-standard resolutions, e.g. .
Figure 3: Interpolation between weights of two target networks generating functional image representation. Observe that linear interpolation in the space of weights corresponds to natural images coming from true data distributions (there in no superimposition of images as in the case of pixel-wise interpolation, see Figure 6).

We present two applications of this model. In the first one, we use hypernetworks to create functional continuous representation of iamges. More precisely, the hypernetwork takes an image and produces weights to target network , which approximates RGB intensities of each coordinate pair. Due to the continuity of representation, we can look at the image at different scales, which is experimentally verified in the case of super-resolution, Figure 1. In contrast to typical super-resolution approaches, we have a single model responsible for upscalling the image to any size. Moreover, we can create non-standard sizes at test time, Figure 2. Since we have one neural network responsible for creating individual image models, similar images are described by similar target networks. In consequence, it is possible to interpolate between weights of target networks and produces natural images, Figure 3. We also demonstrate that hypernetworks can create generative models for new data sets at test time. To show its usefulness, we design a hypernetwork for describing distributions of 2D points clouds.

2 Hypernetwork model

We start with recalling a basic hypernetwork model. Next, we present how to adapt it to create multiple target networks for different tasks on two practical examples.

Let be a target neural network, where is a set of trainable parameters and

are input and output domains, respectively. Our objective is to find weight vector

which solves a given learning problem. As an alternative to a typical backpropagation procedure, the hypernetworks mechanism can be used. In this framework, an additional neural network (hypernetwork)

is employed, where is some input domain and is a weights vector. Given an instance , the hypernetwork returns weights to corresponding target network . Thus the solution is given by a network . In the following subsections, we show how to train the hypernetwork, which returns weights to multiple target networks used for different purposes.

2.1 Functional image representation

Image is usually represented as a two-dimensional pixel matrix. This representation is discrete and, in consequence, it is difficult to look at the image at different scales. Moreover, it is impossible to perform typical mathematical operations such as differentiation, applying continuous filters, etc.. As a remedy, one could create a functional continuous representation of images. More precisely, we aim at creating a function

which approximates RGB intensities of each coordinate pair .

In the simplest case, this function can be obtained by linear or quadratic interpolation. However, restricting to a given class of functions may br insufficient. On the other hand, one could represent an image with a use of neural network, but there is no point in training separate model for each image. We approach this task by introducing a hypernetwork

which for an image returns weights to the corresponding target network. In consequence, an image is represented as a function , which for any coordinates returns corresponding RGB intensities of image .

The above model can be trained by minimizing classical MSE loss. More precisely, we take an input image , generate weights to target network and compare obtained representation with input image pixel by pixel. We minimize the expected mean square error over traing set of images:

Observe that we train only a single neural model (hypernetwork), which allows to produce a great variety of functions at test time. We expect that target networks for similar images will be similar. In consequence, interpolation between weights of target networks should lead to reasonably looking images. In contrast, if we created individual network for every image, such an interpolation would be misleading (see experimental section for details).

2.2 Generative model

Generative models allow to learn the underlying data distribution and create new objects, e.g. images, texts, etc.. However, an individual model has to be trained from scratch for every new data set.

We present how to train hypernetwork to produce generative model for a new data set at test time. The hypernetwork operates on the family of data sets and for each data set returns weights to the corresponding generative model . As a proof of concept, we describe how to realize this idea in a low dimensional spaces.

Figure 4: The architecture of the target network.
Figure 5: The architecture of the hypernetwork.

Let an empirical probability distribution of a given data set

be described by . We would like to model by a transformation of some simple probability distribution defined on the space . Thus we look for weights to neural network . Generative properties are verified by comparing with , where is a data set generated from

. In low dimensional spaces, one can use typical kernel density estimation to approximate transformed density. More precisely, we generate a sample

from , transform it by and obtain , for . Next, we create kernel density estimation by

where is model parameter. Finally, we compare with by minimizing the cross-entropy:

(1)

where the sum is taken over elements of data set .

The advantage of our approach is that, at test time the hypernetwork returns a complete generative model for a given data set. Clearly, this approach can be successful only in low-dimensions, because we use kernel density estimation and compare probability distributions in the original space. To adapt this model to higher dimensions, we should use autoencoder structure and compare distributions using e.g. MMD distance

[Li et al.2017], see [Tolstikhin et al.2017]. In the experiments, we demonstrate that the model in this form has a potential to be used for object modeling spaces.

3 Experiments

We present potential applications of the proposed methodology. First, we apply the model to functional image representation and examine its performance in super-resolution. Next, we use it to describe 2D shapes. In both cases, we show some interesting geometrical properties of target networks space. We start with presenting the architecture used in the experiments.

3.1 Architecture

Designing correct architectures of target network and hypernetwork for a given task is very important. In particular, optimal architectures for super-resolution are different from those used in generative models. Since this paper is devoted to give a proof of concept, we use a single architecture in all experiments.

Scale Bicubic SRCNN [Dong et al.2016a] RDN [Zhang et al.2018b] Ours
Set5 2x 33.64 36.66 38.30 36.09
3x 30.41 32.75 34.78 32.85
4x 28.42 30.49 32.61 30.69
Set14 2x 30.33 32.45 34.10 32.30
3x 27.63 29.30 30.67 29.37
4x 26.08 27.50 28.92 27.61
B100 2x 29.48 31.36 32.40 31.11
3x 27.12 28.41 29.33 28.31
4x 25.87 26.90 27.80 26.86
Urban100 2x 26.85 29.50 33.09 29.43
3x 24.43 26.24 29.00 26.26
4x 23.11 24.52 26.82 24.56
Table 1: The average PSNR values obtained for a super-resolution task.
Scale Bicubic SRCNN [Dong et al.2016a] RDN [Zhang et al.2018b] Ours
Set5 2x 0.930 0.9542 0.9616 0.9505
3x 0.869 0.9090 0.9300 0.9095
4x 0.812 0.8628 0.9003 0.8691
Set14 2x 0.869 0.9067 0.9218 0.899
3x 0.775 0.8215 0.8482 0.8164
4x 0.703 0.7513 0.7893 0.7506
B100 2x 0.843 0.8879 0.9022 0.879
3x 0.738 0.7863 0.8105 0.778
4x 0.666 0.7101 0.7434 0.706
Urban100 2x 0.839 0.8946 0.9368 0.891
3x 0.733 0.7989 0.8683 0.798
4x 0.656 0.7221 0.8069 0.723
Table 2: The average SSIM values obtained for a super-resolution task.

Target network.

An architecture of the target network is supposed to be simple and small. This allows to keep the performance of training phase at the highest possible level as the target network is not directly trained. Moreover, small networks can be easily reused for other applications.

The target network consists of five fully-connected layers, see Figure 4

. The layers’ dimensions are being gradually increased. This is happening up to the middle layer. Later on, they are being decreased. This is because steep transitions of layers’ dimensions negatively affect the learning ability of neural network. Additionally, batch normalization is done between each layer

[Ioffe and Szegedy2015]. We have chosen a

to be an activation function between two consecutive layers

[Goodfellow et al.2016]

, which worked much better than ReLU for our purpose. An activation function for the last layer is sigmoid. Since the size of hyper-network’s output depends on the number of trainable parameters in the target network, we used residual connections in the target network. No convolutions were used, because the input of the target network is too simple.

Hypernetwork.

Hypernetwork is a convolutional neural network with some modifications, see Figure

5. We created an eight layered network with one residual connection. To reduce the number of trainable parameters, we adapt an approach used in the inception network [Szegedy et al.2015]. More precisely, instead of convolution, we used convolution followed by a  convolution (the notation is going to be used for simplicity of description). To generate the weights for each target network’s layer, we designed the following process. The first few layers of the hyper-network are common and take part in generating weights for every target network’s layer. Next, they are split into different branches. There is a branch for each layer of the target network. The purpose of initial layers is to extract features from an input image. The following layers are supposed to find weights for the target network based on these features. This process led to faster training than creating a separate hyper-network for each layer of the target network. ReLU was selected to be an activation function for every layer in the hypernetwork. Additionally, there is a batch normalization used after each layer.

Figure 6: Pixel-wise interpolation on CelebA.
Figure 7: Layerwise interpolation on the CelebA. In -th row we interpolate only over first layers of the network. Each layer may be understood as having a different functionality – i.e. the third layer is responsible for the general shape of the image, while the last layer corrects the colors in the image.
Figure 8: Modeling 2D shapes (probability distributions) by a transformation of a circle.

3.2 Super-resolution

Since target network gives a functional continuous representation of input image, we can upscale the image to any size and, in consequence, apply our approach to super-resolution.

To make this approach successful we feed the hypernetowrk with low resolution images and evaluate MSE loss on high resolution ones. More precisely, we take the original image of the size , downscale it to using bicubic interpolation and input it to the hypernetwork. The hypernetwork produces the weights to target network, which defines the functional representation of input image. To evaluate its quality, we take a grid of the size on the image returned by target network and compare the values with pixels intensities of original image using MSE loss.

Since input images can have different resolutions, we split them into overlapping parts of fixed sizes. In consequence, the value at each coordinate is described by multiple target networks. To produce a single output for every coordinate at test phase, we take the (weighted) average of values returned by all target networks covering this coordinate. This also allows to smooth the output function.

To test our approach, we trained the model on examples from DIV2K data set [Agustsson and Timofte2017]. Its performance was evaluated on Set5 [Bevilacqua et al.2012], Set14 [Zeyde et al.2010], B100 [Martin et al.2001], and Urban100 [Huang et al.2015]. As a quality measure we used two typical measures applied in super-resolution tasks, PSNR and SSIM [Wang et al.2004]. Their high values indicate better performance of a model. We considered scale factors .

As a baseline, we used bicubic interpolation. Moreover, we compared our approach with SRCNN [Dong et al.2016b], which was a state-of-the-art method in 2016. We also included a recent state-of-the-art – RDN [Zhang et al.2018c]. Our goal is to train a single hypernetwork model to generate images at various scales. This is more general solution than typical super-resolution approaches, where every model is responsible for upscaling the image to only one resolution. In consequence, it is expected that both SRCNN and RDN will perform better than our method. We trained our network once on images downscaled 2, 3 and 4 times. Moreover, we supplied a target network with an additional parameter , which indicated the scaling factor.

The results presented in Tables 1 and 2 demonstrate that our model gave significantly better performance than baseline bicubic interpolation (see also Figure 1 for sample result). Surprisingly, a single hypernetwork trained on all scales achieved a comparable performance to SRCNN, which created a separate model for each scale factor. It shows high potential of our models as a method of image representation. Nevertheless, it was not able to obtain as high scores as recent state-of-the-art in super-resolution. It might be caused by insufficient architecture of hypernetwork. In our opinion, designing similar architecture of hypernetwork to RDN should lead to comparable performance. The main advantage of our approach is its generality – we trained a single model for various scale factors.

3.3 Target networks geometry

It is believed that high dimensional data, e.g. images, are embedded in low-dimensional manifolds

[Goodfellow et al.2016]. In consequence, direct linear interpolation between images will not produce nicely looking pictures.

In this experiment, we would like to inspect the space of target network weights. In particular, we verify whether linear interpolation between weights of two target networks produces true images. For this purpose, we train the hypernetwork model presented in previous subsection on images in a single scale. In other words, we are not interested in rescaling images, but only in creating their functional representation. We use CelebA data set [Liu et al.2015] with a trivial preprocessing of cropping central 128x128 pixels from the image and resizing it to the size of 64x64 pixels.

In test phase, we generate target networks for two images and take the linear interpolation between weights of these networks. Figure 3

presents images returned by interpolated target networks. It is evident that interpolation produces images from true data distribution. It means that we transformed a manifold of images into more compact structure (set of weights) where linear transformation can be applied. It allows to suspect that similar images have similar weights to their target networks. For a comparison, we generated classical pixel-wise interpolation between analogical examples. As can be seen in Figure

6, the results are much worse, because interpolation gives superimposition of images.

Figure 9: Interpolation in target weights space for object modeling experiment.

Going further, we verified a layer-wise interpolation. Namely, we took weights to one target network and gradually changed weights of the first layers in a direction to corresponding weights in the second target network. As can be seen in Figure 7 each layer may be understood as having different functionality, i.e. third layer is responsible for the general shape, while the last layer corrects the colors in the image. In future work, it may be interesting to use a hypernetworks mechanism to obtain disentangled representation.

3.4 Generating models of objects

Generative models give an opportunity to create new examples from the underlying data set. We demonstrate that hypernetworks can be trained so that to return an individual generative model for a new data set at test time.

As a proof of concept, we tested our method on 2-dimensional data sets. We used MNIST database

[LeCun et al.1998] and interpret each image as an individual probability distribution on 2D space. Namely, the probability of generating an example for a given image, where , is proportional to the brightness of the pixel . We would like to create an individual generative model for every image (understood as 2D data set). For this purpose, we need to find a transformation

of some simple probability distribution, which allows to produce a given image. In this example we look for a transformation of a uniform distribution on a unit circle

in 2D space. Intuitively, we would like to transform a circle into a shape of a given image. In every batch, we take a sample from a circle, transform it by and use kernel density estimation to model resulting distribution . Similarity between and empirical distribution is verified by computing cross-entropy as described in Section 2.2.

Sample results are presented in Figure 8. Blue points show the values produced by a target network, while the line was traced between subsequent points mapped from a circle. As can be seen, making use of target network we obtain an outline of a given digit if we iterate over the circle in small steps. Despite the simplicity of this experiment, its extension may find the applications in objects modeling. For example, we may use this approach in 3D printing, where finding intermediary steps between sampled points is a crucial task [Zarzar Gandler2017]. Since it is essential that the running time of such algorithms should be low [Morse et al.2005], the use of methods that work in constant time (such as a neural network) may be beneficial. In future, we plan to use reversible generative networks [Kingma and Dhariwal2018] to remove redundant loops visible in the image of circle.

Analogically to previous experiment, we also examined the interpolation between two generative models (weights of two target networks). Sample results shown in Figure 9 demonstrate that obtained shapes change gradually. It means that target networks with similar weights produce similar images. This effect could not be achieved if we use separate hypernetwork for every image.

4 Conclusion

We presented the extension of hypernetworks mechanism, which allows to create target networks for various purposes using a single hypernetwork model. This approach was applied for creating functional image representation and constructing generative models for new data sets at test time. Due to the continuity of representation, we were able to upscale the image to any resolution. We also observed that constructed hypernetwork transformed a manifold of images to more compact space, where linear interpolation between images could be performed. Namely, we can traverse linearly from one image to another and not fall out from the true data distribution. Our experiments suggest that hypernetworks can be used to produce generative models. In future, we plan to use our approach to create generative models for higher dimensional data.

References