Π-nets: Deep Polynomial Neural Networks

03/08/2020 ∙ by Grigorios G. Chrysos, et al. ∙ 27

Deep Convolutional Neural Networks (DCNNs) is currently the method of choice both for generative, as well as for discriminative learning in computer vision and machine learning. The success of DCNNs can be attributed to the careful selection of their building blocks (e.g., residual blocks, rectifiers, sophisticated normalization schemes, to mention but a few). In this paper, we propose Π-Nets, a new class of DCNNs. Π-Nets are polynomial neural networks, i.e., the output is a high-order polynomial of the input. Π-Nets can be implemented using special kind of skip connections and their parameters can be represented via high-order tensors. We empirically demonstrate that Π-Nets have better representation power than standard DCNNs and they even produce good results without the use of non-linear activation functions in a large battery of tasks and signals, i.e., images, graphs, and audio. When used in conjunction with activation functions, Π-Nets produce state-of-the-art results in challenging tasks, such as image generation. Lastly, our framework elucidates why recent generative models, such as StyleGAN, improve upon their predecessors, e.g., ProGAN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: In this paper we introduce a class of networks called nets, where the output is a polynomial of the input. The input in this case, , can be either the latent space of Generative Adversarial Network for a generative task or an image in the case of a discriminative task. Our polynomial networks can be easily implemented using a special kind of skip connections.

Representation learning via the use of (deep) multi-layered non-linear models has revolutionised the field of computer vision the past decade [32, 17]. Deep Convolutional Neural Networks (DCNNs) [33, 32] have been the dominant class of models. Typically, a DCNN is a sequence of layers where the output of each layer is fed first to a convolutional operator (i.e., a set of shared weights applied via the convolution operator) and then to a non-linear activation function. Skip connections between various layers allow deeper representations and improve the gradient flow while training the network [17, 54].

In the aforementioned case, if the non-linear activation functions are removed, the output of a DCNN degenerates to a linear function of the input. In this paper, we propose a new class of DCNNs, which we coin nets, where the output is a polynomial function of the input. We design

nets for generative tasks (e.g., where the input is a small dimensional noise vector) as well as for discriminative tasks (e.g., where the input is an image and the output is a vector with dimensions equal to the number of labels). We demonstrate that these networks can produce good results without the use of non-linear activation functions. Furthermore, our extensive experiments show, empirically, that

nets can consistently improve the performance, in both generative and discriminative tasks, using, in many cases, significantly fewer parameters.

DCNNs have been used in computer vision for over 30 years [33, 50]

. Arguably, what brought DCNNs again in mainstream research was the remarkable results achieved by the so-called AlexNet in the ImageNet challenge 

[32]. Even though it is only seven years from this pioneering effort the field has witnessed dramatic improvement in all data-dependent tasks, such as object detection [21] and image generation [37, 15], just to name a few examples. The improvement is mainly attributed to carefully selected units in the architectural pipeline of DCNNs, such as blocks with skip connections [17], sophisticated normalization schemes (e.g., batch normalisation [23]), as well as the use of efficient gradient-based optimization techniques [28].

Parallel to the development of DCNN architectures for discriminative tasks, such as classification, the notion of Generative Adversarial Networks (GANs) was introduced for training generative models. GANs became instantly a popular line of research but it was only after the careful design of DCNN pipelines and training strategies that GANs were able to produce realistic images [26, 2]. ProGAN [25] was the first architecture to synthesize realistic facial images by a DCNN. StyleGAN [26] is a follow-up work that improved ProGAN. The main addition of StyleGAN was a type of skip connections, called ADAIN [22], which allowed the latent representation to be infused in all different layers of the generator. Similar infusions were introduced in [42] for conditional image generation.

Our work is motivated by the improvement of StyleGAN over ProGAN by such a simple infusion layer and the need to provide an explanation111The authors argued that this infusion layer is a kind of a style that allows a coarser to finer manipulation of the generation process. We, instead, attribute this to gradually increasing the power of the polynomial.. We show that such infusion layers create a special non-linear structure, i.e., a higher-order polynomial, which empirically improves the representation power of DCNNs. We show that this infusion layer can be generalized (e.g. see Fig. 1) and applied in various ways in both generative, as well as discriminative architectures. In particular, the paper bears the following contributions:

  • We propose a new family of neural networks (called nets) where the output is a high-order polynomial of the input. To avoid the combinatorial explosion in the number of parameters of polynomial activation functions [27] our nets use a special kind of skip connections to implement the polynomial expansion (please see Fig. 1 for a brief schematic representation). We theoretically demonstrate that these kind of skip connections relate to special forms of tensor decompositions.

  • We show how the proposed architectures can be applied in generative models such as GANs, as well as discriminative networks. We showcase that the resulting architectures can be used to learn high-dimensional distributions without non-linear activation functions.

  • We convert state-of-the-art baselines using the proposed nets and show how they can largely improve the expressivity of the baseline. We demonstrate it conclusively in a battery of tasks (i.e., generation and classification). Finally, we demonstrate that our architectures are applicable to many different signals such as images, meshes, and audio.

2 Related work

Expressivity of (deep) neural networks: The last few years, (deep) neural networks have been applied to a wide range of applications with impressive results. The performance boost can be attributed to a host of factors including: a) the availability of massive datasets [4, 35], b) the machine learning libraries [57, 43] running on massively parallel hardware, c) training improvements. The training improvements include a) optimizer improvement [28, 46], b) augmented capacity of the network [53], c) regularization tricks [11, 49, 23, 58]

. However, the paradigm for each layer remains largely unchanged for several decades: each layer is composed of a linear transformation and an element-wise activation function. Despite the variety of linear transformations 

[9, 33, 32] and activation functions [44, 39] being used, the effort to extend this paradigm has not drawn much attention to date.

Recently, hierarchical models have exhibited stellar performance in learning expressive generative models [2, 26, 70]. For instance, the recent BigGAN [2] performs a hierarchical composition through skip connections from the noise to multiple resolutions of the generator. A similar idea emerged in StyleGAN [26], which is an improvement over the Progressive Growing of GANs (ProGAN) [25]. As ProGAN, StyleGAN is a highly-engineered network that achieves compelling results on synthesized 2D images. In order to provide an explanation on the improvements of StyleGAN over ProGAN, the authors adopt arguments from the style transfer literature [22]. We believe that these improvements can be better explained under the light of our proposed polynomial function approximation. Despite the hierarchical composition proposed in these works, we present an intuitive and mathematically elaborate method to achieve a more precise approximation with a polynomial expansion. We also demonstrate that such a polynomial expansion can be used in both image generation (as in [26, 2]), image classification, and graph representation learning.

Polynomial networks

: Polynomial relationships have been investigated in two specific categories of networks: a) self-organizing networks with hard-coded feature selection, b) pi-sigma networks.

The idea of learnable polynomial features can be traced back to Group Method of Data Handling (GMDH) [24]222This is often referred to as the first deep neural network [50].. GMDH learns partial descriptors that capture quadratic correlations between two predefined input elements. In [41]

, more input elements are allowed, while higher-order polynomials are used. The input to each partial descriptor is predefined (subset of the input elements), which does not allow the method to scale to high-dimensional data with complex correlations.

Shin et al[51] introduce the pi-sigma network, which is a neural network with a single hidden layer. Multiple affine transformations of the data are learned; a product unit multiplies all the features to obtain the output. Improvements in the pi-sigma network include regularization for training in [66] or using multiple product units to obtain the output in [61]. The pi-sigma network is extended in sigma-pi-sigma neural network (SPSNN) [34]. The idea of SPSNN relies on summing different pi-sigma networks to obtain each output. SPSNN also uses a predefined basis (overlapping rectangular pulses) on each pi-sigma sub-network to filter the input features. Even though such networks use polynomial features or products, they do not scale well in high-dimensional signals. In addition, their experimental evaluation is conducted only on signals with known ground-truth distributions (and with up to 3 dimensional input/output), unlike the modern generative models where only a finite number of samples from high-dimensional ground-truth distributions is available.

3 Method

Symbol Dimension(s) Definition
Polynomial term order, total approximation order.
Rank of the decompositions.
Input to the polynomial approximator, i.e., generator.
Parameters in both decompositions.
Matrix parameters in the hierarchical decomposition.
- Khatri-Rao product, Hadamard product.
Table 1: Nomenclature

Notation: Tensors are symbolized by calligraphic letters, e.g., , while matrices (vectors) are denoted by uppercase (lowercase) boldface letters e.g., , (). The mode- vector product of with a vector is denoted by .333A detailed tensor notation is provided in the supplementary.

We want to learn a function approximator where each element of the output , with , is expressed as a polynomial444The theorem of [55] guarantees that any smooth function can be approximated by a polynomial. The approximation of multivariate functions is covered by an extension of the Weierstrass theorem, e.g. in [40] (pg 19). of all the input elements , with . That is, we want to learn a function of order , such that:

(1)

where , and are parameters for approximating the output . The correlations (of the input elements ) up to order emerge in (1). A more compact expression of (1) is obtained by vectorizing the outputs:

(2)

where and are the learnable parameters. This form of (2) allows us to approximate any smooth function (for large ), however the parameters grow with .

A variety of methods, such as pruning [8, 16], tensor decompositions [29, 52], special linear operators [6] with reduced parameters, parameter sharing/prediction [67, 5]

, can be employed to reduce the parameters. In contrast to the heuristic approaches of pruning or prediction, we describe below two principled ways which allow an efficient implementation. The first method relies on performing an off-the-shelf tensor decomposition on (

2), while the second considers the final polynomial as the product of lower-degree polynomials.

The tensor decompositions are used in this paper to provide a theoretical understanding (i.e., what is the order of the polynomial used) of the proposed family of -nets. Implementation-wise the incorporation of different -net structures is as simple as the incorporatation of a skip-connection. Nevertheless, in -net different skip connections lead to different kinds of polynomial networks.

3.1 Single polynomial

A tensor decomposition on the parameters is a natural way to reduce the parameters and to implement (2) with a neural network. Below, we demonstrate how three such decompositions result in novel architectures for a neural network training. The main symbols are summarized in Table 1, while the equivalence between the recursive relationship and the polynomial is analyzed in the supplementary.

Model 1: CCP: A coupled CP decomposition [29] is applied on the parameter tensors. That is, each parameter tensor, i.e. for , is not factorized individually, but rather a coupled factorization of the parameters is defined. The recursive relationship is:

(3)

for with and . The parameters for are learnable. To avoid overloading the diagram, a schematic assuming a third order expansion () is illustrated in Fig. 2.

Figure 2: Schematic illustration of the CCP (for third order approximation). Symbol refers to the Hadamard product.

Model 2: NCP: Instead of defining a flat CP decomposition, we can utilize a joint hierarchical decomposition on the polynomial parameters. A nested coupled CP decomposition (NCP), which results in the following recursive relationship for order approximation is defined:

(4)

for with and . The parameters , for , are learnable. The explanation of each variable is elaborated in the supplementary, where the decomposition is derived.

Figure 3: Schematic illustration of the NCP (for third order approximation). Symbol refers to the Hadamard product.

Model 3: NCP-Skip: The expressiveness of NCP can be further extended using a skip connection (motivated by CCP). The new model uses a nested coupled decomposition and has the following recursive expression:

(5)

for with and . The learnable parameters are the same as in NCP, however the difference in the recursive form results in a different polynomial expansion and thus architecture.

Figure 4: Schematic illustration of the NCP-Skip (for third order approximation). The difference from Fig. 3 is the skip connections added in this model.

Comparison between the models: All three models are based on a polynomial expansion, however their recursive forms and employed decompositions differ. The CCP has a simpler expression, however the NCP and the NCP-Skip relate to standard architectures using hierarchical composition that have recently yielded promising results in both generative and discriminative tasks. In the remainder of the paper, for comparison purposes we use the NCP by default for the image generation and NCP-Skip for the image classification. In our preliminary experiments, CCP and NCP share a similar performance based on the setting of Sec. 4. In all cases, to mitigate stability issues that might emerge during training, we employ certain normalization schemes that constrain the magnitude of the gradients. An in-depth theoretical analysis of the architectures is deferred to a future version of our work.

3.2 Product of polynomials

Instead of using a single polynomial, we express the function approximation as a product of polynomials. The product is implemented as successive polynomials where the output of the polynomial is used as the input for the polynomial. The concept is visually depicted in Fig. 5; each polynomial expresses a second order expansion. Stacking such polynomials results in an overall order of . Trivially, if the approximation of each polynomial is and we stack such polynomials, the total order is . The product does not necessarily demand the same order in each polynomial, the expressivity and the expansion order of each polynomial can be different and dependent on the task, e.g. for generative tasks that the resolution increases progressively, the expansion order could increase in the last polynomials. In all cases, the final order will be the product of each polynomial.

There are two main benefits of the product over the single polynomial: a) it allows using different decompositions (e.g. like in Sec. 3.1) and expressive power for each polynomial; b) it requires much less parameters for achieving the same order of approximation. Given the benefits of the product of polynomials, we assume below that a product of polynomials is used, unless explicitly mentioned otherwise. The respective model of product polynomials is called ProdPoly.

Figure 5: Abstract illustration of the ProdPoly. The input variable on the left is the input to a order expansion; the output of this is used as the input for the next polynomial (also with a order expansion) and so on. If we use such polynomials, the final output expresses a order expansion. In addition to the high order of approximation, the benefit of using the product of polynomials is that the model is flexible, in the sense that each polynomial can be implemented as a different decomposition of Sec. 3.1.

3.3 Task-dependent input/output

The aforementioned polynomials are a function , where the input/output are task-dependent. For a generative task, e.g. learning a decoder, the input is typically some low-dimensional noise, while the output is a high-dimensional signal, e.g. an image. For a discriminative task the input is an image; for a domain adaptation task the signal denotes the source domain and the target domain.

4 Proof of concept

In this Section, we conduct motivational experiments in both generative and discriminative tasks to demonstrate the expressivity of nets. Specifically, the networks are implemented without activation functions, i.e. only linear operations (e.g. convolutions) and Hadamard products are used. In this setting, the output is linear or multi-linear with respect to the parameters.

4.1 Linear generation

One of the most popular generative models is Generative Adversarial Nets (GAN) [12]. We design a GAN, where the generator is implemented as a product of polynomials (using the NCP decomposition), while the discriminator of [37] is used. No activation functions are used in the generator, but a single hyperbolic tangent () in the image space 555Additional details are deferred to the supplementary material..

Two experiments are conducted with a polynomial generator (Fashion-Mnist and YaleB). We perform a linear interpolation in the latent space when trained with Fashion-Mnist 

[64] and with YaleB [10] and visualize the results in Figs. 6, 7, respectively. Note that the linear interpolation generates plausible images and navigates among different categories, e.g. trousers to sneakers or trousers to t-shirts. Equivalently, it can linearly traverse the latent space from a fully illuminated to a partly dark face.

Figure 6: Linear interpolation in the latent space of ProdPoly (when trained on fashion images [64]). Note that the generator does not include any activation functions in between the linear blocks (Sec. 4.1). All the images are synthesized; the image on the leftmost column is the source, while the one in the rightmost is the target synthesized image.
Figure 7: Linear interpolation in the latent space of ProdPoly (when trained on facial images [10]). As in Fig. 6, the generator includes only linear blocks; the image on the leftmost column is the source, while the one in the rightmost is the target image.

4.2 Linear classification

To empirically illustrate the power of the polynomial, we use ResNet without activations for classification. Residual Network (ResNet) [17, 54] and its variants [21, 62, 65, 69, 68] have been applied to diverse tasks including object detection and image generation [14, 15, 37]. The core component of ResNet is the residual block; the residual block is expressed as for input .

We modify each residual block to express a higher order interaction, which can be achieved with NCP-Skip. The output of each residual block is the input for the next residual block, which makes our ResNet a product of polynomials. We conduct a classification experiment with CIFAR10 [31] ( classes) and CIFAR100 [30] ( classes). Each residual block is modified in two ways: a) all the activation functions are removed, b) it is converted into an order expansion with . The second order expansion (for the residual block) is expressed as ; higher orders are constructed similarly by performing a Hadamard product of the last term with (e.g., for third order expansion it would be ). The following two variations are evaluated: a) a single residual block is used in each ‘group layer’, b) two blocks are used per ‘group layer’. The latter variation is equivalent to ResNet18 without activations.

Each experiment is conducted times; the mean accuracy5 is reported in Fig. 8. We note that the same trends emerge in both datasets666 The performance of the baselines, i.e. ResNet18 without activation functions, is and for CIFAR10 and CIFAR100 respectively. However, we emphasize that the original ResNet was not designed to work without activation functions. The performance of ResNet18 in CIFAR10 and CIFAR100 with activation functions is and respectively.. The performance remains similar irrespective of the the amount of residual blocks in the group layer. The performance is affected by the order of the expansion, i.e. higher orders cause a decrease in the accuracy. Our conjecture is that this can be partially attributed to overfitting (note that a order expansion for the block - in total 8 res. units - yields a polynomial of power), however we defer a detailed study of this in a future version of our work. Nevertheless, in all cases without activations the accuracy is close to the original ResNet18 with activation functions.

Figure 8: Image classification accuracy with linear residual blocks6. The schematic on the left is on CIFAR10 classification, while the one on the right is on CIFAR100 classification.

5 Experiments

We conduct three experiments against state-of-the-art models in three diverse tasks: image generation, image classification, and graph representation learning. In each case, the baseline considered is converted into an instance of our family of -nets and the two models are compared.

5.1 Image generation

The robustness of ProdPoly in image generation is assessed in two different architectures/datasets below.

SNGAN on CIFAR10: In the first experiment, the architecture of SNGAN [37] is selected as a strong baseline on CIFAR10 [31]. The baseline includes residual blocks in the generator and the discriminator.

The generator is converted into a -net, where each residual block is a single order of the polynomial. We implement two versions, one with a single polynomial (NCP) and one with product of polynomials (where each polynomial uses NCP). In our implementation is a thin FC layer,

is a bias vector and

is the transformation of the residual block. Other than the aforementioned modifications, the hyper-parameters (e.g. discriminator, learning rate, optimization details) are kept the same as in [37].

Each network has run for 10 times and the mean and variance are reported. The popular Inception Score (IS) 

[48] and the Frechet Inception Distance (FID) [18]

are used for quantitative evaluation. Both scores extract feature representations from a pre-trained classifier (the Inception network 

[56]).

The quantitative results are summarized in Table 2. In addition to SNGAN and our two variations with polynomials, we have added the scores of [14, 15, 7, 19, 36] as reported in the respective papers. Note that the single polynomial already outperforms the baseline, while the ProdPoly boosts the performance further and achieves a substantial improvement over the original SNGAN.

Image generation on CIFAR10
Model IS () FID ()
SNGAN
NCP(Sec. 3.1)
ProdPoly
CSGAN-[14] -
WGAN-GP-[15] -
CQFG-[36]
EBM [7]
GLANN [19] -
Table 2: IS/FID scores on CIFAR10 [31] generation. The scores of [14, 15] are added from the respective papers as using similar residual based generators. The scores of [7, 19, 36] represent alternative generative models. ProdPolyoutperform the compared methods in both metrics.

StyleGAN on FFHQ: StyleGAN [26] is the state-of-the-art architecture in image generation. The generator is composed of two parts, namely: (a) the mapping network, composed of 8 FC layers, and (b) the synthesis network, which is based on ProGAN [25] and progressively learns to synthesize high quality images. The sampled noise is transformed by the mapping network and the resulting vector is then used for the synthesis network. As discussed in the introduction StyleGAN is already an instance of the -net family, due to AdaIN. Specifically, the AdaIN layer is , where is a normalization, is a convolution and is the transformed noise (mapping network). This is equivalent to our NCP model by setting as the convolution operator.

In this experiment we illustrate how simple modifications, using our family of products of polynomials, further improve the representation power. We make a minimal modification in the mapping network, while fixing the rest of the hyper-parameters. In particular, we convert the mapping network into a polynomial (specifically a NCP), which makes the generator a product of two polynomials.

The Flickr-Faces-HQ Dataset (FFHQ) dataset [26] which includes images of high-resolution faces is used. All the images are resized to . The best FID scores of the two methods (in resolution) are for ours and for the original StyleGAN, respectively. That is, our method improves the results by . Synthesized samples of our approach are visualized in Fig. 9.

Figure 9: Samples synthesized from ProdPoly (trained on FFHQ).

5.2 Classification

We perform two experiments on classification: a) audio classification, b) image classification.

Audio classification: The goal of this experiment is twofold: a) to evaluate ResNet on a distribution that differs from that of natural images, b) to validate whether higher-order blocks make the model more expressive. The core assumption is that we can increase the expressivity of our model, or equivalently we can use less residual blocks of higher-order to achieve performance similar to the baseline.

The performance of ResNet is evaluated on the Speech Commands dataset [63]. The dataset includes audio files; each audio contains a single word of a duration of one second. There are different words (classes) with each word having recordings. Every audio file is converted into a mel-spectrogram of resolution .

The baseline is a ResNet34 architecture; we use second-order residual blocks to build the Prodpoly-ResNet to match the performance of the baseline. The quantitative results are added in Table 3. The two models share the same accuracy, however Prodpoly-ResNet includes fewer parameters. This result validates our assumption that our model is more expressive and with even fewer parameters, it can achieve the same performance.

Speech Commands classification with ResNet
Model # blocks # par Accuracy
ResNet34
Prodpoly-ResNet
Table 3: Speech classification with ResNet. The accuracy of the compared methods is similar, but Prodpoly-ResNet has fewer parameters. The symbol ‘# par’ abbreviates the number of parameters (in millions).
ImageNet classification with ResNet
Model # Blocks Top-1 error () Top-5 error () Speed Model Size
ResNet50 23.570 6.838 8.5K 50.26 MB
Prodpoly-ResNet50 22.875 6.358 7.5K 68.81 MB
Table 4: Image classification (ImageNet) with ResNet. “Speed” refers to the inference speed (images/s) of each method.

Image classification: We perform a large-scale classification experiment on ImageNet [47]. We choose float16 instead of float32 to achieve acceleration and reduce the GPU memory consumption by . To stabilize the training, the second order of each residual block is normalized with a hyperbolic tangent unit. SGD with momentum , weight decay and a mini-batch size of is used. The initial learning rate is set to and decreased by a factor of at , and epochs. Models are trained for epochs from scratch, using linear warm-up of the learning rate during first five epochs according to [13]. For other batch sizes due to the limitation of GPU memory, we linearly scale the learning rate (e.g. for batch size ).

The Top-1 error throughout the training is visualized in Fig. 10, while the validation results are added in Table 4. For a fair comparison, we report the results from our training in both the original ResNet and Prodpoly-ResNet777The performance of the original ResNet [17] is inferior to the one reported here and in [20].. Prodpoly-ResNet consistently improves the performance with an extremely small increase in computational complexity and model size. Remarkably, Prodpoly-ResNet50 achieves a single-crop Top-5 validation error of , exceeding ResNet50 (6.838%) by 0.48%.

Figure 10: Top-1 error on ResNet50 and Prodpoly-ResNet50. Note that Prodpoly-ResNet performs consistently better during the training; the improvement is also reflected in the validation performance.
error (mm) () speed (ms) ()
GAT [59] 0.732 11.04
FeastNet [60] 0.623 6.64
MoNet [38] 0.583 7.59
SpiralGNN [1] 0.635 4.27
ProdPoly (simple) 0.530 4.98
ProdPoly (simple - linear) 0.529 4.79
ProdPoly (full) 0.476 5.30
ProdPoly (full - linear) 0.474 5.14
Table 5:

ProdPoly vs 1st order graph learnable operators for mesh autoencoding. Note that even without using activation functions the proposed methods significantly improve upon the state-of-the-art.

5.3 3D Mesh representation learning

Below, we evaluate higher order correlations in graph related tasks. We experiment with 3D deformable meshes of fixed topology [45], i.e. the connectivity of the graph remains the same and each different shape is defined as a different signal on the vertices of the graph: . As in the previous experiments, we extend a state-of-the-art operator, namely spiral convolutions [1], with the ProdPoly formulation and test our method on the task of autoencoding 3D shapes. We use the existing architecture and hyper-parameters of [1], thus showing that ProdPoly can be used as a plug-and-play operator to existing models, turning the aforementioned one into a Spiral -Net. Our implementation uses a product of polynomials, where each polynomial is a specific instantiation of (4): , , where is the spiral convolution operator written in matrix form.888Stability of the optimization is ensured by applying vertex-wise instance normalization on the 2nd order term. We use this model (ProdPoly simple) to showcase how to increase the expressivity without adding new blocks in the architecture. This model can be also re-interpreted as a learnable polynomial activation function as in [27]. We also show the results of our complete model (ProdPoly full), where is a different spiral convolution.

In Table 5 we compare the reconstruction error of the autoencoder and the inference time of our method with the baseline spiral convolutions, as well as with the best results reported in [1] that other (more computationally involved - see inference time in table 5) graph learnable operators yielded. Interestingly, we manage to outperform all previously introduced models even when discarding the activation functions across the entire network. Thus, expressivity increases without having to increase the depth or the width of the architecture, as usually done by ML practitioners, and with small sacrifices in terms of inference time.

6 Discussion

In this work, we have introduced a new class of DCNNs, called -Nets that perform function approximation using a polynomial neural network. Our -Nets can be efficiently implemented via a special kind of skip connections that lead to high-order polynomials, naturally expressed with tensorial factors. The proposed formulation extends the standard compositional paradigm of overlaying linear operations with activation functions. We motivate our method by a sequence of experiments without activation functions that showcase the expressive power of polynomials, and demonstrate that -Nets are effective in both discriminative, as well as generative tasks. Trivially modifying state-of-the-art architectures in image generation, image and audio classification and mesh representation learning, the performance consistently imrpoves. In the future, we aim to explore the link between different decompositions and the resulting architectures and theoretically analyse their expressive power.

7 Acknowledgements

We are thankful to Nvidia for the hardware donation and Amazon web services for the cloud credits. The work of GC, SM, and GB was partially funded by an Imperial College DTA. The work of JD was partially funded by Imperial President’s PhD Scholarship. The work of SZ was partially funded by the EPSRC Fellowship DEFORM: Large Scale Shape Analysis of Deformable Models of Humans (EP/S010203/1) and a Google Faculty Award. An early version with single polynomials for the generative settings can be found in [3].

References

  • [1] G. Bouritsas, S. Bokhnyak, S. Ploumpis, M. Bronstein, and S. Zafeiriou (2019) Neural 3d morphable models: spiral convolutional networks for 3d shape representation learning and generation. In International Conference on Computer Vision (ICCV), Cited by: §5.3, §5.3, Table 5.
  • [2] A. Brock, J. Donahue, and K. Simonyan (2019) Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  • [3] G. Chrysos, S. Moschoglou, Y. Panagakis, and S. Zafeiriou (2019) PolyGAN: high-order polynomial generators. arXiv preprint arXiv:1908.06571. Cited by: §7.
  • [4] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 248–255. Cited by: §2.
  • [5] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. De Freitas (2013)

    Predicting parameters in deep learning

    .
    In Advances in neural information processing systems (NeurIPS), pp. 2148–2156. Cited by: §3.
  • [6] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, et al. (2017) CirCNN: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 395–408. Cited by: §3.
  • [7] Y. Du and I. Mordatch (2019)

    Implicit generation and generalization in energy-based models

    .
    In Advances in neural information processing systems (NeurIPS), Cited by: §5.1, Table 2.
  • [8] J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), Cited by: §3.
  • [9] K. Fukushima (1980) Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics 36 (4), pp. 193–202. Cited by: §2.
  • [10] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman (2001)

    From few to many: illumination cone models for face recognition under variable lighting and pose

    .
    IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI) (6), pp. 643–660. Cited by: Figure 7, §4.1.
  • [11] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    International Conference on Artificial Intelligence and Statistics (AISTATS)

    ,
    pp. 249–256. Cited by: §2.
  • [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems (NeurIPS), Cited by: §4.1.
  • [13] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv:1706.02677. Cited by: §5.2.
  • [14] G. L. Grinblat, L. C. Uzal, and P. M. Granitto (2017) Class-splitting generative adversarial networks. arXiv preprint arXiv:1709.07359. Cited by: §4.2, §5.1, Table 2.
  • [15] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems (NeurIPS), pp. 5767–5777. Cited by: §1, §4.2, §5.1, Table 2.
  • [16] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems (NeurIPS), pp. 1135–1143. Cited by: §3.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, §1, §4.2, footnote 7.
  • [18] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems (NeurIPS), pp. 6626–6637. Cited by: §5.1.
  • [19] Y. Hoshen, K. Li, and J. Malik (2019) Non-adversarial image synthesis with generative latent nearest neighbors. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5811–5819. Cited by: §5.1, Table 2.
  • [20] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141. Cited by: footnote 7.
  • [21] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708. Cited by: §1, §4.2.
  • [22] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In International Conference on Computer Vision (ICCV), pp. 1501–1510. Cited by: §1, §2.
  • [23] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), Cited by: §1, §2.
  • [24] A. G. Ivakhnenko (1971) Polynomial theory of complex systems. transactions on Systems, Man, and Cybernetics (4), pp. 364–378. Cited by: §2.
  • [25] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations (ICLR), Cited by: §1, §2, §5.1.
  • [26] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §5.1, §5.1.
  • [27] J. Kileel, M. Trager, and J. Bruna (2019) On the expressive power of deep polynomial neural networks. In Advances in neural information processing systems (NeurIPS), Cited by: 1st item, §5.3.
  • [28] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  • [29] T. G. Kolda and B. W. Bader (2009) Tensor decompositions and applications. SIAM review 51 (3), pp. 455–500. Cited by: §3.1, §3.
  • [30] A. Krizhevsky, V. Nair, and G. Hinton CIFAR-100 (canadian institute for advanced research). . External Links: Link Cited by: §4.2.
  • [31] A. Krizhevsky, V. Nair, and G. Hinton (2014) The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html 55. Cited by: §4.2, §5.1, Table 2.
  • [32] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS), pp. 1097–1105. Cited by: §1, §1, §2.
  • [33] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1, §1, §2.
  • [34] C. Li (2003) A sigma-pi-sigma neural network (spsnn). Neural Processing Letters 17 (1), pp. 1–19. Cited by: §2.
  • [35] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In International Conference on Computer Vision (ICCV), pp. 3730–3738. Cited by: §2.
  • [36] T. Lucas, K. Shmelkov, K. Alahari, C. Schmid, and J. Verbeek (2019) Adversarial training of partially invertible variational autoencoders. arXiv preprint arXiv:1901.01091. Cited by: §5.1, Table 2.
  • [37] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §4.1, §4.2, §5.1, §5.1.
  • [38] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 5.
  • [39] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §2.
  • [40] S.M. Nikol’skii (2013) Analysis iii: spaces of differentiable functions. Encyclopaedia of Mathematical Sciences, Springer Berlin Heidelberg. External Links: ISBN 9783662099612 Cited by: footnote 4.
  • [41] S. Oh, W. Pedrycz, and B. Park (2003) Polynomial neural networks architecture: analysis and design. Computers & Electrical Engineering 29 (6), pp. 703–725. Cited by: §2.
  • [42] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2337–2346. Cited by: §1.
  • [43] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in PyTorch

    .
    In NeurIPS Workshops, Cited by: §2.
  • [44] P. Ramachandran, B. Zoph, and Q. V. Le (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941. Cited by: §2.
  • [45] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black (2018) Generating 3d faces using convolutional mesh autoencoders. In European Conference on Computer Vision (ECCV), pp. 704–720. Cited by: §5.3.
  • [46] S. J. Reddi, S. Kale, and S. Kumar (2018) On the convergence of adam and beyond. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • [47] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §5.2.
  • [48] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems (NeurIPS), pp. 2234–2242. Cited by: §5.1.
  • [49] A. M. Saxe, J. L. McClelland, and S. Ganguli (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • [50] J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1, footnote 2.
  • [51] Y. Shin and J. Ghosh (1991) The pi-sigma network: an efficient higher-order neural network for pattern classification and function approximation. In International Joint Conference on Neural Networks, Vol. 1, pp. 13–18. Cited by: §2.
  • [52] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos (2017) Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing 65 (13), pp. 3551–3582. Cited by: §3.
  • [53] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • [54] R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §1, §4.2.
  • [55] M. H. Stone (1948) The generalized weierstrass approximation theorem. Mathematics Magazine 21 (5), pp. 237–254. Cited by: footnote 4.
  • [56] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. Cited by: §5.1.
  • [57] S. Tokui, K. Oono, S. Hido, and J. Clayton (2015)

    Chainer: a next-generation open source framework for deep learning

    .
    In NeurIPS Workshops, Cited by: §2.
  • [58] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §2.
  • [59] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. International Conference on Learning Representations (ICLR). Cited by: Table 5.
  • [60] N. Verma, E. Boyer, and J. Verbeek (2018) Feastnet: feature-steered graph convolutions for 3d shape analysis. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 5.
  • [61] C. Voutriaridis, Y. S. Boutalis, and B. G. Mertzios (2003) Ridge polynomial networks in pattern recognition. In EURASIP Conference focused on Video/Image Processing and Multimedia Communications, Vol. 2, pp. 519–524. Cited by: §2.
  • [62] W. Wang, X. Li, J. Yang, and T. Lu (2018) Mixed link networks. In International Joint Conferences on Artificial Intelligence (IJCAI), Cited by: §4.2.
  • [63] P. Warden (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §5.2.
  • [64] H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: Figure 6, §4.1.
  • [65] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §4.2.
  • [66] Y. Xiong, W. Wu, X. Kang, and C. Zhang (2007) Training pi-sigma network by online gradient algorithm with penalty for small weight update. Neural computation 19 (12), pp. 3356–3368. Cited by: §2.
  • [67] C. Yunpeng, J. Xiaojie, K. Bingyi, F. Jiashi, and Y. Shuicheng (2018) Sharing residual units through collective tensor factorization in deep neural networks. In International Joint Conferences on Artificial Intelligence (IJCAI), Cited by: §3.
  • [68] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.2.
  • [69] K. Zhang, M. Sun, T. X. Han, X. Yuan, L. Guo, and T. Liu (2017) Residual networks of residual networks: multilevel residual networks. IEEE Transactions on Circuits and Systems for Video Technology 28 (6), pp. 1303–1314. Cited by: §4.2.
  • [70] S. Zhao, J. Song, and S. Ermon (2017) Learning hierarchical features from deep generative models. In International Conference on Machine Learning (ICML), pp. 4091–4099. Cited by: §2.