1 Introduction
Representation learning via the use of (deep) multilayered nonlinear models has revolutionised the field of computer vision the past decade [32, 17]. Deep Convolutional Neural Networks (DCNNs) [33, 32] have been the dominant class of models. Typically, a DCNN is a sequence of layers where the output of each layer is fed first to a convolutional operator (i.e., a set of shared weights applied via the convolution operator) and then to a nonlinear activation function. Skip connections between various layers allow deeper representations and improve the gradient flow while training the network [17, 54].
In the aforementioned case, if the nonlinear activation functions are removed, the output of a DCNN degenerates to a linear function of the input. In this paper, we propose a new class of DCNNs, which we coin nets, where the output is a polynomial function of the input. We design
nets for generative tasks (e.g., where the input is a small dimensional noise vector) as well as for discriminative tasks (e.g., where the input is an image and the output is a vector with dimensions equal to the number of labels). We demonstrate that these networks can produce good results without the use of nonlinear activation functions. Furthermore, our extensive experiments show, empirically, that
nets can consistently improve the performance, in both generative and discriminative tasks, using, in many cases, significantly fewer parameters.DCNNs have been used in computer vision for over 30 years [33, 50]
. Arguably, what brought DCNNs again in mainstream research was the remarkable results achieved by the socalled AlexNet in the ImageNet challenge
[32]. Even though it is only seven years from this pioneering effort the field has witnessed dramatic improvement in all datadependent tasks, such as object detection [21] and image generation [37, 15], just to name a few examples. The improvement is mainly attributed to carefully selected units in the architectural pipeline of DCNNs, such as blocks with skip connections [17], sophisticated normalization schemes (e.g., batch normalisation [23]), as well as the use of efficient gradientbased optimization techniques [28].Parallel to the development of DCNN architectures for discriminative tasks, such as classification, the notion of Generative Adversarial Networks (GANs) was introduced for training generative models. GANs became instantly a popular line of research but it was only after the careful design of DCNN pipelines and training strategies that GANs were able to produce realistic images [26, 2]. ProGAN [25] was the first architecture to synthesize realistic facial images by a DCNN. StyleGAN [26] is a followup work that improved ProGAN. The main addition of StyleGAN was a type of skip connections, called ADAIN [22], which allowed the latent representation to be infused in all different layers of the generator. Similar infusions were introduced in [42] for conditional image generation.
Our work is motivated by the improvement of StyleGAN over ProGAN by such a simple infusion layer and the need to provide an explanation^{1}^{1}1The authors argued that this infusion layer is a kind of a style that allows a coarser to finer manipulation of the generation process. We, instead, attribute this to gradually increasing the power of the polynomial.. We show that such infusion layers create a special nonlinear structure, i.e., a higherorder polynomial, which empirically improves the representation power of DCNNs. We show that this infusion layer can be generalized (e.g. see Fig. 1) and applied in various ways in both generative, as well as discriminative architectures. In particular, the paper bears the following contributions:

We propose a new family of neural networks (called nets) where the output is a highorder polynomial of the input. To avoid the combinatorial explosion in the number of parameters of polynomial activation functions [27] our nets use a special kind of skip connections to implement the polynomial expansion (please see Fig. 1 for a brief schematic representation). We theoretically demonstrate that these kind of skip connections relate to special forms of tensor decompositions.

We show how the proposed architectures can be applied in generative models such as GANs, as well as discriminative networks. We showcase that the resulting architectures can be used to learn highdimensional distributions without nonlinear activation functions.

We convert stateoftheart baselines using the proposed nets and show how they can largely improve the expressivity of the baseline. We demonstrate it conclusively in a battery of tasks (i.e., generation and classification). Finally, we demonstrate that our architectures are applicable to many different signals such as images, meshes, and audio.
2 Related work
Expressivity of (deep) neural networks: The last few years, (deep) neural networks have been applied to a wide range of applications with impressive results. The performance boost can be attributed to a host of factors including: a) the availability of massive datasets [4, 35], b) the machine learning libraries [57, 43] running on massively parallel hardware, c) training improvements. The training improvements include a) optimizer improvement [28, 46], b) augmented capacity of the network [53], c) regularization tricks [11, 49, 23, 58]
. However, the paradigm for each layer remains largely unchanged for several decades: each layer is composed of a linear transformation and an elementwise activation function. Despite the variety of linear transformations
[9, 33, 32] and activation functions [44, 39] being used, the effort to extend this paradigm has not drawn much attention to date.Recently, hierarchical models have exhibited stellar performance in learning expressive generative models [2, 26, 70]. For instance, the recent BigGAN [2] performs a hierarchical composition through skip connections from the noise to multiple resolutions of the generator. A similar idea emerged in StyleGAN [26], which is an improvement over the Progressive Growing of GANs (ProGAN) [25]. As ProGAN, StyleGAN is a highlyengineered network that achieves compelling results on synthesized 2D images. In order to provide an explanation on the improvements of StyleGAN over ProGAN, the authors adopt arguments from the style transfer literature [22]. We believe that these improvements can be better explained under the light of our proposed polynomial function approximation. Despite the hierarchical composition proposed in these works, we present an intuitive and mathematically elaborate method to achieve a more precise approximation with a polynomial expansion. We also demonstrate that such a polynomial expansion can be used in both image generation (as in [26, 2]), image classification, and graph representation learning.
Polynomial networks
: Polynomial relationships have been investigated in two specific categories of networks: a) selforganizing networks with hardcoded feature selection, b) pisigma networks.
The idea of learnable polynomial features can be traced back to Group Method of Data Handling (GMDH) [24]^{2}^{2}2This is often referred to as the first deep neural network [50].. GMDH learns partial descriptors that capture quadratic correlations between two predefined input elements. In [41]
, more input elements are allowed, while higherorder polynomials are used. The input to each partial descriptor is predefined (subset of the input elements), which does not allow the method to scale to highdimensional data with complex correlations.
Shin et al. [51] introduce the pisigma network, which is a neural network with a single hidden layer. Multiple affine transformations of the data are learned; a product unit multiplies all the features to obtain the output. Improvements in the pisigma network include regularization for training in [66] or using multiple product units to obtain the output in [61]. The pisigma network is extended in sigmapisigma neural network (SPSNN) [34]. The idea of SPSNN relies on summing different pisigma networks to obtain each output. SPSNN also uses a predefined basis (overlapping rectangular pulses) on each pisigma subnetwork to filter the input features. Even though such networks use polynomial features or products, they do not scale well in highdimensional signals. In addition, their experimental evaluation is conducted only on signals with known groundtruth distributions (and with up to 3 dimensional input/output), unlike the modern generative models where only a finite number of samples from highdimensional groundtruth distributions is available.
3 Method
Symbol  Dimension(s)  Definition 

Polynomial term order, total approximation order.  
Rank of the decompositions.  
Input to the polynomial approximator, i.e., generator.  
Parameters in both decompositions.  
Matrix parameters in the hierarchical decomposition.  
  KhatriRao product, Hadamard product. 
Notation: Tensors are symbolized by calligraphic letters, e.g., , while matrices (vectors) are denoted by uppercase (lowercase) boldface letters e.g., , (). The mode vector product of with a vector is denoted by .^{3}^{3}3A detailed tensor notation is provided in the supplementary.
We want to learn a function approximator where each element of the output , with , is expressed as a polynomial^{4}^{4}4The theorem of [55] guarantees that any smooth function can be approximated by a polynomial. The approximation of multivariate functions is covered by an extension of the Weierstrass theorem, e.g. in [40] (pg 19). of all the input elements , with . That is, we want to learn a function of order , such that:
(1) 
where , and are parameters for approximating the output . The correlations (of the input elements ) up to order emerge in (1). A more compact expression of (1) is obtained by vectorizing the outputs:
(2) 
where and are the learnable parameters. This form of (2) allows us to approximate any smooth function (for large ), however the parameters grow with .
A variety of methods, such as pruning [8, 16], tensor decompositions [29, 52], special linear operators [6] with reduced parameters, parameter sharing/prediction [67, 5]
, can be employed to reduce the parameters. In contrast to the heuristic approaches of pruning or prediction, we describe below two principled ways which allow an efficient implementation. The first method relies on performing an offtheshelf tensor decomposition on (
2), while the second considers the final polynomial as the product of lowerdegree polynomials.The tensor decompositions are used in this paper to provide a theoretical understanding (i.e., what is the order of the polynomial used) of the proposed family of nets. Implementationwise the incorporation of different net structures is as simple as the incorporatation of a skipconnection. Nevertheless, in net different skip connections lead to different kinds of polynomial networks.
3.1 Single polynomial
A tensor decomposition on the parameters is a natural way to reduce the parameters and to implement (2) with a neural network. Below, we demonstrate how three such decompositions result in novel architectures for a neural network training. The main symbols are summarized in Table 1, while the equivalence between the recursive relationship and the polynomial is analyzed in the supplementary.
Model 1: CCP: A coupled CP decomposition [29] is applied on the parameter tensors. That is, each parameter tensor, i.e. for , is not factorized individually, but rather a coupled factorization of the parameters is defined. The recursive relationship is:
(3) 
for with and . The parameters for are learnable. To avoid overloading the diagram, a schematic assuming a third order expansion () is illustrated in Fig. 2.
Model 2: NCP: Instead of defining a flat CP decomposition, we can utilize a joint hierarchical decomposition on the polynomial parameters. A nested coupled CP decomposition (NCP), which results in the following recursive relationship for order approximation is defined:
(4) 
for with and . The parameters , for , are learnable. The explanation of each variable is elaborated in the supplementary, where the decomposition is derived.
Model 3: NCPSkip: The expressiveness of NCP can be further extended using a skip connection (motivated by CCP). The new model uses a nested coupled decomposition and has the following recursive expression:
(5) 
for with and . The learnable parameters are the same as in NCP, however the difference in the recursive form results in a different polynomial expansion and thus architecture.
Comparison between the models: All three models are based on a polynomial expansion, however their recursive forms and employed decompositions differ. The CCP has a simpler expression, however the NCP and the NCPSkip relate to standard architectures using hierarchical composition that have recently yielded promising results in both generative and discriminative tasks. In the remainder of the paper, for comparison purposes we use the NCP by default for the image generation and NCPSkip for the image classification. In our preliminary experiments, CCP and NCP share a similar performance based on the setting of Sec. 4. In all cases, to mitigate stability issues that might emerge during training, we employ certain normalization schemes that constrain the magnitude of the gradients. An indepth theoretical analysis of the architectures is deferred to a future version of our work.
3.2 Product of polynomials
Instead of using a single polynomial, we express the function approximation as a product of polynomials. The product is implemented as successive polynomials where the output of the polynomial is used as the input for the polynomial. The concept is visually depicted in Fig. 5; each polynomial expresses a second order expansion. Stacking such polynomials results in an overall order of . Trivially, if the approximation of each polynomial is and we stack such polynomials, the total order is . The product does not necessarily demand the same order in each polynomial, the expressivity and the expansion order of each polynomial can be different and dependent on the task, e.g. for generative tasks that the resolution increases progressively, the expansion order could increase in the last polynomials. In all cases, the final order will be the product of each polynomial.
There are two main benefits of the product over the single polynomial: a) it allows using different decompositions (e.g. like in Sec. 3.1) and expressive power for each polynomial; b) it requires much less parameters for achieving the same order of approximation. Given the benefits of the product of polynomials, we assume below that a product of polynomials is used, unless explicitly mentioned otherwise. The respective model of product polynomials is called ProdPoly.
3.3 Taskdependent input/output
The aforementioned polynomials are a function , where the input/output are taskdependent. For a generative task, e.g. learning a decoder, the input is typically some lowdimensional noise, while the output is a highdimensional signal, e.g. an image. For a discriminative task the input is an image; for a domain adaptation task the signal denotes the source domain and the target domain.
4 Proof of concept
In this Section, we conduct motivational experiments in both generative and discriminative tasks to demonstrate the expressivity of nets. Specifically, the networks are implemented without activation functions, i.e. only linear operations (e.g. convolutions) and Hadamard products are used. In this setting, the output is linear or multilinear with respect to the parameters.
4.1 Linear generation
One of the most popular generative models is Generative Adversarial Nets (GAN) [12]. We design a GAN, where the generator is implemented as a product of polynomials (using the NCP decomposition), while the discriminator of [37] is used. No activation functions are used in the generator, but a single hyperbolic tangent () in the image space ^{5}^{5}5Additional details are deferred to the supplementary material..
Two experiments are conducted with a polynomial generator (FashionMnist and YaleB). We perform a linear interpolation in the latent space when trained with FashionMnist
[64] and with YaleB [10] and visualize the results in Figs. 6, 7, respectively. Note that the linear interpolation generates plausible images and navigates among different categories, e.g. trousers to sneakers or trousers to tshirts. Equivalently, it can linearly traverse the latent space from a fully illuminated to a partly dark face.4.2 Linear classification
To empirically illustrate the power of the polynomial, we use ResNet without activations for classification. Residual Network (ResNet) [17, 54] and its variants [21, 62, 65, 69, 68] have been applied to diverse tasks including object detection and image generation [14, 15, 37]. The core component of ResNet is the residual block; the residual block is expressed as for input .
We modify each residual block to express a higher order interaction, which can be achieved with NCPSkip. The output of each residual block is the input for the next residual block, which makes our ResNet a product of polynomials. We conduct a classification experiment with CIFAR10 [31] ( classes) and CIFAR100 [30] ( classes). Each residual block is modified in two ways: a) all the activation functions are removed, b) it is converted into an order expansion with . The second order expansion (for the residual block) is expressed as ; higher orders are constructed similarly by performing a Hadamard product of the last term with (e.g., for third order expansion it would be ). The following two variations are evaluated: a) a single residual block is used in each ‘group layer’, b) two blocks are used per ‘group layer’. The latter variation is equivalent to ResNet18 without activations.
Each experiment is conducted times; the mean accuracy^{5} is reported in Fig. 8. We note that the same trends emerge in both datasets^{6}^{6}6 The performance of the baselines, i.e. ResNet18 without activation functions, is and for CIFAR10 and CIFAR100 respectively. However, we emphasize that the original ResNet was not designed to work without activation functions. The performance of ResNet18 in CIFAR10 and CIFAR100 with activation functions is and respectively.. The performance remains similar irrespective of the the amount of residual blocks in the group layer. The performance is affected by the order of the expansion, i.e. higher orders cause a decrease in the accuracy. Our conjecture is that this can be partially attributed to overfitting (note that a order expansion for the block  in total 8 res. units  yields a polynomial of power), however we defer a detailed study of this in a future version of our work. Nevertheless, in all cases without activations the accuracy is close to the original ResNet18 with activation functions.
5 Experiments
We conduct three experiments against stateoftheart models in three diverse tasks: image generation, image classification, and graph representation learning. In each case, the baseline considered is converted into an instance of our family of nets and the two models are compared.
5.1 Image generation
The robustness of ProdPoly in image generation is assessed in two different architectures/datasets below.
SNGAN on CIFAR10: In the first experiment, the architecture of SNGAN [37] is selected as a strong baseline on CIFAR10 [31]. The baseline includes residual blocks in the generator and the discriminator.
The generator is converted into a net, where each residual block is a single order of the polynomial. We implement two versions, one with a single polynomial (NCP) and one with product of polynomials (where each polynomial uses NCP). In our implementation is a thin FC layer,
is a bias vector and
is the transformation of the residual block. Other than the aforementioned modifications, the hyperparameters (e.g. discriminator, learning rate, optimization details) are kept the same as in [37].Each network has run for 10 times and the mean and variance are reported. The popular Inception Score (IS)
[48] and the Frechet Inception Distance (FID) [18]are used for quantitative evaluation. Both scores extract feature representations from a pretrained classifier (the Inception network
[56]).The quantitative results are summarized in Table 2. In addition to SNGAN and our two variations with polynomials, we have added the scores of [14, 15, 7, 19, 36] as reported in the respective papers. Note that the single polynomial already outperforms the baseline, while the ProdPoly boosts the performance further and achieves a substantial improvement over the original SNGAN.
Image generation on CIFAR10  
Model  IS ()  FID () 
SNGAN  
NCP(Sec. 3.1)  
ProdPoly  
CSGAN[14]    
WGANGP[15]    
CQFG[36]  
EBM [7]  
GLANN [19]   
StyleGAN on FFHQ: StyleGAN [26] is the stateoftheart architecture in image generation. The generator is composed of two parts, namely: (a) the mapping network, composed of 8 FC layers, and (b) the synthesis network, which is based on ProGAN [25] and progressively learns to synthesize high quality images. The sampled noise is transformed by the mapping network and the resulting vector is then used for the synthesis network. As discussed in the introduction StyleGAN is already an instance of the net family, due to AdaIN. Specifically, the AdaIN layer is , where is a normalization, is a convolution and is the transformed noise (mapping network). This is equivalent to our NCP model by setting as the convolution operator.
In this experiment we illustrate how simple modifications, using our family of products of polynomials, further improve the representation power. We make a minimal modification in the mapping network, while fixing the rest of the hyperparameters. In particular, we convert the mapping network into a polynomial (specifically a NCP), which makes the generator a product of two polynomials.
The FlickrFacesHQ Dataset (FFHQ) dataset [26] which includes images of highresolution faces is used. All the images are resized to . The best FID scores of the two methods (in resolution) are for ours and for the original StyleGAN, respectively. That is, our method improves the results by . Synthesized samples of our approach are visualized in Fig. 9.
5.2 Classification
We perform two experiments on classification: a) audio classification, b) image classification.
Audio classification: The goal of this experiment is twofold: a) to evaluate ResNet on a distribution that differs from that of natural images, b) to validate whether higherorder blocks make the model more expressive. The core assumption is that we can increase the expressivity of our model, or equivalently we can use less residual blocks of higherorder to achieve performance similar to the baseline.
The performance of ResNet is evaluated on the Speech Commands dataset [63]. The dataset includes audio files; each audio contains a single word of a duration of one second. There are different words (classes) with each word having recordings. Every audio file is converted into a melspectrogram of resolution .
The baseline is a ResNet34 architecture; we use secondorder residual blocks to build the ProdpolyResNet to match the performance of the baseline. The quantitative results are added in Table 3. The two models share the same accuracy, however ProdpolyResNet includes fewer parameters. This result validates our assumption that our model is more expressive and with even fewer parameters, it can achieve the same performance.
Speech Commands classification with ResNet  

Model  # blocks  # par  Accuracy 
ResNet34  
ProdpolyResNet 
ImageNet classification with ResNet  

Model  # Blocks  Top1 error ()  Top5 error ()  Speed  Model Size 
ResNet50  23.570  6.838  8.5K  50.26 MB  
ProdpolyResNet50  22.875  6.358  7.5K  68.81 MB 
Image classification: We perform a largescale classification experiment on ImageNet [47]. We choose float16 instead of float32 to achieve acceleration and reduce the GPU memory consumption by . To stabilize the training, the second order of each residual block is normalized with a hyperbolic tangent unit. SGD with momentum , weight decay and a minibatch size of is used. The initial learning rate is set to and decreased by a factor of at , and epochs. Models are trained for epochs from scratch, using linear warmup of the learning rate during first five epochs according to [13]. For other batch sizes due to the limitation of GPU memory, we linearly scale the learning rate (e.g. for batch size ).
The Top1 error throughout the training is visualized in Fig. 10, while the validation results are added in Table 4. For a fair comparison, we report the results from our training in both the original ResNet and ProdpolyResNet^{7}^{7}7The performance of the original ResNet [17] is inferior to the one reported here and in [20].. ProdpolyResNet consistently improves the performance with an extremely small increase in computational complexity and model size. Remarkably, ProdpolyResNet50 achieves a singlecrop Top5 validation error of , exceeding ResNet50 (6.838%) by 0.48%.
error (mm) ()  speed (ms) ()  
GAT [59]  0.732  11.04 
FeastNet [60]  0.623  6.64 
MoNet [38]  0.583  7.59 
SpiralGNN [1]  0.635  4.27 
ProdPoly (simple)  0.530  4.98 
ProdPoly (simple  linear)  0.529  4.79 
ProdPoly (full)  0.476  5.30 
ProdPoly (full  linear)  0.474  5.14 
ProdPoly vs 1st order graph learnable operators for mesh autoencoding. Note that even without using activation functions the proposed methods significantly improve upon the stateoftheart.
5.3 3D Mesh representation learning
Below, we evaluate higher order correlations in graph related tasks. We experiment with 3D deformable meshes of fixed topology [45], i.e. the connectivity of the graph remains the same and each different shape is defined as a different signal on the vertices of the graph: . As in the previous experiments, we extend a stateoftheart operator, namely spiral convolutions [1], with the ProdPoly formulation and test our method on the task of autoencoding 3D shapes. We use the existing architecture and hyperparameters of [1], thus showing that ProdPoly can be used as a plugandplay operator to existing models, turning the aforementioned one into a Spiral Net. Our implementation uses a product of polynomials, where each polynomial is a specific instantiation of (4): , , where is the spiral convolution operator written in matrix form.^{8}^{8}8Stability of the optimization is ensured by applying vertexwise instance normalization on the 2nd order term. We use this model (ProdPoly simple) to showcase how to increase the expressivity without adding new blocks in the architecture. This model can be also reinterpreted as a learnable polynomial activation function as in [27]. We also show the results of our complete model (ProdPoly full), where is a different spiral convolution.
In Table 5 we compare the reconstruction error of the autoencoder and the inference time of our method with the baseline spiral convolutions, as well as with the best results reported in [1] that other (more computationally involved  see inference time in table 5) graph learnable operators yielded. Interestingly, we manage to outperform all previously introduced models even when discarding the activation functions across the entire network. Thus, expressivity increases without having to increase the depth or the width of the architecture, as usually done by ML practitioners, and with small sacrifices in terms of inference time.
6 Discussion
In this work, we have introduced a new class of DCNNs, called Nets that perform function approximation using a polynomial neural network. Our Nets can be efficiently implemented via a special kind of skip connections that lead to highorder polynomials, naturally expressed with tensorial factors. The proposed formulation extends the standard compositional paradigm of overlaying linear operations with activation functions. We motivate our method by a sequence of experiments without activation functions that showcase the expressive power of polynomials, and demonstrate that Nets are effective in both discriminative, as well as generative tasks. Trivially modifying stateoftheart architectures in image generation, image and audio classification and mesh representation learning, the performance consistently imrpoves. In the future, we aim to explore the link between different decompositions and the resulting architectures and theoretically analyse their expressive power.
7 Acknowledgements
We are thankful to Nvidia for the hardware donation and Amazon web services for the cloud credits. The work of GC, SM, and GB was partially funded by an Imperial College DTA. The work of JD was partially funded by Imperial President’s PhD Scholarship. The work of SZ was partially funded by the EPSRC Fellowship DEFORM: Large Scale Shape Analysis of Deformable Models of Humans (EP/S010203/1) and a Google Faculty Award. An early version with single polynomials for the generative settings can be found in [3].
References
 [1] (2019) Neural 3d morphable models: spiral convolutional networks for 3d shape representation learning and generation. In International Conference on Computer Vision (ICCV), Cited by: §5.3, §5.3, Table 5.
 [2] (2019) Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.
 [3] (2019) PolyGAN: highorder polynomial generators. arXiv preprint arXiv:1908.06571. Cited by: §7.

[4]
(2009)
Imagenet: a largescale hierarchical image database.
In
Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 248–255. Cited by: §2. 
[5]
(2013)
Predicting parameters in deep learning
. In Advances in neural information processing systems (NeurIPS), pp. 2148–2156. Cited by: §3.  [6] (2017) CirCNN: accelerating and compressing deep neural networks using blockcirculant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 395–408. Cited by: §3.

[7]
(2019)
Implicit generation and generalization in energybased models
. In Advances in neural information processing systems (NeurIPS), Cited by: §5.1, Table 2.  [8] (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), Cited by: §3.
 [9] (1980) Neocognitron: a selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics 36 (4), pp. 193–202. Cited by: §2.

[10]
(2001)
From few to many: illumination cone models for face recognition under variable lighting and pose
. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (6), pp. 643–660. Cited by: Figure 7, §4.1. 
[11]
(2010)
Understanding the difficulty of training deep feedforward neural networks.
In
International Conference on Artificial Intelligence and Statistics (AISTATS)
, pp. 249–256. Cited by: §2.  [12] (2014) Generative adversarial nets. In Advances in neural information processing systems (NeurIPS), Cited by: §4.1.
 [13] (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv:1706.02677. Cited by: §5.2.
 [14] (2017) Classsplitting generative adversarial networks. arXiv preprint arXiv:1709.07359. Cited by: §4.2, §5.1, Table 2.
 [15] (2017) Improved training of wasserstein gans. In Advances in neural information processing systems (NeurIPS), pp. 5767–5777. Cited by: §1, §4.2, §5.1, Table 2.
 [16] (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems (NeurIPS), pp. 1135–1143. Cited by: §3.
 [17] (2016) Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, §1, §4.2, footnote 7.
 [18] (2017) Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in neural information processing systems (NeurIPS), pp. 6626–6637. Cited by: §5.1.
 [19] (2019) Nonadversarial image synthesis with generative latent nearest neighbors. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5811–5819. Cited by: §5.1, Table 2.
 [20] (2018) Squeezeandexcitation networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141. Cited by: footnote 7.
 [21] (2017) Densely connected convolutional networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708. Cited by: §1, §4.2.
 [22] (2017) Arbitrary style transfer in realtime with adaptive instance normalization. In International Conference on Computer Vision (ICCV), pp. 1501–1510. Cited by: §1, §2.
 [23] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), Cited by: §1, §2.
 [24] (1971) Polynomial theory of complex systems. transactions on Systems, Man, and Cybernetics (4), pp. 364–378. Cited by: §2.
 [25] (2018) Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations (ICLR), Cited by: §1, §2, §5.1.
 [26] (2019) A stylebased generator architecture for generative adversarial networks. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §5.1, §5.1.
 [27] (2019) On the expressive power of deep polynomial neural networks. In Advances in neural information processing systems (NeurIPS), Cited by: 1st item, §5.3.
 [28] (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.
 [29] (2009) Tensor decompositions and applications. SIAM review 51 (3), pp. 455–500. Cited by: §3.1, §3.
 [30] CIFAR100 (canadian institute for advanced research). . External Links: Link Cited by: §4.2.
 [31] (2014) The cifar10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html 55. Cited by: §4.2, §5.1, Table 2.
 [32] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS), pp. 1097–1105. Cited by: §1, §1, §2.
 [33] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1, §1, §2.
 [34] (2003) A sigmapisigma neural network (spsnn). Neural Processing Letters 17 (1), pp. 1–19. Cited by: §2.
 [35] (2015) Deep learning face attributes in the wild. In International Conference on Computer Vision (ICCV), pp. 3730–3738. Cited by: §2.
 [36] (2019) Adversarial training of partially invertible variational autoencoders. arXiv preprint arXiv:1901.01091. Cited by: §5.1, Table 2.
 [37] (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §4.1, §4.2, §5.1, §5.1.
 [38] (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 5.
 [39] (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814. Cited by: §2.
 [40] (2013) Analysis iii: spaces of differentiable functions. Encyclopaedia of Mathematical Sciences, Springer Berlin Heidelberg. External Links: ISBN 9783662099612 Cited by: footnote 4.
 [41] (2003) Polynomial neural networks architecture: analysis and design. Computers & Electrical Engineering 29 (6), pp. 703–725. Cited by: §2.
 [42] (2019) Semantic image synthesis with spatiallyadaptive normalization. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2337–2346. Cited by: §1.

[43]
(2017)
Automatic differentiation in PyTorch
. In NeurIPS Workshops, Cited by: §2.  [44] (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941. Cited by: §2.
 [45] (2018) Generating 3d faces using convolutional mesh autoencoders. In European Conference on Computer Vision (ECCV), pp. 704–720. Cited by: §5.3.
 [46] (2018) On the convergence of adam and beyond. In International Conference on Learning Representations (ICLR), Cited by: §2.
 [47] (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §5.2.
 [48] (2016) Improved techniques for training gans. In Advances in neural information processing systems (NeurIPS), pp. 2234–2242. Cited by: §5.1.
 [49] (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations (ICLR), Cited by: §2.
 [50] (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1, footnote 2.
 [51] (1991) The pisigma network: an efficient higherorder neural network for pattern classification and function approximation. In International Joint Conference on Neural Networks, Vol. 1, pp. 13–18. Cited by: §2.
 [52] (2017) Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing 65 (13), pp. 3551–3582. Cited by: §3.
 [53] (2015) Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §2.
 [54] (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §1, §4.2.
 [55] (1948) The generalized weierstrass approximation theorem. Mathematics Magazine 21 (5), pp. 237–254. Cited by: footnote 4.
 [56] (2015) Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. Cited by: §5.1.

[57]
(2015)
Chainer: a nextgeneration open source framework for deep learning
. In NeurIPS Workshops, Cited by: §2.  [58] (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §2.
 [59] (2018) Graph attention networks. International Conference on Learning Representations (ICLR). Cited by: Table 5.
 [60] (2018) Feastnet: featuresteered graph convolutions for 3d shape analysis. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 5.
 [61] (2003) Ridge polynomial networks in pattern recognition. In EURASIP Conference focused on Video/Image Processing and Multimedia Communications, Vol. 2, pp. 519–524. Cited by: §2.
 [62] (2018) Mixed link networks. In International Joint Conferences on Artificial Intelligence (IJCAI), Cited by: §4.2.
 [63] (2018) Speech commands: a dataset for limitedvocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §5.2.
 [64] (2017) Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: Figure 6, §4.1.
 [65] (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §4.2.
 [66] (2007) Training pisigma network by online gradient algorithm with penalty for small weight update. Neural computation 19 (12), pp. 3356–3368. Cited by: §2.
 [67] (2018) Sharing residual units through collective tensor factorization in deep neural networks. In International Joint Conferences on Artificial Intelligence (IJCAI), Cited by: §3.
 [68] (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.2.
 [69] (2017) Residual networks of residual networks: multilevel residual networks. IEEE Transactions on Circuits and Systems for Video Technology 28 (6), pp. 1303–1314. Cited by: §4.2.
 [70] (2017) Learning hierarchical features from deep generative models. In International Conference on Machine Learning (ICML), pp. 4091–4099. Cited by: §2.
Comments
There are no comments yet.