Expressivity and Trainability of Quadratic Networks

10/12/2021 ∙ by Feng-Lei Fan, et al. ∙ 11

Inspired by diversity of biological neurons, quadratic artificial neurons can play an important role in deep learning models. The type of quadratic neurons of our interest replaces the inner-product operation in the conventional neuron with a quadratic function. Despite promising results so far achieved by networks of quadratic neurons, there are important issues not well addressed. Theoretically, the superior expressivity of a quadratic network over either a conventional network or a conventional network via quadratic activation is not fully elucidated, which makes the use of quadratic networks not well grounded. Practically, although a quadratic network can be trained via generic backpropagation, it can be subject to a higher risk of collapse than the conventional counterpart. To address these issues, we first apply the spline theory and a measure from algebraic geometry to give two theorems that demonstrate better model expressivity of a quadratic network than the conventional counterpart with or without quadratic activation. Then, we propose an effective and efficient training strategy referred to as ReLinear to stabilize the training process of a quadratic network, thereby unleashing the full potential in its associated machine learning tasks. Comprehensive experiments on popular datasets are performed to support our findings and evaluate the performance of quadratic deep learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, a plethora of deep artificial neural networks have been developed with impressive successes in many mission-critical tasks

[fuchs2021super, you2019ct]. However, up to date the design of these networks focuses on architectures, such as shortcut connections [he2016deep, fan2018sparse]. Indeed, neural architecture search [liu2018progressive] is to find networks of similar topological types. Almost exclusively, the mainstream network models are constructed with neurons of the same type, which is composed by two parts: inner combination and nonlinear activation (We refer to such a neuron as a conventional neuron and a network made of these neurons as a conventional network hereafter). Despite that a conventional network does simulate certain important aspects of a biological neural network such as a hierarchical representation [lecun2015deep], attention mechanism [air], and so on, a conventional network and a biological neural system are fundamentally different in terms of neuronal diversity and complexity. In particular, a biological neural system coordinates numerous types of neurons which contribute to all kinds of intellectual behaviors [thivierge2008neural]. Considering that an artificial network is invented to mimic the biological neural system, and the essential role of neuronal diversity therein should be taken into account in deep learning research.

Along this direction, the so-called quadratic neurons [fan2019quadratic] were recently proposed, which replace the inner product in a conventional neuron with a quadratic operation (Hereafter, we call a network made of quadratic neurons as a quadratic network). A single quadratic neuron can implement XOR logic operation, which is not possible for an individual conventional neuron. The superior expressivity of quadratic networks over conventional networks is partially confirmed by a theorem that given the same structure there exists a class of functions that can be expressed by quadratic networks with a polynomial number of neurons, and can only be expressed by conventional networks with an exponential number of neurons [fan2020universal]

. In addition, a quadratic autoencoder was developed for CT image denoising and produced denoising performance better than or comparable to its competitors

[fan2019quadratic].

In spite of promising progress achieved by quadratic networks, there are still important theoretical and practical issues unaddressed satisfactorily. First of all, the superiority of quadratic networks in the representation power can be analyzed in a generic way instead of just showing the superiority in the case of a special class of functions. Particularly, we are interested in comparing a quadratic network with both a conventional network and a conventional network with quadratic activation [du2018power, mannelli2020optimization, xu2021robust]. Also, the training process of a quadratic network is subject to a higher risk of collapse than the conventional counterpart. Specifically, given a quadratic network with layers, its output function will be a polynomial of

degrees. Such a degree of freedom may lead to magnitude explosion in the training process. For example, when data are not appropriately initialized, the output of a deep quadratic network can be bizarrely huge (e.g.,

), which may destroy the training outcome. Therefore, it is essential to derive an effective and efficient strategy to facilitate the training process of a quadratic network.

To address the above issues, here we first present two theorems to reveal the superiority of a quadratic network in terms of model expressivity over either the conventional network or the conventional network with quadratic activation. The first theorem utilizes spline theory to compare the model expressivity of a quadratic network with that of a conventional network. Suppose that a ReLU activation is used, a conventional network outputs a piecewise linear function, and a quadratic network defines a piecewise polynomial function. According to spline theory, the approximation with a piecewise polynomial function is substantially more accurate than that with piecewise linear functions. Correspondingly, a quadratic network enjoys a better approximation accuracy. The other theorem is based on a measure in algebraic geometry to show that a quadratic network is more expressive than a conventional network with quadratic activation, which suggests that a conventional network with quadratic activation is not optimal to leverage quadratic mapping for deep learning.

Fig. 1: The performance of a quadratic network trained using the proposed ReLinear method, with an observed improvement than the conventional network of the same structure, where , , and

are hyperparameters of

ReLinear, and the quadratic training is based on the conventional training.

To unleash the full potential of a quadratic network, we propose a novel training strategy referred to as ReLinear (Referenced Linear initialization) to stabilize the training process of a quadratic network, with which we let each quadratic neuron evolve from a conventional neuron to a quadratic neuron gradually. Moreover, regularization is imposed to control nonlinear terms of a quadratic neuron to favor a low-order polynomial fitting. As a result, not only the training process is stabilized but also a quadratic network so trained can yield a performance gain compared to the conventional network of the same structure. Furthermore, encouraged by the success of ReZero (residual with zero initialization) used in training a residual network [bachlechner2020rezero], we merge our strategy ReLinear with ReZero to train a quadratic residual network progressively. Finally, we evaluate the proposed training strategy in comprehensive experiments on well-known data sets.

Main Contributions. In this paper, we present two theorems to demonstrate the superiority of quadratic networks in functional expressivity. Our results show not only that a quadratic network is powerful in the deep learning armory but also that a network with quadratic activation is sub-optimal. Of great practical importance, we propose a novel training strategy for optimization of quadratic networks. Finally, we conduct comprehensive experiments to demonstrate that the quadratic network trained with the proposed training strategy can perform competitively on well-known data sets.

2 Related Work

Polynomial networks were investigated in the late 90s. The idea of polynomial networks can be traced back to the Group Method of Data Handling (GMDH [ivakhnenko1971polynomial]), which learns gradually a complicated feature extractor:

(1)

where is the -th input variable, and are coefficients. Usually, this model is terminated at the second-order terms to avoid nonlinear explosion for high-dimensional inputs. The GMDH is thought as one of the first deep learning models in the survey paper [schmidhuber2015deep]. Furthermore, based on GMDH, the so-called higher-order unit was defined in [poggio1975optimal, giles1987learning, lippmann1989pattern] whose output is given by

(2)

where

is a nonlinear activation function. To maintain the power of high-order units while reducing the number of weights in high-order terms, Shin

et al. reported the pi-sigma network [shin1991pi], which is formulated as

(3)

where is the output of the -th sigma unit for the -th output element , and is the weight of the -th sigma unit associated with the input element . A pi-sigma network is intrinsically a shallow quadratic network. Along this direction, [milenkovic1996annealing] removed all cubic and higher-order terms and proposed to use the annealing technique to find optimal second-order terms.

Recently, higher-order units were revisited [zoumpourlis2017non, chrysos2021deep, livni2014computational, krotov2018dense]. In [zoumpourlis2017non], a quadratic convolutional filter of the complexity was proposed to replace the linear filter, while in the work by Chrysos et al. [chrysos2021deep] the higher-order units as described by Eq. (3

) were embedded into a deep network to reduce the complexity of the individual unit via tensor decomposition and factor sharing. Such a network achieved cutting-edge performance on several tasks. Compared to

[chrysos2021deep], our group proposed a simplified quadratic neuron with parameters and argued that more complicated neurons are not necessary based on algebraic fundamental theorem [remmert1991fundamental]. Interestingly, when only the first and second-order terms are kept, and the rank is set to two in tensor decomposition, the network of [chrysos2021deep] becomes a special case of our quadratic model.

On the other hand, neurons with polynomial activation [livni2014computational, krotov2018dense] are also relevant. However, it is underlined that the polynomially activated neurons are essentially different from polynomial neurons. In the former case, the networks use a polynomial activation but their neurons are still characterized by a piece-wise linear decision boundary, while in the latter case, a polynomial decision boundary is implied that can truly extract nonlinear features. Kileel et al. [kileel2019expressive] found that a network with polynomial activation is an algebraic variety, and proposed the dimension of the algebraic variety to measure the representation power of such a network.

Development of quadratic networks. In a theoretical perspective, the representation capability of quadratic networks was partially addressed in the analyses on the role of multiplicative operations in a network [jayakumar2019multiplicative], where the incorporation of multiplicative interactions can strictly enlarge the hypothesis space of a feedforward neural network. Fan et al. [fan2020universal] showed that a quadratic network can approximate a complicated radial function with a more compact structure than a conventional model. Increasingly more results are emerging on quadratic networks applications. For example, Nguyen et al. [nguyen2019deep] applied quadratic networks to predict the compressive strength of foamed concrete. Bu et al. [bu2021quadratic] applied a quadratic network to solve forward and inverse problems in PDEs.

3 Expressivity of Quadratic Networks

Given an input , a quadratic neuron of our interest is characterized as

(4)

where is a nonlinear activation function (hereafter, we use to denote ReLU), denotes the Hadamard product, , and are biases. For a univariate input, we have

(5)

A network with higher expressivity means that this network can either express more functions or express the same function more accurately. In this section, we show the enhanced expressivity of our quadratic network relative to either a conventional network or a conventional network with quadratic activation. For comparison with a conventional network, we note that in spline theory, a polynomial spline has a significantly more accurate approximation power than the linear spline. Since a quadratic network can express a polynomial spline and a conventional network corresponds to a linear spline, a quadratic network is naturally more powerful than a conventional network. As far as conventional networks with quadratic activation are concerned, we leverage the dimension of algebraic variety as the model expressivity measure defined in [kileel2019expressive] to demonstrate that our quadratic network has a higher dimension of algebraic variety, which suggests that a quadratic network is more expressive than a conventional network with quadratic activation.

3.1 Spline Theory

Let be a function in and

be a polynomial to interpolate the function

according to distinct points . Then, for any , there exists a point such that

(6)

For some function such as the Runge function , as the degree of the polynomial increases, the interpolation error goes to infinity, i.e.,

(7)

This bad behavior is referred to as the Runge phenomenon [boyd1992defeating], which is explained by two reasons: 1) As grows, the magnitude of the -th derivative increases; and 2) large distances between the points makes huge.

To overcome the Runge phenomenon when a high-order polynomial is involved for interpolation, the polynomial spline [birman1967piecewise] is invented to partition a function into pieces and fit a low-order polynomial to a small subset of points in each piece. Given a set of instances from the function , a polynomial spline is generically formulated as follows:

(8)

where is a polynomial spline of the order

(without loss of generality, we consider odd-degree polynomial splines), satisfying that (1)

for ; (2) . The simplest piecewise polynomial is a piecewise linear function. However, a piecewise linear interpolation is generically inferior to a piecewise polynomial interpolation in terms of accuracy. To illustrate this rigorously, we have the following lemma:

Lemma 1 ([hall1976optimal]).

Let be the -th degree spline of , as described by Eq. (8), if is a uniform partition with , then

(9)

where is the norm and denotes the -th Euler number. The approximation error is bounded by . For example,

(10)

This theorem also suggests that, to achieve the same accuracy, high degree splines require less amount of data than linear splines do. To reveal the expressivity of quadratic networks, we have the following proposition to show that a quadratic network can accurately express any univariate polynomial spline but a conventional network cannot. The method used for the proof is to re-express into a summation of several continuous functions and use quadratic network modules to express these functions one by one. Finally, we aggregate these modules together into a network, as shown by Figure 2.

Fig. 2: Illustration of our constructive spline approximation by a quadratic network.
Proposition 1 (Universal Spline Approximation).

Suppose that high-order terms of the polynomial spline do not degenerate, given a univariate polynomial spline expressed in Eq. (8), there exists a quadratic network with ReLU activation satisfying , while there exists no conventional networks with ReLU activation such that .

Proof.

Because a function defined by a quadratic network is a continuous function, we need to re-express into a formulation on continuous functions to allow the construction. Mathematically, we re-write given in Eq. (8) as

(11)

where for any ,

(12)

For notation consistency, . It is straightforward to verify that Eq. (11) is equivalent to Eq. (8). For any ,

(13)

has the following favorable characteristics: 1) It is a truncated function which has zero function value over the domain of ; 2) and due to , is also a continuous function. Thus, can be succinctly expressed as

(14)

where maps into that has the function value of zero.

Because a quadratic network can represent any univariate polynomial [fan2020universal], we let ; then,

(15)

Substituting Eq. (15) into Eq. (11), we derive our construction:

(16)

which concludes the proof of the first part of this proposition. For the second part, because a conventional network with ReLU activation is a piecewise linear function, cannot perfectly represent as long as high-order terms are non-zero.

Remark 1. Proposition 1 demonstrates that a polynomial spline is a solution in the space of quadratic networks, yet it will not appear in the space of conventional networks. Despite that our result is constructive, it does imply that the expressivity of quadratic networks is superior to that of conventional networks since a piecewise polynomial spline is certainly a better fitting tool than a piecewise linear spline. Since high degree splines require less amount of data than linear splines do, the use of quadratic networks might admit data efficiency when realizing the same approximation error.

3.2 Dimension of Algebraic Variety

To our best knowledge, there exist at least two ways to realize so-called polynomial networks. The first is to utilize a polynomial activation function, while the second one is to take a polynomial function for the aggregation, such as with our quadratic neurons. Despite the confusion due to the use of the same name, two realizations are dramatically different. One natural idea to check the capacity of networks is to compute the dimension of the hypothesis space provided by the network architecture. In algebraic geometry, the set of polynomials, refereed as algebraic variety, have been well studied. To put the superior expressivity of quadratic networks in perspective, we employ the dimension of algebraic variety, which was proposed to gauge the expressive power of polynomial networks [kileel2019expressive], to compare the two realizations. We find that the dimension of algebraic variety of our quadratic network is significantly higher than that of its competitor, which suggests that our quadratic network can represent a richer class of functions than the network using quadratic activation.

Two realizations. Assume that a network architecture consists of

layers with their widths specified by a vector

respectively, and . A network with a quadratic activation is a function of the form

(17)

where , , and . In contrast, our quadratic network is of the following form:

(18)

where , , and (To simplify our analysis, we use linear activation for our quadratic network as ). Given an architecture , the polynomial network with respect to the weights and biases defines a functional space, and we denote the functional spaces of two realizations as and , respectively.

Dimension of algebraic variety. In [kileel2019expressive], the Zariski closure of is considered, where is an algebraic variety, and the dimension of () is a measure to the expressivity of the pertaining network. Although is larger than , their dimensions agree with each other. Moreover, is amendable to the powerful analysis tools from algebraic geometry. In the following, based on the results in [kileel2019expressive]

, we provide an estimation for the upper bound of

.

Lemma 2.

Given an architecture , the following holds:

(19)
Proof.

For all diagonal matrices and permutation matrices , the function described in (18) returns the same output under the following replacements:

(20)

where represents any element in . As a result, the dimension of a generic fiber of is at least .

According to [eisenbud2013commutative], the dimension of , , is equal to the dimension of the domain of minus the dimension of the generic fiber of , which means

(21)

In addition, is at most the number of terms of , which means

(22)

Combining the above two formulas, we conclude this proof. ∎

For the same architecture, the upper bound provided by the network with quadratic activation [kileel2019expressive] is

(23)

This bound is lower than what we derived in Eq. (19). In addition to the upper bound comparison, we have the following proposition to directly compare and .

Proposition 2.

Given the same architecture , we have

(24)
Proof.

It can be shown that by the following substitutions:

(25)

turns into , which means .

However, getting from is difficult because we need to construct interaction terms from . Generally, representing a single quadratic neuron with neurons using quadratic activation requires a good number of neurons:

(26)

where are coefficients.

From another meaningful angle, can only get degree polynomials, while can be much more flexible to get an arbitrary degree polynomial because there is a product operation in the quadratic neuron. Therefore, can never represent . As a result, given the same network structure, , thereby we can conclude that .

Remark 2. The functions defined by the polynomial networks with respect to network parameters are functional varieties, and their Zariski closures are algebraic varieties. Using the dimension of the algebraic variety to measure its capacity, we show that our quadratic network has a higher expressivity than the conventional network with quadratic activation. The picture is even clearer when the network architecture is shallow and with infinite width. According to the theoretical results in [siegel2020approximation], such a network is never dense in the set of continuous functions if the activation function is a quadratic activation. However, a quadratic network using ReLU activation equipped with such an architecture is preferably dense.

4 Relinear

Fig. 3: Illustration of the proposed training strategy.
Initialization Learning Rate Updating Equation
,
ReLinear
ReLinear-
ReLinear-

Updating can be similarly done in reference to the equations for updating respectively.

TABLE I: Proposed training strategy. ReLinear uses shrinking gradients, while ReLinear works with shrinking weights.

Despite the superior expressivity, the training of a quadratic network may collapse, which prevents a quadratic network from achieving its full potential. When randomly initializing parameters of a quadratic network, the training is typically unstable: sometimes the model yields an exploding magnitude of the output; and in some other cases the training curve oscillates. Likely, this is because a quadratic term is nonlinear, and the composition of quadratic operations layer by layer produces a function of an exponentially high degree, causing the instability of the training process. As such, although a quadratic operation is more powerful and promises superior performance, it is necessary to balance model scalability and training stability. In this context, controlling quadratic terms is instrumental to the performance of the model, since how nonlinear a model should depend on a specific task.

To control the quadratic terms, we propose the ReLinear (referenced linear initialization), which encourages the model to learn suitable quadratic terms gradually and adaptively in reference to the corresponding linear terms. The ReLinear method has the following two steps. First, the quadratic weights in each neuron are set to and . Such an initialization degenerates a quadratic neuron into a conventional neuron. Second, quadratic terms are regularized in the training process. Intuitively, two ways of regularization: shrinking the gradients of quadratic weights (ReLinear); and shrinking quadratic weights (ReLinear). Let , , and be the learning rates for updating , and respectively, be the weight factors of and , and be the weight decay rates for and respectively. In Table I, we summarize the key points of the proposed training strategy.

Specifically, for ReLinear in Table I, we set different learning rates for and , where the learning rate for the former keeps intact as the conventional network, while the learning rate for the latter adjusts quadratic nonlinearity. Suppose that the learning rate of is zero, then quadratic terms are never updated, and a quadratic model actually turns into a linear model. As this learning rate increases, the quadratic terms become more significant. By tuning the learning rate of , such as starting with a smaller learning rate, we can prevent magnitude explosion while utilizing the quadratic nonlinearity.

For ReLinear in Table I, a straightforward way is to use or norms for , and shrink the quadratic weights at each iteration, instead of cropping their gradients.

We argue that shrinking the gradients of quadratic terms (ReLinear) is better than shrinking quadratic terms (ReLinear). Shrinking quadratic terms is to adjust quadratic terms along the directions of those parameters, which may not decrease the loss due to their deviation from the directions of the gradients. In contrast, shrinking the gradients respects the directions of the gradients, thereby reducing the loss.

Furthermore, in regard to the parameters , we can use either random initialization or weight transfer which is to train a conventional network sharing the same structure of a quadratic network and then transfer the learned parameters into the quadratic network. Specifically, in this case in each quadratic neuron are initialized by the parameters of the corresponding conventional neuron. In contrast to the random initialization, the weight transfer has an extra computational cost to train the conventional model. If the conventional model of the same structure needs to be trained, we estimate that the cost will increase by because the number of multiplications of a conventional neuron is of a quadratic neuron.

Remark 3. The pre-trained models are widely used in many computationally intensive or data-insufficient scenarios [chen2021pre]

. The representation and knowledge from a pre-trained model for a similar task can facilitate a deep learning solution to the current task. To train a quadratic network, it is often possible to use a pre-trained conventional model from related tasks for weight transfer. In these circumstances, the proposed ReLinear method is an embodiment of transfer learning.

In our numerical experiments, the quadratic network trained via ReLinear always outperforms the conventional network of the same structure. Because when or or , the quadratic network will be a conventional network, therefore at least delivering the same performance as the conventional network. As or , or gradually increases, the quadratic terms are preferably evolving to extract features and refine the workflow, making the model generally better than the corresponding conventional model. Of course, or , or should not be too large or small; otherwise, the quadratic terms would be insignificant or too aggressive, defeating the purpose of quadratic training.

5 Experiments

In this section, we first conduct analysis experiments (Runge function approximation and image recognition) to show the effectiveness of the proposed training strategy in controlling quadratic terms and to analyze the performance of different schemes for the proposed ReLinear method. Then, encouraged by our theoretical analyses, we compare quadratic networks with conventional networks and the networks with quadratic activation to show that a quadratic network is a competitive model.

5.1 Analysis Experiemnts

5.1.1 Runge Function Approximation

As mentioned earlier, a polynomial spline is used to replace a complete polynomial to overcome the Runge phenomenon. Since a quadratic network with ReLU activation is a polynomial spline, we implement a fully connected quadratic network to approximate the Runge function to verify the feasibility of our proposed training strategy in suppressing quadratic terms. This experiment is to approximate a univariate function, which enables us to conveniently compute the degree of the output function produced by a quadratic network and monitor its change.

In total, points are sampled from with an equal distance. The width of all layers is . The depth is such that the degree of the output function is , meeting the minimum requirement to fit instances. We compare the proposed strategy (ReLinear-, ReLinear- and ReLinear) with the regular training. In ReLinear-, and . In ReLinear(), and . In ReLinear, the learning rates are set as . In contrast, we configure a learning rate of for all parameters in regular training. The total iteration is to guarantee convergence.

Fig. 4: (a) Fitting by QNN via different training strategies. (b)-(e) Coefficients of randomly selected piecewise polynomials from QNN trained with regular training and the proposed strategies (ReLinear-, ReLinear- and ReLinear).

The results are shown in Figure 4. As a spline can avoid the Runge phenomenon, regardless of how the QNN with ReLU activation is trained, it can fit the Runge function desirably without oscillations at edges, as shown in Figure 4(a). Furthermore, in Figure 4(b)-(e), we examine coefficients of randomly selected polynomials contained by functions of QNNs at different pieces. It is observed from Figure 4(b) that the polynomials associated with the regular training have unignorable high-order terms. This is counter-intuitive because a QNN partitions the interval into many pieces (24 pieces based on our computation). Since only a few samples lie in each piece, it suffices to use a low-degree polynomial in each piece. This might be due to that the space of low-degree polynomials is a measure-zero subspace in the high-degree polynomials. Thus, it is hard to obtain a low-degree polynomial fit straightforwardly. Next, we observe that coefficients of high degrees are significantly suppressed in Figure 4(c)-(e) than those in Figure 4(b). At the same time, the magnitudes of coefficients of low degrees in Figure 4(c)-(e) are put down. Such observations imply that all the proposed strategies can effectively control the quadratic terms as expected.

Training ReLinear() ReLinear() ReLinear
RMSE 0.0656 0.0426 0.0205
TABLE II: RMSE valuse of different training strategies.

Next, we quantitatively compare the approximation errors of different training strategies using the rooted mean squared error (RMSE). We evenly sample 100 instances from as test samples, none of which appears in the training. Table II shows RMSE values of three realizations for ReLinear. It can be seen that ReLinear achieves the lowest error, which suggests that ReLinear is better at making balancing between suppressing quadratic terms and maintaining approximation precision.

5.1.2 Image Classification

In the preceding part, we show that the proposed strategies can suppress the high-order terms by explicitly examining the output function of a quadratic network. Here, we focus on an image recognition task to further confirm the effectiveness of the proposed strategy. We build a quadratic ResNet (QResNet for short) by replacing conventional neurons in ResNet with quadratic neurons and keeping everything else intact. We train the QResNet on the CIFAR10 dataset. Our motivation is to gauge the characteristics and performance of the proposed different realizations of the ReLinear method through experiments.

Following configurations of [he2016deep]

, we use batch training with a batch of 128. The optimization is stochastic gradient descent using Adam

[kingma2014adam]. is set to

. The total number of epochs is

. In the -th and -th epoch, decrease at the same time. Because training curves share the same trend with the testing curves, we only show the testing curves here for conciseness.

Tuning ReLinear. Here, with QResNet20, we study the impact of on the effectiveness of ReLinear. Without loss of generality, we set . The lower and are, the more the quadratic weights are constrained. We respectively set to for a comprehensive analysis. The resultant accuracy curves are shown in Figure 5. It can be seen that when is large (), the training is quite unstable, the accuracy score jumps severely. However, as and go low, i.e., , the training curves become stabilized, mirroring that the high-order terms are well controlled. The best performance (error ) is achieved when , which is consistent with the trend shown in Figure 1.

Fig. 5: Accuracy curves obtained from different learning rates () for ReLinear.

Tuning ReLinear. Here, we investigate the impact of different parameters for ReLinear- and ReLinear-, respectively. For both ReLinear- and ReLinear-, , , and are set to , and decay at epochs and at the same time. Concerning parameters of ReLinear-, let which are respectively set to for comprehension. For ReLinear-, we cast to , respectively. If the equal to , the quadratic weights will oscillate around 0. The accuracy curves for ReLinear- and ReLinear- are shown in Figure 6. It is observed that the -norm is not good at stablizing the training in this task, while an appropriate norm, i.e., manages to eliminate the large oscillation. The lowest error of ReLinear is achieved by the norm at , which is worse than the lowest error of ReLinear.

Fig. 6: Left: accuracy curves obtained from different parameters () for ReLinear-. Right: accuracy curves obtained from different parameters () for ReLinear-.
Fig. 7: Accuracy curves from different learning rates by transferring weights from different stages.

Weight Transfer. As mentioned earlier, weight transfer can also be used to train a quadratic network. Because the training of a conventional ResNet has three stages (1-100 epochs, 101-150 epochs, 151-200 epochs), weight transfer also has three choices, corresponding to transferring weights from which stage. We evaluate all three choices. After transferring, the learning rate will inherit the learning rate of the transferred model and then decay when the training moves to the next stage. Still, we set to for a comprehensive analysis. Here we show the accuracy curves in Figure 7. There are two observations from Figure 7. First, transferred parameters can stabilize the training provided appropriate . For the same , weights transferred from the later stage make the training more robust than those transferred from the earlier stage. This is because transferred parameters from the later stages have been good for the model, there is less need to optimize the quadratic terms, thereby avoiding the risk of explosion. The second highlight is that the best performance comes from weights transferred from the first stage, which suggests that the model can be improved if quadratic terms play a significant role. The lowest errors by transferring from three stages are , , and .

ReLinear+ReZero Specially, for training a residual quadratic network, the proposed ReLinear method can be integrated with the recent proposal called ReZero (residual with zero initialization, [bachlechner2020rezero]), which dedicates to training a residual network by revising the residual propagation formula from to , where is initialized to be zero such that the network is an identity at the beginning. ReZero can not only speed up the training but also improve the performance of the model. Here, we evaluate the feasibility of training a quadratic residual network with ReLinear+ReZero. We adopt ReLinear and let . We respectively set to . The results are shown in Figure 8. Comparing Figures 8 and 5, we surprisingly find that QResNet20 trained via ReLinear+Rezero is more stable. Previously, when , the accuracy curves from the only ReLinear still suffer unrest oscillations, while curves from the ReLinear+ReZero do not.

Fig. 8: Accuracy curves obtained from different learning rates () for the ReLinear+ReZero method.

5.1.3 Training Stability

Here, we compare our quadratic model with a model using quadratic activation on the CIFAR100 dataset. The VGG16 [simonyan2014very] is used as the test bed. We obtain the VGG using quadratic activation by directly revising activation function, and we prototype the Quadratic-VGG16 by replacing the conventional neurons with quadratic neurons in the VGG16. All training protocols of the quadratic VGG16 and the VGG16 using quadratic activation are the same to the authentic VGG16. We test three learning rates for the VGG16 using quadratic activation: . For the Quadratic-VGG16, we set the learning rate to and to . The results are shown in Table III. We find that the VGG16 using quadratic activation does not converge regardless of different learning rates, therefore, no errors can be reported. Indeed, direct training a network using quadratic activation suffers the magnitude as well due to the exponentially high degree. Preferably, this problem is overcame in our quadratic network aided by the proposed training strategy.

Network Error (%)
VGG16(quadratic activation) l.r=0.05 no converge
VGG16(quadratic activation) l.r=0.03 no converge
VGG16(quadratic activation) l.r=0.01 no converge
Quadratic-VGG16 28.33
TABLE III: Error() of the quadratic VGG and VGG using quadratic activation on CIFAR100 validation set.

5.2 Comparative Study

Here, we validate the superiority of a quadratic network over a conventional network, with experiments on three data sets: CIFAR10 and ImageNet. The quadratic network is implemented as a drop-in replacement for the conventional network, which means that the only difference is the neuron type. Despite the straightforward replacement, aided by the proposed strategy, a network of the quadratic version can be much better than its counterpart. Moreover, we implement a compact version of quadratic networks, which also demonstrates competitive performance.

CIFAR10. In this experiment, we systematically compare our QResNet with the ResNet. We follow the same protocol as the ResNet to train the QResNet, such as batch size, epoch number, and so on. As implied by our preceding experimental results that ReLinear is generally better than ReLinear, therefore, we adopt ReLinear. , , are set for QResNet20, QResNet32, QResNet56, and QResNet110, respectively. For all quadratic models, we test the weight transfer (the first stage) and a random initialization for their linear parts. We also implement ReLinear+ReZero, where , , are set for QResNet20, QResNet32, QResNet56, and QResNet110, respectively. Table IV summarizes the results. Regardless of ways of initialization, all quadratic models are better than their counterparts, which is consistent with our analyses that quadratic neurons can improve model expressibility. Again, the improvement by quadratic networks is warranted because the employment of the proposed strategy makes the conventional model a special case of quadratic models. At least, a quadratic model will deliver the same as the conventional model. Furthermore, combined with the weight transfer, a quadratic network trained via ReLinear can surpass a conventional network by a larger margin, which suggests that a well-trained conventional model can effectively guide the training of a quadratic network.

Network Paras Error ()
ResNet20 0.27M 8.75
ResNet32 0.46M 7.51
ResNet56 0.86M 6.97
ResNet110 1.7M 6.61
ResNet1202 19.4M 7.93
QResNet20 (r. i., ReLinear) 0.81M 7.78
QResNet20 (r. i., ReLinear+ReZero) 0.81M 7.97
QResNet20 (w. t., ReLinear) 0.81M 7.17
QResNet32 (r. i., ReLinear) 1.39M 7.18
QResNet32 (r. i., ReLinear+ReZero) 1.39M 6.90
QResNet32 (w. t., ReLinear) 1.39M 6.38
QResNet56 (r. i., ReLinear) 2.55M 6.43
QResNet56 (r. i., ReLinear+ReZero) 2.55M 6.34
QResNet56 (w. t., ReLinear) 2.55M 6.22
QResNet110 (r. i., ReLinear) 5.1M 6.36
QResNet110 (r. i., ReLinear+ReZero) 5.1M 6.12
QResNet110 (w. t., ReLinear) 5.1M 5.44

Note: r. i. refers to random initialization, and w. t. refers to weight transfer.

TABLE IV: Image classification by ResNet and QResNet on CIFAR10 validation set.

Currently, an individual quadratic neuron has times parameters relative to an individual convention neuron, which causes that a quadratic model is three times larger than the conventional model. To reduce the model complexity, we simplify the quadratic neuron by eradicating interaction terms, leading to a compact quadratic neuron: . The number of parameters in a compact quadratic neuron is twice that in a conventional neuron. Similarly, as a drop-in replacement, we implement the QResNet20, QResNet32, and QResNet56 with compact quadratic neurons referred to as Compact-QResNet, and compare them with conventional models. The linear parts are initialized with a conventional ResNet. We also use ReLinear to train the Compact-QResNet and adopt , and for the Compact-QResNet20, Compact-QResNet32, and Compact-QResNet56, respectively. It is seen from Table V that Compact-QResNets are overall inferior to QResNets, but the margins between QResNet32/56 and Compact-QResNet32/56 are slight. It is noteworthy that both Compact-QResNet32 and QResNet32 are better than ResNet110 with smaller model sizes.

Network Params Error (%)
QResNet20 0.81M 7.17
Compact-QResNet20 0.54M 7.76
QResNet32 1.39M 6.38
Compact-QResNet32 0.92M 6.56
QResNet56 2.55M 6.22
Compact QResNet56 1.92M 6.30
TABLE V: Image classification error() by Compact-QResNet on CIFAR10 validation set.

Furthermore, we would like to underscore that the reason for improvements produced by quadratic networks is not increasing the number of parameters in a brute-force manner but facilitating the enhanced feature representation. The enhanced feature representation in some circumstances can also enjoy model efficiency. For example, purely stacking layers does not definitely facilitate the model performance, evidenced by the fact the ResNet1202 is worse than ResNet34, ResNet56, and ResNet110. Moreover, we implement the ResNet32 and ResNet56 with times channel numbers of the original. As Table VI shows, increasing channel numbers of the ResNets does not produce as much gain as using quadratic models with the approximately similar model size. We think this is because a widened network is just a combination of more linear operations in the same layer, which does not change the type of representation too much.

Network Channel Number Params Error (%)
ResNet32 (16) 0.46M 7.51
ResNet32 (24) 1.04M 6.45
ResNet32 (32) 1.84M 6.75
ResNet32 (40) 2.88M 7.44
QResNet32 - 1.39M 6.38
ResNet56 (16) 0.86M 6.97
ResNet56 (24) 1.92M 6.84
ResNet56 (32) 3.40M 6.12
ResNet56 (40) 5.31M 6.58
QResNet56 - 2.55M 6.22
TABLE VI: Image classification error() by ResNets with increased channel numbers on CIFAR10 validation set..

ImageNet.

Here, we confirm the superior model expressivity of quadratic networks with experiments on ImageNet. The ImageNet dataset

[deng2009imagenet] is made of million images for training and images for validation, which are from classes. For model configurations, we follow those in the ResNet paper [he2016deep]. We set the batch size to 256, the initial learning rate to 0.1, the weight decay to 0.0001, and the momentum to 0.9. For ReLinear, we set and to . We adopt the standard 10-crop validation. As seen in Table VII, similar to what we observed in CIFAR10 experiments, direct replacement with quadratic neurons can promote the performance, which confirms that the quadratic network is more expressive than the conventional network. For example, QResNet32 is better than ResNet34 with a considerable margin (>1%).

Network Error()
plain-18 27.94
ResNet18 27.88
QResNet18 27.67
plain-34 28.54
ResNet34 25.03
QResNet32 24.01

The errors of compared models are reported by the official implementation.

TABLE VII: Image classification by QResNet on the ImageNet validation set.

6 Conclusion

In this article, we have theoretically demonstrated the superior expressivity of a quadratic network over either the popular deep learning model or such a conventional model with quadratic activation. Then, we have proposed an effective and efficient strategy ReLinear for training a quadratic network, thereby improving its performance in various machine learning tasks. Finally, we have performed extensive experiments to corroborate our theoretical findings and confirm the practical gains with ReLinear. We have shared our code in https://github.com/FengleiFan/ReLinear. Future research directions include up-scaling quadratic networks to solve more real-world problems, and characterize our quadratic approach in terms of its robustness, generalizability, and other properties.

Acknowledgement

Dr. Fei Wang would like to acknowledge the support from NSF 1750326, the AWS machine learning for research award, and the Google faculty research award. Dr. Rongjie Lai’s research is supported in part by an NSF Career Award DMS–1752934. Dr. Ge Wang would like to acknowledge the funding support from R01EB026646, R01CA233888, R01CA237267, R01HL151561, R21CA264772, and R01EB031102.

References