# On Universal Equivariant Set Networks

Using deep neural networks that are either invariant or equivariant to permutations in order to learn functions on unordered sets has become prevalent. The most popular, basic models are DeepSets [Zaheer et al. 2017] and PointNet [Qi et al. 2017]. While known to be universal for approximating invariant functions, DeepSets and PointNet are not known to be universal when approximating equivariant set functions. On the other hand, several recent equivariant set architectures have been proven equivariant universal [Sannai et al. 2019], [Keriven et al. 2019], however these models either use layers that are not permutation equivariant (in the standard sense) and/or use higher order tensor variables which are less practical. There is, therefore, a gap in understanding the universality of popular equivariant set models versus theoretical ones. In this paper we close this gap by proving that: (i) PointNet is not equivariant universal; and (ii) adding a single linear transmission layer makes PointNet universal. We call this architecture PointNetST and argue it is the simplest permutation equivariant universal model known to date. Another consequence is that DeepSets is universal, and also PointNetSeg, a popular point cloud segmentation network (used eg, in [Qi et al. 2017]) is universal. The key theoretical tool used to prove the above results is an explicit characterization of all permutation equivariant polynomial layers. Lastly, we provide numerical experiments validating the theoretical results and comparing different permutation equivariant models.

## Authors

• 5 publications
• 20 publications
• ### Universal approximations of permutation invariant/equivariant functions by deep neural networks

In this paper,we develop a theory of the relationship between permutatio...
03/05/2019 ∙ by Akiyoshi Sannai, et al. ∙ 0

• ### Universal Invariant and Equivariant Graph Neural Networks

Graph Neural Networks (GNN) come in many flavors, but should always be e...
05/13/2019 ∙ by Nicolas Keriven, et al. ∙ 13

• ### Learning Set-equivariant Functions with SWARM Mappings

In this work we propose a new neural network architecture that efficient...
06/22/2019 ∙ by Roland Vollgraf, et al. ∙ 8

• ### Neural Networks on Groups

Recent work on neural networks has shown that allowing them to build int...
06/13/2019 ∙ by Stella Rose Biderman, et al. ∙ 0

• ### Universal Approximation of Functions on Sets

Modelling functions of sets, or equivalently, permutation-invariant func...
07/05/2021 ∙ by Edward Wagstaff, et al. ∙ 4

• ### Universal and Succinct Source Coding of Deep Neural Networks

Deep neural networks have shown incredible performance for inference tas...
04/09/2018 ∙ by Sourya Basu, et al. ∙ 0

• ### Lessons on Parameter Sharing across Layers in Transformers

We propose a parameter sharing method for Transformers (Vaswani et al., ...
04/13/2021 ∙ by Sho Takase, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 introduction

Many interesting tasks in machine learning can be described by functions

that take as input a set, , and output some per-element features or values, . Permutation equivariance is the property required of so it is well-defined. Namely, it assures that reshuffling the elements in and applying results in the same output, reshuffled in the same manner. For example, if then .

Building neural networks that are permutation equivariant by construction proved extremely useful in practice where the most popular models, DeepSets Zaheer et al. (2017) and PointNet Qi et al. (2017) enjoy small number of parameters, low memory footprint and computational efficiency along with high empirical expressiveness. Although both DeepSets and PointNet are known to be invariant universal (i.e., can approximate arbitrary invariant continuous functions) they are not known to be equivariant universal (i.e., can approximate arbitrary equivariant continuous functions).

On the other hand, several researchers have suggested theoretical permutation equivariant models and proved they are equivariant universal. Sannai et al. (2019) builds a universal equivariant network by taking copies of -invariant networks and combines them with a layer that is not permutation invariant in the standard (above mentioned) sense. Keriven and Peyré (2019) solves a more general problem of building networks that are equivariant universal over arbitrary high order input tensors (including graphs); their construction, however, uses higher order tensors as hidden variables which is of less practical value. Yarotsky (2018) proves that neural networks constructed using a finite set of invariant and equivariant polynomial layers are also equivariant universal, however his network is not explicit (i.e., the polynomials are not characterized for the equivariant case) and also of less practical interest due to the high degree polynomial layers.

In this paper we close the gap between the practical and theoretical permutation equivariant constructions and prove:

###### Theorem 1.
1. PointNet is not equivariant universal.

2. Adding a single linear transmission layer (i.e., ) to PointNet makes it equivariant universal.

3. Using ReLU activation the minimal width required for universal permutation equivariant network satisfies

.

This theorem suggests that, arguably, PointNet with an addition of a single linear layer is the simplest universal equivariant network, able to learn arbitrary continuous equivariant functions of sets. An immediate corollary of this theorem is

###### Corollary 1.

DeepSets and PointNetSeg are universal.

PointNetSeg is a network used often for point cloud segmentation (e.g., in Qi et al. (2017)). One of the benefit of our result is that it provides a simple characterization of universal equivariant architectures that can be used in the network design process to guarantee universality.

The theoretical tool used for the proof of Theorem 1

is an explicit characterization of the permutation equivariant polynomials over sets of vectors in

using power-sum multi-symmetric polynomials. We prove:

###### Theorem 2.

Let be a permutation equivariant polynomial map. Then,

 P(X)=∑|α|≤nbαqTα, (1)

where , , where , , are invariant polynomials; are the power-sum multi-symmetric polynomials. On the other hand every polynomial map satisfying Equation 1 is equivariant.

This theorem, which extends a result in Golubitsky and Stewart (2002) to sets of vectors using multivariate polynomials, lends itself to expressing arbitrary equivariant polynomials as a composition of entry-wise continuous functions and a single linear transmission, which in turn facilitates the proof of Theorem 1.

We conclude the paper by numerical experiments validating the theoretical results and testing several permutation equivariant networks for the tasks of set classification and regression.

## 2 Preliminaries

#### Equivariant maps.

Vectors are by default column vectors; are the all zero and all one vectors/tensors; is the -th standard basis vector;

is the identity matrix; all dimensions are inferred from context or mentioned explicitly. We represent a set of

vectors in as a matrix and denote , where , , are the columns of . We denote by the permutation group of ; its action on is defined by , . That is, is reshuffling the rows of . The natural class of maps assigning a value or feature vector to every element in an input set is permutation equivariant maps:

###### Definition 1.

A map satisfying for all and is called permutation equivariant.

#### Power-sum multi-symmetric polynomials.

For a vector and a multi-index vector we define , and . Given a vector the power-sum symmetric polynomials , with , uniquely characterize up to permuting its entries. In other words, for we have for some if and only if for all . An equivalent property is that every invariant polynomial can be expressed as a polynomial in the power-sum symmetric polynomials, i.e., , see Rydh (2007) Corollary 8.4, Briand (2004) Theorem 3.

A generalization of the power-sum symmetric polynomials to matrices exists and is called power-sum multi-symmetric polynomials, defined with a bit of notation abuse: , where is a multi-index satisfying . Note that the number of power-sum multi-symmetric polynomials acting on is . For notation simplicity let be a list of all with . Then we index the collection of power-sum multi-symmetric polynomials as .

Similarly to the vector case the numbers , characterize up to permutation of its rows. That is for some iff for all . Furthermore, every invariant polynomial can be expressed as a polynomial in the power-sum multi-symmetric polynomials (see (Rydh, 2007) corollary 8.4), i.e.,

 p(X)=q(s1(X),…,st(X)), (2)

These polynomials were recently used to encode multi-sets in Maron et al. (2019).

## 3 Equivariant multi-symmetric polynomial layers

In this section we develop the main theoretical tool of this paper, namely, a characterization of all permutation equivariant polynomial layers. As far as we know, these layers were not fully characterized before.

Theorem 2 provides an explicit representation of arbitrary permutation equivariant polynomial maps using the basis of power-sum multi-symmetric polynomials, . The particular use of power-sum polynomials has the advantage it can be encoded efficiently using a neural network: as we will show can be approximated using a PointNet with a single linear transmission layer. This allows approximating an arbitrary equivariant polynomial map using PointNet with a single linear transmission layer. In contrast, Yarotsky (2018) also provides a polynomial characterization of equivariant maps (see Lemma 2.1 and Proposition 2.4 in (Yarotsky, 2018)) however there is no formula given to the invariant/equivariant generating set of polynomials and their efficient implementation/approximation is therefore questionable.

A version of this theorem for vectors instead of matrices (i.e., the case of ) appears in Golubitsky and Stewart (2002); we extend their proof to matrices, which is the relevant scenario for ML applications as it allows working with sets of vectors.

First, note that it is enough to prove Theorem 1 for and apply it to every column of . Hence, we deal with a vector of polynomials and need to prove it can be expressed as , for invariant polynomial .

Given a polynomial and the cyclic permutation the following operation, taking a polynomial to a vector of polynomials, is useful in characterizing equivariant polynomial maps:

 ⌈p⌉(X)=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝p(X)p(σ⋅X)p(σ2⋅X)⋮p(σn−1⋅X)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ (3)

Theorem 1 will be proved using the following two lemmas:

###### Lemma 1.

Let be an equivariant polynomial map. Then, there exists a polynomial , invariant to (permuting the last rows of ) so that .

###### Proof.

Equivariance of means that for all it holds that

 σ⋅p(X)=p(σ⋅X). (4)

Choosing an arbitrary permutation , namely a permutation satisfying , and observing the first row in Equation 4 we get . Since this is true for all is invariant. Next, applying to Equation 4 and observing the first row again we get . Using the invariance of to we get . ∎

###### Lemma 2.

Let be a polynomial invariant to (permuting the last rows of ) then

 p(X)=∑|α|≤nxα1qα(X), (5)

where are invariant.

###### Proof.

Expanding with respect to we get

 p(X)=∑|α|≤mxα1pα(x2,…,xn), (6)

for some . We first claim are invariant. Indeed, note that if is invariant, i.e., invariant to permutations of , then also its derivatives are permutation invariant, for all . Taking the derivative on both sides of Equation 6 we get that is equivariant.

For brevity denote . Since is invariant it can be expressed as a polynomial in the power-sum multi-symmetric polynomials, i.e., . Note that and therefore

 p(x2,…,xn)=r(s1(X)−xα11,…,st(X)−xαt1).

Since is a polynomial, expanding its monomials in and shows can be expressed as , where , and are invariant (as multiplication of invariant polynomials ). Plugging this in Equation 6 we get Equation 5, possibly with the sum over some . It remains to show can be taken to be at-most . This is proved in Corollary 5 in Briand (2004)

###### Proof.

(Theorem 2) Given an equivariant as above, use Lemma 1 to write where is invariant to permuting the last rows of . Use Lemma 2 to write , where are invariant. We get,

 p=⌈p⌉=∑|α|≤nbαqα.

The converse direction is immediate after noting that are equivariant and are invariant. ∎

## 4 Universality of set equivariant neural networks

We consider equivariant deep neural networks ,

 F(X)=Lm∘ν∘⋯∘ν∘L1(X), (7)

where are affine equivariant transformations, and is an entry-wise non-linearity (e.g., ReLU). We define the width of the network to be ; note that this definition is different from the one used for standard MLP where the width would be , see e.g., Hanin and Sellke (2017). Zaheer et al. (2017) proved that affine equivariant are of the form

 Li(X)=XA+1n11TXB+1cT, (8)

where , and

are the layer’s trainable parameters; we call the linear transformation

a linear transmission layer. Equation 7 with the choice of layers as in Equation 8 is the DeepSets architecture (Zaheer et al., 2017). Taking in all layers is the PointNet architecture (Qi et al., 2017). Another type of architecture of interest is PointNetSeg appearing in (Qi et al., 2017). Variations of this architecture are used for Point cloud segmentation, see e,g, Li et al. (2018) and Engelmann et al. (2017). PointNsetSeg uses an invariant version of PointNet (i.e., PointNet composed with an invariant max layer) and concatenates its output as a constant feature to an intermediate layer which is then inputed to another PointNet. Lastly, we will also consider a model we call PointNetST that is PointNet with a Single (linear) Transmission layer; in more details, PointNetST is an equivariant model of the form Equation 7 with layer as in Equation 8 where only a single layer has a non-zero (see Equation 8). We will prove PointNetST is permutation equivariant universal and therefore arguably the simplest permutation equivariant universal model known to date.

Universality of equivariant deep networks is defined next.

###### Definition 2.

Permutation equivariant universality111Or just equivariant universal in short. of a model means that for every permutation equivariant continuous function defined over the cube , and there exists a choice of (i.e., network depth), (i.e., network width) and the trainable parameters of so that for all .

We prove our main theorem next.

Proof. (Theorem 1) Fact (i), namely that PointNet is not equivariant universal is a consequence of the following simple lemma:

###### Lemma 3.

Let be the equivariant linear function defined by . There is no so that for all and .

###### Proof.

Assume such exists. Let . Then,

 1=|h2(e1)−h2(0)|≤|h2(e1)−f(0)|+|f(0)−h2(0)|<1

To prove (ii) we first reduce the problem from the class of all continuous equivariant functions to the class of equivariant polynomials. This is justified by the following lemma.

###### Lemma 4.

Equivariant polynomials are dense in the space of continuous equivariant functions over the cube .

###### Proof.

Take an arbitrary . Consider the function , which denotes the -th output entry of . By the Stone-Weierstrass Theorem there exists a polynomial such that for all . Consider the polynomial map defined by . is in general not equivariant. To finish the proof we will symmetrize :

 ∥∥ ∥∥F(X)−1n!∑σ∈Snσ⋅P(σ−1⋅X)∥∥ ∥∥∞ =∥∥ ∥∥1n!∑σ∈Snσ⋅F(σ−1⋅X)−1n!∑σ∈Snσ⋅P(σ−1⋅X)∥∥ ∥∥∞ =∥∥ ∥∥1n!∑σ∈Snσ⋅(F(σ−1⋅X)−P(σ−1⋅X))∥∥ ∥∥∞ ≤1n!∑σ∈Snϵ =ϵ,

where in the first equality we used the fact that is equivariant. This concludes the proof since is an equivariant polynomial map. ∎

Now, according to Theorem 2 an arbitrary equivariant polynomial can be written as , where and are invariant polynomials. Remember that every invariant polynomial can be expressed as a polynomial in the power-sum multi-symmetric polynomials , (we use the normalized version for a bit more simplicity later on). We can therefore write as composition of three maps:

 P=Q∘L∘B, (9)

where is defined by

 B(X)=(b(x1),…,b(xn))T,

; is defined as in Equation 8 with and , where the identity matrix and represents the standard basis (as-usual). We assume , for . Note that the output of is of the form

 L(B(X))=(X,1s1(X),1s2(X),…,1st(X)).

Finally, is defined by

 Q(X,1s1,…,1st)=(q(x1,s1,…,st),…,q(xn,s1,…,st))T,

and .

The decomposition in Equation 9 of suggests that replacing

with Multi-Layer Perceptrons (MLPs) would lead to a universal permutation equivariant network consisting of PointNet with a single linear transmission layer, namely PointNetST.

The approximating will be defined as

 F=Ψ∘L∘Φ, (10)

where and are both of PointNet architecture, namely there exist MLPs and so that and . See Figure 1 for an illustration of .

To build the MLPs , we will first construct to approximate , that is, we use the universality of MLPS (see (Hornik, 1991; Sonoda and Murata, 2017; Hanin and Sellke, 2017)) to construct so that for all . Furthermore, as over is uniformly continuous, let be such that if , then . Now, we use universality again to construct approximating , that is we take so that for all .

 ∥F(X)−P(X)∥∞ ≤∥Ψ(L(Φ(X)))−Ψ(L(B(X)))∥∞+∥Ψ(L(B(X)))−Q(L(B(X)))∥∞ =err1+err2

First, for all and therefore . Second, note that if then and . Therefore by construction of we have .

To prove (iii) we use the result in Hanin and Sellke (2017) (see Theorem 1) bounding the width of an MLP approximating a function by . Therefore, the width of the MLP is bounded by , where the width of the MLP is bounded by , proving the bound.                                     ∎

Before we prove Corollary 1 let us recall the PointNetSeg model from Qi et al. (2017) in detail. We write the model as follows: First, we apply . Let be the output of the first layer of the mlp and let be the vector where the ’th coordinate is the maximal value over the ’th column in . Then , with .

We can now prove Cororllary 1.

###### Proof.

(Corollary 1)

The fact that the DeepSets model is equivariant universal is immediate. Indeed, The PointNetST model can be obtained from the DeepSets model by setting in all but one layer, with as in Equation 8.

For the PointNetSeg model note that by Theorem 1 in Qi et al. (2017) every invariant function can be approximated by a network of the form

 ¯ϕ∘max∘¯Ψ

Where , with multi layer perceptrons. In particular, for every there exists such that for every where are the power sum multi symmetric polynomials. The rest of the proof follows closely the proof of Theorem 1.

## 5 Experiments

We conducted experiments in order to validate our theoretical observations. We compared the results of several equivariant models, as well as baseline (full) MLP, on three equivariant learning tasks: a classification task (knapsack) and two regression tasks (squared norm and Fiedler eigen vector). For all tasks we compare results of different models: DeepSets, PointNet, PointNetSeg, PointNetST, and PointNetQT. PointNetQT is PointNet with a single quadratic equivariant transmission layer as defined in the appendix. In all experiments we used a network of the form Equation 7 with depth and varying width, fixed across all layers.

#### Equivariant classification.

For classification, we chose to learn the multidimensional knapsack problem, which is known to be NP-hard. We are given a set of -vectors, represented by , and our goal is to learn the equivariant classification function defined by the following optimization problem:

 f(X)=argmaxz n∑i=1xi1zi s.t. n∑i=1xijzi≤wj,j=2,3,4 zi∈{0,1},i∈[n]

Intuitively, given a set of vectors , , where each row represents an element in a set, our goal is to find a subset maximizing the value while satisfying budget constraints. The first column of defines the value of each element, and the three other columns the costs. In Subsection 5.1 we detail how we generated this dataset.

#### Equivariant regression.

The first equivariant function we considered for regression is the function . Hanin and Sellke (2017) showed this function cannot be approximated by MLPs of small width. We drew training examples and test examples i.i.d. from a distribution (per entry of ).

The second equivariant function we considered is defined on point clouds . For each point cloud we computed a graph by connecting every point to its

nearest neighbors. We then computed the absolute value of the first non trivial eigenvector of the graph Laplacian. We used the ModelNet dataset

(Wu et al., 2015) which contains training meshes and test meshes. The point clouds are generated by randomly sampling points from each mesh.

Figure 2 summarizes train and test accuracy of the 6 models after training (training details in Subsection 5.1) as a function of the network width . We have tested 15 values equidistant in .

As can be see in the graphs, in all three datasets the equivariant universal models (PointNetST, PointNetQT , DeepSets, PointNetSeg) achieved comparable accuracy. PointNet, which is not equivariant universal, consistently achieved inferior performance compared to the universal models, as expected by the theory developed in this paper. The non-equivariant MLP, although universal, used the same width (i.e., same number of parameters) as the equivariant models and was able to over-fit only on one train set (the quadratic function); its performance on the test sets was inferior by a large margin to the equivariant models.

An interesting point is that although the width used in the experiments in much smaller than the bound established by Theorem 1, the universal models are still able to learn well the functions we tested on. This raises the question of the tightness of this bound, which we leave to future work.

Lastly, we see that adding a quadratic transmission layer (PointNetQT) has a marginal effect on the model performance, while adding a single linear transmission layer to a PointNet network gives similar results to the DeepSets architecture, while having lower memory footprint.

### 5.1 Implementation details

#### Knapsack data generation.

We constructed a dataset of training examples and test examples consisting of matrices. We took , , . To generate , we draw an integer uniformly at random between and and randomly choose integers between as the first column of . We also randomly chose an integer between and and then randomly chose integers in that range as the three last columns of . The labels for each input were computed by a standard dynamic programming approach, see Martello and Toth (1990).

#### Optimization.

We implemented the experiments in Pytorch

Paszke et al. (2017) with the Adam Kingma and Ba (2014)

optimizer for learning. For the classification we used the cross entropy loss and trained for 150 epochs with learning rate 0.001, learning rate decay of 0.5 every 100 epochs and batch size 32. For the quadratic function regression we trained for 150 epochs with leaning rate of 0.001, learning rate decay 0.1 every 50 epochs and batch size 64; for the regression to the leading eigen vecto we trained for 50 epochs with leaning rate of 0.001 and batch size 32.

## References

• E. Briand (2004) When is the algebra of multisymmetric polynomials generated by the elementary multisymmetric polynomials. Beiträge zur Algebra und Geometrie 45, pp. . Cited by: §2, §3.
• F. Engelmann, T. Kontogianni, A. Hermans, and B. Leibe (2017) Exploring spatial context for 3d semantic segmentation of point clouds. In

IEEE International Conference on Computer Vision, 3DRMS Workshop, ICCV

,
Cited by: §4.
• M. Golubitsky and I. Stewart (2002) The symmetry perspective. Cited by: §1, §3.
• B. Hanin and M. Sellke (2017) Approximating continuous functions by relu nets of minimal width. CoRR abs/1710.11278. External Links: Link, 1710.11278 Cited by: §4, §4, §4, §5.
• K. Hornik (1991) Approximation capabilities of multilayer feedforward networks. Neural networks 4 (2), pp. 251–257. Cited by: §4.
• N. Keriven and G. Peyré (2019) Universal invariant and equivariant graph neural networks. arXiv preprint arXiv:1905.04943. Cited by: On Universal Equivariant Set Networks, §1.
• D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
• J. Li, B. M. Chen, and G. H. Lee (2018) SO-net: self-organizing network for point cloud analysis. arXiv preprint arXiv:1803.04249. Cited by: §4.
• H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman (2019) Provably powerful graph networks. CoRR abs/1905.11136. External Links: Link, 1905.11136 Cited by: §2.
• S. Martello and P. Toth (1990) Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc., New York, NY, USA. External Links: ISBN 0-471-92420-2 Cited by: §5.1.
• A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §5.1.
• C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017)

Pointnet: deep learning on point sets for 3d classification and segmentation

.
In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

,
pp. 652–660. Cited by: On Universal Equivariant Set Networks, §1, §1, §4, §4, §4.
• D. Rydh (2007) A minimal set of generators for the ring of multisymmetric functions. In Annales de l’institut Fourier, Vol. 57, pp. 1741–1769. Cited by: §2, §2.
• A. Sannai, Y. Takai, and M. Cordonnier (2019) Universal approximations of permutation invariant/equivariant functions by deep neural networks. arXiv preprint arXiv:1903.01939. Cited by: On Universal Equivariant Set Networks, §1.
• S. Sonoda and N. Murata (2017)

Neural network with unbounded activation functions is universal approximator

.
Applied and Computational Harmonic Analysis 43 (2), pp. 233–268. Cited by: §4.
• Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3D shapenets: a deep representation for volumetric shapes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
• D. Yarotsky (2018) Universal approximations of invariant maps by neural networks. arXiv preprint arXiv:1804.10306. Cited by: §1, §3.
• M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: On Universal Equivariant Set Networks, §1, §4.

## Appendix A Appendix

One potential application of Theorem 2 is augmenting an equivariant neural network (Equation 7) with equivariant polynomial layers of some maximal degree . This can be done in the following way: look for all solutions to so that . Any solution to this equation will give a basis element of the form .

In the paper we tested PointNetQT, an architecture that adds to PointNet a single quadratic equivariant layer. We opted to use only the quadratic transmission operators: For a matrix we define as follows:

 L(X)=XW1+11TXW2+(11TX)⊙(11TX)W3+(X⊙X)W4+(11TX)⊙XW5,

where is a pointwise multiplication and are the learnable parameters.