Kronecker Attention Networks

07/16/2020 ∙ by Hongyang Gao, et al. ∙ Texas A&M University 54

Attention operators have been applied on both 1-D data like texts and higher-order data such as images and videos. Use of attention operators on high-order data requires flattening of the spatial or spatial-temporal dimensions into a vector, which is assumed to follow a multivariate normal distribution. This not only incurs excessive requirements on computational resources, but also fails to preserve structures in data. In this work, we propose to avoid flattening by assuming the data follow matrix-variate normal distributions. Based on this new view, we develop Kronecker attention operators (KAOs) that operate on high-order tensor data directly. More importantly, the proposed KAOs lead to dramatic reductions in computational resources. Experimental results show that our methods reduce the amount of required computational resources by a factor of hundreds, with larger factors for higher-dimensional and higher-order data. Results also show that networks with KAOs outperform models without attention, while achieving competitive performance as those with original attention operators.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep neural networks with attention operators have shown great capability of solving challenging tasks in various fields, such as natural language processing 

(Bahdanau et al., 2015; Vaswani et al., 2017; Johnson and Zhang, 2015)

, computer vision 

(Xu et al., 2015; Lu et al., 2016), and network embedding (Veličković et al., 2017; Gao et al., 2018b). Attention operators are able to capture long-range dependencies, resulting in significant performance boost (Li et al., 2018; Malinowski et al., 2018). While attention operators were originally proposed for 1-D data, recent studies (Wang et al., 2018; Zhao et al., 2018; Gao and Ji, 2019) have attempted to apply them on high-order data, such as images and videos. However, a practical challenge of using attention operators on high-order data is the excessive requirement computational resources, including computational cost and memory usage. For example, for 2-D image tasks, the time and space complexities are both quadratic to the product of the height and width of the input feature maps. This bottleneck becomes increasingly severe as the spatial or spatial-temporal dimensions and the order of input data increase. Prior methods address this problem by either down-sampling data before attention operators (Wang et al., 2018) or limiting the path of attention (Huang et al., 2018).

In this work, we propose novel and efficient attention operators, known as Kronecker attention operators (KAOs), for high-order data. We investigate the above problem from a probabilistic perspective. Specifically, regular attention operators flatten the data and assume the flattened data follow multivariate normal distributions. This assumption not only results in high computational cost and memory usage, but also fails to preserve the spatial or spatial-temporal structures of data. We instead propose to use matrix-variate normal distributions to model the data, where the Kronecker covariance structure is able to capture relationships among spatial or spatial-temporal dimensions. Based on this new view, we propose our KAOs, which avoid flattening and operate on high-order data directly. Experimental results show that KAOs are as effective as original attention operators, while dramatically reducing the amount of required computational resources. In particular, we employ KAOs to design a family of efficient modules, leading to our compact deep models known as Kronecker attention networks (KANets). KANets significantly outperform prior compact models on the image classification task, with fewer parameters and less computational cost. Additionally, we perform experiments on image segmentation tasks to demonstrate the effectiveness of our methods in general application scenarios.

2. Background and Related Work

[]

Figure 1. An illustration of the attention operator. Here, denotes matrix multiplication, and softmax() is the column-wise softmax operator. , , and are input matrices. A similarity score is computed between each query vector as a column of and each key vector as a column in . Softmax() normalizes these scores and makes them sum to 1. Multiplication between normalized scores and the matrix yields the corresponding output vector.

In this section, we describe the attention and related non-local operators, which have been applied on various types of data such as texts, images and videos.

2.1. Attention Operator

The inputs to an attention operator include a query matrix with each , a key matrix with each , and a value matrix with each . The attention operation computes the responses of a query vector by attending it to all key vectors in and uses the results to take a weighted sum over value vectors in . The layer-wise forward-propagation operation of an attention operator can be expressed as

(1)

Matrix multiplication between and results in a coefficient matrix , in which each element is calculated by the inner product between and . This coefficient matrix computes similarity scores between every query vector , and every key vector and is normalized by a column-wise softmax operator to make every column sum to 1. The output is obtained by multiplying with the normalized . In self-attention operators (Vaswani et al., 2017), we have . Figure 1 provides an illustration of the attention operator. The computational cost in Eq. 1 is . The memory required for storing the intermediate coefficient matrix is . If and , the time and space complexities become and , respectively.

There are several other ways to compute from and , including Gaussian function, dot product, concatenation, and embedded Gaussian function. It has been shown that dot product is the simplest but most effective one (Wang et al., 2018). Therefore, we focus on the dot product similarity function in this work.

In practice, we can first perform separate linear transformations on each input matrix, resulting in the following attention operator:

, where , , and . For notational simplicity, we omit linear transformations in the following discussion.

2.2. Non-Local Operator

[]

Figure 2. Conversion of a third-order tensor into a matrix by unfolding along mode-3. In this example, a tensor is unfolded into a matrix.

Non-local operators, which is proposed in (Wang et al., 2018), apply self-attention operators on higher-order data such as images and videos. Taking 2-D data as an example, the input to the non-local operator is a third-order tensor , where , , and denote the height, width, and number of channels, respectively. The tensor is first converted into a matrix by unfolding along mode-3 (Kolda and Bader, 2009), as illustrated in Figure 2. Then we perform the operation in Eq. 1 by setting . The output of the attention operator is converted back to a third-order tensor as the final output.

One practical challenge of the non-local operator is that it consumes excessive computational resources. If , the computational cost of a 2-D non-local operator is . The memory used to store the intermediate coefficient matrix incurs space complexity. The time and space complexities are prohibitively high for high-dimensional and high-order data.

3. Kronecker Attention Networks

In this section, we describe our proposed Kronecker attention operators, which are efficient and effective attention operators on high-order data. We also describe how to use these operators to build Kronecker attention networks.

3.1. From Multivariate to Matrix-Variate Distributions

We analyze the problem of attention operators on high-order data and propose solutions from a probabilistic perspective. To illustrate the idea, we take the non-local operator on 2-D data in Section 2.2 as an example. Formally, consider a self-attention operator with , where is the mode-3 unfolding of a third-order input tensor , as illustrated in Figure 2. The th row of corresponds to , where denotes the th frontal slice of  (Kolda and Bader, 2009), and denotes the vectorization of a matrix by concatenating its columns (Gupta and Nagar, 2018).

The frontal slices of are usually known as feature maps. In this view, the mode-3 unfolding is equivalent to the vectorization of each feature map independently. It is worth noting that, in addition to , any other operation that transforms each feature map into a vector leads to the same output from the non-local operator, as long as a corresponding reverse operation is performed to fold the output into a tensor. This fact indicates that unfolding of in local operators ignores the structural information within each feature map, i.e., the relationships among rows and columns. In addition, such unfolding results in excessive requirements on computational resources, as explained in Section 2.2.

In the following discussions, we focus on one feature map

by assuming feature maps are conditionally independent of each other, given feature maps of previous layers. This assumption is shared by many deep learning techniques that process each feature map independently, including the unfolding mentioned above, batch normalization 

(Ioffe and Szegedy, 2015), instance normalization (Ulyanov et al., 2016), and pooling operations (LeCun et al., 1998). To view the problem above from a probabilistic perspective (Ioffe and Szegedy, 2015; Ulyanov et al., 2016), the unfolding yields the assumption that follows a multivariate normal distribution as , where and . Apparently, the multivariate normal distribution does not model relationships among rows and columns in . To address this limitation, we propose to model using a matrix-variate normal distribution (Gupta and Nagar, 2018), defined as below.

Definition 1.

A random matrix

is said to follow a matrix-variate normal distribution with mean matrix and covariance matrix , where and , if . Here, denotes the Kronecker product (Van Loan, 2000; Graham, 2018).

The matrix-variate normal distribution has separate covariance matrices for rows and columns. They interact through the Kronecker product to produce the covariance matrix for the original distribution. Specifically, for two elements and from different rows and columns in , the relationship between and is modeled by the interactions between the th and th rows and the th and th columns. Therefore, the matrix-variate normal distribution is able to incorporate relationships among rows and columns.

3.2. The Proposed Mean and Covariance Structures

In machine learning, Kalaitzis et al. (Kalaitzis et al., 2013) proposed to use the Kronecker sum to form covariance matrices, instead of the Kronecker product. Based on the above observations and studies, we propose to model as , where , , , denotes the Kronecker sum (Kalaitzis et al., 2013), defined as , and denotes an identity matrix. Covariance matrices following the Kronecker sum structure can still capture the relationships among rows and columns (Kalaitzis et al., 2013). It also follows from (Allen and Tibshirani, 2010; Wang et al., 2019) that constraining the mean matrix allows a more direct modeling of the structural information within a feature map. Following these studies, we assume follows a variant of the matrix-variate normal distribution as

(2)

where the mean matrix is restricted to be the outer sum of two vectors, defined as

(3)

where , , and denotes a vector of all ones of size .

[]

Figure 3. Illustrations of regular attention operator (a), KAO (b) and KAO (c) on 2-D data. In the regular attention operator (a), the input tensor is unfolded into a mode-3 matrix and fed into the attention operator. The output of the attention operator is folded back to a tensor as the final output. In KAO (b), we juxtapose the horizontal and lateral average matrices derived from the input tensor as the key and value matrices. We keep the mode-3 unfolding of input tensor as the query matrix. In KAO (c), all three input matrices use the juxtaposition of two average matrices. In contrast to KAO, we use an outer-sum operation to generate the third-order tensor from the output of the attention operator.

Under this model, the marginal distributions of rows and columns are both multivariate normal (Allen and Tibshirani, 2010). Specifically, the th row vector follows , and the th column vector follows . In the following discussion, we assume that and are diagonal, implying that any pair of variables in are uncorrelated. Note that, although the variables in are independent, their covariance matrix still follows the Kronecker covariance structure, thus capturing the relationships among rows and columns (Allen and Tibshirani, 2010; Wang et al., 2019).

3.3. Main Technical Results

Let and be the average of row and column vectors, respectively. Under the assumption above, and follow multivariate normal distributions as

(4)
(5)

where , , , and . Our main technical results can be summarized in the following theorem.

Theorem 1. Given the multivariate normal distributions in Eqs. (4) and (5) with diagonal and , if (a) are independent and identically distributed (i.i.d.) random vectors that follow the distribution in Eq. (4), (b) are i.i.d. random vectors that follow the distribution in Eq. (5), (c) and are independent, we have

(6)

where , . In particular, if , the covariance matrix satisfies

(7)

where denotes matrix trace.

Proof.

The fact that and are diagonal implies independence in the case of multivariate normal distributions. Therefore, it follows from assumptions (a) and (b) that

(8)

where , and

(9)

where

Given assumption (c) and , we have

(10)

where .

If , we have

(11)

and

(12)

This completes the proof of the theorem. ∎

With certain normalization on , we can have , resulting in

(13)

As the trace of a covariance matrix measures the total variation, Theorem 1 implies that follows a matrix-variate normal distribution with the same mean and scaled covariance as the distribution of in Eq. (2). Given this conclusion and the process to obtain from , we propose our Kronecker attention operators in the following section.

[]

Figure 4. Architectures of the BaseModule (a), BaseSkipModule (b), AttnModule (c), and AttnSkipModule (d) as described in Section 3.5. The skip connections indicated by single dashed paths are not used when or . Those indicated by double dashed paths are not used when .

3.4. Kronecker Attention Operators

We describe the Kronecker attention operators (KAO) in the context of self-attention on 2-D data, but they can be easily generalized to generic attentions. In this case, the input to the th layer is a third-order tensor . Motivated by the theoretical results of Sections 3.2 and 3.3, we propose to use horizontal and lateral average matrices to represent original mode-3 unfolding without much information loss. Based on Eq. (4) and Eq. (5), the horizontal average matrix and the lateral average matrix are computed as

(14)

where and are the horizontal and lateral slices (Kolda and Bader, 2009) of tensor , respectively. We then form a matrix by juxtaposing and as

(15)

Based on the horizontal and lateral average matrices contained in , we propose two Kronecker attention operators (KAOs), i.e., KAO and KAO. In KAO as shown in Figure 3 (b), we use as the query matrix and as the key and value matrices as

(16)

Note that the number of columns in depends on the number of query vectors. Thus, we obtain output vectors from the attention operation in Eq. (16). Similar to the regular attention operator, is folded back to a third-order tensor by considering the column vectors in as mode-3 fibers of . KAO uses as the output of layer .

If , the time and space complexities of KAO are and , respectively. Compared to the original local operator on 2-D data, KAO reduces time and space complexities by a factor of .

In order to reduce the time and space complexities further, we propose another operator known as KAO. In KAO as shown in Figure 3(c), we use as the query, key, and value matrices as

(17)

The final output tensor is obtained as

(18)

where and are the th rows of the corresponding matrices. That is, the th frontal slice of is obtained by computing the outer sum of the th rows of and .

If , the time and space complexities of KAO are and , respectively. Thus, the time and space complexities have been reduced by a factor of as compared to the original local operator, and by a factor of as compared to KAO.

Note that we do not consider linear transformations in our description, but these transformations can be applied to all three input matrices in KAO and KAO as shown in Figure 3.

3.5. Kronecker Attention Modules and Networks

Input Operator
2243 Conv2D - 32 1 2
11232 BaseSkipModule 1 16 1 1
11216 BaseSkipModule 6 24 2 2
5624 BaseSkipModule 6 32 2 2
2832 AttnSkipModule 6 32 1 1
2832 BaseSkipModule 6 64 1 2
1464 AttnSkipModule 6 64 3 1
1464 AttnSkipModule 6 96 3 1
1496 BaseSkipModule 6 160 1 2
7160 AttnSkipModule 6 160 2 1
7160 AttnSkipModule 6 320 1 1
7320 Conv2D - 1280 1 1
71280 AvgPool + FC - 1 -
Table 1. Details of the KANets architecture. Each line describes a sequence of operators in the format of “input size / operator name / expansion rate / number of output channels / number of operators in the sequence

/ stride

”. “Conv2D” denotes the regular 2D convolutional layer. “AvgPool” and “FC” denote the global average pooling layer and the fully-connected layer, respectively. All depth-wise convolutions use the kernel size of . For multiple operators in a sequence denoted in the same line, all operators produce output channels. And the first operator applies the stride of while the following operators applies the stride of 1. denotes the class number in the task.

Attention models have not been used in compact deep models to date, primarily due to their high computational cost. Our efficient KAOs make it possible to use attention operators in compact convolutional neural networks (CNNs) like MobileNet (Sandler et al., 2018). In this section, we design a family of efficient Kronecker attention modules based on MobileNetV2 that can be used in compact CNNs.

BaseModule: MobileNetV2 (Sandler et al., 2018) is mainly composed of bottleneck blocks with inverted residuals. Each bottleneck block consists of three convolutional layers; those are, convolutional layer, depth-wise convolutional layer, and another convolutional layer. Suppose the expansion factor is and stride is . Given input for the th block, the first convolutional layer outputs feature maps . The depth-wise convolutional layer uses a stride of and outputs feature maps . The last convolutional layer produces feature maps . When and , a skip connection is added between and . The BaseModule is illustrated in Figure 4 (a).

BaseSkipModule: To facilitate feature reuse and gradient back-propagation in deep models, we improve the BaseModule by adding a skip connection. Given input , we use an expansion factor of for the first convolutional layer, instead of as in BaseModule. We then concatenate the output with the original input, resulting in . The other parts of the BaseSkipModule are the same as those of the BaseModule as illustrated in Figure 4 (b). Compared to the BaseModule, the BaseSkipModule reduces the number of parameters by and computational cost by . It achieves better feature reuse and gradient back-propagation.

AttnModule: We propose to add an attention operator into the BaseModule to enable the capture of global features. We reduce the expansion factor of the BaseModule by and add a new parallel path with an attention operator that outputs feature maps. Concretely, after the depth-wise convolutional layer, the original path outputs . The attention operator, optionally followed by an average pooling of stride if , produces . Concatenating them gives . The final convolutional layer remains the same. Within the attention operator, we only apply the linear transformation on the value matrix to limit the number of parameters and required computational resources. We denote this module as the AttnModule as shown in Figure 4 (b). In this module, the original path acts as locality-based feature extractors, while the new parallel path with an attention operator computes global features. This enables the module to incorporate both local and global information. Note that we can use any attention operator in this module, including the regular attention operator and our KAOs.

AttnSkipModule: We propose to add an additional skip connection in the AttnModule, as shown in Figure 4 (d). This skip connection can always be added unless . The AttnSkipModule has the same amount of parameters and computational cost as the AttnModule.

4. Experimental Studies

In this section, we evaluate our proposed operators and networks on image classification and segmentation tasks. We first compare our proposed KAOs with regular attention operators in terms of computational cost and memory usage. Next, we design novel compact CNNs known as Kronecker attention networks (KANets) using our proposed operators and modules. We compare KANets with other compact CNNs on the ImageNet ILSVRC 2012 dataset 

(Deng et al., 2009). Ablation studies are conducted to investigate how our KAOs benefit the entire networks. We also perform experiments on the PASCAL 2012 dataset (Everingham et al., 2010) to show the effectiveness of our KAOs on general application scenarios.

Input Operator MAdd Cost Saving Memory Memory Saving Time Speedup
Attn 0.63m 0.00% 5.2MB 0.00% 5.8ms 1.0
Attn+Pool 0.16m 75.00% 1.5MB 71.65% 2.0ms 3.0
KAO 0.09m 85.71% 0.9MB 82.03% 1.7ms 3.5
KAO 0.01m 97.71% 0.3MB 95.06% 0.8ms 6.8
Attn 9.88m 0.00% 79.9MB 0.00% 72.4ms 1.0
Attn+Pool 2.47m 75.00% 20.7MB 74.13% 20.9ms 3.5
KAO 0.71m 92.86% 6.5MB 91.88% 7.1ms 10.1
KAO 0.05m 99.46% 0.9MB 98.85% 1.7ms 40.9
Attn 157.55m 0.00% 1,262.6MB 0.00% 1,541.1ms 1.0
Attn+Pool 39.39m 75.00% 318.7MB 74.76% 396.9ms 3.9
KAO 5.62m 96.43% 48.2MB 96.18% 49.6ms 31.1
KAO 0.21m 99.87% 3.4MB 99.73% 5.1ms 305.8
Table 2. Comparisons between the regular attention operator, the regular attention operator with a pooling operation (Wang et al., 2018), and our proposed KAO and KAO in terms of the number of parameters, number of MAdd, memory usage, and CPU inference time on simulated data of different sizes. The input sizes are given in the format of “batch size spatial sizes number of input channels”. “Attn” denotes the regular attention operator. “Attn+Pool” denotes the regular attention operator which employs a pooling operation on and input matrices to reduce required computational resources.

4.1. Experimental Setup

In this section, we describe the experimental setups for both image classification tasks and image segmentation tasks.

Experimental Setup for Image Classification As a common practice on this dataset, we use the same data augmentation scheme in He et al. (2016). Specifically, during training, we scale each image to and then randomly crop a patch. During inference, the center-cropped patches are used. We train our KANets using the same settings as MobileNetV2 (Sandler et al., 2018) with minor changes. We perform batch normalization (Ioffe and Szegedy, 2015) on the coefficient matrices in KAOs to stabilize the training. All trainable parameters are initialized with the Xavier initialization (Glorot and Bengio, 2010)

. We use the standard stochastic gradient descent optimizer with a momentum of 0.9 

(Sutskever et al., 2013)

to train models for 150 epochs in total. The initial learning rate is 0.1 and it decays by 0.1 at the

th, th, and th epoch. Dropout (Srivastava et al., 2014) with a keep rate of is applied after the global average pooling layer. We use 8 TITAN Xp GPUs and a batch size of for training, which takes about days. Since labels of the test dataset are not available, we train our networks on training dataset and report accuracies on the validation dataset.

Experimental Setup for Image Segmentation We train all the models with randomly cropped patches of size and a batch size of 8. Data augmentation by randomly scaling the inputs for training is employed. We adopt the “poly” learning rate policy (Liu et al., 2015) with , and set the initial learning rate to 0.00025. Following DeepLabV2, we use the ResNet-101 model pre-trained on ImageNet (Deng et al., 2009) and MS-COCO (Lin et al., 2014) for initialization. The models are then trained for 25,000 iterations with a momentum of 0.9 and a weight decay of 0.0005. We perform no post-processing such as conditional random fields and do not use multi-scale inputs due to limited GPU memory. All the models are trained on the training set and evaluated on the validation set.

4.2. Comparison of Computational Efficiency

According to the theoretical analysis in Section 3.4

, our KAOs have efficiency advantages over regular attention operators on high-order data, especially for inputs with large spatial sizes. We conduct simulated experiments to evaluate the theoretical results. To reduce the influence of external factors, we build networks composed of a single attention operator, and apply the TensorFlow profile tool 

(Abadi et al., 2016) to report the multiply-adds (MAdd), required memory, and time consumed on 2-D simulated data. For the simulated input data, we set the batch size and number of channels both to 8, and test three spatial sizes; those are, , , and . The number of output channels is also set to 8.

Table 2 summarizes the comparison results. On simulated data of spatial sizes , our KAO and KAO achieve 31.1 and 305.8 times speedup, and 96.18% and 99.73% memory saving compared to the regular attention operator, respectively. Our proposed KAOs show significant improvements over regular attention operators in terms of computational resources, which is consistent with the theoretical analysis. In particular, the amount of improvement increases as the spatial sizes increase. These results show that the proposed KAOs are efficient attention operators on high-dimensional and high-order data.

Model Top-1 Params MAdd
GoogleNet 0.698 6.8m 1550m
VGG16 0.715 128m 15300m
AlexNet 0.572 60m 720m
SqueezeNet 0.575 1.3m 833m
MobileNetV1 0.706 4.2m 569m
ShuffleNet 1.5x 0.715 3.4m 292m
ChannelNet-v1 0.705 3.7m 407m
MobileNetV2 0.720 3.47m 300m
KANet (ours) 0.729 3.44m 288m
KANet (ours) 0.728 3.44m 281m
Table 3. Comparisons between KANets and other CNNs in terms of the top-1 accuracy on the ImageNet validation set, the number of total parameters, and MAdd. We use KANet and KANet to denote KANets using KAO and KAO, respectively.

4.3. Results on Image Classification

With the high efficiency of our KAOs, we have proposed several efficient Kronecker attention modules for compact CNNs in Section 3.5. To further show the effectiveness of KAOs and the modules, we build novel compact CNNs known as Kronecker attention networks (KANets). Following the practices in (Wang et al., 2018), we apply these modules on inputs of spatial sizes , , and . The detailed network architecture is described in Table 1 in the Section 4.1.

We compare KANets with other CNNs on the ImageNet ILSVRC 2012 image classification dataset, which serves as the benchmark for compact CNNs (Howard et al., 2017; Zhang et al., 2017; Gao et al., 2018a; Sandler et al., 2018). The dataset contains 1.2 million training, 50 thousand validation, and 50 thousand testing images. Each image is labeled with one of 1,000 classes. Details of the experimental setups are provided in the Section 4.1.

The comparison results between our KANets and other CNNs in terms of the top-1 accuracy, number of parameters, and MAdd are reported in Table 3. SqueezeNet (Iandola et al., 2016) has the least number of parameters, but uses the most MAdd and does not obtain competitive performance as compared to other compact CNNs. Among compact CNNs, MobileNetV2 (Sandler et al., 2018) is the previous state-of-the-art model, which achieves the best trade-off between effectiveness and efficiency. According to the results, our KANets significantly outperform MobileNetV2 with 0.03 million fewer parameters. Specifically, our KANet and KANet outperform MobileNetV2 by margins of 0.9% and 0.8%, respectively. More importantly, our KANets has the least computational cost. These results demonstrate the effectiveness and efficiency of our proposed KAOs.

The performance of KANets indicates that our proposed methods are promising, since we only make small modifications to the architecture of MobileNetV2 to include KAOs. Compared to modules with the regular convolutional layers only, our proposed modules with KAOs achieve better performance without using excessive computational resources. Thus, our methods can be used widely for designing compact deep models. Our KAOs successfully address the practical challenge of applying regular attention operators on high-order data. In the next experiments, we show that our proposed KAOs are as effective as regular attention operators.

Model Top-1 Params MAdd
AttnNet 0.730 3.44m 365m
AttnNet+Pool 0.729 3.44m 300m
KANet 0.729 3.44m 288m
KANet 0.728 3.44m 281m
Table 4. Comparisons between KANets with regular attention operators (denoted as AttnNet), KANets with regular attention operators with a pooling operation (denoted as AttnNet+Pool) and KANets with KAOs in terms of the top-1 accuracy on the ImageNet validation set, the number of total parameters, and MAdd.

4.4. Comparison with Regular Attention Operators

We perform experiments to compare our proposed KAOs with regular attention operators. We consider the regular attention operator and the one with a pooling operation in (Wang et al., 2018). For the attention operator with pooling operation, the spatial sizes of the key matrix and value matrix are reduced by pooling operations to save computation cost. To compare these operators in fair settings, we replace all KAOs in KANets with regular attention operators and regular attention operators with a pooling operation, denoted as AttnNet and AttnNet+Pool, respectively.

The comparison results are summarized in Table 4. Note that all these models have the same number of parameters. We can see that KANet and KANet achieve similar performance as AttnNet and AttnNet+Pool with dramatic reductions of computational cost. The results indicate that our proposed KAOs are as effective as regular attention operators while being much more efficient. In addition, our KAOs are better than regular attention operators that uses a pooling operation to increase efficiency in (Wang et al., 2018).

4.5. Ablation Studies

To show how our KAOs benefit entire networks in different settings, we conduct ablation studies on MobileNetV2 and KANet. For MobileNetV2, we replace BaseModules with AttnModules as described in Section 3.5, resulting in a new model denoted as MobileNetV2+KAO. On the contrary, based on KANet, we replace all AttnSkipModules by BaseModules. The resulting model is denoted as KANet w/o KAO.

Table 5 reports the comparison results. By employing KAO, MobileNetV2+KAO gains a performance boost of 0.6% with fewer parameters than MobileNetV2. On the other hand, KANet outperforms KANet w/o KAO by a margin of 0.8%, while KANet w/o KAO has more parameters than KANet. KANet achieves the best performance while costing the least computational resources. The results indicate that our proposed KAOs are effective and efficient, which is independent of specific network architectures.

Model Top-1 Params MAdd
MobileNetV2 0.720 3.47m 300m
MobileNetV2+KAO 0.726 3.46m 298m
KANet 0.729 3.44m 288m
KANet w/o KAO 0.721 3.46m 298m
Table 5. Comparisons between MobileNetV2, MobileNetV2 with KAOs (denoted as MobileNetV2+KAO), KANet, and KANet without KAO (denoted as KANet w/o KAO) in terms of the top-1 accuracy on the ImageNet validation set, the number of total parameters, and MAdd.

4.6. Results on Image Segmentation

Model Accuracy Mean IOU
DeepLabV2 0.944 75.1
DeepLabV2+Attn 0.947 76.3
DeepLabV2+KAO 0.946 75.9
DeepLabV2+KAO 0.946 75.8
Table 6. Comparisons of DeepLabV2, DeepLabV2 with the regular attention operator (DeepLabV2+Attn), DeepLabV2 with our KAO (DeepLabV2+KAO), and DeepLabV2 with our KAO (DeepLabV2+KAO) in terms of the pixel-wise accuracy, and mean IOU on the PASCAL VOC 2012 validation dataset.

In order to show the efficiency and effectiveness of our KAOs in broader application scenarios, we perform additional experiments on image segmentation tasks using the PASCAL 2012 dataset (Everingham et al., 2010). With the extra annotations provided by (Hariharan et al., 2011), the augmented dataset contains 10,582 training, 1,449 validation, and 1,456 testing images. Each pixel of the images is labeled by one of 21 classes with 20 foreground classes and 1 background class.

We re-implement the DeepLabV2 model (Chen et al., 2018) as our baseline. Following (Wang and Ji, 2018), using attention operators as the output layer, instead of atrous spatial pyramid pooling (ASPP), results in a significant performance improvement. In our experiments, we replace ASPP with the regular attention operator and our proposed KAOs, respectively, and compare the results. For all attention operators, linear transformations are applied on , , and . Details of the experimental setups are provided in the Section 4.1.

Table 6 shows the evaluation results in terms of pixel accuracy and mean intersection over union (IoU) on the PASCAL VOC 2012 validation set. Clearly, models with attention operators outperform the baseline model with ASPP. Compared with the regular attention operator, KAOs result in similar pixel-wise accuracy but slightly lower mean IoU. From the pixel-wise accuracy, results indicate that KAOs are as effective as the regular attention operator. The decrease in mean IoU may be caused by the strong structural assumption behind KAOs. Overall, the experimental results demonstrate the efficiency and effectiveness of our KAOs in broader application scenarios.

5. Conclusions

In this work, we propose Kronecker attention operators to address the practical challenge of applying attention operators on high-order data. We investigate the problem from a probabilistic perspective and use matrix-variate normal distributions with Kronecker covariance structure. Experimental results show that our KAOs reduce the amount of required computational resources by a factor of hundreds, with larger factors for higher-dimensional and higher-order data. We employ KAOs to design a family of efficient modules, leading to our KANets. KANets significantly outperform the previous state-of-the-art compact models on image classification tasks, with fewer parameters and less computational cost. Additionally, we perform experiments on the image segmentation task to show the effectiveness of our KAOs on general application scenarios.

Acknowledgements.
This work was supported in part by National Science Foundation grants IIS-1908220 and DBI-1922969.

References

  • (1)
  • Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: a system for large-scale machine learning.. In OSDI, Vol. 16. 265–283.
  • Allen and Tibshirani (2010) Genevera I Allen and Robert Tibshirani. 2010.

    Transposable regularized covariance models with an application to missing data imputation.

    The Annals of Applied Statistics 4, 2 (2010), 764.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations (2015).
  • Chen et al. (2018) Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2018. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40, 4 (2018), 834–848.
  • Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    .
  • Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision 88, 2 (2010), 303–338.
  • Gao and Ji (2019) Hongyang Gao and Shuiwang Ji. 2019. Graph representation learning via hard and channel-wise attention networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 741–749.
  • Gao et al. (2018a) Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. 2018a. ChannelNets: Compact and Efficient Convolutional Neural Networks via Channel-Wise Convolutions. In Advances in Neural Information Processing Systems. 5203–5211.
  • Gao et al. (2018b) Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. 2018b. Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1416–1424.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 249–256.
  • Graham (2018) Alexander Graham. 2018. Kronecker products and matrix calculus with applications. Courier Dover Publications.
  • Gupta and Nagar (2018) Arjun K Gupta and Daya K Nagar. 2018. Matrix variate distributions. Chapman and Hall/CRC.
  • Hariharan et al. (2011) Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. 2011. Semantic contours from inverse detectors. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 991–998.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
  • Huang et al. (2018) Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2018. CCNet: Criss-Cross Attention for Semantic Segmentation. arXiv preprint arXiv:1811.11721 (2018).
  • Iandola et al. (2016) Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning. 448–456.
  • Johnson and Zhang (2015) Rie Johnson and Tong Zhang. 2015. Semi-supervised convolutional neural networks for text categorization via region embedding. In Advances in neural information processing systems. 919–927.
  • Kalaitzis et al. (2013) Alfredo Kalaitzis, John Lafferty, Neil Lawrence, and Shuheng Zhou. 2013. The bigraphical lasso. In International Conference on Machine Learning. 1229–1237.
  • Kolda and Bader (2009) Tamara G. Kolda and Brett W. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev. 51, 3 (2009), 455–500.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
  • Li et al. (2018) Guanbin Li, Xiang He, Wei Zhang, Huiyou Chang, Le Dong, and Liang Lin. 2018. Non-locally Enhanced Encoder-Decoder Network for Single Image De-raining. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 1056–1064.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.
  • Liu et al. (2015) Wei Liu, Andrew Rabinovich, and Alexander C Berg. 2015. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579 (2015).
  • Lu et al. (2016) Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems. 289–297.
  • Malinowski et al. (2018) Mateusz Malinowski, Carl Doersch, Adam Santoro, and Peter Battaglia. 2018. Learning Visual Question Answering by Bootstrapping Hard Attention. In Proceedings of the European Conference on Computer Vision (ECCV). 3–20.
  • Sandler et al. (2018) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 4510–4520.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
  • Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In International conference on machine learning. 1139–1147.
  • Ulyanov et al. (2016) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2016. Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv preprint arXiv:1607.08022 (2016).
  • Van Loan (2000) Charles F Van Loan. 2000. The ubiquitous Kronecker product. Journal of computational and applied mathematics 123, 1-2 (2000), 85–100.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
  • Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph Attention Networks. arXiv preprint arXiv:1710.10903 (2017).
  • Wang et al. (2018) Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1. 4.
  • Wang and Ji (2018) Zhengyang Wang and Shuiwang Ji. 2018. Smoothed Dilated Convolutions for Improved Dense Prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2486–2495.
  • Wang et al. (2019) Zhengyang Wang, Hao Yuan, and Shuiwang Ji. 2019. Spatial Variational Auto-Encoding via Matrix-Variate Normal Distributions. In Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM, 648–656.
  • Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048–2057.
  • Zhang et al. (2017) Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2017. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083 (2017).
  • Zhao et al. (2018) Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. 2018. PSANet: Point-wise Spatial Attention Network for Scene Parsing. In Proceedings of the European Conference on Computer Vision (ECCV). 267–283.