1 Introduction
Capturing the longrange spatialtemporal dependencies is crucial for the Deep Convolutional Neural Networks (CNNs) to extract discriminate features in vision tasks such as image and video classification. However, the traditional convolution operator only focuses on processing local neighborhood at a time. This makes the CNNs need to go deeper with convolutional operations to enlarge the receptive fields, which lead to higher computation and memory. Moreover, going deeper cannot always increase the effective receptive fields due to the Gaussian distribution of the kernel weight (
widelimit). To eliminate this limitation, some recent works focus on designing the network architecture with wider and welldesigned modules to catch the longrange dependencies such as (largeconv, deeplab, pspnet). Although having larger receptive fields, these modules still need to be applied recursively to catch the dependencies of the pairs in large distances.Inspired by the classical nonlocal means method in image denoising, nl
proposes the nonlocal neural network which uses the nonlocal (NL) block to concern the “fullrange” dependencies in only one module by exploring the correlations between each position and all other positions. In the NL block, the affinity matrix is first computed to represent the correlations between each position pair. Then the weight means of features are calculated based on the affinity matrix to refine the feature representation. Finally, the residual connection is added to the refined feature map. Due to its simplicity and effectiveness, the nonlocal block has been widely used in image and video classification (
nl; cgnl; ns; a2net), image segmentation (ccnet; cgnl; nl) and person reidentification (nlreid; scan) recently.However, due to the complexity of the affinity matrix, the nonlocal block ^{1}^{1}1The nonlocal block is composed of a nonlocal operator and a residual connection needs much more computational effort and is sensitive to its number and position in the neural network (ns). Some works solve the first problem by simplifying the calculation of the affinity matrix such as ccnet, cgdnet, cgnl, a2net. Only a few works try to solve the second problem which limits the robustness of the nonlocal network ^{2}^{2}2The nonlocal network is composed of nonlocal blocks and a backbone network. ns proposes the nonlocal stage (NS) block which concerns the diffusion nature and maintains the same affinity matrix for all the nonlocal units in the NS block. Comparing with the NL block, the NS block is insensitive to the numbers and allows deeper nonlocal structure. However, the deeper nonlocal structure of NS block increases the complexity and do not have a remarkable improvement. The work from recent dynamical systems utilizing efficient lowrank approximation for learning, she2018reduced; she2018stochastic
, or considering complex layerwise dynamics with Recurrent Neural Networks
she2019neural; shesupplementary can enlighten this research with respect to their wellexplored mathematical formulations.In this work, we focus on elaborating a robust nonlocal block which is more flexible when using in the neural network. We prove that the nonlocal operator in the nonlocal block is equivalent to the Chebyshevapproximated fullyconnected graph filter with irrational constraints that limits its liberty for parameter learning. To remove these irrational constraints, we propose the Spectral Nonlocal (SNL) block which is more robust and can degrade into the NL and NS with specific assumptions. We also prove that the deeper nonlocal structure satisfies the stable hypothesis with the help of steadystate analysis. Based on this hypothesis, we give the fullorder approximated spectral nonlocal (gSNL) block which is wellperformed for deeper nonlocal structure. Finally, we add our proposed nonlocal blocks into the deep network and evaluate them on the image and video classification tasks. Experiments show that the networks with our proposed blocks are more robust and have a higher accuracy than using other types of nonlocal blocks. To summarize, our contributions are threefold:

We propose a spectral nonlocal (SNL) block as an efficient, simple, and generic component for capturing longrange spatialtemporal dependencies with deep neural networks, which is a generalization of the classical nonlocal blocks.

We propose the stable hypothesis, which can enable the deeper nonlocal structure without an elaborate preparation for both the number and position of the building blocks. We further extend SNL into generalized SNL (gSNL), which can enable multiple nonlocal blocks to be plugged into the existing computer vision architectures with stable learning dynamics.

Both SNL and gSNL have outperformed other nonlocal blocks across both several image and video classification tasks with a clearcut improvement.
2 Preliminary
Nonlocal block The NL block consist of NL operator with residual connection and is expressed as:
(1) 
where is the input feature map, ) is the NL operator, is the transferred feature map that compresses the channels of
by a linear transformation with kernel
. Here is the number of positions. The affinity matrix is composed by pairwise correlations between pixels.In the NL block, the NL operator explores the “fullrange” dependencies by concerning the relationships between all the position pairs:
(2) 
where is the weight matrix of a linear transformation. is the affinity kernel which can adopt the “Dot Product”, “Traditional Gasuassian”, “Embedded Gasussian” or other kernel matrix with a finite Frobenius norm.
Nonlocal stage To make the NL operator follow the diffusion nature that allows deeper nonlocal structure (ns), the nonlocal stage (NS) operator uses the graph laplacian to replace the affinity matrix in the NL operator:
(3) 
where is the NS operator. is the degree of node . Moreover, when adding multiple blocks with the same affinity matrix and replacing the NL operator by the NS operator, these consecutivelyconnected blocks become the NS block. We called these nonlocal blocks in the NS block as the NS units.
3 Method
The nonlocal operator can be divided into two steps: calculating the affinity matrix to represent the correlations between each position pairs and refining the feature map by calculating the weighted means based on . In this section, a fullyconnected graph filter is utilized for explaining the nonlocal operator. With the Chebyshev approximation, we propose the SNL operator which is proved to be a generalized form of NL and NS operator and is more robust with higher performance in computer vision tasks. Furthermore, based on the stable hypothesis that deeper nonlocal structure tends to learn a stable affinity matrix, we extend our SNL operator into a fullorder Chebyshev approximation version, i.e. the gSNL.
3.1 The Proposed Spectral Nonlocal Operator
Nonlocal operator in the graph view The nonlocal operator is a filter that computes a weighted mean of all the positions in the feature map based on the affinity matrix and then conduct the feature transformation with the kernel . This is the same as filtering the signal by a graph filter in the fullyconnected graph domain determined by the affinity matrix (gsp). From this perspective, we further illustrate the nonlocal operator as
Theorem 1.
Given an affinity matrix and the signal , the nonlocal operator is the same as filtering the signal in the graph domain of a fullyconnected weighted graph :
(4) 
where the graph filter is a diagonal parameter matrix, i.e. , . is a fullyconnected graph with the vertex set and affinity matrix . and
are the eigenvectors and eigenvalues of the graph laplacian
.This definition requires that the graph laplacian has nonsingular eigenvalue and eigenvector, so the affinity matrix should be a symmetric, nonnegative, rownormalized matrix. To meet this requirement, the affinity matrix can be obtained by the following steps. First, the affinity kernel is used to calculate the matrix (we use the dot product with embeded weight matrix and as the affinity kernel, i.e. ). Then we make the matrix symmetric: . Finally, we normalize the row of to make it satisfy and having . In the following sections the symmetric, nonnegative, rownormalized matrix is denoted as .
The proposed spectral nonlocal operator The graph filter in Eq. (4) contains parameters. To simplify it, we use the Chebyshev polynomials which can reduce the parameters into (). For simplicity, we firstly assume that the input , the output and the output have only one channel.
Following the similar method as gcn, the order Chebyshev polynomials is used to approximate the graph filter function :
(5)  
Due to is a random walk laplacican, the maximum eiginvalue satisfies which makes (gsp). Then Eq. (5) becomes:
(6) 
If , the firstorder Chebyshev approximation of Eq. (6) becomes:
(7) 
where and are the coefficients for the first and second term which are approximated by learning with SGD. Then, extending Eq. (7) into multichannel conditions, we can get the formation of our SNL operator:
(8) 
where is the SNL operator, , . Finally, a residual connection is added with the SNL operator to form the SNL block:
(9) 
Relation with other nonlocal operators As shown in fig. 1, our SNL operator can degrade into the NL operator by setting , i.e. . However, its analytic solution: controls the total filtering intensity, which cannot be guaranteed to be 0. This setting will limit the search space when training the network and reduce the robustness of the NL block. Thus, the NL operator cannot magnify features of a large range and damp some discriminative features such as the beak of the waterfowl in fig. 1. Our SNL operator can also degrade into the NS operator by setting , i.e. . However, the analytic solution of this equation is . When setting it to zero, the filter strength of the highfrequency signal (with high ) such as the small part or twig is suppressed. Thus, it still cannot magnify the discriminative part such as the beak of the waterfowl as shown in fig. 1. Comparing with NL and NS, our SNL does not have these irrational constraints and give these two parameters a liberal learning space. Thus, can control the preserve strength of the discriminative features, while can pay more attention to the lowfrequency signal to diminish the noise.
3.2 The proposed generalized Spectral Nonlocal Operator
To fully exploit the “fullrange” dependencies, the nonlocal block should have the ability to be consecutively stacked into the network to form a deeper nonlocal structure. However, some types of nonlocal blocks such as the NL and CGNL block cannot achieve this purpose (ns). To show the robustness of our SNL block when used in the deeper nonlocal structure, we firstly study the steadystate of deeper nonlocal structure when consecutively adding our SNL block. We also prove the stable hypothesis that the deeper nonlocal structure tends to learn a stable affinity. Based on this hypothesis, we can extend our SNL block into a fullorder Chebyshev approximation, i.e. the gSNL block which is more applicable for deeper nonlocal structure.
The stable hypothesis The Steadystate analysis can be used to analyze the stable dynamics of the nonlocal block. Here we give the steadystate analysis of our SNL block when consecutively adds into the network structure and get the Stable Hypothesis:
Lemma 1.
The Stable Hypothesis: when adding more than two consecutivelyconnected SNL blocks with the same affinity matrix into the network structure, these SNL blocks are stable when the variable affinity matrix satisfies: .
Proof.
The stability holds when the weight parameters in and are small enough such that the CFL condition is satisfied (ns). By ignoring them for simplicity, the discrete nonlinear operator of our SNL have a similar formulation as the NS operator:
where is the discretization parameter. is the input of the block in the deeper nonlocal structure with . The stable assumption demands that , so the steadystate equation of the last SNL block can be written as:
The deeper nonlocal structure has more than one SNL blocks. So the and can be used to express :
Finally, the steadystate equation becomes:
This equation can naturally extend to the khop affinity matrix , i.e. . ∎
To verify the stable hypothesis, we add five consecutivelyconnected SNL blocks (and NS blocks) into the PreResnet56 preresnet and train this model on the train set of the CIFAR100 dataset with the initial learning rate which is subsequently divided by at and epochs (total epochs). A weight decay and momentum are also used. Then we test the trained model on the test set and output the affinity matrix of each image. Figure. 2 shows the statistics that reflects the strength of the affinity matrix, 2hop, 3hop, and 4hop affinity matrix: . We can see that the number of elements in each histogram bin are nearly the same. This means that the , , , have similar distribution of all the elements in khop affinity matrixes, which also empirically verifies the stablestate equation: .
Fullorder spectral nonlocal operator With the stable hypothesis, the Chebyshev polynomials can be simplified into a piecewise function (details in Appendix B). Taking this piecewise function into the Eq. 7, we can get the fullorder approximation of the SNL operator:
(10) 
where , , . Then, extending it into multichannel input and output with the residual connection, we can get our gSNL block:
(11) 
The gSNL block is wellperformed when the stable affinity hypothesis is satisfied, i.e. adding more than two nonlocal blocks with the same affinity matrix as shown in Table. 4.
3.3 Implementation Details
The implementation details of the gSNL block is shown in fig. 3. The input feature map is first fed into three 1x1 convolutions with the weight kernel: , , to subtract the number of channel. One of the output is used as the transferred feature map to reduce the calculation complexity, while the other two output , are used to get the affinity matrix . The subchannel are usually two times less than the input channel . The affinity matrix is calculated by the affinity kernel function and then use the operation in Sec3.1 to make it nonnegative, symmetric and normalized. Finally, with the affinity matrix and the transferred feature map , the output of the nonlocal block can be obtained by the equation Eq. (11). Specifically, the three weight matrixes , , are implemented as three 1x1 convolutions.
4 Experiment
4.1 Setting
Datasets Our proposed SNL and gSNL blocks have been evaluated across several computer vision tasks, including image classification and videobased action recognition. For the image classification, both CIFAR10 and CIFAR100 datasets (cifar) are tested. The CIFAR10 dataset contains images of classes, and CIFAR100 dataset contains images of classes. For these two datasets, we use images as the train set and images as the test set. We also generate experiments for the finegrained classification on the Birds2002011 (CUB200) dataset (cub) which contains images of bird categories. For the action recognition, the experiments are conducted on the UCF101 dataset (ucf), which contains different actions.
Backbones For the image classification, the ResNet50 and the PreResNet variations (including both PreResNet20 and PreResNet56) are used as the backbone networks. For the video classification task, we follow the I3D structure (I3d) which uses kernels to replace the convolution operator in the residual block.
Setting for the network In the main experiments, we set . Without loss of the generality, we use the “Dot Product” as the affinity kernel in the experiments. We add one SNL (or gSNL) block into these backbone networks to construct the SNL (or gSNL) network. For the ResNet and the I3D (I3d), following nl we add the SNL block right before the last residual block of . For the PreResNet series, we add the SNL block right after the second residual block in the early stage (). For the other nonlocalbase block such as the original nonlocal block (nl), the nonlocal stage (ns), the compact generalized nonlocal block (cgnl), the settings are all the same as ours to make a fair comparison.
Setting for the training For the image classification on CIFAR10 dataset and CIFAR100 dataset, we train the models endtoend without using pretrained model. The initial learning rate is used for these two datasets with the weight decay and momentum . The learning rate is divided by at and epochs. The models are trained for total epochs.
For the finegrained classification on CUB200 dataset, we use the models pretrained on ImageNet (
imagenet) to initialize the weights. We train the models for total epochs with the initial learning rate which is subsequently divided by at , , epochs. The weight decay and momentum are the same as the setting of CIFAR10 and CIFAR100.For the video classification on the UCF101 dataset, the weights are initialized by the pretrained I3D model on Kinetics dataset (kinetics). We train the models with the initial learning rate which is subsequently divided by each epochs. The training stops at the epochs. The weight decay and momentum are the same as the setting of CIFAR10 and CIFAR100.
4.2 Ablation Experiment
model  SelfPreserving  SelfAttention  Approximate Conditions  ChannelWise 
NL  ✓  
A2  ✓  
CGNL  ✓  ✓  
NS  ✓  ✓  and  
*SNL  ✓  ✓  
*gSNL  ✓  ✓   
The number of channels in transferred feature space The nonlocalbased block firstly reduces the channels of original feature map into the transferred feature space by the convolution to reduce the computation complexity. When is too large, the feature map will contain redundant information which introduces the noise when calculating the affinity matrix . However, if is too small, it is hard to reconstruct the output feature map due to inadequate features. To test the robustness for the number of the , we generate three types of models with different number of the transferred channels with the setting: “Sub 1” (), “Sub 2” (), “Sub 4” () as shown in Table. 3. Other parameters of the models and the training steps are the same as the setting in Sec.4.1. Table. 3 shows the experimental results of the three types of models with different nonlocal blocks. Our SNL and gSNL blocks outperforms other models profited by their flexible for learning. Moreover, from Table. 3, we can see that the performances of the CGNL steeply drops when the number of the transferred channels increases. This is because the CGNL block concerns the relationship between channels, when the number of the subchannel increases, the relationship between the redundant channels seriously interferes its effects. Overall, our proposed nonlocal block is the most robust for the large number of transferred channels (our model rise in Top1 while the best of others only rise compared to the baseline).
model  top1  top5  

  PR56  75.33%  93.97% 
Sub 1  + NL  75.29%  94.07% 
+ NS  75.39%  93.00%  
+ A2  75.51%  92.90%  
+ CGNL  74.71%  93.60%  
+ *SNL  76.34%  94.48%  
+ *gSNL  76.21%  94.42%  
Sub 2  + NL  75.31%  92.84% 
+ NS  75.83%  93.87%  
+ A2  75.58%  94.27%  
+ CGNL  75.75%  93.47%  
+ *SNL  76.41%  94.38%  
+ *gSNL  76.07%  94.16%  
Sub 4  + NL  75.50%  93.75% 
+ NS  75.61%  93.66%  
+ A2  75.61%  93.61%  
+ CGNL  75.27%  93.05%  
+ *SNL  76.02%  94.08%  
+ *gSNL  76.05%  94.21% 
model  top1  top5  

  PR56  75.33%  93.97% 
Stage 1  + NL  75.31%  92.84% 
+ NS  75.83%  93.87%  
+ A2  75.58%  94.27%  
+ CGNL  75.75%  93.47%  
+ *SNL  76.41%  94.38%  
+ *gSNL  76.07%  94.16%  
Stage 2  + NL  75.64%  93.79% 
+ NS  75.74%  94.02%  
+ A2  75.60%  93.82%  
+ CGNL  74.64%  92.65%  
+ *SNL  76.29%  94.27%  
+ *gSNL  76.02%  93.98%  
Stage 3  + NL  75.28%  93.93% 
+ NS  75.44%  93.86%  
+ A2  75.21%  93.65%  
+ CGNL  74.90%  92.46%  
+ *SNL  75.68%  93.90%  
+ *gSNL  75.74%  93.78% 
The stage for adding the nonlocal blocks The nonlocalbased blocks can be added into the different stages of the preResNet (or the ResNet) to form the Nonlocal Net. In ns, the nonlocalbased blocks are added into the early stage of the preResNet to catch the longrange correlations. Here we experiment the performance of adding different types of nonlocal blocks into the three stages (the first, the second and the third stage of the preResNet) and train the models on CIFAR100 dataset with the same setting discussed in Sec.5.2. The experimental results are shown in Table. 3. We can see that the performances of the NL block is lower than the backbones when adding into the early stage. However, our proposed SNL block has improvement compared with the backbone when respectively adding into all the three stages, which is much higher than the other type nonlocal blocks (only for the best case).
To intuitively show the stability and robustness of our SNL, we give the spectrum analysis for the estimated weight matrices (
ns). We extract the selfattention weight matrix: of the NL block and the NS block, of our proposed SNL block. The dimension of the weight matrix satisfies: , . To make all the eigenvalues real, we let: . We do the same to the . Figure. 5 shows the top eigenvalues of the weight matrix of on the models in Table. 3. We can see that the density of the negative eigenvalues is higher than the positive eigenvalues of the NL block when adding into all three stages. This phenomenon makes the NL operator in Eq. (1) less than zero. So the output feature map is less than the input feature map, i.e. (more detail of this phenomenon can be seen in ns). The NS block can avoid “the damping effect” to some extent by concerning the diffusion nature. However, when adding into the early stage, only six eigenvalues of the nonlocal stage are not equal to zero. This phenomenon makes the nonlocal stage cannot effectively magnify the discriminated feature. Comparing with these two models, our proposed SNL block has more positive eigenvalues which takes effect to enhance the discriminated features and also avoids the “damping effect”.The number of the nonlocal blocks We test the robustness for adding multiple nonlocal blocks into the backbone network which forms the three type network “Different Position 3 (DP 3)”, “Same Position 3 (SP 3)” “Same Position 5 (SP 5)” as shown in Table. 4. The result are shown in Table. 4. For the model “DP3”, three blocks are added into the stage , stage , and stage (right after the second residual block). We can see that adding three proposed nonlocal operators into different stages of the backbone generate a larger improvement than the NS operator and NL operator ( improvement). This is because when adding NS and NL into the early stage, these two models cannot better aggregate the lowlevel features and interfere the following blocks. For the model “SP 3” (“SP 5”), we add three (five) consecutivelyconnected nonlocal blocks into the stage . Note that different from the experiment in ns and nl, these consecutivelyconnected nonlocal blocks have the same affinity matrix. From Table. 4, we can see that profited by concerning the stable hypothesis discussed in Sec 3.3, our gSNL outperform all other models when adding consecutivelyconnected nonlocal blocks (rises average % to the backbone and % higher than the best performance of other type nonlocal blocks) and has a relatively stable performance. However, one drawback is that our gSNL may interfere the learning when adding only one nonlocal block (the stable hypothesis is not satisfied).
model  top1  top5  model  top1  top5  
  PR56  75.33%  93.97%    PR56  75.33%  93.97% 
1  + NL  75.31%  92.84%  SP 3  + NL  75.43%  93.67% 
+ NS  75.83%  93.87%  + NS  75.30 %  93.74%  
+ A2  75.58%  94.27%  + A2  75.23%  94.03%  
+ CGNL  75.75%  93.47%  + CGNL  75.64%  93.05%  
+ *SNL  76.41%  94.38%  + *SNL  75.70%  94.10%  
+ *gSNL  76.07%  94.16%  + *gSNL  76.16%  94.32%  
DP 3  + NL  74.34%  93.31%  SP 5  + NL  75.13%  93.53% 
+ NS  75.00%  93.57%  + NS  75.25%  94.00%  
+ A2  75.63%  94.12%  + A2  75.61%  93.81%  
+ CGNL  75.96%  93.10%  + CGNL  75.15%  92.93%  
+ *SNL  76.70%  93.94%  + SNL  76.04%  94.19%  
+ *gSNL  76.45%  94.53%  + gSNL  76.04%  94.35% 
4.3 Main Results
We test the networks with the NL, NS, CGNL, A2 and our SNL (gSNL) blocks in the different visual learning tasks. The experiment settings are discussed in Sec.4.1. Our models outperform other types of the nonlocal blocks across several standard benchmarks. Table. 7 shows the experimental results on the CIFAR10 dataset, we can see that by adding one proposed block, the Top1 rises about , which is higher than adding other type nonlocal blocks (0.3%). As the experiments on CIFAR100 dataset shown in Table. 7, using our proposed block brings improvement about with ResNet50. While using a more simple backbone PreResnet56, our model can still generate improvement as shown in Table. 7.
Table. 9 shows the experimental results on the finegrained image classification task on CUB200 datasets. Our model outperforms other nonchannelconcerning blocks and generate () improvement. Comparing with the channelwise concerning CGNL block, our model is only a bit lower in Top1. The visible examples are also given in fig. 4. We can see that the feature maps of our proposed block can cover more critical area of the birds such as the wings (red square), webs (green square). Table. 9 shows the experimental results on the action recognition task. The network with our proposed block can generate improvement than the I3D model and outperforms all other nonlocal models on the UCF101 dataset.
model  top1  top5 

PR20  94.94%  99.87% 
+ NL  94.01%  99.82% 
+ NS  95.15%  99.88% 
+ A2  92.44%  99.86% 
+ CGNL  94.49%  99.92% 
+ *SNL  94.69%  99.84% 
+ *gSNL  95.59%  99.92% 
model  top1  top5 

PR56  75.33%  93.97% 
+ NL  75.31%  92.84% 
+ NS  75.83%  93.87% 
+ A2  75.58%  94.27% 
+ CGNL  75.75%  93.47% 
+ *SNL  76.41%  94.38% 
+ *gSNL  76.07%  94.16% 
model  top1  top5 

R50  76.50%  93.14% 
+ NL  76.77%  93.55% 
+ NS  77.90%  94.34% 
+ A2  77.30%  93.40% 
+ CGNL  74.88%  92.56% 
+ *SNL  78.17%  94.17% 
+ *gSNL  77.28%  93.63% 
model  top1  top5 

I3D  81.57%  95.40% 
+ NL  81.37%  95.76% 
+ NS  82.50%  95.84% 
+ A2  82.68%  95.85% 
+ CGNL  83.16%  96.16 % 
+ *SNL  82.30%  95.56% 
+ *gSNL  83.21%  96.53% 
model  top1  top5 

R50  85.43%  96.70% 
+ NL  85.34%  96.77% 
+ NS  85.54%  96.56% 
+ A2  86.02%  96.56% 
+ CGNL  86.14%  96.34% 
+ *SNL  85.91%  96.65% 
+ *gSNL  85.95%  96.79% 
5 Conclusion
In this paper, we explain the nonlocal block in the graph view and propose the spectral nonlocal (SNL) block which is more robust and wellbehaved. Our SNL block is a generalized version of the NL and NS block and having more liberty for the parameter learning. We also give the stable hypothesis for deeper nonlocal structure and extend the SNL to gSNL that can be applied to the deeper nonlocal structures. The experiments on multiple computer vision tasks show the high robustness and performance of our proposed nonlocal block. Not only the classification tasks are explored in this work, we expect the SNL and gSNL can be applied to more complex tasks, e.g. trajectory prediction in the video zhang2019stochastic.
References
Appendix A Analytic Solution of the Chebyshev Approximate
Here we give the analytic solution for the coefficients in Chebyshev polynomials (chebApr):
Theorem 2.
Giving a function , , it can be optimally approximated by Chebyshev polynomials: , only when satisfies: . We call the as the analytic solution of the Chebyshev coeffcients.
Based on these theorem, we can get the analytic solution of the parameter for Eq. (7):
Lemma 2.
The spectral nonlocal operator can be best approximated when the function can be best approximated by the Chebyshev polynomials, i.e. the analytic solutions of the Chebyshev coeffcients satisfy:
(12) 
Appendix B The Piecewise Chebyshev Polynomials
Taking into the Chebyshev polynomials of the affinity matrix , the Chebyshev polynomials becomes:
(13)  
This cyclic form of Chebshev polynomials can be reformulated as a piecewise function:
(14) 
Appendix C Experiment of Semantic Segmentation on VOC2012 Dataset
For the semantic segmentation tasks, we generate experiment on the VOC2012 dataset with the model proposed by deeplab.We add different types of nonlocal blocks on right before the last residual block in of the ResNet50. The models are trained for 50 epochs with the SGD optimize algorithm. The learning rate is set with the weight decay and momentum . Experimental results show that the model with our proposed block can the best results.
model  mIoU  fwIoU  acc 

R50  0.713  0.868  0.926 
+ NL  0.722  0.872  0.927 
+ NS  0.722  0.873  0.927 
+ A2  0.723  0.874  0.928 
+ CGNL  0.722  0.872  0.928 
+ SNL  0.726  0.875  0.930 
+ gSNL  0.727  0.875  0.929 
Appendix D The Example of the affinity matrix on CUB datasets
Experiments to verify the stable hypothesis is also generated on the CUB datasets, we add three consecutivelyconnected SNL blocks (and NS blocks) into the ResNet50 (right before the last residual block of ) and train this model on the train set of the CUB dataset with the initial learning rate which is subsequently divided by at , and epochs (total epochs). A weight decay and momentum are also used. Figure. 6 shows the histogram of the strength statistics of the affinity matrix . We can see that although using different backbone and dataset, the distribution of the khop affinity matrixes are corresponded with the experiments on CIFAR100.
Comments
There are no comments yet.