Capturing the long-range spatial-temporal dependencies is crucial for the Deep Convolutional Neural Networks (CNNs) to extract discriminate features in vision tasks such as image and video classification. However, the traditional convolution operator only focuses on processing local neighborhood at a time. This makes the CNNs need to go deeper with convolutional operations to enlarge the receptive fields, which lead to higher computation and memory. Moreover, going deeper cannot always increase the effective receptive fields due to the Gaussian distribution of the kernel weight (widelimit). To eliminate this limitation, some recent works focus on designing the network architecture with wider and well-designed modules to catch the long-range dependencies such as (largeconv, deeplab, pspnet). Although having larger receptive fields, these modules still need to be applied recursively to catch the dependencies of the pairs in large distances.
Inspired by the classical non-local means method in image denoising, nl
proposes the nonlocal neural network which uses the nonlocal (NL) block to concern the “full-range” dependencies in only one module by exploring the correlations between each position and all other positions. In the NL block, the affinity matrix is first computed to represent the correlations between each position pair. Then the weight means of features are calculated based on the affinity matrix to refine the feature representation. Finally, the residual connection is added to the refined feature map. Due to its simplicity and effectiveness, the nonlocal block has been widely used in image and video classification (nl; cgnl; ns; a2net), image segmentation (ccnet; cgnl; nl) and person re-identification (nlreid; scan) recently.
However, due to the complexity of the affinity matrix, the nonlocal block 111The nonlocal block is composed of a nonlocal operator and a residual connection needs much more computational effort and is sensitive to its number and position in the neural network (ns). Some works solve the first problem by simplifying the calculation of the affinity matrix such as ccnet, cgdnet, cgnl, a2net. Only a few works try to solve the second problem which limits the robustness of the nonlocal network 222The nonlocal network is composed of nonlocal blocks and a backbone network. ns proposes the nonlocal stage (NS) block which concerns the diffusion nature and maintains the same affinity matrix for all the nonlocal units in the NS block. Comparing with the NL block, the NS block is insensitive to the numbers and allows deeper nonlocal structure. However, the deeper nonlocal structure of NS block increases the complexity and do not have a remarkable improvement. The work from recent dynamical systems utilizing efficient low-rank approximation for learning, she2018reduced; she2018stochastic
, or considering complex layer-wise dynamics with Recurrent Neural Networksshe2019neural; shesupplementary can enlighten this research with respect to their well-explored mathematical formulations.
In this work, we focus on elaborating a robust nonlocal block which is more flexible when using in the neural network. We prove that the nonlocal operator in the nonlocal block is equivalent to the Chebyshev-approximated fully-connected graph filter with irrational constraints that limits its liberty for parameter learning. To remove these irrational constraints, we propose the Spectral Nonlocal (SNL) block which is more robust and can degrade into the NL and NS with specific assumptions. We also prove that the deeper nonlocal structure satisfies the stable hypothesis with the help of steady-state analysis. Based on this hypothesis, we give the full-order approximated spectral nonlocal (gSNL) block which is well-performed for deeper nonlocal structure. Finally, we add our proposed nonlocal blocks into the deep network and evaluate them on the image and video classification tasks. Experiments show that the networks with our proposed blocks are more robust and have a higher accuracy than using other types of nonlocal blocks. To summarize, our contributions are threefold:
We propose a spectral nonlocal (SNL) block as an efficient, simple, and generic component for capturing long-range spatial-temporal dependencies with deep neural networks, which is a generalization of the classical nonlocal blocks.
We propose the stable hypothesis, which can enable the deeper nonlocal structure without an elaborate preparation for both the number and position of the building blocks. We further extend SNL into generalized SNL (gSNL), which can enable multiple nonlocal blocks to be plugged into the existing computer vision architectures with stable learning dynamics.
Both SNL and gSNL have outperformed other nonlocal blocks across both several image and video classification tasks with a clear-cut improvement.
Nonlocal block The NL block consist of NL operator with residual connection and is expressed as:
where is the input feature map, ) is the NL operator, is the transferred feature map that compresses the channels of
by a linear transformation with kernel. Here is the number of positions. The affinity matrix is composed by pairwise correlations between pixels.
In the NL block, the NL operator explores the “full-range” dependencies by concerning the relationships between all the position pairs:
where is the weight matrix of a linear transformation. is the affinity kernel which can adopt the “Dot Product”, “Traditional Gasuassian”, “Embedded Gasussian” or other kernel matrix with a finite Frobenius norm.
Nonlocal stage To make the NL operator follow the diffusion nature that allows deeper nonlocal structure (ns), the nonlocal stage (NS) operator uses the graph laplacian to replace the affinity matrix in the NL operator:
where is the NS operator. is the degree of node . Moreover, when adding multiple blocks with the same affinity matrix and replacing the NL operator by the NS operator, these consecutively-connected blocks become the NS block. We called these nonlocal blocks in the NS block as the NS units.
The nonlocal operator can be divided into two steps: calculating the affinity matrix to represent the correlations between each position pairs and refining the feature map by calculating the weighted means based on . In this section, a fully-connected graph filter is utilized for explaining the nonlocal operator. With the Chebyshev approximation, we propose the SNL operator which is proved to be a generalized form of NL and NS operator and is more robust with higher performance in computer vision tasks. Furthermore, based on the stable hypothesis that deeper nonlocal structure tends to learn a stable affinity matrix, we extend our SNL operator into a full-order Chebyshev approximation version, i.e. the gSNL.
3.1 The Proposed Spectral Nonlocal Operator
Nonlocal operator in the graph view The nonlocal operator is a filter that computes a weighted mean of all the positions in the feature map based on the affinity matrix and then conduct the feature transformation with the kernel . This is the same as filtering the signal by a graph filter in the fully-connected graph domain determined by the affinity matrix (gsp). From this perspective, we further illustrate the nonlocal operator as
Given an affinity matrix and the signal , the nonlocal operator is the same as filtering the signal in the graph domain of a fully-connected weighted graph :
This definition requires that the graph laplacian has non-singular eigenvalue and eigenvector, so the affinity matrix should be a symmetric, non-negative, row-normalized matrix. To meet this requirement, the affinity matrix can be obtained by the following steps. First, the affinity kernel is used to calculate the matrix (we use the dot product with embeded weight matrix and as the affinity kernel, i.e. ). Then we make the matrix symmetric: . Finally, we normalize the row of to make it satisfy and having . In the following sections the symmetric, non-negative, row-normalized matrix is denoted as .
The proposed spectral nonlocal operator The graph filter in Eq. (4) contains parameters. To simplify it, we use the Chebyshev polynomials which can reduce the parameters into (). For simplicity, we firstly assume that the input , the output and the output have only one channel.
Following the similar method as gcn, the -order Chebyshev polynomials is used to approximate the graph filter function :
Due to is a random walk laplacican, the maximum eiginvalue satisfies which makes (gsp). Then Eq. (5) becomes:
If , the first-order Chebyshev approximation of Eq. (6) becomes:
where and are the coefficients for the first and second term which are approximated by learning with SGD. Then, extending Eq. (7) into multi-channel conditions, we can get the formation of our SNL operator:
where is the SNL operator, , . Finally, a residual connection is added with the SNL operator to form the SNL block:
Relation with other nonlocal operators As shown in fig. 1, our SNL operator can degrade into the NL operator by setting , i.e. . However, its analytic solution: controls the total filtering intensity, which cannot be guaranteed to be 0. This setting will limit the search space when training the network and reduce the robustness of the NL block. Thus, the NL operator cannot magnify features of a large range and damp some discriminative features such as the beak of the waterfowl in fig. 1. Our SNL operator can also degrade into the NS operator by setting , i.e. . However, the analytic solution of this equation is . When setting it to zero, the filter strength of the high-frequency signal (with high ) such as the small part or twig is suppressed. Thus, it still cannot magnify the discriminative part such as the beak of the waterfowl as shown in fig. 1. Comparing with NL and NS, our SNL does not have these irrational constraints and give these two parameters a liberal learning space. Thus, can control the preserve strength of the discriminative features, while can pay more attention to the low-frequency signal to diminish the noise.
3.2 The proposed generalized Spectral Nonlocal Operator
To fully exploit the “full-range” dependencies, the nonlocal block should have the ability to be consecutively stacked into the network to form a deeper nonlocal structure. However, some types of nonlocal blocks such as the NL and CGNL block cannot achieve this purpose (ns). To show the robustness of our SNL block when used in the deeper nonlocal structure, we firstly study the steady-state of deeper nonlocal structure when consecutively adding our SNL block. We also prove the stable hypothesis that the deeper nonlocal structure tends to learn a stable affinity. Based on this hypothesis, we can extend our SNL block into a full-order Chebyshev approximation, i.e. the gSNL block which is more applicable for deeper nonlocal structure.
The stable hypothesis The Steady-state analysis can be used to analyze the stable dynamics of the nonlocal block. Here we give the steady-state analysis of our SNL block when consecutively adds into the network structure and get the Stable Hypothesis:
The Stable Hypothesis: when adding more than two consecutively-connected SNL blocks with the same affinity matrix into the network structure, these SNL blocks are stable when the variable affinity matrix satisfies: .
The stability holds when the weight parameters in and are small enough such that the CFL condition is satisfied (ns). By ignoring them for simplicity, the discrete nonlinear operator of our SNL have a similar formulation as the NS operator:
where is the discretization parameter. is the input of the block in the deeper nonlocal structure with . The stable assumption demands that , so the steady-state equation of the last SNL block can be written as:
The deeper nonlocal structure has more than one SNL blocks. So the and can be used to express :
Finally, the steady-state equation becomes:
This equation can naturally extend to the k-hop affinity matrix , i.e. . ∎
To verify the stable hypothesis, we add five consecutively-connected SNL blocks (and NS blocks) into the PreResnet56 preresnet and train this model on the train set of the CIFAR100 dataset with the initial learning rate which is subsequently divided by at and epochs (total epochs). A weight decay and momentum are also used. Then we test the trained model on the test set and output the affinity matrix of each image. Figure. 2 shows the statistics that reflects the strength of the affinity matrix, 2-hop, 3-hop, and 4-hop affinity matrix: . We can see that the number of elements in each histogram bin are nearly the same. This means that the , , , have similar distribution of all the elements in k-hop affinity matrixes, which also empirically verifies the stable-state equation: .
Full-order spectral nonlocal operator With the stable hypothesis, the Chebyshev polynomials can be simplified into a piece-wise function (details in Appendix B). Taking this piece-wise function into the Eq. 7, we can get the full-order approximation of the SNL operator:
where , , . Then, extending it into multi-channel input and output with the residual connection, we can get our gSNL block:
The gSNL block is well-performed when the stable affinity hypothesis is satisfied, i.e. adding more than two nonlocal blocks with the same affinity matrix as shown in Table. 4.
3.3 Implementation Details
The implementation details of the gSNL block is shown in fig. 3. The input feature map is first fed into three 1x1 convolutions with the weight kernel: , , to subtract the number of channel. One of the output is used as the transferred feature map to reduce the calculation complexity, while the other two output , are used to get the affinity matrix . The sub-channel are usually two times less than the input channel . The affinity matrix is calculated by the affinity kernel function and then use the operation in Sec3.1 to make it non-negative, symmetric and normalized. Finally, with the affinity matrix and the transferred feature map , the output of the nonlocal block can be obtained by the equation Eq. (11). Specifically, the three weight matrixes , , are implemented as three 1x1 convolutions.
Datasets Our proposed SNL and gSNL blocks have been evaluated across several computer vision tasks, including image classification and video-based action recognition. For the image classification, both CIFAR-10 and CIFAR-100 datasets (cifar) are tested. The CIFAR-10 dataset contains images of classes, and CIFAR-100 dataset contains images of classes. For these two datasets, we use images as the train set and images as the test set. We also generate experiments for the fine-grained classification on the Birds-200-2011 (CUB-200) dataset (cub) which contains images of bird categories. For the action recognition, the experiments are conducted on the UCF-101 dataset (ucf), which contains different actions.
Backbones For the image classification, the ResNet-50 and the PreResNet variations (including both PreResNet-20 and PreResNet-56) are used as the backbone networks. For the video classification task, we follow the I3D structure (I3d) which uses kernels to replace the convolution operator in the residual block.
Setting for the network In the main experiments, we set . Without loss of the generality, we use the “Dot Product” as the affinity kernel in the experiments. We add one SNL (or gSNL) block into these backbone networks to construct the SNL (or gSNL) network. For the ResNet and the I3D (I3d), following nl we add the SNL block right before the last residual block of . For the PreResNet series, we add the SNL block right after the second residual block in the early stage (). For the other nonlocal-base block such as the original nonlocal block (nl), the nonlocal stage (ns), the compact generalized nonlocal block (cgnl), the settings are all the same as ours to make a fair comparison.
Setting for the training For the image classification on CIFAR-10 dataset and CIFAR-100 dataset, we train the models end-to-end without using pretrained model. The initial learning rate is used for these two datasets with the weight decay and momentum . The learning rate is divided by at and epochs. The models are trained for total epochs.
For the fine-grained classification on CUB-200 dataset, we use the models pretrained on ImageNet (imagenet) to initialize the weights. We train the models for total epochs with the initial learning rate which is subsequently divided by at , , epochs. The weight decay and momentum are the same as the setting of CIFAR-10 and CIFAR-100.
For the video classification on the UCF-101 dataset, the weights are initialized by the pretrained I3D model on Kinetics dataset (kinetics). We train the models with the initial learning rate which is subsequently divided by each epochs. The training stops at the epochs. The weight decay and momentum are the same as the setting of CIFAR-10 and CIFAR-100.
4.2 Ablation Experiment
The number of channels in transferred feature space The nonlocal-based block firstly reduces the channels of original feature map into the transferred feature space by the convolution to reduce the computation complexity. When is too large, the feature map will contain redundant information which introduces the noise when calculating the affinity matrix . However, if is too small, it is hard to reconstruct the output feature map due to inadequate features. To test the robustness for the number of the , we generate three types of models with different number of the transferred channels with the setting: “Sub 1” (), “Sub 2” (), “Sub 4” () as shown in Table. 3. Other parameters of the models and the training steps are the same as the setting in Sec.4.1. Table. 3 shows the experimental results of the three types of models with different nonlocal blocks. Our SNL and gSNL blocks outperforms other models profited by their flexible for learning. Moreover, from Table. 3, we can see that the performances of the CGNL steeply drops when the number of the transferred channels increases. This is because the CGNL block concerns the relationship between channels, when the number of the sub-channel increases, the relationship between the redundant channels seriously interferes its effects. Overall, our proposed nonlocal block is the most robust for the large number of transferred channels (our model rise in Top1 while the best of others only rise compared to the baseline).
|Sub 1||+ NL||75.29%||94.07%|
|Sub 2||+ NL||75.31%||92.84%|
|Sub 4||+ NL||75.50%||93.75%|
|Stage 1||+ NL||75.31%||92.84%|
|Stage 2||+ NL||75.64%||93.79%|
|Stage 3||+ NL||75.28%||93.93%|
The stage for adding the nonlocal blocks The nonlocal-based blocks can be added into the different stages of the preResNet (or the ResNet) to form the Nonlocal Net. In ns, the nonlocal-based blocks are added into the early stage of the preResNet to catch the long-range correlations. Here we experiment the performance of adding different types of nonlocal blocks into the three stages (the first, the second and the third stage of the preResNet) and train the models on CIFAR100 dataset with the same setting discussed in Sec.5.2. The experimental results are shown in Table. 3. We can see that the performances of the NL block is lower than the backbones when adding into the early stage. However, our proposed SNL block has improvement compared with the backbone when respectively adding into all the three stages, which is much higher than the other type nonlocal blocks (only for the best case).
To intuitively show the stability and robustness of our SNL, we give the spectrum analysis for the estimated weight matrices (ns). We extract the self-attention weight matrix: of the NL block and the NS block, of our proposed SNL block. The dimension of the weight matrix satisfies: , . To make all the eigenvalues real, we let: . We do the same to the . Figure. 5 shows the top eigenvalues of the weight matrix of on the models in Table. 3. We can see that the density of the negative eigenvalues is higher than the positive eigenvalues of the NL block when adding into all three stages. This phenomenon makes the NL operator in Eq. (1) less than zero. So the output feature map is less than the input feature map, i.e. (more detail of this phenomenon can be seen in ns). The NS block can avoid “the damping effect” to some extent by concerning the diffusion nature. However, when adding into the early stage, only six eigenvalues of the nonlocal stage are not equal to zero. This phenomenon makes the nonlocal stage cannot effectively magnify the discriminated feature. Comparing with these two models, our proposed SNL block has more positive eigenvalues which takes effect to enhance the discriminated features and also avoids the “damping effect”.
The number of the nonlocal blocks We test the robustness for adding multiple nonlocal blocks into the backbone network which forms the three type network “Different Position 3 (DP 3)”, “Same Position 3 (SP 3)” “Same Position 5 (SP 5)” as shown in Table. 4. The result are shown in Table. 4. For the model “DP3”, three blocks are added into the stage , stage , and stage (right after the second residual block). We can see that adding three proposed nonlocal operators into different stages of the backbone generate a larger improvement than the NS operator and NL operator ( improvement). This is because when adding NS and NL into the early stage, these two models cannot better aggregate the low-level features and interfere the following blocks. For the model “SP 3” (“SP 5”), we add three (five) consecutively-connected nonlocal blocks into the stage . Note that different from the experiment in ns and nl, these consecutively-connected nonlocal blocks have the same affinity matrix. From Table. 4, we can see that profited by concerning the stable hypothesis discussed in Sec 3.3, our gSNL outperform all other models when adding consecutively-connected nonlocal blocks (rises average % to the backbone and % higher than the best performance of other type nonlocal blocks) and has a relatively stable performance. However, one drawback is that our gSNL may interfere the learning when adding only one nonlocal block (the stable hypothesis is not satisfied).
|1||+ NL||75.31%||92.84%||SP 3||+ NL||75.43%||93.67%|
|+ NS||75.83%||93.87%||+ NS||75.30 %||93.74%|
|+ A2||75.58%||94.27%||+ A2||75.23%||94.03%|
|+ CGNL||75.75%||93.47%||+ CGNL||75.64%||93.05%|
|+ *SNL||76.41%||94.38%||+ *SNL||75.70%||94.10%|
|+ *gSNL||76.07%||94.16%||+ *gSNL||76.16%||94.32%|
|DP 3||+ NL||74.34%||93.31%||SP 5||+ NL||75.13%||93.53%|
|+ NS||75.00%||93.57%||+ NS||75.25%||94.00%|
|+ A2||75.63%||94.12%||+ A2||75.61%||93.81%|
|+ CGNL||75.96%||93.10%||+ CGNL||75.15%||92.93%|
|+ *SNL||76.70%||93.94%||+ SNL||76.04%||94.19%|
|+ *gSNL||76.45%||94.53%||+ gSNL||76.04%||94.35%|
4.3 Main Results
We test the networks with the NL, NS, CGNL, A2 and our SNL (gSNL) blocks in the different visual learning tasks. The experiment settings are discussed in Sec.4.1. Our models outperform other types of the nonlocal blocks across several standard benchmarks. Table. 7 shows the experimental results on the CIFAR10 dataset, we can see that by adding one proposed block, the Top1 rises about , which is higher than adding other type nonlocal blocks (0.3%). As the experiments on CIFAR100 dataset shown in Table. 7, using our proposed block brings improvement about with ResNet50. While using a more simple backbone PreResnet56, our model can still generate improvement as shown in Table. 7.
Table. 9 shows the experimental results on the fine-grained image classification task on CUB-200 datasets. Our model outperforms other non-channel-concerning blocks and generate () improvement. Comparing with the channel-wise concerning CGNL block, our model is only a bit lower in Top1. The visible examples are also given in fig. 4. We can see that the feature maps of our proposed block can cover more critical area of the birds such as the wings (red square), webs (green square). Table. 9 shows the experimental results on the action recognition task. The network with our proposed block can generate improvement than the I3D model and outperforms all other nonlocal models on the UCF-101 dataset.
|+ CGNL||83.16%||96.16 %|
In this paper, we explain the nonlocal block in the graph view and propose the spectral nonlocal (SNL) block which is more robust and well-behaved. Our SNL block is a generalized version of the NL and NS block and having more liberty for the parameter learning. We also give the stable hypothesis for deeper nonlocal structure and extend the SNL to gSNL that can be applied to the deeper nonlocal structures. The experiments on multiple computer vision tasks show the high robustness and performance of our proposed nonlocal block. Not only the classification tasks are explored in this work, we expect the SNL and gSNL can be applied to more complex tasks, e.g. trajectory prediction in the video zhang2019stochastic.
Appendix A Analytic Solution of the Chebyshev Approximate
Here we give the analytic solution for the coefficients in Chebyshev polynomials (chebApr):
Giving a function , , it can be optimally approximated by Chebyshev polynomials: , only when satisfies: . We call the as the analytic solution of the Chebyshev coeffcients.
Based on these theorem, we can get the analytic solution of the parameter for Eq. (7):
The spectral nonlocal operator can be best approximated when the function can be best approximated by the Chebyshev polynomials, i.e. the analytic solutions of the Chebyshev coeffcients satisfy:
Appendix B The Piecewise Chebyshev Polynomials
Taking into the Chebyshev polynomials of the affinity matrix , the Chebyshev polynomials becomes:
This cyclic form of Chebshev polynomials can be reformulated as a piecewise function:
Appendix C Experiment of Semantic Segmentation on VOC2012 Dataset
For the semantic segmentation tasks, we generate experiment on the VOC2012 dataset with the model proposed by deeplab.We add different types of nonlocal blocks on right before the last residual block in of the ResNet50. The models are trained for 50 epochs with the SGD optimize algorithm. The learning rate is set with the weight decay and momentum . Experimental results show that the model with our proposed block can the best results.
Appendix D The Example of the affinity matrix on CUB datasets
Experiments to verify the stable hypothesis is also generated on the CUB datasets, we add three consecutively-connected SNL blocks (and NS blocks) into the ResNet50 (right before the last residual block of ) and train this model on the train set of the CUB dataset with the initial learning rate which is subsequently divided by at , and epochs (total epochs). A weight decay and momentum are also used. Figure. 6 shows the histogram of the strength statistics of the affinity matrix . We can see that although using different backbone and dataset, the distribution of the k-hop affinity matrixes are corresponded with the experiments on CIFAR100.