Log In Sign Up

Beyond Fixation: Dynamic Window Visual Transformer

Recently, a surge of interest in visual transformers is to reduce the computational cost by limiting the calculation of self-attention to a local window. Most current work uses a fixed single-scale window for modeling by default, ignoring the impact of window size on model performance. However, this may limit the modeling potential of these window-based models for multi-scale information. In this paper, we propose a novel method, named Dynamic Window Vision Transformer (DW-ViT). The dynamic window strategy proposed by DW-ViT goes beyond the model that employs a fixed single window setting. To the best of our knowledge, we are the first to use dynamic multi-scale windows to explore the upper limit of the effect of window settings on model performance. In DW-ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head self-attention. Then, the information is dynamically fused by assigning different weights to the multi-scale window branches. We conducted a detailed performance evaluation on three datasets, ImageNet-1K, ADE20K, and COCO. Compared with related state-of-the-art (SoTA) methods, DW-ViT obtains the best performance. Specifically, compared with the current SoTA Swin Transformers <cit.>, DW-ViT has achieved consistent and substantial improvements on all three datasets with similar parameters and computational costs. In addition, DW-ViT exhibits good scalability and can be easily inserted into any window-based visual transformers.


SimViT: Exploring a Simple Vision Transformer with sliding windows

Although vision Transformers have achieved excellent performance as back...

Multi-Scale Self-Attention for Text Classification

In this paper, we introduce the prior knowledge, multi-scale structure, ...

Transformer Tracking with Cyclic Shifting Window Attention

Transformer architecture has been showing its great strength in visual o...

Raw Produce Quality Detection with Shifted Window Self-Attention

Global food insecurity is expected to worsen in the coming decades with ...

MUSE: Multi-Scale Temporal Features Evolution for Knowledge Tracing

Transformer based knowledge tracing model is an extensively studied prob...

Robust Degraded Face Recognition Using Enhanced Local Frequency Descriptor and Multi-scale Competition

Recognizing degraded faces from low resolution and blurred images are co...

Differentiable Window for Dynamic Local Attention

We propose Differentiable Window, a new neural module and general purpos...

1 Introduction

In computer vision (CV) tasks, the visual transformer represented by Vision Transformer (ViT)

[dosovitskiy2021an] has shown great potential. These methods have achieved impressive performance on tasks such as image classification [srinivas2021bottleneck, wang2021pyramid], semantic segmentation [wu2021p2t, liu2021polarized] and object detection [liu2021swin, yang2021focal, yuan2021tokens].

In ViT, the complexity of the self-attention operation is proportional to the square of the number of image patches. This is unfriendly to most tasks in the CV field. Swin [liu2021swin] thus proposed to limit the calculation of self-attention to a local window to reduce the computational complexity and achieved some promising results. This local window self-attention quickly attracted a significant amount of attention [lin2021cat, chu2021twins, wang2021crossformer]. However, most of these methods [lin2021cat, chu2021twins, wang2021crossformer] use a fixed single-scale window (e.g., ) by default. The following questions accordingly arise: Is this window size optimal? Does a bigger window entail better performance? Is a multi-scale window more advantageous than a single-scale window? Furthermore, will dynamic multi-scale windows yield better results? To answer these questions, we evaluate the impact of window sizes on the model performance. In Fig. 1, we report the change curve () of top-1 accuracy and FLOPs (G) of Swin-T [liu2021swin] under four single-scale windows () on ImageNet-1K [Deng2009ImageNetAL]. In Swin [liu2021swin], the window size has a very small effect on the amount of model parameters.

As shown in Fig. 1, as the window size increases, the performance of the model is found to be significantly improved, but this is not absolutely monotonous. For example, when the window size is increased from 21 to 23, the performance of the model hardly improves or even drops. Therefore, it is not feasible to simply increase the window to improve the performance of the model. In addition, it is difficult to choose the best window size from multiple alternative window sizes. And the optimal window settings of different layers may also be different. A natural idea is to mix information from windows of different scales for prediction tasks. Based on this idea, we design a multi-scale window multi-head self-attention (MSW-MSA) mechanism for the window-based ViT. In Fig. 1, as shown in the results of Swin-T with MSW (MSW-Swin) and Swin-T with a single-scale window, simply introducing the MSW mechanism for the W-MSA of the transformer cannot further effectively improve the performance of the model. For example, the performance of MSW-Swin () is lower than that of Swin-T with single-scale windows when

. It may be caused by suboptimal window settings that impairs the performance of the model. This shows that it may require more effort to protect ViT with MSW from suboptimal window settings while retaining the advantages of multi-scale windows. On the other hand, the dynamic neural network

[han2021dynamic] has been favored by a large number of researchers because of its ability to adjust the structure and parameters of the model adaptively according to the input. Moreover, the dynamic network has been successfully applied in CNN [szegedy2016rethinking, szegedy2017inception, xie2017aggregated, zoph2018learning, tan2019mixconv, li2019selective] and ViT [wang2021crossformer, yang2021focal, chen2021crossvit].

[DW-ViT (ours)]    [Swin Transformer]

Figure 2: Comparison of DW-ViT’s multi-scale window (e.g., and ) and Swin-based single-scale window (e.g., ). The number of patches in the local window is . A dynamic multi-scale window (DMSW) is a dynamic adaptive window module we designed for multi-scale window multi-head self-attention (MSW-MSA). is a learnable parameter of the DMSW module. and are a possible weight distribution scheme of DMSW.

Based on the above observations, in this paper, we propose a novel method, named Dynamic Window Vision Transformer (DW-ViT). As far as we know, it is the first method to use dynamic multi-scale windows to explore the upper limit of the impact of window settings on model performance. In DW-ViT, we first obtain multi-scale information by assigning different scale windows to different head groups of multi-head self-attention in transformer. Then, we realize the dynamic fusion of information by assigning weights to the multi-scale window branches. In Fig. 2, we present a comparison of DW-ViT’s multi-scale window and single-scale window approaches based on Swin [liu2021swin] class methods. More specifically, in DW-ViT, MSW-MSA is responsible for the extraction of multi-scale window information, while DMSW is responsible for the dynamic enhancement of these multi-scale information. Through the above two parts, DW-ViT can improve the model’s multi-scale information modeling capabilities dynamically while ensuring relatively low computational complexity. As shown in Fig. 1, the performance of DW-T with a dynamic window is significantly better than that of Swin-T with a single fixed-scale window, which we call ”beyond fixed”. Our main contributions can be summarized as follows:

  • The recently popular window-based ViT mostly ignores the influence of window size on model performance. This severely limits the upper limit of the model’s performance. As far as we know, we are the first to challenge this problem.

  • We propose a novel plug-and-play module with a dynamic multi-scale window for multi-head self-attention in transformer. DW-ViT is superior to all other ViTs that use the same single-scale window and can be easily embedded into any window-based ViT.

  • Compared with the state-of-the-art methods, DW-ViT achieves the best performance on multiple CV tasks with similar parameters and FLOPs.

2 Related Works

Figure 3: In the visual transformer, a schematic diagram of the window self-attention calculation process. Assume that the number of pixels in the input image is (e.g. ). The image is first split into fixed-size patches (e.g. ), and then the self-attention calculation is limited to a fixed-size window (i.e. each window has patches, e.g. ). For simplicity, patch and position embeddings are omitted here.
Figure 4: The architecture of the Dynamic Window Vision Transformer (DW-ViT).

Window self-attention. In the ViT context, standard self-attention splits each image into fixed-size patches [dosovitskiy2021an, wang2021pyramid, touvron2021training]. These patches are expanded as a sequence of tokens, which are then fed to the transformer encoder after being encoded. The calculation amount of this standard self-attention is still huge. Subsequent work [wang2021pyramid, wu2021p2t, Hu2019LocalRN] has continued to try to reduce the computational complexity of standard self-attention. In particular, Swin [liu2021swin] proposes to limit the calculation of self-attention to a local window. This window self-attention strategy reduces the computational complexity of MSA from to (here is the number of image patches). The schematic diagram of the window-based self-attention calculation process in ViT is shown in Fig. 3. This window self-attention mechanism quickly attracted the attention of a large number of researchers [chu2021twins, yang2021focal, wang2021crossformer]. However, these works all use a fixed single-scale window. They ignored the impact of window size on model performance. This may limit the upper limit of the impact of window configuration on model performance. In Fig. 1, the performance comparison of Swin [liu2021swin] under different single-scale windows just verifies this idea. Based on the above observations, we filled this gap and explored in detail the effect of window size on model performance, which is a supplement to the above work.

Multi-scale information in ViT. Multi-scale information has been successfully applied in the field of convolution. To obtain more comprehensive information, the model not only needs small-scale information but also large-scale information. For example, Inception [szegedy2015going, Szegedy2017Inceptionv4IA], Timeception [Hussein2019TimeceptionFC], MixConv [Tan2019MixConvMD] and SKNet [li2019selective], among others, obtain multi-scale information by using different sizes of convolution kernels. In addition, some works [graham2021levit, wang2021crossformer] also try to use the output of CNN as the input of ViT to improve the ability of ViT to model local information. In particular, CrossFormer [wang2021crossformer] uses multi-scale convolution to provide multi-scale information for the ViT input. Recently, due to the popularity of ViT in the CV field, many researchers have attempted to introduce multi-scale information into ViT. The pyramid structure in CNN is a widely borrowed idea. For example, T2T [yuan2021tokens] reduces the length of the token sequence stage by stage by aggregating adjacent patches, while PVT [wang2021pyramid] reduces the feature dimension by modifying self-attention. BossNAS [li2021bossnas] searches for the downsampling position of multi-stage transformers. Further, P2T [wu2021p2t] introduces pyramid pooling into the self-attention of the transformer. Similarly, Focal self-attention [yang2021focal] also incorporates multi-scale information into the calculation of each self-attention. More directly, CrossVit [chen2021crossvit] has designed a two-branch transformer encoder with image tokens of different sizes. All of these improve the model’s ability to model multi-scale information to varying degrees. However, the above-mentioned method either has a large amount of calculation due to the global self-attention, or it is difficult to expand due to the complex design. In our work, we design a multi-scale window mechanism for MSA to enhance its modeling capabilities in the context of multi-scale information. This MSW-MSA strategy applies to most types of W-MSA computing and exhibits good expansion.

Dynamic multi-branch network. Recently, dynamic networks [han2021dynamic, li2021dynamic, li2021ds] are popular because they can flexibly adjust the structure and parameters of the network according to the input and have better adaptive capabilities. In a dynamic multi-branch network, a common strategy is to assign corresponding weights to different branches according to their importance to achieve a large-capacity, more versatile, and flexible network structure. For example, early works on this topic [jacobs1991adaptive, Eigen2014LearningFR] used real-valued weights to dynamically rescale the representations obtained from different experts. In addition, SKNet [li2019selective], ACNet [wang2019adaptively], TreeConv [wang2020grammatically], and ResNeSt [Zhang2020ResNeStSN] propose a simple split-attention mechanism that dynamically adjusts the weight of the information obtained by different convolution kernels or branches. This strategy can obtain dynamic feature representations for different samples with a small computational cost, thereby improving the model’s expressive ability. In our work, the proposed multi-scale window self-attention module has a natural affinity with the above-mentioned dynamic multi-branch network. Accordingly, we propose a dynamic multi-scale window (DMSW) module for MSW-MSA. This DMSW strategy enables DW-ViT to integrate information from windows of different scales in a dynamic manner so that the model can obtain better expressive capabilities.

3 Method

3.1 Overall Architecture

To facilitate proper comparison while maintaining its high-resolution task processing capabilities, DW-ViT follows the architectural design outlined in [liu2021swin, wang2021pyramid, Zhang2021MultiScaleVL]. Fig. 4 presents the overall architecture of DW-ViT. The model comprises four stages. To generate hierarchical feature representation, the -th stage consists of a feature compression layer and Dynamic Window Module (DWM) transformer layers. More specifically, in Stage 1, similar to the ViT [dosovitskiy2021an, liu2021swin], the RGB image is split into non-overlapping patches (the patch size is set to ; that is, the compression ratio in the spatial dimension is 4). The original RGB pixel value of each patch is concatenated (i.e. after patch concatenation, the dimension is ) and projected to an arbitrary dimension (denoted as ) through a linear embedding layer. The feature dimension of the corresponding patch embedding layer output is . These generated patch tokens are then used as the input of the DWM transformer layers, and the number (i.e. ) of tokens remains unchanged during this process. Similarly, Stages 2–4 uses a similar structure. The difference is that the feature compression ratio of the patch merging layer in each stage is 2, while the number of channels is doubled. That is, the resolutions of the output features for Stages 2–4 are , , and , and the corresponding channel dimensions are , , and , respectively. The combination of output features at different stages can be used as the input of task networks such as classification, segmentation, and detection.

Figure 5: Dynamic Window Module (DWM). DWM has two main parts: Multi-Scale Window Multi-Head Self-Attention Module (MSW-MSA) and Dynamic Multi-Scale Window Module (DMSW).

3.2 Dynamic Window Module

As shown in Fig. 5, the DWM we designed comprises two main parts: a multi-scale window multi-head self-attention module (MSW-MSA) and a dynamic multi-scale window module (DMSW). The former is responsible for the capture of multi-scale window information, while the latter is responsible for the dynamic adaptive weighting of this information.

3.2.1 Multi-Scale Window Multi-head Self-Attention

Fig. 5 (left) presents an architecture diagram of MSW-MSA with heads and scale windows. Here we take and as an example. The multi-head of MSA is evenly divided into groups, which perform multi-head self-attention at different scales window to capture multi-scale window information. A group of windows here can be set to . Specifically, assume the input feature map ; we thus have the following output of MSW-MSA:


where the -th branch is divided into windows in the spatial dimension. Each window is expanded into a token sequence of length and used as the input of the -th branch W-MSA of MSW-MSA. The structure of W-MSA is illustrated in Fig. 3. The output of W-MSA is reconstructed as in the spatial dimension, and the final output dimension is . The outputs of these branches are concatenated in the channel dimension and used as the output of the entire MSW-MSA module.

3.2.2 Dynamic Multi-Scale Window

The output of the multi-branch structure MSW-MSA can naturally be used as the input of DMSW. retains the multi-scale information of window groups of different scales in the channel dimension. To this end, we designed an dynamic multi-scale window information weighting module DMSW for MSW-MSA.

In more detail, DMSW uses the integrated information of all branches to generate corresponding weights for each branch, then integrates the information of different branches via weighting. The DMSW structure diagram is presented on the right of Fig. 5. This process is divided into two main steps: Fuse and Select. The former is responsible for integrating the information of all branches, while the latter generates corresponding weights for each branch based on the global information and completes the fusion of branch information. Specifically, the details of these two parts are as follows:

Fuse: It mainly consists of a pooling layer and two pairs of fully connected layers and activation layers . The calculation process is as follows:


where is the GELU [Hendrycks2016BridgingNA] function. The specific dimension setting is presented in Fig. 5 (right), where and is set to .

Select: It consists of two parts. The first part is composed of a set of fully connected layers

and a softmax layer to generate corresponding weights for each branch, while the second contains two linear mapping layers to restore the channel dimension of the fused features. The specific calculation process is as follows:


where . The DMSW module output is as follows:


Moreover, is also the output of the entire DWM.

3.3 Dynamic Window Block

The DW block is constructed by replacing the standard MSA module in the Transformer block with DWM. In addition, because DWM is designed for multi-scale information, it does not specifically design for cross-window information exchange. In the interests of simplicity, following the design presented in [liu2021swin], we retain the Swin’s [liu2021swin] shifted window strategy. DWM with shifted window strategy is defined as a dynamic shifted window (DSW) block. Each DWM (or DSW) block consists of two LayerNorm (LN) layers and a two-layer MLP with GELU nonlinearity. DSW achieves cross-window information exchange by moving the feature patches to the upper left in the spatial dimension. When the feature is reconstructed, it moves patches to the lower right to restore the spatial position of the feature. Alternate stacking of DWM and DSW is used to avoid a decline in information exchange. Specifically, two successive DWM blocks are calculated as follows:


where and respectively define the output of the DWM (DSW) module and MLP module in the -th block.

Output Size Layer Name DW-T DW-B
Stage 1 Patch Embedding
Stage 2 Patch Merging
Stage 3 Patch Merging
Stage 4 Patch Merging
Table 1: Configuration details of DW-ViT. Here, is the size of the patch in the -th stage, and is also the downsampling ratio of the feature in the spatial dimension. is the number of feature channels, while and are the window combination used by the MSW-MSA module and the number of heads used by the MSA in transformer respectively.

Position encoding. For a local window with patches, following [Raffel2020ExploringTL, Bao2020UniLMv2PL, liu2021swin], we added a set of relative position bias to the similarity calculation of each head of DWM self-attention. For the W-MSA of the -th scale local window, we have the window self-attention calculation of as follows:


where are query, key, and value matrices, while is the number of patches in the -th scale window, and is the dimension. In addition, we parameterized a bias matrix set . Specifically, for , because the relative position on each axis lies in the range of , a small-sized bias matrix is parameterized, and the values in are taken from .

3.4 Model Configuration

To facilitate fair comparison, following [liu2021swin], we set the two configuration models as DW-T and DW-B. Their configuration details are summarized in Tab. 1. In particular, according to the results in Fig. 1 and the size of the output features in each stage on ImageNet [Deng2009ImageNetAL], for the DW-T with three heads in the first stage, we set . For Stages 2–4, we adjust the window according to the size of the output feature of each stage (when the size of the window and the output feature are equal, the standard self-attention is calculated at this time). Similarly, for DW-B, . For all experiments, the query dimension of each head is , while the expansion layer of each MLP is .

3.5 Complexity Analysis

The computational complexity of the DWM block is composed of two main parts: and . For an image with patches, their computational complexity is as follows222The calculation of SoftMax is ignored here.:


The total computational complexity of DWM is as follows:


Since both and are constants, the total computational complexity of DWM does not significantly increase. The computational complexity of DWM is still .

4 Experiments

We conduct a performance comparison with the state-of-the-art (SoTA) methods on an upstream task, ImageNet-1K image classification [Deng2009ImageNetAL], and two downstream tasks: semantic segmentation on ADE20K [Zhou2018SemanticUO], and object detection and instance segmentation on COCO 2017 [Lin2014MicrosoftCC]. Finally, we ablate the important modules of DW-ViT.

4.1 Image Classification on ImageNet-1K

Experimental Settings We benchmark DW-ViT on ImageNet-1K [Deng2009ImageNetAL]. ImageNet-1K contains 1.28M training images and 50K test images from 1000 categories. To test the effectiveness of DW-ViT and conduct a fair comparison with similar methods [liu2021swin, chen2021crossvit, chu2021twins], we carefully avoid using any tricks that provide unfair advantage [Touvron2021GoingDW, Jiang2021TokenLT]. Specifically, following the settings in [liu2021swin, chu2021twins]

, DW-ViT was trained for 300 epochs with a batch size of 1024 using the AdamW optimizer

[Loshchilov2019DecoupledWD]. The cosine decay learning rate scheduler and 20 epochs of a linear warm-up are used. The initial learning rate and weight decay are set to 0.001 and 0.05, respectively. In training, [Touvron2021TrainingDI]’s augmentation and regularization strategies are used. Following the settings in [liu2021swin], the repeated enhancement [Hoffer2020AugmentYB] and EMA [Polyak1992AccelerationOS] strategy are abandoned.

Method #param. (M) FLOPs (G) Top-1 (%)
ResNet50[he2016deep] 26 4.1 76.6
ResNet101[he2016deep] 45 7.9 78.2
X50-32x4d[xie2017aggregated] 25 4.3 77.9
X101-32x4d[xie2017aggregated] 44 8.0 78.7
RegNetY-4G [radosavovic2020designing] 21 4.0 80.0
RegNetY-8G [radosavovic2020designing] 39 8.0 81.7
RegNetY-16G [radosavovic2020designing] 84 16 82.9
DeiT-Small/16 [touvron2021training] 22 4.6 79.9
CrossViT-S [chen2021crossvit] 27 5.6 81.0
T2T-ViT-14 [yuan2021tokens] 22 5.2 81.5
TNT-S [han2021transformer] 24 5.2 81.3
CoaT Mini [xu2021co] 10 6.8 80.8
PVT-Small [wang2021pyramid] 25 3.8 79.8
CPVT-GAP [yuan2021tokens] 23 4.6 81.5
CrossFormer-S [wang2021crossformer] 28 4.5 81.5
Swin-T [liu2021swin] 28 4.5 81.3
DW-T 30 5.2 82.0
ViT-Base/16 [dosovitskiy2020image] 87 17.6 77.9
DeiT-Base/16 [touvron2021training] 87 17.6 81.8
T2T-ViT-24 [yuan2021tokens] 64 14.1 82.3
CrossViT-B [chen2021crossvit] 105 21.2 82.2
TNT-B [han2021transformer] 66 14.1 82.8
CPVT-B [chu2021conditional] 88 17.6 82.3
PVT-Large [wang2021pyramid] 61 9.8 81.7
Swin-B [liu2021swin] 88 15.4 83.3
DW-B 91 17.0 83.8
Table 2: Performance comparison on ImageNet-1K. All models are trained and evaluated at resolution. shows the performance in the case of single-scale embedding.

Results Tab. 2 reports the performance comparison of DW-ViT and state-of-the-art methods on ImageNet-1K. Methods of comparison include the classic and the latest ConvNet-based [he2016deep, xie2017aggregated, radosavovic2020designing] and Transformer-based [liu2021swin, wang2021crossformer, chen2021crossvit] models. All models are trained and evaluated at resolution. As shown in Tab. 2, with similar parameters and FLOPs, DW-ViT still has obvious advantages compared with other current state-of-the-art methods. Specifically, compared with Transformer baseline DeiT [touvron2021training], the performance of DW-T and DW-B are improved by 2.1% and 2.0%, respectively. At the same time, under the same settings, compared with Swin [liu2021swin], DW-T and DW-B also achieved performance gains of 0.7 and 0.5 points, respectively, with the help of dynamic windows. This shows that DW-ViT as a general visual feature extractor can obtain better feature representation. In addition, it is worth mentioning that as an independent module, DWM can be flexibly embedded in any window-based ViT model [wang2021crossformer, chu2021twins, lin2021cat] like Swin [liu2021swin] to improve the model’s dynamic modeling capabilities for multi-scale information. Compared with these ViTs [wang2021crossformer, chu2021twins, lin2021cat] that use a fixed single-scale window, DWM enables DW-ViT to have a larger model capacity and perform better in terms of adaptability and scalability.

4.2 Semantic Segmentation on ADE20K

Backbone Method #param. (M) FLOPs (G) mIoU +MS
ResNet-101 [he2016deep] DANet [nam2017dual] 69 1119 45.3 -
ResNet-101 OCRNet [yuan2020object] 56 923 44.1 -
ResNet-101 DLab.v3+ [chen2018encoder] 63 1021 44.1 -
ResNet-101 ACNet[fu2019adaptive] - - 45.9 -
ResNet-101 DNL[yin2020disentangled] 69 1249 46.0 -
ResNet-101 UperNet [xiao2018unified] 86 1029 44.9 -
HRNet-w48 [sun2019deep] DLab.v3+ [chen2018encoder] 71 664 45.7
ResNeSt-101[Zhang2020ResNeStSN] DLab.v3+[chen2018encoder] 66 1051 46.9 -
ResNeSt-200[Zhang2020ResNeStSN] DLab.v3+[chen2018encoder] 88 1381 48.4 -
PVT-S [wang2021pyramid] S-FPN [kirillov2019panoptic] 28 - 39.8
PVT-M S-FPN 48 219 41.6 -
PVT-L S-FPN 65 283 42.1 -
CAT-S [lin2021cat] S-FPN 41 214 42.8 -
CAT-B S-FPN 55 276 44.9 -
Swin-T[liu2021swin] UperNet[xiao2018unified] 60 945 44.5 45.8
Swin-B[liu2021swin] UperNet[xiao2018unified] 121 1188 48.1 49.7
DW-T UperNet[xiao2018unified] 61 953 45.7 46.9
DW-B UperNet[xiao2018unified] 125 1200 48.7 50.3
Table 3: Performance comparison on the ADE20K [Zhou2018SemanticUO] val. The single-scale and multi-scale evaluation results are presented in the last two columns. The FLOPs (G) are calculated at an input resolution of .

ADE20K [Zhou2018SemanticUO] is also a widely used semantic segmentation dataset. It contains 20K training images, 2K verification images, and 3K test images, covering a total of 150 semantic categories. DW-ViT and UperNet [xiao2018unified] in mmsegmentation [mmseg2020] are used as the backbone and segmentation methods respectively. The pre-trained backbone used is DW-ViT trained on ImageNet-1K. Following the settings in [liu2021swin], the input size of the image is , AdamW [Loshchilov2019DecoupledWD] is used as the optimizer (the initial learning rate is , weight decay is 0.01, and a linear learning rate decay is used), and the model is trained with a batch size of 16 and 160K iterations. For multi-scale evaluation (+MS), the scaling ratio is between 0.5 and 1.75.

The performance comparison between DW-ViT and other methods on ADE20K val is shown in Tab. 3. As shown in Tab. 3, DW-ViT achieves the best performance compared to many state-of-the-art methods. Specifically, under similar FLOPs and parameters, compared with Swin [liu2021swin], DW-ViT improves the single-scale evaluation by 1.2 and 0.6 points, respectively. Compared with other methods, DW-ViT has also obtained competitive results. Compared with Swin, DW-ViT has a more obvious advantage (e.g. ) in ADE20K than in ImageNet. This shows that the dynamic window mechanism of DW-ViT has more obvious advantages in downstream tasks such as more complex image datasets.

4.3 Object Detection on COCO

Method #param. (M) FLOPs (G) AP AP AP AP AP AP
Mask R-CNN [he2017mask]
ResNet50 [he2016deep] 44 260 41.0 61.7 44.9 37.1 58.4 40.1
PVT-Small [wang2021pyramid] 44 245 43.0 65.3 46.9 39.9 62.5 42.8
ViL-Small [Zhang2021MultiScaleVL] 45 174 43.4 64.9 47.0 39.6 62.1 42.4
Swin-T [liu2021swin] 48 264 46.0 68.2 50.2 41.6 65.1 44.8
DW-T 49 275 46.7 69.1 51.4 42.4 66.2 45.6
ResNeXt101-64x4d [xie2017aggregated] 102 493 44.4 64.9 48.8 39.7 61.9 42.6
PVT-Large [wang2021pyramid] 81 364 44.5 66.0 48.3 40.7 63.4 43.7
ViL-Base [Zhang2021MultiScaleVL] 76.1 365 45.7 67.2 49.9 41.3 64.4 44.5
Swin-Base [liu2021swin] 107 496 48.5 69.8 53.2 43.4 66.8 46.9
DW-B 111 505 49.2 70.6 54.0 44.0 68.0 47.7
Cascade Mask R-CNN [cai2018cascade, he2017mask]
DeiT-S[touvron2021training] 80 889 48.0 67.2 51.7 41.4 64.2 44.3
ResNet50[he2016deep] 82 739 46.3 64.3 50.5 40.1 61.7 43.4
Swin-T[liu2021swin] 86 745 50.5 69.3 54.9 43.7 66.6 47.1
DW-T 87 754 51.5 70.5 55.9 44.7 67.8 48.5
X101-64 [xie2017aggregated] 140 972 48.3 66.4 52.3 41.7 64.0 45.1
Swin-B [liu2021swin] 145 982 51.9 70.9 56.5 45.0 68.4 48.7
DW-B 149 992 52.9 71.6 57.5 45.7 69.0 50.0
Table 4: Performance comparison of object detection and instance segmentation on the COCO2017 val dataset. Two object detection frameworks are used: Mask R-CNN [he2017mask] and Cascade Mask R-CNN [cai2018cascade]. The FLOPs (G) are calculated at an input resolution of . indicates that additional deconvolution layers are used to generate hierarchical features.

Further, we benchmark DW-ViT on object detection and instance segmentation with COCO 2017 [Lin2014MicrosoftCC]. COCO contains 118K training, 5K validation, and 20K test images. The pre-trained model used is DW-ViT trained on ImageNet-1K. DW-ViT is used as the visual backbone and is then plugged into a representative object detection framework. We here consider two representative object detection frameworks: Mask R-CNN [he2017mask] and Cascade Mask R-CNN [cai2018cascade]. All models are trained on the training images and the results are reported on the validation set. The same settings were used for all frameworks. Specifically, we use multi-scale training [carion2020end, sun2021sparse], the AdamW [Loshchilov2019DecoupledWD] optimizer (the initial learning rate, weight decay and batch size are 0.0001, 0.05, and 16), and a 3 schedule (it has 36 epochs, and the learning rate decays by 10 between epochs 27 and 33). It is implemented based on MMDetection [chen2019mmdetection].

The performance comparison of object detection and instance segmentation on the COCO2017 val dataset is shown in Tab. 4. Compared with other state-of-the-art methods, DW-ViT achieves the best performance in both object detection frameworks. Specifically, compared with the Transformer baseline DeiT-S [touvron2021training], DW-T is improved by 3.5 points. Compared with Swin [liu2021swin], DW-ViT has achieved an improvement of more than 0.7 points in object detection and instance segmentation under the two object detection frameworks. At the same time, compared with Swin, the parameters and FOLPs of DW-ViT have not increased significantly, which once again demonstrates the superiority of the dynamic window mechanism. In addition, the results of the two detection frameworks show that DW-ViT can be easily embedded into different frameworks like other backbones.

4.4 Ablation Study

Method Window #param. (M) FLOPs (G) Top-1 (%)
Swin-T 7 11 14 17 21 23 28.29 28.31 28.34 28.35 28.36 28.36 4.49 4.69 4.89 5.06 5.34 5.49 74.31 75.18 75.83 76.31 76.28 76.24
MSW-MSA () 1 - 29.05 28.33 29.77 5.18 5.07 5.18 73.43 76.10 76.68
Table 5: Performance comparison of Swin and DW-ViT on ImageNet-1K [Deng2009ImageNetAL] under different window and module settings.

To explore the effects of each component of DW-ViT, we compared the performance of Swin-T with single-scale window, MSW-Swin, and DW-ViT with and without DMSW mechanism. Specifically, we set ; for all other settings, we adopt the default settings presented Swin [liu2021swin]. Single-scale windows are taken from , and multi-scale windows are set to 333We adopted the original settings in Swin [liu2021swin] and modified only the window size. When the window size is larger than the input feature, the global self-attention is performed at this time.. Their performance on ImageNet-1K [Deng2009ImageNetAL] are shown in Tab. 5.

In Tab. 5, DMSW shows three states (’1’, ’-’, ’✓’). MSW-MSA + ’1’ refers to removing the dynamic weight generation and directly assigning the same weight () to all branches. MSW-MSA + ’-’ (MSW-Swin) denotes removing the entire DMSW module, while, MSW-MSA + ’✓’ means normal DW-T. The performance of MSW-Swin is lower than that of Swin-T with . This may be due to the sub-optimal window setting that impairs the performance of the model to a certain extent. The performance comparison between DW-T and MSW-MSA + ’1’ further shows that this dynamic window mechanism achieves a very significant improvement (i.e. 3.3%). In addition, with the help of the dynamic window mechanism, the performance of DW-ViT is better than all ViTs that use the same single-scale window. This shows that this dynamic window weighting mechanism does play a very important role in DW-ViT.

5 Conclusion

The size of the window has an important impact on the performance of the model. There is currently very little systematic study of window size in the window-based ViT works. In this paper, we challenged this problem for the first time. Based on our insightful observations on the above issues, we propose a novel dynamic multi-scale window mechanism for W-MSA to obtain the optimal window configuration, thereby enhancing the model’s dynamic modeling capabilities for multi-scale information. With the help of the dynamic window mechanism, the performance of DW-ViT is found to be better than all ViTs that use the same single-scale window, with the proposed approach achieving good results on multiple CV tasks. At the same time, DWM has good scalability, and can thus be easily inserted into any window-based ViT as a module.

6 Discussion

Potential negative societal impact: As a general visual feature extractor, DW-ViT has shown good performance on multiple CV tasks. However, due to the domain gap between different tasks, when the model is transferred to other tasks, some fine adjustments may still be needed.

Limitation: These are a few issues that we need to improve in the future: (1) Although DW-ViT has shown good performance on multiple vision tasks. But compared with the single-scale window self-attention mechanism [liu2021swin], DWM still introduces a small number of additional parameters and calculations. (2) In addition, as far as DWM’s dynamic window mechanism is concerned, part of the computational budget is still allocated to suboptimal optional windows. However, an ideal strategy is to allocate the entire computational budget to the most potential windows at each layer of the network.


This work was partially supported in part by National Key R&D Program of China under Grant No.2020AAA0109700, NSFC under Grant (No.61972315 and No.61976233), Guangdong Province Basic and Applied Basic Research (Regional Joint Fund-Key) Grant No.2019B1515120039, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), Australian Research Council (ARC) Discovery Early Career Researcher Award (DECRA) under DE190100626, Shaanxi Province International Science and Technology Cooperation Program Project-Key Projects No.2022KWZ-14, Ministry of Science and Technology Foundation Project 2020AAA0106900 and Key Realm R&D Program of Guangzhou 202007030007 and Open Fund from Alibaba.