Log In Sign Up

FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization

by   Kecheng Zheng, et al.

MLP-like models built entirely upon multi-layer perceptrons have recently been revisited, exhibiting the comparable performance with transformers. It is one of most promising architectures due to the excellent trade-off between network capability and efficiency in the large-scale recognition tasks. However, its generalization performance to heterogeneous tasks is inferior to other architectures (e.g., CNNs and transformers) due to the extensive retention of domain information. To address this problem, we propose a novel frequency-aware MLP architecture, in which the domain-specific features are filtered out in the transformed frequency domain, augmenting the invariant descriptor for label prediction. Specifically, we design an adaptive Fourier filter layer, in which a learnable frequency filter is utilized to adjust the amplitude distribution by optimizing both the real and imaginary parts. A low-rank enhancement module is further proposed to rectify the filtered features by adding the low-frequency components from SVD decomposition. Finally, a momentum update strategy is utilized to stabilize the optimization to fluctuation of model parameters and inputs by the output distillation with weighted historical states. To our best knowledge, we are the first to propose a MLP-like backbone for domain generalization. Extensive experiments on three benchmarks demonstrate significant generalization performance, outperforming the state-of-the-art methods by a margin of 3


page 1

page 2

page 3

page 4


Global Filter Networks for Image Classification

Recent advances in self-attention and pure multi-layer perceptrons (MLP)...

Feature Stylization and Domain-aware Contrastive Learning for Domain Generalization

Domain generalization aims to enhance the model robustness against domai...

Deep Frequency Filtering for Domain Generalization

Improving the generalization capability of Deep Neural Networks (DNNs) i...

Efficient Domain Generalization via Common-Specific Low-Rank Decomposition

Domain generalization refers to the task of training a model which gener...

DPFNet: A Dual-branch Dilated Network with Phase-aware Fourier Convolution for Low-light Image Enhancement

Low-light image enhancement is a classical computer vision problem aimin...

Few-shot One-class Domain Adaptation Based on Frequency for Iris Presentation Attack Detection

Iris presentation attack detection (PAD) has achieved remarkable success...

MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake Detection

The rapid progress in the ease of creating and spreading ultra-realistic...

1 Introduction

Convolutional neural networks (CNNs) brings huge performance breakthroughs to various vision tasks and dominates corresponding backbone networks (e.g., VGG, ResNet) for a long time. Recently, the transformer with self-attention mechanism replaces the local relation learning of CNNs with long-range modeling, pulling up the upper bound of the performance of deep networks. More recently, some MLP-like works further replace self-attention operations with only fully connected and skip-connected layers, achieving a better trade-off between network capability and efficiency.

Although MLP-like models show promising results in large-scale homogeneous recognition tasks (e.g.

, ImageNet classification), its transfer performance to various heterogeneous tasks is lower than that of CNNs and transformers with the same amount of parameters. To bridge this gap, this paper explores how MLP-like models trained from a collection of multiple training sources can be better generalized to unknown heterogeneous data domains, which is also known as the domain generalization (DG) problem.

Existing DG works are mainly built upon CNNs [17, 2, 42, 3] to learn domain-invariant representations after conditioning on the class label from known multi-source domains. They introduce adversarial training [20], meta learning [2]

, self-supervised learning 

[42] or domain augmentation techniques [42] and have shown promising results. Orthogonally, some recent works extract generalized CNN features by augmenting the frequency domain, and find that manipulation on the amplitude components can directly affect the domain information.

Motivated by this, we analyze the SOTA MLP-like models in the frequency domain, revealing the following challenges. Firstly, as shown in Fig. 1 (a), we calculate the degree of filtering for different frequency components before/after MLP layer, which illustrates that MLP layer cannot suppress the high frequency components of the input features. When no extra frequency operations (e.g., Fourier filter in Fig. 1 (b)) are implemented, most high-frequency information are retained after the pure MLP layer, making it hard to resist the interference of heterogeneous data in domain generalization. Secondly, the frequency response is domain-specific. It can be seen in Fig. 1 (c) that the frequency responses are inconsistent between different domains, so it is impractical to set a fixed cutoff frequency. Meanwhile, the parameters of MLP-like models are data-independent, making it impossible to adjust the frequency response adaptively according to the input domain characteristics as shown in Fig. 1 (d). This makes MLP-Like architectures not suitable for DG problem, which covers the class prediction with different frequency distributions.

Figure 1: The Fourier analysis on PACS dataset. (a) The amplitudes before the last MLP layer and after it; (b) The amplitudes before the proposed adaptive Fourier filter and after it; (c) amplitude is the difference between the amplitudes before the last MLP layer and after it on different domains; (d) amplitude is the difference between the amplitudes before the proposed Fourier filter and after it on different domains.

To address this problem, we propose a frequency-aware MLP framework (FAMLP), explicitly promoting the extraction of domain-invariant frequency features. The core of the framework is the adaptive Fourier filter layer, which enhances the rectification of low-frequency features block by block, mitigating the interference of domain shifts. Specifically, we firstly utilize the fast Fourier transform to map the features to the frequency domain before each MLP layer. Then the domain-specific features are filtered out by a learnable frequency filter, which corresponds to the real and imaginary parts of the frequency features. To ensure integrity of important features, the filtered features are further strengthened by fusing the low-frequency components from SVD decomposition. Finally, the domain-invariant features are mapped back to the spatial domain through the inverse Fourier transform for the subsequent MLP layer. Furthermore, to improve the overall generalization of the model from an optimization perspective, we propose a momentum update strategy, distilling the invariant features from a updated teacher model. We calculate the teacher model based on the weighted historical states of our FAMLP model, guaranteeing consistency of output for minor network changes. The input images obtained by data augmentation are fed into the teacher network to guide the optimization process in terms of robustness to different domain shifts.

The main contributions of this paper are three-fold:

  • We propose a frequency-aware MLP framework (FAMLP) for domain generalization task, in which the low-frequency features are adaptively enhanced by a learnable frequency kernel, resulting in a domain-invariant representation.

  • We propose a momentum update strategy for the FAMLP model, in which the historical states are weighted as the updated teacher model to constrain the consistent features.

  • We propose a strong baseline that exploits the MLP-like model for DG tasks for the first time, achieving the state-of-the-art performance on three benchmarks including PACS, Office-Home and Digits-DG.

Figure 2: Illustration of the proposed frequency-aware MLP architecture. Specifically, an adaptive Fourier filter (AFF) module is proposed to plug into the MLP-like model. Within this module, a learnable frequency filter (LFF) is utilized to adjust the amplitude distribution by optimizing both the real and imaginary parts. Meanwhile, a low-rank enhancement (LRE) module is further proposed to rectify the filtered features by adding the low-frequency components from SVD decomposition. In addition, input images are transformed by the data transformations to prepared data for the distillation loss.

2 Related Work

Domain Generalization. Domain generalization (DG) targets to generalize the model to unseen domains with multiple disjoint domains provided during training. Many approaches focus on extracting the domain-invariant features and align the distribution of different domains to address the DG problem. For example, [20] proposes a conditional invariant adversarial network to guarantee the domain-invariant property and the Siamese network is introduced in [24] to learn a discriminative embedding space. Later, some meta-based [17, 2] methods are proposed to introduce a type of regularization into the domain generalization. This type of method synthesizes virtual testing domains to simulate train/test domain shift within each mini-batch. Data augmentation is also a popular idea to address this problem. Adversarial-based [39] and Fourier-based [42] examples are generated to improve the generalization of the models. There are also other methods employing low-rank decomposition [16] or self-supervising jigsaw task [3]

to train the models. Convolutional neural networks dominate the task among all of the above methods while we target to investigate the performance of the MLP-like model for DG in this paper.

MLP-Like Backbones. Recently, some works [36, 37, 22, 30, 11] try to replace the self-attention layer with the fully connected layer for the better trade-off between performance and efficiency on the large-scale datasets. MLP-mixer [36] firstly proposes a technically simple architecture solely based on multi-layer perceptrons, which mix the per-location features and spatial features. The experimental results show that MLP-like models are as good as existing SOTA methods including CNNs and transformers [6, 27]. Following this work, gMLP [22] enhances the spatial interaction with multiplicative gating. ResMLP [37] replaces the batch or channel normalization with the simple affine transformation for better trade-off. Vip [11] separately encodes the feature representations along the height and width dimensions for precise positional information. Furthermore, MLP-like models have also been explored in other vision tasks such as dense prediction [5] and video recognition [45]. Orthogonally, this paper is designed to explore the transfer capability and optimization strategies of MLP-like models, especially in domain generalization. We believe this is a must for the MLP-like model to act as a universal backbone. To the best of our knowledge, FAMLP is the first method designed for the domain generation task.

Matrix Decomposition.

Matrix decomposition has been widely adopted in deep networks for different purposes. Most researchers focus on network compression by factorizing the low-rank components, including the softmax layer 

[33], the convolution layer [8, 41] and the embedding layer [15]. Recently, some researchers also explore and introduce the certain properties of the decomposed signals to different tasks. [14] decomposes each convolution into a shared part for the subsequent incremental tasks. [9] factorizes the representation to recover a clean signal subspace as the global context, modeling the long-range dependencies. [21] revisits the dynamic convolution via matrix decomposition, mitigating the joint optimization difficulty. In contrast, this paper is designed to explore the low-rank components of frequency features for the augmentation on domain-invariant information.

3 Frequency-Aware MLP

We detail the frequency-aware MLP architecture and its important components in this section. First of all, we demonstrate the paradigms of problem setting and standard MLP-like model, which is adopted as our baseline. Then two proposed core components adaptive Fourier filter layer and momentum update strategy are introduced. Finally, we analyze the optimization flow of the overall pipeline.

3.1 Problem Description

Given multiple source domains with labelled samples in -th domain , where denotes the number of sampled data, the goal of DG methods is to utilize these data to train a model that performs well on the unseen target domain. Although most existing DG works are mainly built upon convolutional neural networks to learn domain-invariant representations after conditioning on the class label from known multi-source domains, this work turns to fully explore the pure MLP architecture for comprehensively investigating the performance of MLP-like models on the domain generalization task.

3.2 Standard MLP-Like Model.

Following the architecture of MLP-mixer [36], the standard MLP-like model consists of a per-patch transform layer, MLP layers and a classification head. Specifically, the input image X is firstly split into a grid of S S non-overlapping patches (), where H and W represents the initial spatial size. Then, each patch is independently projected to the embedding space by a linear layer ,


The resulting latent features are fed to a sequence of MLP layers , which fuse the per-patch and per-channel information in turn,


where N represents the number of MLP layers and i represents the i-th

MLP layer in the sequence. Finally, the output features are averaged as a d-dimension vector, which is fed to a linear classifier

for the predicted label,


MLP Layer. To facilitate the feature interaction during the optimization process, each MLP layer contains two MLP blocks along different dimensions. The input features are firstly projected along the patch dimension (i.e., ) in the first block (). To reduce the difficulty of optimization, the initial input is added through the skip connection. Similarly, the middle features are then projected along the channel dimension (i.e., ) in the second block (). Each MLP block consists of two fully connected layers and an element-wise nonlinearity (GELU [10]) :


where LayerNorm represents the layer normalization [1].

3.3 Adaptive Fourier Filter Layer

As the receptive filed of fully connected layer spans a long range and covers global interactions, the extracted features contain extensive domain information, which is reflected in the high-frequency component. To eliminate its negative effect on DG, we add an adaptive Fourier filter layer before each MLP layer. The input features are firstly fed to the adaptive Fourier filter layer, eliminating the high-frequency interference.


In this case, equation 4 can be rewritten as:


Learnable Frequency Filter. To explicitly filter the high-frequency interference in the latent features, we directly transform the spatial feature into the frequency domain. For a input embedding , its Fourier transformation can be formulated as:


and represent the real and imaginary parts of . and represent the amplitude and phase components in the frequency domain. Existing works [42, 44] have proven that the amplitude components is highly related to the domain information, which is influenced by both the real and imaginary parts. To adaptively refine the domain-invariant features, we maintain a learnable frequency filter , which is the same size as and . Different from the small-size filter (e.g., 33) in the spatial domain, the frequency filter contains all the sampled frequency values. The frequency features are directly element-wise multiplied by the filter and optimized to adjust the useful amplitude distribution from multiple domains,


The filtered features are transformed to the spatial domain through the inverse Fourier transformation for the subsequent operation,


Both Fourier transformation and the inverse one can be implemented by the FFT algorithm [26].

Low-Rank Enhancement Module. To further enhance the maintanence of domain-invariant features, we extract low-frequency components from the perspective of matrix decomposition. A input embedding can be seen as a static kernel and some noise information E, and the latter is sensitive to the variant such as domain shifts,


where D and C represent the decomposed matrices, respectively. We utilize the SVD decomposition for the low-rank component in the frequency feature,


where represents the reconstruction loss, and are the regularization terms. It is noted that the whole process is non-parameter, so we denote it as , which is distinguished from the learnable operator . To reduce the complexity of

, we utilize two linear transformation layers (

i.e., and ) to map the features to different embedding spaces. Finally the compact features are added to the filtered features for further augmentation,


3.4 Momentum Update Strategy

To enhance the generalization of MLP from the perspective of overall optimization, we adopt the momentum update strategy to the standard full supervised paradigm. Here we denote and as the all the optimized parameters of student and teacher models at different time state t. We update the teacher model based on the historical state of teacher model and current state of student model for distillation. That is,


where represents the momentum weight. It is noted that the teacher model can be seen as the weighted summation of the student models, their outputs should be similar. So we constrain the optimized model to be consistent to the teacher one. To further improve the generalization, we adopt data augmentation for the input of the teacher model. It can be seen in the experimental part that these augmentation strategies are also beneficial for the DG problem,


where DataAug represents the Fourier-based data augmentation [42] together with the standard augmentation protocols,

represents Kullback-Leibler divergence,

denotes to the classification head of teacher model, is the temperature and refers to the softmax operation.

3.5 Optimization

Combining all above loss functions together, we can get our full objective when given the input image-target pair (X, Y):


where represents the cross-entropy loss, and controls the trade-off between the classification and the distillation loss.

Methods Art Cartoon Photo Sketch Avg.
DeepAll 77.63 76.77 95.85 69.50 79.94
MetaReg [2] 83.70 77.20 95.50 70.30 81.70
JiGen [3] 79.42 75.25 96.03 71.35 80.51
Epi-FCR [18] 82.10 77.00 93.90 73.00 81.50
MMLD [23] 81.28 77.16 96.09 72.29 81.83
DDAIG [46] 84.20 78.10 95.30 74.70 83.10
CSD [28] 78.90 75.80 94.10 76.70 81.40
InfoDrop [35] 80.27 76.54 96.11 76.38 82.33
MASF [7] 80.29 77.17 94.99 71.69 81.04
L2A-OT [47] 83.30 78.20 96.20 73.60 82.80
EISNet [40] 81.89 76.44 95.93 74.33 82.15
RSC [12] 83.43 80.31 95.99 80.85 85.15
FACT [42] 85.37 78.38 95.15 79.15 84.51
ATSRL [43] 85.80 80.70 97.30 77.30 85.30
DIRT-GAN [25] 82.56 76.37 95.65 79.89 83.62
FSDCL [13] 85.30 81.31 95.63 81.19 85.86
Our FAMLP-S 92.06 82.49 98.10 84.09 89.19
DeepAll 84.94 76.98 97.64 76.75 84.08
MetaReg [2] 87.20 79.20 97.60 70.30 83.60
MASF [7] 82.89 80.49 95.01 72.29 82.67
EISNet [40] 86.64 81.53 97.11 78.07 85.84
RSC [12] 87.89 82.16 97.92 83.35 87.83
FACT [42] 89.63 81.77 96.75 84.46 88.15
ATSRL [43] 90.00 83.50 98.90 80.00 88.10
MBDG [31] 80.60 79.30 97.00 85.20 85.60
FSDCL [13] 88.48 83.83 96.59 82.92 87.96
SWAD [4] 89.30 83.40 97.30 82.50 88.10
Our FAMLP-B 92.63 87.03 98.14 82.69 90.12
Table 1: Leave-one-domain-out results on PACS. The best and second-best results are bolded and underlined respectively.
Methods Art Clipart Product Real Avg.
DeepAll 57.88 52.72 73.50 74.80 64.72
CCSA [24] 59.90 49.90 74.10 75.70 64.90
MMD [19] 56.50 47.30 72.10 74.80 62.70
CG [34] 58.40 49.40 73.90 75.80 64.40
DDAIG [46] 59.20 52.30 74.60 76.00 65.50
L2A-OT [47] 60.60 50.10 74.80 77.00 65.60
Jigen [3] 53.04 47.51 71.47 72.79 61.20
RSC [12] 58.42 47.90 71.63 74.54 63.12
FACT [42] 60.34 54.85 74.48 76.55 66.56
ATSRL [43] 60.70 52.90 75.80 77.20 66.70
FSDCL [13] 60.24 53.54 74.36 76.66 66.20
Our FAMLP-S 69.34 62.61 79.82 82.00 73.44
Fishr [29] 63.40 54.20 76.40 78.50 68.20
SWAD [4] 66.10 57.70 78.40 80.20 70.60
ATSRL [43] 69.30 60.10 81.50 82.10 73.30
Our FAMLP-B 70.53 64.63 81.32 82.79 74.82
Table 2: Leave-one-domain-out results on OfficeHome. The best and second-best results are bolded and underlined respectively.
DeepAll [46] 95.8 58.8 61.7 78.6 73.7
CCSA [24] 95.2 58.2 65.5 79.1 74.5
MMD-AAE [19] 96.5 58.4 65.0 78.4 74.6
CrossGrad [34] 96.7 61.1 65.3 80.2 75.8
DDAIG [46] 96.6 64.1 68.6 81.0 77.6
Jigen [3] 96.5 61.4 63.7 74.0 73.9
L2A-OT [47] 96.7 63.9 68.6 83.2 78.1
FACT [42] 97.9 65.6 72.4 90.3 81.5
Our FAMLP-S 98.0 83.3 84.1 96.9 90.6
Table 3: Leave-one-domain-out results on Digits-DG.

4 Experiments

In this section, we demonstrate the superiority of our method on three conventional DG benchmarks and conduct several ablation studies to show the effectiveness of each component.

4.1 Setup

Datasets. We conduct the experiments on three benchmark datasets: (1) PACS [16] consists of four domains, i.e.,  Art Painting, Cartoon, Photo and Sketch. It totally contains 9991 images of 7 classes. (2) Office-Home [38] is also composed of four domains, i.e.,  Art, Clipart, Product and Real World with 15500 images of 65 classes. The model is trained on three domains and tested on the remaining one during experiments. (3)Digits-DG [46]: a digit recognition benchmark consisted of four classical datasets MNIST, MNIST-M, SVHN, SYN. The four datasets mainly differ in font style, background and image quality. We use the original train-validation split in [46] with 600 images per class per dataset.

Implementation Details. The backbone is detailed in Section 3.2

, which is pretrained on the ImageNet

[32] with input patch size in all of our experiments. For fair comparison, we adjust the depth and width of FAMLP to ensure comparable model capacity with different CNNs. Finally, we scale the depth by a factor of (i.e.

, MLP-S and MLP-B), corresponding to ResNet-18 and ResNet-50, respectively. The network is trained for 50 epochs with batch size of 16 and weight decay of 5e-4. We use SGD as the optimizer and set the initial learning rate as 0.001 which is decayed by 0.1 at 40 epochs. The Fourier-based data augmentation

[42] together with the standard augmentation protocol, i.e.,  random resized cropping, horizontal flipping and color jittering are applied in our experiments. The momentum m for the teacher model is set as 0.9995 and the value of the temperature is 10 and the is 1.5. The first weight parameter is set to 2 for PACS/Digits-DG, and 200 for OfficeHome and the second one is set to 1 for both datasets. We also use a sigmoid ramp-up [42] for the two weights with a length of 5 epochs. The strength of Fourier-based data augmentation is chosen as 1.0 for PACS/Digits-DG, and 0.2 for OfficeHome.

4.2 Comparison with State-of-the-Art Methods

Domain Generalization. To better assess the overall performance of our scheme, we compare it with the SOTA methods of domain generalization. As shown in Table 1, 2 and Table 3, our method achieves average improvement of 3%, 4% and 9% accuracy on PACS, OfficeHome and Digits-DG datasets, respectively. It is worth noting that our model (i.e., FAMLP-S) maintains good generalization even when the number of parameters decreases, achieving 3.33% and 7.24% improvement. In PACS, our method improves the FACT [42] with ResNet-50 as the backbone by 1.98% and achieves best on the art, cartoon, and photo domains except the sketch domain. The possible reason is that the content of the sketch is simpler than other domains, where the global interaction is not very beneficial. For the results of the larger dataset Office-Home, FAMLP outperforms other ResNet-18 and ResNet-50 based methods by a large margin on all the held-out domains, which further illustrates the superiority of our method.

MLP-Like Architecture. To demonstrate the generalization performance of our FAMLP architecture, we compare it with the SOTA MLP-like models, including MLP-mixer, gMLP, ResMLP and Vip. As shown in Table 5, our method achieves one point improvement in the smaller model and 6 points improvement in the larger one. This demonstrates the effectiveness of our scheme in assisting the MLP-like models to resist the disturbances caused by domain shifts.

Backbone LFF LRE MUS Art Cartoon Photo Sketch Avg.
ResNet-50 85.45 79.44 96.77 79.33 85.25
MLP-B 85.00 77.86 94.43 65.72 80.75
ResNet-50 86.28 82.77 96.71 78.80 86.14
MLP-B 89.75 81.83 97.66 81.93 87.79
MLP-B 93.36 85.24 98.62 82.03 89.81
MLP-B 90.45 82.96 98.41 82.49 88.58
MLP-B 92.63 87.03 98.14 82.69 90.12
AFF Office-home
Backbone LFF LRE MUS Art Clipart Product Real Avg.
ResNet-50 64.77 60.02 78.80 78.82 70.60
MLP-B 63.45 56.31 77.81 79.76 69.33
ResNet-50 66.63 57.78 80.15 80.81 71.34
MLP-B 68.31 63.00 81.60 82.65 73.89
MLP-B 69.39 64.16 81.50 82.95 74.50
MLP-B 68.81 64.63 81.08 81.23 73.93
MLP-B 70.53 64.63 81.32 82.79 74.82
Table 4: Effectiveness of each proposed components on PACS and OfficeHome. ‘AFF’ refers to the adaptive Fourier filter layer; ‘MUS’ refers to the momenta update strategy; ‘LFF’ refers to the learnable frequency filter; ‘LRE’ refers to the low-rank enhancement module.
Method Para. Art Cartoon Photo Sketch Avg.
gMLP-S [22] 20 86.72 80.80 97.54 72.13 84.23
Vip-S [11] 25 87.35 85.96 98.68 80.20 88.05
ResMLP-S [37] 40 85.50 78.63 97.07 72.64 83.46
MLP-B [36] 59 85.00 77.86 94.43 65.72 80.75
Our FAMLP-S 25 92.06 82.49 98.10 84.09 89.19
Our FAMLP-B 44 92.63 87.03 98.14 82.69 90.12
Method Para. Art Clipart Product Real Avg.
gMLP-S [22] 20 64.81 58.33 75.78 79.3 69.56
Vip-S [11] 25 69.55 61.51 79.34 83.11 73.38
ResMLP-S [37] 40 62.42 51.94 75.40 77.21 66.74
MLP-B [36] 59 63.45 56.31 77.81 79.76 69.33
Our FAMLP-S 25 69.34 62.61 79.82 82.00 73.44
Our FAMLP-B 44 70.53 64.63 81.32 82.79 74.82
Table 5: Comparison between different MLP-like models on PACS and OfficeHome.

4.3 Ablation Study

We conduct ablation studies to show the effectiveness of each component in our FAMLP architecture in Table 4. The performance of our scheme is mainly attributed to three prominent components: LFF layer, LRE module and MUS. To clarify the function of learnable frequency filter in MLP-like model, we add the LFF layer to both the ResNet-50 and MLP-B model for comparison. It can be seen that the generalization performance of ResNet is initially better than that of MLP, but MLP overtakes ResNet after adding the LFF layer. As analyzed earlier, since the MLP-like model covers global interactions, it contains a large amount of domain information. Although the LFF layer brings gain to the CNNs as well, MLP-like model can benefit more from the frequency operation, which proves the effectiveness of frequency filtering for MLP generalization. Then we add the LRE module and MUS to the Fourier-based MLP-like model separately. We can see that these two components improve the baseline of 1.32% and 0.42% on average, which demonstrate the effectiveness of the two components. And the model performs best after the combination of both components, which further shows that the two components act in different ways and can assist each other.

4.4 Analysis

Effectiveness of Learnable Frequency Filter. To better demonstrate the role of the learnable frequency filter during optimization, we show the visualization results in the frequency domain. As shown in Fig. 3 (a), the high-frequency components are obviously suppressed by our Fourier filter. Due to this property, the domain-specific features are greatly filtered out, which improves the generalization of the optimized features. In Fig. 3 (b) and (c), the vertical coordinate represents the amplitude attenuation of different frequency components before and after adopting the frequency filter. It can be seen that the suppression frequency characteristics are consistent within the domain (i.e., 8 different samples in Fig. 3 (c)) and different between the domains ((i.e., 4 different domains in Fig. 3 (b))) owing to our learnable frequency filter. It is the learnability of the frequency filtering kernel that allows the network to adjust adaptively to the domain characteristics of the input, thus enhancing the overall generalization performance of our MLP-like model.

Figure 3: The Fourier analysis of our FAMLP on PACS dataset. (a) The amplitudes before/after the learnable frequency filter; (b) The amplitudes of different samples on sketch domain; (c) amplitude is the difference between the amplitudes before the learnable frequency filter and after it on different domains; (d) amplitude is the difference between the amplitudes before/after the different components (i.e., Learnable Frequency Filter (LFF) and Low-Rank Enhancement module (LRE)).

Effectiveness of Low-Rank Enhancement Module. To further demonstrate the specific role of low-rank enhancement module, we decompose the visualization results for different layers. As shown in Fig. 3 (d), only learnable frequency filter tends to oversuppress high frequencies, leading to even some important low-frequency information being lost. To ensure the integrity of the features, the low-rank enhancement module is introduced to augment the low-frequency components. The resulting adaptive Fourier filter layer significantly facilitate the preservation of low-frequency information, thus ensuring the discrimination of the extracted features.

Effectiveness of Hyper-Parameter.

In this subsection, we conduct a series of analysis studies to show how the average accuracy varies as a function of the hyperparameters. The basic values of the

, and are set to {2, 10, 0.9995}. We vary the value of each hyper-parameter and keep the remaining fixed. As shown in Figure 4, we can see that the performance indeed changes with the parameters. However, the margin of change is relatively small which means that our method is insensitive to the hyperparameters.

Figure 4: Ablation studies of hyper-parameter on PACS dataset. Comparison of average performance when varying the momentum weight , weighting factor and temperature .

5 Conclusion

In this paper, a novel frequency-aware MLP architecture (FAMLP) is presented for the domain generalization task. An adaptive Fourier filter layer is especially designed to be embedded before each MLP layer, augmenting the domain-invariant feature descriptor for label prediction. Specifically, a learnable frequency filter is firstly utilized to adaptively filter out the high-frequency components by considering both the real and imaginary parts of the transformed frequency features. Then, A low-rank enhancement module is further proposed to rectify the filtered features by fusing the low-frequency components from SVD decomposition. In particular, a momentum update strategy is proposed to stabilize the optimization to parameters and input fluctuations by output distillation with the weighted historical model. Experimental results show that our architecture is superior in both performance and adaptability to the state-of-the-art methods, especially in the smaller model.


  • [1] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.2.
  • [2] Y. Balaji, S. Sankaranarayanan, and R. Chellappa (2018) Metareg: towards domain generalization using meta-regularization. NeurIPS 31, pp. 998–1008. Cited by: §1, §2, Table 1.
  • [3] F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, and T. Tommasi (2019) Domain generalization by solving jigsaw puzzles. In CVPR, pp. 2229–2238. Cited by: §1, §2, Table 1, Table 2, Table 3.
  • [4] J. Cha, S. Chun, K. Lee, H. Cho, S. Park, Y. Lee, and S. Park (2021) SWAD: domain generalization by seeking flat minima. arXiv. Cited by: Table 1, Table 2.
  • [5] S. Chen, E. Xie, C. Ge, D. Liang, and P. Luo (2021) Cyclemlp: a mlp-like architecture for dense prediction. arXiv preprint arXiv:2107.10224. Cited by: §2.
  • [6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2.
  • [7] Q. Dou, D. Coelho de Castro, K. Kamnitsas, and B. Glocker (2019) Domain generalization via model-agnostic learning of semantic features. NeurIPS 32, pp. 6450–6461. Cited by: Table 1.
  • [8] T. Garipov, D. Podoprikhin, A. Novikov, and D. Vetrov (2016)

    Ultimate tensorization: compressing convolutional and fc layers alike

    arXiv preprint arXiv:1611.03214. Cited by: §2.
  • [9] Z. Geng, M. Guo, H. Chen, X. Li, K. Wei, and Z. Lin (2021) Is attention better than matrix decomposition?. arXiv preprint arXiv:2109.04553. Cited by: §2.
  • [10] D. Hendrycks and K. Gimpel (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §3.2.
  • [11] Q. Hou, Z. Jiang, L. Yuan, M. Cheng, S. Yan, and J. Feng (2022) Vision permutator: a permutable mlp-like architecture for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2, Table 5.
  • [12] Z. Huang, H. Wang, E. P. Xing, and D. Huang (2020) Self-challenging improves cross-domain generalization. In ECCV, pp. 124–140. Cited by: Table 1, Table 2.
  • [13] S. Jeon, K. Hong, P. Lee, J. Lee, and H. Byun (2021) Feature stylization and domain-aware contrastive learning for domain generalization. In ACM MM, pp. 22–31. Cited by: Table 1, Table 2.
  • [14] M. Kanakis, D. Bruggemann, S. Saha, S. Georgoulis, A. Obukhov, and L. V. Gool (2020) Reparameterizing convolutions for incremental multi-task learning without task interference. In

    European Conference on Computer Vision

    pp. 689–707. Cited by: §2.
  • [15] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019) Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. Cited by: §2.
  • [16] D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2017) Deeper, broader and artier domain generalization. In ICCV, pp. 5542–5550. Cited by: §2, §4.1.
  • [17] D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2018) Learning to generalize: meta-learning for domain generalization. In AAAI, Cited by: §1, §2.
  • [18] D. Li, J. Zhang, Y. Yang, C. Liu, Y. Song, and T. M. Hospedales (2019) Episodic training for domain generalization. In CVPR, pp. 1446–1455. Cited by: Table 1.
  • [19] H. Li, S. J. Pan, S. Wang, and A. C. Kot (2018) Domain generalization with adversarial feature learning. In CVPR, pp. 5400–5409. Cited by: Table 2, Table 3.
  • [20] Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and D. Tao (2018) Deep domain generalization via conditional invariant adversarial networks. In ECCV, pp. 624–639. Cited by: §1, §2.
  • [21] Y. Li, Y. Chen, X. Dai, M. Liu, D. Chen, Y. Yu, L. Yuan, Z. Liu, M. Chen, and N. Vasconcelos (2021) Revisiting dynamic convolution via matrix decomposition. arXiv preprint arXiv:2103.08756. Cited by: §2.
  • [22] H. Liu, Z. Dai, D. So, and Q. Le (2021) Pay attention to mlps. Advances in Neural Information Processing Systems 34. Cited by: §2, Table 5.
  • [23] T. Matsuura and T. Harada (2020) Domain generalization using a mixture of multiple latent domains. In AAAI, Vol. 34, pp. 11749–11756. Cited by: Table 1.
  • [24] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto (2017) Unified deep supervised domain adaptation and generalization. In ICCV, pp. 5715–5725. Cited by: §2, Table 2, Table 3.
  • [25] A. T. Nguyen, T. Tran, Y. Gal, and A. G. Baydin (2021) Domain invariant representation learning with domain density transformations. arXiv. Cited by: Table 1.
  • [26] H. J. Nussbaumer (1981) The fast fourier transform. In Fast Fourier Transform and Convolution Algorithms, pp. 80–111. Cited by: §3.3.
  • [27] N. Park and S. Kim (2022) How do vision transformers work?. arXiv preprint arXiv:2202.06709. Cited by: §2.
  • [28] V. Piratla, P. Netrapalli, and S. Sarawagi (2020) Efficient domain generalization via common-specific low-rank decomposition. In ICML, pp. 7728–7738. Cited by: Table 1.
  • [29] A. Rame, C. Dancette, and M. Cord (2021)

    Fishr: invariant gradient variances for out-of-distribution generalization

    arXiv. Cited by: Table 2.
  • [30] Y. Rao, W. Zhao, Z. Zhu, J. Lu, and J. Zhou (2021) Global filter networks for image classification. Advances in Neural Information Processing Systems 34. Cited by: §2.
  • [31] A. Robey, G. J. Pappas, and H. Hassani (2021) Model-based domain generalization. arXiv. Cited by: Table 1.
  • [32] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §4.1.
  • [33] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran (2013) Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6655–6659. Cited by: §2.
  • [34] S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi, and S. Sarawagi (2018) Generalizing across domains via cross-gradient training. arXiv. Cited by: Table 2, Table 3.
  • [35] B. Shi, D. Zhang, Q. Dai, Z. Zhu, Y. Mu, and J. Wang (2020) Informative dropout for robust representation learning: a shape-bias perspective. In ICML, pp. 8828–8839. Cited by: Table 1.
  • [36] I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, et al. (2021) Mlp-mixer: an all-mlp architecture for vision. Advances in Neural Information Processing Systems 34. Cited by: §2, §3.2, Table 5.
  • [37] H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard, A. Joulin, G. Synnaeve, J. Verbeek, et al. (2021) Resmlp: feedforward networks for image classification with data-efficient training. arXiv preprint arXiv:2105.03404. Cited by: §2, Table 5.
  • [38] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan (2017) Deep hashing network for unsupervised domain adaptation. In CVPR, pp. 5018–5027. Cited by: §4.1.
  • [39] R. Volpi, H. Namkoong, O. Sener, J. Duchi, V. Murino, and S. Savarese (2018) Generalizing to unseen domains via adversarial data augmentation. arXiv. Cited by: §2.
  • [40] S. Wang, L. Yu, C. Li, C. Fu, and P. Heng (2020) Learning from extrinsic and intrinsic supervisions for domain generalization. In ECCV, pp. 159–176. Cited by: Table 1.
  • [41] W. Wang, Y. Sun, B. Eriksson, W. Wang, and V. Aggarwal (2018) Wide compression: tensor ring nets. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 9329–9338. Cited by: §2.
  • [42] Q. Xu, R. Zhang, Y. Zhang, Y. Wang, and Q. Tian (2021) A fourier-based framework for domain generalization. In CVPR, Cited by: §1, §2, §3.3, §3.4, Table 1, Table 2, Table 3, §4.1, §4.2.
  • [43] F. Yang, Y. Cheng, Z. Shiau, and Y. F. Wang (2021) Adversarial teacher-student representation learning for domain generalization. NeurIPS 34. Cited by: Table 1, Table 2.
  • [44] Y. Yang and S. Soatto (2020) Fda: fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4085–4095. Cited by: §3.3.
  • [45] D. J. Zhang, K. Li, Y. Chen, Y. Wang, S. Chandra, Y. Qiao, L. Liu, and M. Z. Shou (2021) MorphMLP: a self-attention free, mlp-like backbone for image and video. arXiv preprint arXiv:2111.12527. Cited by: §2.
  • [46] K. Zhou, Y. Yang, T. Hospedales, and T. Xiang (2020) Deep domain-adversarial image generation for domain generalisation. In AAAI, Vol. 34, pp. 13025–13032. Cited by: Table 1, Table 2, Table 3, §4.1.
  • [47] K. Zhou, Y. Yang, T. Hospedales, and T. Xiang (2020) Learning to generate novel domains for domain generalization. In ECCV, pp. 561–578. Cited by: Table 1, Table 2, Table 3.