1 Introduction
Convolutional neural networks (CNNs) brings huge performance breakthroughs to various vision tasks and dominates corresponding backbone networks (e.g., VGG, ResNet) for a long time. Recently, the transformer with selfattention mechanism replaces the local relation learning of CNNs with longrange modeling, pulling up the upper bound of the performance of deep networks. More recently, some MLPlike works further replace selfattention operations with only fully connected and skipconnected layers, achieving a better tradeoff between network capability and efficiency.
Although MLPlike models show promising results in largescale homogeneous recognition tasks (e.g.
, ImageNet classification), its transfer performance to various heterogeneous tasks is lower than that of CNNs and transformers with the same amount of parameters. To bridge this gap, this paper explores how MLPlike models trained from a collection of multiple training sources can be better generalized to unknown heterogeneous data domains, which is also known as the domain generalization (DG) problem.
Existing DG works are mainly built upon CNNs [17, 2, 42, 3] to learn domaininvariant representations after conditioning on the class label from known multisource domains. They introduce adversarial training [20], meta learning [2]
, selfsupervised learning
[42] or domain augmentation techniques [42] and have shown promising results. Orthogonally, some recent works extract generalized CNN features by augmenting the frequency domain, and find that manipulation on the amplitude components can directly affect the domain information.Motivated by this, we analyze the SOTA MLPlike models in the frequency domain, revealing the following challenges. Firstly, as shown in Fig. 1 (a), we calculate the degree of filtering for different frequency components before/after MLP layer, which illustrates that MLP layer cannot suppress the high frequency components of the input features. When no extra frequency operations (e.g., Fourier filter in Fig. 1 (b)) are implemented, most highfrequency information are retained after the pure MLP layer, making it hard to resist the interference of heterogeneous data in domain generalization. Secondly, the frequency response is domainspecific. It can be seen in Fig. 1 (c) that the frequency responses are inconsistent between different domains, so it is impractical to set a fixed cutoff frequency. Meanwhile, the parameters of MLPlike models are dataindependent, making it impossible to adjust the frequency response adaptively according to the input domain characteristics as shown in Fig. 1 (d). This makes MLPLike architectures not suitable for DG problem, which covers the class prediction with different frequency distributions.
To address this problem, we propose a frequencyaware MLP framework (FAMLP), explicitly promoting the extraction of domaininvariant frequency features. The core of the framework is the adaptive Fourier filter layer, which enhances the rectification of lowfrequency features block by block, mitigating the interference of domain shifts. Specifically, we firstly utilize the fast Fourier transform to map the features to the frequency domain before each MLP layer. Then the domainspecific features are filtered out by a learnable frequency filter, which corresponds to the real and imaginary parts of the frequency features. To ensure integrity of important features, the filtered features are further strengthened by fusing the lowfrequency components from SVD decomposition. Finally, the domaininvariant features are mapped back to the spatial domain through the inverse Fourier transform for the subsequent MLP layer. Furthermore, to improve the overall generalization of the model from an optimization perspective, we propose a momentum update strategy, distilling the invariant features from a updated teacher model. We calculate the teacher model based on the weighted historical states of our FAMLP model, guaranteeing consistency of output for minor network changes. The input images obtained by data augmentation are fed into the teacher network to guide the optimization process in terms of robustness to different domain shifts.
The main contributions of this paper are threefold:

We propose a frequencyaware MLP framework (FAMLP) for domain generalization task, in which the lowfrequency features are adaptively enhanced by a learnable frequency kernel, resulting in a domaininvariant representation.

We propose a momentum update strategy for the FAMLP model, in which the historical states are weighted as the updated teacher model to constrain the consistent features.

We propose a strong baseline that exploits the MLPlike model for DG tasks for the first time, achieving the stateoftheart performance on three benchmarks including PACS, OfficeHome and DigitsDG.
2 Related Work
Domain Generalization. Domain generalization (DG) targets to generalize the model to unseen domains with multiple disjoint domains provided during training. Many approaches focus on extracting the domaininvariant features and align the distribution of different domains to address the DG problem. For example, [20] proposes a conditional invariant adversarial network to guarantee the domaininvariant property and the Siamese network is introduced in [24] to learn a discriminative embedding space. Later, some metabased [17, 2] methods are proposed to introduce a type of regularization into the domain generalization. This type of method synthesizes virtual testing domains to simulate train/test domain shift within each minibatch. Data augmentation is also a popular idea to address this problem. Adversarialbased [39] and Fourierbased [42] examples are generated to improve the generalization of the models. There are also other methods employing lowrank decomposition [16] or selfsupervising jigsaw task [3]
to train the models. Convolutional neural networks dominate the task among all of the above methods while we target to investigate the performance of the MLPlike model for DG in this paper.
MLPLike Backbones. Recently, some works [36, 37, 22, 30, 11] try to replace the selfattention layer with the fully connected layer for the better tradeoff between performance and efficiency on the largescale datasets. MLPmixer [36] firstly proposes a technically simple architecture solely based on multilayer perceptrons, which mix the perlocation features and spatial features. The experimental results show that MLPlike models are as good as existing SOTA methods including CNNs and transformers [6, 27]. Following this work, gMLP [22] enhances the spatial interaction with multiplicative gating. ResMLP [37] replaces the batch or channel normalization with the simple affine transformation for better tradeoff. Vip [11] separately encodes the feature representations along the height and width dimensions for precise positional information. Furthermore, MLPlike models have also been explored in other vision tasks such as dense prediction [5] and video recognition [45]. Orthogonally, this paper is designed to explore the transfer capability and optimization strategies of MLPlike models, especially in domain generalization. We believe this is a must for the MLPlike model to act as a universal backbone. To the best of our knowledge, FAMLP is the first method designed for the domain generation task.
Matrix Decomposition.
Matrix decomposition has been widely adopted in deep networks for different purposes. Most researchers focus on network compression by factorizing the lowrank components, including the softmax layer
[33], the convolution layer [8, 41] and the embedding layer [15]. Recently, some researchers also explore and introduce the certain properties of the decomposed signals to different tasks. [14] decomposes each convolution into a shared part for the subsequent incremental tasks. [9] factorizes the representation to recover a clean signal subspace as the global context, modeling the longrange dependencies. [21] revisits the dynamic convolution via matrix decomposition, mitigating the joint optimization difficulty. In contrast, this paper is designed to explore the lowrank components of frequency features for the augmentation on domaininvariant information.3 FrequencyAware MLP
We detail the frequencyaware MLP architecture and its important components in this section. First of all, we demonstrate the paradigms of problem setting and standard MLPlike model, which is adopted as our baseline. Then two proposed core components adaptive Fourier filter layer and momentum update strategy are introduced. Finally, we analyze the optimization flow of the overall pipeline.
3.1 Problem Description
Given multiple source domains with labelled samples in th domain , where denotes the number of sampled data, the goal of DG methods is to utilize these data to train a model that performs well on the unseen target domain. Although most existing DG works are mainly built upon convolutional neural networks to learn domaininvariant representations after conditioning on the class label from known multisource domains, this work turns to fully explore the pure MLP architecture for comprehensively investigating the performance of MLPlike models on the domain generalization task.
3.2 Standard MLPLike Model.
Following the architecture of MLPmixer [36], the standard MLPlike model consists of a perpatch transform layer, MLP layers and a classification head. Specifically, the input image X is firstly split into a grid of S S nonoverlapping patches (), where H and W represents the initial spatial size. Then, each patch is independently projected to the embedding space by a linear layer ,
(1) 
The resulting latent features are fed to a sequence of MLP layers , which fuse the perpatch and perchannel information in turn,
(2)  
where N represents the number of MLP layers and i represents the ith
MLP layer in the sequence. Finally, the output features are averaged as a ddimension vector, which is fed to a linear classifier
for the predicted label,(3) 
MLP Layer. To facilitate the feature interaction during the optimization process, each MLP layer contains two MLP blocks along different dimensions. The input features are firstly projected along the patch dimension (i.e., ) in the first block (). To reduce the difficulty of optimization, the initial input is added through the skip connection. Similarly, the middle features are then projected along the channel dimension (i.e., ) in the second block (). Each MLP block consists of two fully connected layers and an elementwise nonlinearity (GELU [10]) :
(4)  
where LayerNorm represents the layer normalization [1].
3.3 Adaptive Fourier Filter Layer
As the receptive filed of fully connected layer spans a long range and covers global interactions, the extracted features contain extensive domain information, which is reflected in the highfrequency component. To eliminate its negative effect on DG, we add an adaptive Fourier filter layer before each MLP layer. The input features are firstly fed to the adaptive Fourier filter layer, eliminating the highfrequency interference.
(5) 
In this case, equation 4 can be rewritten as:
(6) 
Learnable Frequency Filter. To explicitly filter the highfrequency interference in the latent features, we directly transform the spatial feature into the frequency domain. For a input embedding , its Fourier transformation can be formulated as:
(7) 
(8)  
and represent the real and imaginary parts of . and represent the amplitude and phase components in the frequency domain. Existing works [42, 44] have proven that the amplitude components is highly related to the domain information, which is influenced by both the real and imaginary parts. To adaptively refine the domaininvariant features, we maintain a learnable frequency filter , which is the same size as and . Different from the smallsize filter (e.g., 33) in the spatial domain, the frequency filter contains all the sampled frequency values. The frequency features are directly elementwise multiplied by the filter and optimized to adjust the useful amplitude distribution from multiple domains,
(9) 
The filtered features are transformed to the spatial domain through the inverse Fourier transformation for the subsequent operation,
(10) 
Both Fourier transformation and the inverse one can be implemented by the FFT algorithm [26].
LowRank Enhancement Module. To further enhance the maintanence of domaininvariant features, we extract lowfrequency components from the perspective of matrix decomposition. A input embedding can be seen as a static kernel and some noise information E, and the latter is sensitive to the variant such as domain shifts,
(11) 
where D and C represent the decomposed matrices, respectively. We utilize the SVD decomposition for the lowrank component in the frequency feature,
(12) 
where represents the reconstruction loss, and are the regularization terms. It is noted that the whole process is nonparameter, so we denote it as , which is distinguished from the learnable operator . To reduce the complexity of
, we utilize two linear transformation layers (
i.e., and ) to map the features to different embedding spaces. Finally the compact features are added to the filtered features for further augmentation,(13) 
(14) 
3.4 Momentum Update Strategy
To enhance the generalization of MLP from the perspective of overall optimization, we adopt the momentum update strategy to the standard full supervised paradigm. Here we denote and as the all the optimized parameters of student and teacher models at different time state t. We update the teacher model based on the historical state of teacher model and current state of student model for distillation. That is,
(15) 
where represents the momentum weight. It is noted that the teacher model can be seen as the weighted summation of the student models, their outputs should be similar. So we constrain the optimized model to be consistent to the teacher one. To further improve the generalization, we adopt data augmentation for the input of the teacher model. It can be seen in the experimental part that these augmentation strategies are also beneficial for the DG problem,
(16) 
(17)  
where DataAug represents the Fourierbased data augmentation [42] together with the standard augmentation protocols,
represents KullbackLeibler divergence,
denotes to the classification head of teacher model, is the temperature and refers to the softmax operation.3.5 Optimization
Combining all above loss functions together, we can get our full objective when given the input imagetarget pair (X, Y):
(18)  
where represents the crossentropy loss, and controls the tradeoff between the classification and the distillation loss.
Methods  Art  Cartoon  Photo  Sketch  Avg. 

ResNet18  
DeepAll  77.63  76.77  95.85  69.50  79.94 
MetaReg [2]  83.70  77.20  95.50  70.30  81.70 
JiGen [3]  79.42  75.25  96.03  71.35  80.51 
EpiFCR [18]  82.10  77.00  93.90  73.00  81.50 
MMLD [23]  81.28  77.16  96.09  72.29  81.83 
DDAIG [46]  84.20  78.10  95.30  74.70  83.10 
CSD [28]  78.90  75.80  94.10  76.70  81.40 
InfoDrop [35]  80.27  76.54  96.11  76.38  82.33 
MASF [7]  80.29  77.17  94.99  71.69  81.04 
L2AOT [47]  83.30  78.20  96.20  73.60  82.80 
EISNet [40]  81.89  76.44  95.93  74.33  82.15 
RSC [12]  83.43  80.31  95.99  80.85  85.15 
FACT [42]  85.37  78.38  95.15  79.15  84.51 
ATSRL [43]  85.80  80.70  97.30  77.30  85.30 
DIRTGAN [25]  82.56  76.37  95.65  79.89  83.62 
FSDCL [13]  85.30  81.31  95.63  81.19  85.86 
MLPS  
Our FAMLPS  92.06  82.49  98.10  84.09  89.19 
ResNet50  
DeepAll  84.94  76.98  97.64  76.75  84.08 
MetaReg [2]  87.20  79.20  97.60  70.30  83.60 
MASF [7]  82.89  80.49  95.01  72.29  82.67 
EISNet [40]  86.64  81.53  97.11  78.07  85.84 
RSC [12]  87.89  82.16  97.92  83.35  87.83 
FACT [42]  89.63  81.77  96.75  84.46  88.15 
ATSRL [43]  90.00  83.50  98.90  80.00  88.10 
MBDG [31]  80.60  79.30  97.00  85.20  85.60 
FSDCL [13]  88.48  83.83  96.59  82.92  87.96 
SWAD [4]  89.30  83.40  97.30  82.50  88.10 
MLPB  
Our FAMLPB  92.63  87.03  98.14  82.69  90.12 
Methods  Art  Clipart  Product  Real  Avg. 

ResNet18  
DeepAll  57.88  52.72  73.50  74.80  64.72 
CCSA [24]  59.90  49.90  74.10  75.70  64.90 
MMD [19]  56.50  47.30  72.10  74.80  62.70 
CG [34]  58.40  49.40  73.90  75.80  64.40 
DDAIG [46]  59.20  52.30  74.60  76.00  65.50 
L2AOT [47]  60.60  50.10  74.80  77.00  65.60 
Jigen [3]  53.04  47.51  71.47  72.79  61.20 
RSC [12]  58.42  47.90  71.63  74.54  63.12 
FACT [42]  60.34  54.85  74.48  76.55  66.56 
ATSRL [43]  60.70  52.90  75.80  77.20  66.70 
FSDCL [13]  60.24  53.54  74.36  76.66  66.20 
MLPS  
Our FAMLPS  69.34  62.61  79.82  82.00  73.44 
ResNet50  
Fishr [29]  63.40  54.20  76.40  78.50  68.20 
SWAD [4]  66.10  57.70  78.40  80.20  70.60 
ATSRL [43]  69.30  60.10  81.50  82.10  73.30 
MLPB  
Our FAMLPB  70.53  64.63  81.32  82.79  74.82 
Methods  MNIST  MNISTM  SVHN  SYN  Avg. 

DeepAll [46]  95.8  58.8  61.7  78.6  73.7 
CCSA [24]  95.2  58.2  65.5  79.1  74.5 
MMDAAE [19]  96.5  58.4  65.0  78.4  74.6 
CrossGrad [34]  96.7  61.1  65.3  80.2  75.8 
DDAIG [46]  96.6  64.1  68.6  81.0  77.6 
Jigen [3]  96.5  61.4  63.7  74.0  73.9 
L2AOT [47]  96.7  63.9  68.6  83.2  78.1 
FACT [42]  97.9  65.6  72.4  90.3  81.5 
Our FAMLPS  98.0  83.3  84.1  96.9  90.6 
4 Experiments
In this section, we demonstrate the superiority of our method on three conventional DG benchmarks and conduct several ablation studies to show the effectiveness of each component.
4.1 Setup
Datasets. We conduct the experiments on three benchmark datasets: (1) PACS [16] consists of four domains, i.e., Art Painting, Cartoon, Photo and Sketch. It totally contains 9991 images of 7 classes. (2) OfficeHome [38] is also composed of four domains, i.e., Art, Clipart, Product and Real World with 15500 images of 65 classes. The model is trained on three domains and tested on the remaining one during experiments. (3)DigitsDG [46]: a digit recognition benchmark consisted of four classical datasets MNIST, MNISTM, SVHN, SYN. The four datasets mainly differ in font style, background and image quality. We use the original trainvalidation split in [46] with 600 images per class per dataset.
Implementation Details. The backbone is detailed in Section 3.2
, which is pretrained on the ImageNet
[32] with input patch size in all of our experiments. For fair comparison, we adjust the depth and width of FAMLP to ensure comparable model capacity with different CNNs. Finally, we scale the depth by a factor of (i.e., MLPS and MLPB), corresponding to ResNet18 and ResNet50, respectively. The network is trained for 50 epochs with batch size of 16 and weight decay of 5e4. We use SGD as the optimizer and set the initial learning rate as 0.001 which is decayed by 0.1 at 40 epochs. The Fourierbased data augmentation
[42] together with the standard augmentation protocol, i.e., random resized cropping, horizontal flipping and color jittering are applied in our experiments. The momentum m for the teacher model is set as 0.9995 and the value of the temperature is 10 and the is 1.5. The first weight parameter is set to 2 for PACS/DigitsDG, and 200 for OfficeHome and the second one is set to 1 for both datasets. We also use a sigmoid rampup [42] for the two weights with a length of 5 epochs. The strength of Fourierbased data augmentation is chosen as 1.0 for PACS/DigitsDG, and 0.2 for OfficeHome.4.2 Comparison with StateoftheArt Methods
Domain Generalization. To better assess the overall performance of our scheme, we compare it with the SOTA methods of domain generalization. As shown in Table 1, 2 and Table 3, our method achieves average improvement of 3%, 4% and 9% accuracy on PACS, OfficeHome and DigitsDG datasets, respectively. It is worth noting that our model (i.e., FAMLPS) maintains good generalization even when the number of parameters decreases, achieving 3.33% and 7.24% improvement. In PACS, our method improves the FACT [42] with ResNet50 as the backbone by 1.98% and achieves best on the art, cartoon, and photo domains except the sketch domain. The possible reason is that the content of the sketch is simpler than other domains, where the global interaction is not very beneficial. For the results of the larger dataset OfficeHome, FAMLP outperforms other ResNet18 and ResNet50 based methods by a large margin on all the heldout domains, which further illustrates the superiority of our method.
MLPLike Architecture. To demonstrate the generalization performance of our FAMLP architecture, we compare it with the SOTA MLPlike models, including MLPmixer, gMLP, ResMLP and Vip. As shown in Table 5, our method achieves one point improvement in the smaller model and 6 points improvement in the larger one. This demonstrates the effectiveness of our scheme in assisting the MLPlike models to resist the disturbances caused by domain shifts.
AFF  PACS  

Backbone  LFF  LRE  MUS  Art  Cartoon  Photo  Sketch  Avg. 
ResNet50  85.45  79.44  96.77  79.33  85.25  
MLPB  85.00  77.86  94.43  65.72  80.75  
ResNet50  ✓  86.28  82.77  96.71  78.80  86.14  
MLPB  ✓  89.75  81.83  97.66  81.93  87.79  
MLPB  ✓  ✓  93.36  85.24  98.62  82.03  89.81  
MLPB  ✓  ✓  90.45  82.96  98.41  82.49  88.58  
MLPB  ✓  ✓  ✓  92.63  87.03  98.14  82.69  90.12 
AFF  Officehome  
Backbone  LFF  LRE  MUS  Art  Clipart  Product  Real  Avg. 
ResNet50  64.77  60.02  78.80  78.82  70.60  
MLPB  63.45  56.31  77.81  79.76  69.33  
ResNet50  ✓  66.63  57.78  80.15  80.81  71.34  
MLPB  ✓  68.31  63.00  81.60  82.65  73.89  
MLPB  ✓  ✓  69.39  64.16  81.50  82.95  74.50  
MLPB  ✓  ✓  68.81  64.63  81.08  81.23  73.93  
MLPB  ✓  ✓  ✓  70.53  64.63  81.32  82.79  74.82 
PACS  
Method  Para.  Art  Cartoon  Photo  Sketch  Avg. 
gMLPS [22]  20  86.72  80.80  97.54  72.13  84.23 
VipS [11]  25  87.35  85.96  98.68  80.20  88.05 
ResMLPS [37]  40  85.50  78.63  97.07  72.64  83.46 
MLPB [36]  59  85.00  77.86  94.43  65.72  80.75 
Our FAMLPS  25  92.06  82.49  98.10  84.09  89.19 
Our FAMLPB  44  92.63  87.03  98.14  82.69  90.12 
OfficeHome  
Method  Para.  Art  Clipart  Product  Real  Avg. 
gMLPS [22]  20  64.81  58.33  75.78  79.3  69.56 
VipS [11]  25  69.55  61.51  79.34  83.11  73.38 
ResMLPS [37]  40  62.42  51.94  75.40  77.21  66.74 
MLPB [36]  59  63.45  56.31  77.81  79.76  69.33 
Our FAMLPS  25  69.34  62.61  79.82  82.00  73.44 
Our FAMLPB  44  70.53  64.63  81.32  82.79  74.82 
4.3 Ablation Study
We conduct ablation studies to show the effectiveness of each component in our FAMLP architecture in Table 4. The performance of our scheme is mainly attributed to three prominent components: LFF layer, LRE module and MUS. To clarify the function of learnable frequency filter in MLPlike model, we add the LFF layer to both the ResNet50 and MLPB model for comparison. It can be seen that the generalization performance of ResNet is initially better than that of MLP, but MLP overtakes ResNet after adding the LFF layer. As analyzed earlier, since the MLPlike model covers global interactions, it contains a large amount of domain information. Although the LFF layer brings gain to the CNNs as well, MLPlike model can benefit more from the frequency operation, which proves the effectiveness of frequency filtering for MLP generalization. Then we add the LRE module and MUS to the Fourierbased MLPlike model separately. We can see that these two components improve the baseline of 1.32% and 0.42% on average, which demonstrate the effectiveness of the two components. And the model performs best after the combination of both components, which further shows that the two components act in different ways and can assist each other.
4.4 Analysis
Effectiveness of Learnable Frequency Filter. To better demonstrate the role of the learnable frequency filter during optimization, we show the visualization results in the frequency domain. As shown in Fig. 3 (a), the highfrequency components are obviously suppressed by our Fourier filter. Due to this property, the domainspecific features are greatly filtered out, which improves the generalization of the optimized features. In Fig. 3 (b) and (c), the vertical coordinate represents the amplitude attenuation of different frequency components before and after adopting the frequency filter. It can be seen that the suppression frequency characteristics are consistent within the domain (i.e., 8 different samples in Fig. 3 (c)) and different between the domains ((i.e., 4 different domains in Fig. 3 (b))) owing to our learnable frequency filter. It is the learnability of the frequency filtering kernel that allows the network to adjust adaptively to the domain characteristics of the input, thus enhancing the overall generalization performance of our MLPlike model.
Effectiveness of LowRank Enhancement Module. To further demonstrate the specific role of lowrank enhancement module, we decompose the visualization results for different layers. As shown in Fig. 3 (d), only learnable frequency filter tends to oversuppress high frequencies, leading to even some important lowfrequency information being lost. To ensure the integrity of the features, the lowrank enhancement module is introduced to augment the lowfrequency components. The resulting adaptive Fourier filter layer significantly facilitate the preservation of lowfrequency information, thus ensuring the discrimination of the extracted features.
Effectiveness of HyperParameter.
In this subsection, we conduct a series of analysis studies to show how the average accuracy varies as a function of the hyperparameters. The basic values of the
, and are set to {2, 10, 0.9995}. We vary the value of each hyperparameter and keep the remaining fixed. As shown in Figure 4, we can see that the performance indeed changes with the parameters. However, the margin of change is relatively small which means that our method is insensitive to the hyperparameters.5 Conclusion
In this paper, a novel frequencyaware MLP architecture (FAMLP) is presented for the domain generalization task. An adaptive Fourier filter layer is especially designed to be embedded before each MLP layer, augmenting the domaininvariant feature descriptor for label prediction. Specifically, a learnable frequency filter is firstly utilized to adaptively filter out the highfrequency components by considering both the real and imaginary parts of the transformed frequency features. Then, A lowrank enhancement module is further proposed to rectify the filtered features by fusing the lowfrequency components from SVD decomposition. In particular, a momentum update strategy is proposed to stabilize the optimization to parameters and input fluctuations by output distillation with the weighted historical model. Experimental results show that our architecture is superior in both performance and adaptability to the stateoftheart methods, especially in the smaller model.
References
 [1] (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.2.
 [2] (2018) Metareg: towards domain generalization using metaregularization. NeurIPS 31, pp. 998–1008. Cited by: §1, §2, Table 1.
 [3] (2019) Domain generalization by solving jigsaw puzzles. In CVPR, pp. 2229–2238. Cited by: §1, §2, Table 1, Table 2, Table 3.
 [4] (2021) SWAD: domain generalization by seeking flat minima. arXiv. Cited by: Table 1, Table 2.
 [5] (2021) Cyclemlp: a mlplike architecture for dense prediction. arXiv preprint arXiv:2107.10224. Cited by: §2.
 [6] (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2.
 [7] (2019) Domain generalization via modelagnostic learning of semantic features. NeurIPS 32, pp. 6450–6461. Cited by: Table 1.

[8]
(2016)
Ultimate tensorization: compressing convolutional and fc layers alike
. arXiv preprint arXiv:1611.03214. Cited by: §2.  [9] (2021) Is attention better than matrix decomposition?. arXiv preprint arXiv:2109.04553. Cited by: §2.
 [10] (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §3.2.
 [11] (2022) Vision permutator: a permutable mlplike architecture for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2, Table 5.
 [12] (2020) Selfchallenging improves crossdomain generalization. In ECCV, pp. 124–140. Cited by: Table 1, Table 2.
 [13] (2021) Feature stylization and domainaware contrastive learning for domain generalization. In ACM MM, pp. 22–31. Cited by: Table 1, Table 2.

[14]
(2020)
Reparameterizing convolutions for incremental multitask learning without task interference.
In
European Conference on Computer Vision
, pp. 689–707. Cited by: §2.  [15] (2019) Albert: a lite bert for selfsupervised learning of language representations. arXiv preprint arXiv:1909.11942. Cited by: §2.
 [16] (2017) Deeper, broader and artier domain generalization. In ICCV, pp. 5542–5550. Cited by: §2, §4.1.
 [17] (2018) Learning to generalize: metalearning for domain generalization. In AAAI, Cited by: §1, §2.
 [18] (2019) Episodic training for domain generalization. In CVPR, pp. 1446–1455. Cited by: Table 1.
 [19] (2018) Domain generalization with adversarial feature learning. In CVPR, pp. 5400–5409. Cited by: Table 2, Table 3.
 [20] (2018) Deep domain generalization via conditional invariant adversarial networks. In ECCV, pp. 624–639. Cited by: §1, §2.
 [21] (2021) Revisiting dynamic convolution via matrix decomposition. arXiv preprint arXiv:2103.08756. Cited by: §2.
 [22] (2021) Pay attention to mlps. Advances in Neural Information Processing Systems 34. Cited by: §2, Table 5.
 [23] (2020) Domain generalization using a mixture of multiple latent domains. In AAAI, Vol. 34, pp. 11749–11756. Cited by: Table 1.
 [24] (2017) Unified deep supervised domain adaptation and generalization. In ICCV, pp. 5715–5725. Cited by: §2, Table 2, Table 3.
 [25] (2021) Domain invariant representation learning with domain density transformations. arXiv. Cited by: Table 1.
 [26] (1981) The fast fourier transform. In Fast Fourier Transform and Convolution Algorithms, pp. 80–111. Cited by: §3.3.
 [27] (2022) How do vision transformers work?. arXiv preprint arXiv:2202.06709. Cited by: §2.
 [28] (2020) Efficient domain generalization via commonspecific lowrank decomposition. In ICML, pp. 7728–7738. Cited by: Table 1.

[29]
(2021)
Fishr: invariant gradient variances for outofdistribution generalization
. arXiv. Cited by: Table 2.  [30] (2021) Global filter networks for image classification. Advances in Neural Information Processing Systems 34. Cited by: §2.
 [31] (2021) Modelbased domain generalization. arXiv. Cited by: Table 1.
 [32] (2015) Imagenet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §4.1.
 [33] (2013) Lowrank matrix factorization for deep neural network training with highdimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6655–6659. Cited by: §2.
 [34] (2018) Generalizing across domains via crossgradient training. arXiv. Cited by: Table 2, Table 3.
 [35] (2020) Informative dropout for robust representation learning: a shapebias perspective. In ICML, pp. 8828–8839. Cited by: Table 1.
 [36] (2021) Mlpmixer: an allmlp architecture for vision. Advances in Neural Information Processing Systems 34. Cited by: §2, §3.2, Table 5.
 [37] (2021) Resmlp: feedforward networks for image classification with dataefficient training. arXiv preprint arXiv:2105.03404. Cited by: §2, Table 5.
 [38] (2017) Deep hashing network for unsupervised domain adaptation. In CVPR, pp. 5018–5027. Cited by: §4.1.
 [39] (2018) Generalizing to unseen domains via adversarial data augmentation. arXiv. Cited by: §2.
 [40] (2020) Learning from extrinsic and intrinsic supervisions for domain generalization. In ECCV, pp. 159–176. Cited by: Table 1.

[41]
(2018)
Wide compression: tensor ring nets.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 9329–9338. Cited by: §2.  [42] (2021) A fourierbased framework for domain generalization. In CVPR, Cited by: §1, §2, §3.3, §3.4, Table 1, Table 2, Table 3, §4.1, §4.2.
 [43] (2021) Adversarial teacherstudent representation learning for domain generalization. NeurIPS 34. Cited by: Table 1, Table 2.
 [44] (2020) Fda: fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4085–4095. Cited by: §3.3.
 [45] (2021) MorphMLP: a selfattention free, mlplike backbone for image and video. arXiv preprint arXiv:2111.12527. Cited by: §2.
 [46] (2020) Deep domainadversarial image generation for domain generalisation. In AAAI, Vol. 34, pp. 13025–13032. Cited by: Table 1, Table 2, Table 3, §4.1.
 [47] (2020) Learning to generate novel domains for domain generalization. In ECCV, pp. 561–578. Cited by: Table 1, Table 2, Table 3.