Log In Sign Up

ModSelect: Automatic Modality Selection for Synthetic-to-Real Domain Generalization

by   Zdravko Marinov, et al.

Modality selection is an important step when designing multimodal systems, especially in the case of cross-domain activity recognition as certain modalities are more robust to domain shift than others. However, selecting only the modalities which have a positive contribution requires a systematic approach. We tackle this problem by proposing an unsupervised modality selection method (ModSelect), which does not require any ground-truth labels. We determine the correlation between the predictions of multiple unimodal classifiers and the domain discrepancy between their embeddings. Then, we systematically compute modality selection thresholds, which select only modalities with a high correlation and low domain discrepancy. We show in our experiments that our method ModSelect chooses only modalities with positive contributions and consistently improves the performance on a Synthetic-to-Real domain adaptation benchmark, narrowing the domain gap.


page 1

page 2

page 3

page 4


Cross-modal Learning for Domain Adaptation in 3D Semantic Segmentation

Domain adaptation is an important task to enable learning when labels ar...

Greedy Modality Selection via Approximate Submodular Maximization

Multimodal learning considers learning from multi-modality data, aiming ...

Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities

Multimodal emotion recognition leverages complementary information acros...

HMS: Hierarchical Modality Selection for Efficient Video Recognition

Videos are multimodal in nature. Conventional video recognition pipeline...

Multimodal Co-Training for Selecting Good Examples from Webly Labeled Video

We tackle the problem of learning concept classifiers from videos on the...

1 Introduction

Human activity analysis is vital for intuitive human-machine interaction, with applications ranging from driver assistance [54] to smart homes and assistive robotics [70]. Domain shifts, such as appearance changes, constitute a significant bottleneck for deploying such models in real-life. For example, while simulations are an excellent way of economical data collection, a SyntheticReal domain shift leads to drop in accuracy when recognizing daily living activities [69]. Multimodality is a way of mitigating this effect, since different types of data, such as RGB videos, optical flow and body poses, exhibit individual strengths and weaknesses. For example, models operating on body poses are less affected by appearance changes, as the relations between different joints are more stable given a good skeleton detector [23, 22]. RGB videos, in contrast, are more sensitive to domain shifts [71, 56] but also convenient since they cover the complete scene and video is the most ubiquitous modality [14, 26, 72, 74].

Given the complementary nature of different data types (see Figure 2), we believe, that multimodality has a strong potential for improving domain generalization of activity recognition models, but which modalities to select and how to fuse the information become important questions. Despite its high relevance for applications, the question of modality selection has been often overlooked in this field. The main goal of our work is to develop a systematic framework for studying the contribution of individual modalities in cross-domain human activity recognition. We specifically focus on the SyntheticReal distributional shift [69], which opens new doors for economical data acquisition but comes with an especially large domain gap. We study five different modalities and examine how the prediction outcomes of multiple unimodal classifiers correlate as well as the domain discrepancy between their embeddings. We hope that our study will provide guidance for a better modality selection process in the future.

Contributions and Summary. We aim to make a step towards effective use of multimodality in the context of cross-domain activity recognition, which has been studied mostly for RGB videos in the past [14, 26, 72, 74]

. This work develops the modality selection framework ModSelect for quantifying the importance of individual data streams and can be summarized in two major contributions. (1) We propose a metric for quantifying the contribution of each modality for the specific task by calculating how the performance changes when the modality is included in the late fusion. Our new metric can be used by future research to justify decisions in modality selection. However, to estimate these performance changes, we use the ground-truth labels from the test data. (2) To detach ourselves from supervised labels, we propose to study the domain discrepancy between the embeddings and the correlation between the predictions of the unimodal classifiers of each modality. We use the discrepancy and the correlation to compute modality selection thresholds and show that these thresholds can be used to select only modalities with positive contributions w.r.t. our proposed metric in (1). Our unsupervised modality selection ModSelect can be applied in settings where no labels are present, e.g., in a multi-sensor setup deployed in unseen environments, where ModSelect would identify which sensors to trust.

2 Related Work

2.1 Multimodal Action Recognition

The usage of multimodal data represents a common technique in the field of action recognition, and is applied for both: increasing performance in supervised learning as well as unsupervised representation learning. Multimodal methods for action recognition include approaches which make use of video and audio

[49, 4, 63, 61, 3], optical flow [38, 63], text [51] or pose information [23, 22, 64]. Such methods can be divided into lower level early / feature fusion which is based on merging latent space information from multiple modality streams [42, 94, 55, 1, 2, 62, 33, 84] and late / score fusion which combines the predictions of individual classifiers or representation encoders either with learned fusion modules [77, 76, 1, 93, 46, 60] or with rule-based algorithms.

For this work, we focus on the latter, since the variety of early fusion techniques and learned late fusion impedes a systematic comparison, while rule-based late fusion builds on few basic but successful techniques such as averaging single-modal scores [89, 41, 7, 5, 24, 13, 27, 12, 28], the max rule [44, 5, 27, 66], product rule [41, 78, 44, 47, 25, 91, 81, 27, 66] or median rule. Ranking based solutions [30, 75, 65, 20]

like Borda count are less commonly used for action recognition but recognised in other fields of computer vision.

2.2 Modality Contribution Quantification

While the performance contribution of modalities has been analyzed in multiple previous works, e.g., by measuring the signal-to-noise ratio between modalities

[83], determining class-wise modality contribution by learning an optimal linear modality combination [45, 6] or extracting modality relations with threshold-based rules [80], in-depth analysis of modality contributions in the field of action recognition remains sparse and mostly limited to small ablation studies. Metrics to measure data distribution distances like Maximuim Mean Discrepancy (MMD) or Mean Pairwise Distance (MPD) have been applied in fields like domain adaptation. MMD is commonly used to estimate and to reduce domain shift [59, 53, 34, 79] and can be adapted to be robust against class bias, e.g. in the form of weighted MMD [87], Mean Pairwise Distance (MPD) was applied to analyze semantic similarities of word embeddings, e.g., in [29]. In this work, we introduce a systematic approach for analyzing modality contributions in the context of cross-domain activity recognition, which, to the best of out knowledge, has not been addressed in the past.

2.3 Domain Generalization and Adaptation

Both domain generalization and domain adaptation present strategies to learn knowledge from a source domain which is transferable to a given target domain. While domain adaptation allows access to data from the target domain to fulfill this task, either paired with labels [68, 16, 58] or in the form of unsupervised domain adaptation [10, 15, 68, 16, 58, 19, 18, 73]

, domain generalization assumes an unknown target domain and builds upon methods which condition a neural network to make use of features which are found to be more generalizable

[88], apply heavy augmentations to increase robustness [90] or explore different methods of leveraging temporal data [90].

3 Approach

Figure 1: ModSelect: out approach for unsupervised modality selection which uses predictions correlations and domain discrepancy.

Our approach consists of three main steps. (1) We extract multiple modalities and train a unimodal action recognition classifier on each modality. Afterwards, we evaluate all possible combinations of the modalities with different late fusion methods. We define the action recognition task in Section 3.1, the datasets we use in Section 3.2, and the modality extraction and training in Section 3.3. (2) In Section 3.4, we determine which modalities lead to a performance gain based on our evaluation results from (1). This establishes a baseline for the (3) third step (Section 3.5), where we show how to systematically select these beneficial modalities in an unsupervised way with our framework ModSelect - without the need of labels nor evaluation results. We offer an optional notation table in the Supplementary for a better understanding of all of our equations.

We intentionally do not make use of learned late fusion techniques, such as [77, 76, 1, 93, 46], since such methods do not allow for comparing the contribution of individual modalities. Instead, a specific learned late fusion architecture could be better suited to some modalities in contrast to others, overshadowing a neutral evaluation. However, our work can be used to select modalities upon which such learned late fusion techniques can be designed.

3.1 Action Recognition Task

Our goal is to produce a systematic method for unsupervised modality selection in multimodal action recognition. More specifically, we focus on SyntheticReal domain generalization to show the need for a modality selection approach when a large domain gap is present. In this scenario an action classifier is trained only on samples from a Synthetic source domain with action labels . In domain generalization, the goal is to generalize to an unseen target domain , without using any samples from it during training. In our case, the target domain consists of Real

data and the source and target data originate from distinct probability distributions

and . The goal is to classify each instance from the Real target test domain , which has a shared action label set with the training set. To achieve this, we use the synthetic Sims4Action dataset [69] for training and the real Toyota Smarthome (Toyota) [21] and ETRI-Activity3D-LivingLab (ETRI) [43] as two separate target test sets. We also evaluate our models on the Sims4Action official test split [69] in our additional SyntheticSynthetic experiments.

3.2 Datasets

We focus on SyntheticReal domain generalization between the synthetic Sims4Action [69] as a training dataset and the real Toyota Smarthome [21] and ETRI [43] as test datasets. Sims4Action consists of ten hours of video material recorded from the computer game Sims 4, covering activities of daily living which have direct correspondences in the two Real datasets. Toyota Smarthome [21] contains videos of 18 subjects performing 31 different everyday actions within a single apartment, and ETRI [43] consists of 50 subjects performing 55 actions recorded from perspectives of home service robots in various residential spaces. However, we use only the 10 action correspondences to Sims4Action from the Real datasets for our evaluation.

Figure 2:

Examples of all extracted modalities. Note: the YOLO modality is represented as a vector

v, which encodes distances to the person’s detection (see Section 3.3).

3.3 Modality Extraction and Training

We leverage the multimodal nature of actions to extract additional modalities for our training data, such as body pose, movement dynamics, and object detections. To this end, we utilize the RGB videos from the synthetic Sims4Action 

[69] to produce four new modalities - heatmaps, limbs, optical flow, and object detections. An overview of all modalities can be seen in Figure 2.

Heatmaps and Limbs. The heatmaps and limbs (H and L) are extracted via AlphaPose [31, 50, 86], which infers 17 joint locations of the human body. The heatmaps modality at pixel is obtained by stacking 2D Gaussian maps, which are centered at each joint location and each map is weighted by its detection confidence as shown in Equation 1, where .


The limbs modality is produced by connecting the joints with white lines and weighting each line by the smaller confidence of its endpoints. We weight both modalities by the detection confidences so that uncertain and occluded body parts are dimmer and have a smaller contribution.

Optical Flow. The optical flow modality (OF) is estimated via the Gunnar-Farneback method [32]. The optical flow at pixel encodes the magnitude and angle of the pixel intensity changes between two frames in the value and hue components of the HSV color space. The saturation is used to adjust the visibility and we set it to its maximum value. The heatmaps, limbs, and optical flow are all image-based and are used as an input to models which usually utilize RGB images.

Object Detections. Our last modality (YOLO) consists of object detections obtained by YOLOv3 [67], which detects 80 different objects. Unlike the other modalities, we represent the detections as a vector, instead of an image. We show that such a simple representation achieves good domain generalization in our experiments. The YOLO modality for an image sample consists of a -dimensional vector v, where v corresponds to the reciprocal Euclidean distance between the person’s and the object’s bounding box centers, and is the number of detection classes. This way, objects closer to the person have a larger weight in v than ones which are further away. After computing the distances, v is normalized by its norm: . We denote the set of all modalities as and use the term in our equations.

Training. We train unimodal classifiers on each modality and evaluate all possible modality combinations with different late fusion methods. We utilize 3D-CNN models with the S3D backbone [85] for each one of the RGB, H, L, and OF modalities. The YOLO modality utilizes an MLP model as it is not image-based. We train all action recognition models end-to-end on Sims4Action [69].

Evaluation. For our late fusion experiments, we combine the predictions of all unimodal classifiers at the class score level and obtain results for all modality combinations. We investigate late fusion strategies - Sum, Squared Sum, Product, Maximum, Median, and Borda Count [39, 8]

, which all operate on the class probability scores. Borda Count also uses the ranking of the class scores. For brevity, we refer to a late fusion of unimodal classifiers as a

multimodal classifier and present our late fusion results in Section 4.1.

3.4 Quantification Study: Modality Contributions

In this section, we propose how to quantify the contributions of each modality based on the performance of the models on the target test sets. To this end, we propose a with-without metric, which computes the average difference of the performance of a multimodal classifier with a modality to the performance without it. Formally the contribution of a modality is defined as:


where is the set of all modality combinations, is the test accuracy of the multimodal classifier with the modality combination , and . We compute for all modalities based on the late fusion results listed in Section 4.1. The contribution of each modality can be used to determine the modalities, which positively influence the performance on the test dataset .

3.5 ModSelect: Unsupervised Modality Selection

In this section, we introduce our method ModSelect for unsupervised modality selection. In this setting, we assume that we do not have any labels in the target test domain . This is exactly the case for SyntheticReal domain generalization, where a model trained on simulated data is deployed in real-world conditions. In this case, the contribution of each modality cannot be estimated with Equation 2 as cannot be computed without ground-truth labels. Note that we do have labels in our test sets but we only use them in our quantification study and ignore them for our unsupervised experiments.

We propose ModSelect - a method for unsupervised modality selection based on the consensus of two metrics: (1) the correlation between the unimodal classifiers’ predictions and (2) the Maximum Mean Discrepancy (MMD) [36] between the classifiers’ embeddings. We compute both metrics with our unimodal classifiers and propose how to systematically estimate modality selection thresholds. We show that the thresholds select the same modalities with positive contributions as our quantification study in Section 3.4.

Correlation Metric. We define the predictions correlation vector between modalities and as:


where are the softmax class scores of the action classifiers trained on modalities and respectively,

are the mean and standard deviation vectors of

, and is the element-wise multiplication operator. We define the predictions correlation between two modalities , as:


where is the number of action classes.

MMD Metric. We show that the distance between the distributions of the embeddings of two unimodal classifiers can also be used to compute a modality selection threshold. The MMD metric [36] between two distributions and over a set is formally defined as:


where is a feature map, and is a reproducing kernel Hilbert space (RKHS) [35, 9, 36]. For our empirical calculation of MMD between the embeddings of two modalities we set and :


where are the embeddings from the second-to-last linear layer of the action classifiers for modalities and respectively, and is the embedding size. Note that using a linear feature map lets us determine only the discrepancy between the distributions’ means. A linear mapping is sufficient to produce a good modality selection threshold, but one could also consider more complex alternatives, such as or a Gaussian kernel [37].

We make the following observations regarding both metrics for modality selection. Firstly, a high correlation between correct predictions is statistically more likely than a high correlation between wrong predictions, since there is only correct class and possibilities for error. We believe that a stronger correlation between the predictions results in a higher performance. Secondly, unimodal classifiers should have a high agreement on easy samples and a disagreement on difficult cases [39]. A higher domain discrepancy between the classifiers’ embeddings has been shown to indicate a lower agreement on their predictions [52, 92], and hence, a decline in performance when fused. We therefore believe that good modalities are characterized by a low discrepancy and high correlation.

Modality Selection Thresholds. After computing and for all pairs , we systematically calculate modality selection thresholds for each metric and MMD. We consider two types of thresholds: (1) an aggregated threshold , which selects a set of individual modalities , and (2) a pairs-threshold , which selects a set of modality pairs .

Aggregated Threshold . For the first threshold, we aggregate the and values for a modality by averaging over all of its pairs:


Thus, we produce the sets and . A simple approach would be to use the mean or median as a threshold for and

. However, such thresholds are sensitive to outliers (mean) or do not use all the information from the values (median). Additionally, one cannot tune the threshold with prior knowledge. To mitigate these issues, we propose to use the Winsorized Mean 

[40, 82] for both sets, which is defined as:


where is the -percentile of , is the -trimmed mean of , and

is a “trust” hyperparameter. A higher

results in a lower contribution of edge values in and a bigger trust in values near the center. Therefore, we set as we have modalities and expect to trust at least . We compute two separate thresholds and and select the modalities as a consensus between the two metrics as:


Pairs-Threshold . The second type of selection threshold skips the aggregation step of and directly computes the Winsorized Means over the sets of all and values to obtain and respectively. This results in a selection of modality pairs, rather than individual modalities as in Equation 9. In other words, the thresholds are suitable when one is searching for the best pairs of modalities, and for the best individual modalities. The selected modality pairs with this method are:


Summary of our Approach. A summary of our unsupervised modality selection method ModSelect is illustrated in Figure 1. We use the embeddings of unimodal action classifiers to compute the Maximum Mean Discrepancy (MMD) between all pairs of modalities. We also compute the correlation between the predictions of all pairs of classifiers. We systematically estimate thresholds for MMD and which discard certain modalities and select only modalities on which both metrics have a consensus. In the following Experiments 4 we show that the selected modalities with our method ModSelect are exactly the modalities with a positive contribution according to Equation 2, although our unsupervised selection does not utilize any ground-truth labels.

4 Experiments

4.1 Late Fusion: Results

We evaluate our late fusion multimodal classifiers following the cross-subject protocol from [21] for Toyota, the inter-dataset protocol from [48] for ETRI, and the official test split for Sims4Action from [69]. We follow the original Sims4ActionToyota evaluation protocol of [69] and utilize the mean-per-class accuracy (mPCA) as the number of samples per class are imbalanced in the Real test sets. The mPCA metric avoids bias towards overrepresented classes and is often used in unbalanced activity recognition datasets [21, 11, 54, 69].

The results from our evaluation are displayed in Table 1. The domain gap of transferring to Real data is apparent in the drastically lower performance, especially on the ETRI dataset. Combinations including the H, L, or RGB modalities exhibit the best performance on the Sims4Action dataset, whereas OF and YOLO are weaker. However, combinations including the RGB modality seem to have an overall lower performance on the Real datasets, perhaps due to the large appearance change. Combinations with the YOLO modality show the best performance for both Real test sets. Inspecting the results in Table 1 is tedious and prone to misinterpretation or confirmation bias [57]. It is also possible to overlook important tendencies. Hence, we show in Section 4.2 how our quantification study tackles these problems by systematically disentangling the modalities with a positive contribution from the rest .

mPCA [%]
Synthetic Real Synthetic Real
Test Set Sims4Action [69] Toyota [21] ETRI [43] Test Set Sims4Action [69] Toyota [21] ETRI [43]
Modalities Modalities
Table 1: Results for the action classifiers trained on Sims4Action [69] in the mPCA metric. The late fusion results are averaged over the fusion strategies discussed in Section 3.3. H: Heatmaps, L: Limbs, OF: Optical Flow.

4.2 Quantification Study: Results

We use the results from Table 1 for the term in Equation 2 and compute the contribution of each modality . We do this for all late fusion strategies and all three test datasets and plot the results in Figure 3. The SyntheticReal domain gap is clearly seen in the substantial difference in the height of the bars in the test split of Sims4Action [69] compared to the Real test datasets. The limbs and RGB modalities have the largest contribution on Sims4Action [69], followed by the heatmaps. The only modalities with negative contributions are the optical flow and YOLO, where YOLO reaches a drastic drop of over for the Squared Sum and Maximum late fusion methods. We conclude that YOLO and optical flow have a negative contribution on Sims4Action [69].

Figure 3: Quantification Study: Quantification of the contribution of each modality for late fusion methods and on three different test sets. The height of each bar corresponds to the contribution value which is computed with Equation 2.

The results on the Real test datasets Toyota [21] and ETRI [43] show different tendencies. The contributions are smaller due to the domain shift, especially on the ETRI dataset. The RGB modality has explicitly negative contributions on both Real datasets. We hypothesize that this is due to the appearance changes when transitioning from synthetic to real data. Apart from RGB, optical flow also has a consistently negative contribution on the ETRI dataset. An interesting observation is that the domain gap is much larger on ETRI than on Toyota. A reason for this might be that Roitberg et al. [69] design Sims4Action specifically as a SyntheticReal domain adaptation benchmark to Toyota Smarthome [21], e.g., in Sims4Action the rooms are furnished the same way as in Toyota Smarthome. Our results indicate that optical flow has a negative contribution in ETRI, whereas RGB is negative in both Real datasets. The average contribution of each modality over the 6 fusion methods is in Table 2(a).

Test Dataset (a) Contribution (b) Aggregated (c) Aggregated
Sims4Action [69] 4.37 8.86 -0.74 6.98 -2.57 0.57 0.55 0.38 0.50 0.37 0.40 9.49 8.12 13.07 9.92 10.15
Toyota [21] 2.14 2.46 2.90 -1.86 2.13 0.23 0.21 0.14 0.08 0.14 0.10 11.93 11.47 13.34 20.79 14.38
ETRI [43] 0.76 1.60 -0.13 -1.17 2.02 0.14 0.14 0.06 0.05 0.13 0.08 17.84 17.76 22.04 24.91 20.64
Table 2: (a) Average contribution over the late fusion methods of each modality . Negative contributions are colored in red. (b) Aggregated prediction correlation values for each modality on the three test datasets and the aggregated thresholds . Values below the threshold are colored in red. (c) Aggregated values for each modality on the three test datasets and the aggregated thresholds . Values above the threshold are colored in red.

4.3 Results from ModSelect: Unsupervised Modality Selection

Table 2(a) shows which modalities have a negative contribution on each target test dataset. However, to estimate these values we needed the performance, and hence, the labels for the target test sets. In this section, we show how to select only the modalities with a positive contribution without using any labels.

Predictions Correlation. We utilize the predictions correlation metric and compute it for all modality pairs using Equation 4. The results for all datasets are illustrated in the chord diagrams in Figure 4. The chord diagrams allow us to identify the same tendencies, which we observed in our quantification study. Each arch connects two modalities and its thickness corresponds to the value . The YOLO modality has the weakest correlations on Sims4Action [69], depicted in the thinner green arches. Optical flow also exhibits weaker correlations compared to the heatmaps, limbs, and RGB. We see significantly thinner arches for the RGB modality in both Real test datasets, and for optical flow in ETRI, which matches our results in Table 2(a).

Figure 4: Chord plots of the prediction correlations for all modality pairs . The thickness of each arch corresponds to the correlation between its two endpoint modalities and . Each value is computed according to Equation 4.

However, simply inspecting the chord plots is not a systematic method for modality selection. Hence, we first compute the aggregated correlations with Equation 7, which constitute the set . Then, we compute the aggregated threshold using the -Winsorized Mean [82] from Equation 8. We compute these terms for each test dataset and obtain three thresholds. The aggregated correlations and their thresholds for each test dataset are illustrated in Table 2(b). The modalities underneath the thresholds are exactly the ones with negative contributions from Table 2(a).

We also show that applying the pairs-threshold leads to the same results. We skip the aggregation step of and compose the set out of the values, i.e. we focus on modality pairs instead of individual modalities. We compute the threshold again with the -Winsorized Mean [82] from Equation 8. The correlation values as well as the accuracies of all bi-modal action classifiers from Table 1 are shown in Figure 5. The pairs-thresholds for each test set are drawn as dashed lines and divide the modality pairs into two groups. The modality pairs below the thresholds are marked with an so that it is possible to identify which pairs are selected.

For Sims4Action [69], optical flow (OF) and RGB show a negative in Table 2(a) and are also discarded by . Figure 5 shows that all modality pairs containing either OF or RGB are below the threshold, i.e. discards the same modalities as for Sims4Action. The same is true for the Real test datasets, where all models containing RGB are discarded in Toyota and ETRI, as well as OF for ETRI. Another observation is that the majority of ”peaks” in the yellow accuracy lines coincide with the peaks in the blue correlation lines. This result is in agreement with our theory that a high correlation between correct predictions is statistically more likely than a high correlation between wrong predictions, since there is one only correct class and multiple incorrect ones. Moreover, while we do achieve the same results with both thresholds, we recommend using when discarding an entire input modality, e.g. a faulty sensor in a multi-sensor setup, and when searching for the best synergies from all modality combinations.

Domain Discrepancy. The second metric we use to discern the contributing modalities from the rest is the Maximum Mean Discrepancy (MMD) [36] between the embeddings of the action classifiers. Note that the YOLO modality is not included in this experiment as its MLP model’s embedding size is different that the other image-based modalities. While MMD is widely used as a loss term for minimizing the domain gap between source and target domains [79, 87, 17], a large MMD is also associated with a decline in performance in fusion methods [52, 92]. To utilize the MMD metrics to separate the modalities, we first compute for all modality pairs and the aggregated discrepancies with Equations 6 and 7. We then compute the pairs- and aggregated thresholds with the -Winsorized Mean [82] from Equation 8 the same way as we did for the predictions’ correlations .

Figure 5: Prediction correlations between all modality pairs and the late fusion accuracy of the bi-modal action classifiers. The pairs-thresholds are depicted as dashed lines. Pairs under the thresholds are crossed out with an in the yellow line.
Figure 6: Maximum Mean Discrepancy values computed with Equation 6 for all modality pairs . Warmer colors correspond to a higher discrepancy.

The values for all modality pairs in the three test datasets are illustrated in Figure 6. The higher discrepancy values are clearly apparent by their bright colors and contrast to the rest of the values. Optical flow has the largest values on Sims4Action, and RGB has the highest discrepancy on Toyota and ETRI. Optical flow also exhibits a high discrepancy on ETRI. Once again, we can see that the domain gap to the ETRI dataset is much larger, which is manifested in drastically higher values. The pairs-thresholds for the three datasets are and separate exactly the same modality pairs as our quantification study and the thresholds, with the exception of the (H,OF) pair in ETRI. The aggregated discrepancies for each modality and the aggregated thresholds are listed in Table 2(c), where the values above the thresholds are colored in red. The red values coincide exactly with the negative contributions from our quantification study in Table 2(a).

ModSelect: Unsupervised Modality Selection. Finally, we select the modalities with either the aggregated or the pairs-thresholds , by constructing the consensus between our two metrics and (see Equations 9 and 10). The selected modalities with our aggregated thresholds and selected modality pairs with our pairs-thresholds are listed in Table 3. The selected modalities from our aggregated thresholds are exactly the ones with a positive contribution in Table 2(a) from our quantification study in Section 4.2. The pairs-thresholds have selected only modality pairs which are constituted out of modalities from , which means that contains only pairs of modalities with positive contributions, i.e. . In other words, our proposed unsupervised modality selection is able to select only the modalities with positive contributions by utilizing the predictions correlation and MMD between the embeddings of the unimodal action classifiers, without the need of any ground-truth labels on the test datasets.

Impact on the multimodal accuracy. The impact of the unsupervised modality selection on the mean multimodal accuracy can be seen in Table 3. Selecting the modalities with our proposed thresholds leads to an average improvement of , , and for Sims4Action [69], Toyota [21], and ETRI [43] respectively. This is a substantial improvement, given the low accuracies on the Real test datasets due to the synthetic-to-real domain gap. These results confirm that our modality selection approach is able to discern between good and bad sources of information, even in the case of a large distributional shift.

Average multimodal accuracy
Test Dataset All Modalities Ours:
Sims4Action [69] {H, L, RGB} 85.7% 90.9% (+5.2%)
Toyota [21] {H, L, OF, YOLO} 22.9% 26.5% (+3.6%)
ETRI [43] {H, L, YOLO} 17.7% 22.0% (+4.3%)
Table 3: Results from ModSelect: Selected modalities with and selected modality pairs with computed with Equations 9 and 10.

5 Limitations and Conclusion

Limitations. A limitation of our work is that the contributions quantification metric relies on the evaluation results on the test datasets, i.e., the ground-truth labels are needed to calculate the metric. Moreover, the with-without metric in Equation 2 requires computations, where is the number of modalities. However, one should note that in practice is not too large, e.g., . Additionally, since our method is novel, it has only been tested on the task of cross-domain action recognition. To safely apply our method to other multimodal tasks, e.g., object recognition, future investigations are needed. Moreover, the overall accuracy is still relatively low for cross-domain activity recognition and more research is needed for deployment-ready systems.

Conclusion. This is the first systematic study of modality selection in the context of cross-domain activity recognition, aimed at providing guidance for future work in multimodal domain generalization. Our experiments validate our assumption, that cross-domain activity recognition clearly benefits from multimodality, but not all modalities improve the recognition and a systematic modality selection is vital for achieving good results. We proposed a way to measure the contribution of each modality when it is included in a late fusion workflow. The contribution can be used to quantify the importance of each modality and to justify which sources of information are included in a multimodal framework. Our experiments indicate that the correlation between the predictions of unimodal classifiers and the Maximum Mean Discrepancy between their embeddings are both suitable metrics for unsupervised modality selection. The metrics allow to compute thresholds which select only modalities with positive contributions, which opens the possibility to automatically discard bad or uncertain sources of information and to improve the performance on unseen domains. We hope that our findings will provide guidance for a better modality selection process in the future, which is based on more structured and justified decisions.

Acknowledgements. This work was supported by the JuBot project sponsored by the Carl Zeiss Stiftung and Competence Center Karlsruhe for AI Systems Engineering (CC-KING) sponsored by the Ministry of Economic Affairs, Labour and Housing Baden-Württemberg.


  • [1] Z. Ahmad and N. Khan (2019) Human action recognition using deep multilevel multimodal () fusion of depth and inertial sensors. IEEE Sensors Journal 20 (3), pp. 1445–1455. Cited by: §2.1, §3.
  • [2] Z. Ahmad and N. Khan (2020) CNN-based multistage gated average fusion (mgaf) for human action recognition using depth and inertial sensors. IEEE Sensors Journal 21 (3), pp. 3623–3634. Cited by: §2.1.
  • [3] J. Alayrac, A. Recasens, R. Schneider, R. Arandjelović, J. Ramapuram, J. De Fauw, L. Smaira, S. Dieleman, and A. Zisserman (2020) Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems 33, pp. 25–37. Cited by: §2.1.
  • [4] H. Alwassel, D. Mahajan, B. Korbar, L. Torresani, B. Ghanem, and D. Tran (2020) Self-supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems 33. Cited by: §2.1.
  • [5] S. Ardianto and H. Hang (2018) Multi-view and multi-modal action recognition with learned fusion. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1601–1604. Cited by: §2.1.
  • [6] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia systems 16 (6), pp. 345–379. Cited by: §2.2.
  • [7] F. Baradel, C. Wolf, and J. Mille (2017) Human action recognition: pose-based attention draws focus to hands. In IEEE International Conference on Computer Vision Workshops, pp. 604–613. Cited by: §2.1.
  • [8] D. Black et al. (1958) The theory of committees and elections. Cited by: §3.3.
  • [9] K. M. Borgwardt, A. Gretton, M. J. Rasch, H. Kriegel, B. Schölkopf, and A. J. Smola (2006) Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22 (14), pp. e49–e57. Cited by: §3.5.
  • [10] P. P. Busto, A. Iqbal, and J. Gall (2018) Open set domain adaptation for image and action recognition. IEEE transactions on pattern analysis and machine intelligence 42 (2), pp. 413–429. Cited by: §2.3.
  • [11] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles (2015) Activitynet: a large-scale video benchmark for human activity understanding. In

    Proceedings of the ieee conference on computer vision and pattern recognition

    pp. 961–970. Cited by: §4.1.
  • [12] J. Cai, N. Jiang, X. Han, K. Jia, and J. Lu (2021) JOLO-gcn: mining joint-centered light-weight information for skeleton-based action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2735–2744. Cited by: §2.1.
  • [13] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §2.1.
  • [14] J. M. Chaquet, E. J. Carmona, and A. Fernández-Caballero (2013) A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding 117 (6), pp. 633–659. Cited by: §1, §1.
  • [15] M. Chen, Z. Kira, G. AlRegib, J. Yoo, R. Chen, and J. Zheng (2019) Temporal attentive alignment for large-scale video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6321–6330. Cited by: §2.3.
  • [16] M. Chen, B. Li, Y. Bao, and G. AlRegib (2020) Action segmentation with mixed temporal domain adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614. Cited by: §2.3.
  • [17] Y. Chen, S. Song, S. Li, and C. Wu (2019) A graph embedding framework for maximum mean discrepancy-based domain adaptation algorithms. IEEE Transactions on Image Processing 29, pp. 199–213. Cited by: §4.3.
  • [18] J. Choi, G. Sharma, M. Chandraker, and J. Huang (2020) Unsupervised and semi-supervised domain adaptation for action recognition from drones. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1717–1726. Cited by: §2.3.
  • [19] J. Choi, G. Sharma, S. Schulter, and J. Huang (2020) Shuffle and attend: video domain adaptation. In European Conference on Computer Vision, pp. 678–695. Cited by: §2.3.
  • [20] G. V. Cormack, C. L. Clarke, and S. Buettcher (2009) Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In International ACM SIGIR conference on Research and development in information retrieval, pp. 758–759. Cited by: §2.1.
  • [21] S. Das, R. Dai, M. Koperski, L. Minciullo, L. Garattoni, F. Bremond, and G. Francesca (2019) Toyota smarthome: real-world activities of daily living. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 833–842. Cited by: §3.1, §3.2, §4.1, §4.2, §4.3, Table 1, Table 2, Table 3.
  • [22] S. Das, R. Dai, D. Yang, and F. Bremond (2021) VPN++: rethinking video-pose embeddings for understanding activities of daily living. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2.1.
  • [23] S. Das, S. Sharma, R. Dai, F. Bremond, and M. Thonnat (2020) Vpn: learning video-pose embedding for activities of daily living. In European Conference on Computer Vision, pp. 72–90. Cited by: §1, §2.1.
  • [24] N. Dawar and N. Kehtarnavaz (2018)

    A convolutional neural network-based sensor fusion system for monitoring transition movements in healthcare applications

    In 2018 IEEE 14th International Conference on Control and Automation (ICCA), pp. 482–485. Cited by: §2.1.
  • [25] N. Dawar, S. Ostadabbas, and N. Kehtarnavaz (2018)

    Data augmentation in deep learning-based fusion of depth and inertial sensing for action recognition

    IEEE Sensors Letters 3 (1), pp. 1–4. Cited by: §2.1.
  • [26] V. Delaitre, I. Laptev, and J. Sivic (2010) Recognizing human actions in still images: a study of bag-of-features and part-based representations. In BMVC 2010-21st British Machine Vision Conference, Cited by: §1, §1.
  • [27] C. Dhiman and D. K. Vishwakarma (2020) View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Transactions on Image Processing 29, pp. 3835–3844. Cited by: §2.1.
  • [28] H. Duan, Y. Zhao, K. Chen, D. Shao, D. Lin, and B. Dai (2021) Revisiting skeleton-based action recognition. arXiv preprint arXiv:2104.13586. Cited by: §2.1.
  • [29] Á. Elekes, M. Schäler, and K. Böhm (2017) On the various semantics of similarity in word embedding models. In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 1–10. Cited by: §2.2.
  • [30] P. Emerson (2013) The original borda count and partial voting. Social Choice and Welfare 40 (2), pp. 353–358. Cited by: §2.1.
  • [31] H. Fang, S. Xie, Y. Tai, and C. Lu (2017) RMPE: regional multi-person pose estimation. In ICCV, Cited by: §3.3.
  • [32] G. Farnebäck (2003) Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis, pp. 363–370. Cited by: §3.3.
  • [33] R. Gao, T. Oh, K. Grauman, and L. Torresani (2020) Listen to look: action recognition by previewing audio. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10457–10467. Cited by: §2.1.
  • [34] M. Ghifary, W. B. Kleijn, and M. Zhang (2014) Domain adaptive neural networks for object recognition. In

    Pacific Rim international conference on artificial intelligence

    pp. 898–904. Cited by: §2.2.
  • [35] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A kernel two-sample test.

    The Journal of Machine Learning Research

    13 (1), pp. 723–773.
    Cited by: §3.5.
  • [36] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola (2006) A kernel method for the two-sample-problem. Advances in neural information processing systems 19. Cited by: §3.5, §3.5, §4.3.
  • [37] A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur (2012) Optimal kernel choice for large-scale two-sample tests. Advances in neural information processing systems 25. Cited by: §3.5.
  • [38] T. Han, W. Xie, and A. Zisserman (2020) Self-supervised Co-training for Video Representation Learning. (NeurIPS). External Links: 2010.09709, Link Cited by: §2.1.
  • [39] T. K. Ho, J. J. Hull, and S. N. Srihari (1994) Decision combination in multiple classifier systems. IEEE transactions on pattern analysis and machine intelligence 16 (1), pp. 66–75. Cited by: §3.3, §3.5.
  • [40] P. J. Huber (1992) Robust estimation of a location parameter. In Breakthroughs in statistics, pp. 492–518. Cited by: §3.5.
  • [41] J. Imran and P. Kumar (2016) Human action recognition using rgb-d sensor and deep convolutional neural networks. In 2016 international conference on advances in computing, communications and informatics (ICACCI), pp. 144–148. Cited by: §2.1.
  • [42] J. Imran and B. Raman (2020) Evaluating fusion of rgb-d and inertial sensors for multimodal human action recognition. Journal of Ambient Intelligence and Humanized Computing 11 (1), pp. 189–208. Cited by: §2.1.
  • [43] J. Jang, D. Kim, C. Park, M. Jang, J. Lee, and J. Kim (2020) ETRI-activity3d: a large-scale rgb-d dataset for robots to recognize daily activities of the elderly. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10990–10997. Cited by: §3.1, §3.2, §4.2, §4.3, Table 1, Table 2, Table 3.
  • [44] A. Kamel, B. Sheng, P. Yang, P. Li, R. Shen, and D. D. Feng (2018) Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Transactions on Systems, Man, and Cybernetics: Systems 49 (9), pp. 1806–1819. Cited by: §2.1.
  • [45] O. Kampman, E. J. Barezi, D. Bertero, and P. Fung (2018) Investigating audio, visual, and text fusion methods for end-to-end automatic personality prediction. arXiv preprint arXiv:1805.00705. Cited by: §2.2.
  • [46] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen (2019) Epic-fusion: audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5492–5501. Cited by: §2.1, §3.
  • [47] P. Khaire, J. Imran, and P. Kumar (2018) Human activity recognition by fusion of rgb, depth, and skeletal data. In Proceedings of 2nd International Conference on Computer Vision & Image Processing, B. B. Chaudhuri, M. S. Kankanhalli, and B. Raman (Eds.), Singapore, pp. 409–421. External Links: ISBN 978-981-10-7895-8 Cited by: §2.1.
  • [48] D. Kim, I. Lee, D. Kim, and S. Lee (2021) Action recognition using close-up of maximum activation and etri-activity3d livinglab dataset. Sensors 21 (20), pp. 6774. Cited by: §4.1.
  • [49] B. Korbar, D. Tran, and L. Torresani (2018) Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 7763–7774. Cited by: §2.1.
  • [50] J. Li, C. Wang, H. Zhu, Y. Mao, H. Fang, and C. Lu (2018) CrowdPose: efficient crowded scenes pose estimation and a new benchmark. arXiv preprint arXiv:1812.00324. Cited by: §3.3.
  • [51] T. Li and L. Wang (2020) Learning spatiotemporal features via video and text pair discrimination. External Links: 2001.05691 Cited by: §2.1.
  • [52] T. Liang, G. Lin, L. Feng, Y. Zhang, and F. Lv (2021) Attention is not enough: mitigating the distribution discrepancy in asynchronous multimodal sequence fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8148–8156. Cited by: §3.5, §4.3.
  • [53] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu (2013)

    Transfer feature learning with joint distribution adaptation

    In Proceedings of the IEEE international conference on computer vision, pp. 2200–2207. Cited by: §2.2.
  • [54] M. Martin, A. Roitberg, M. Haurilet, M. Horne, S. Reiß, M. Voit, and R. Stiefelhagen (2019) Drive&act: a multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2801–2810. Cited by: §1, §4.1.
  • [55] R. Memmesheimer, N. Theisen, and D. Paulus (2020) Gimme signals: discriminative signal encoding for multimodal activity recognition. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10394–10401. Cited by: §2.1.
  • [56] J. Munro and D. Damen (2020) Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 122–132. Cited by: §1.
  • [57] R. S. Nickerson (1998) Confirmation bias: a ubiquitous phenomenon in many guises. Review of general psychology 2 (2), pp. 175–220. Cited by: §4.1.
  • [58] B. Pan, Z. Cao, E. Adeli, and J. C. Niebles (2020) Adversarial cross-domain action recognition with co-attention. In AAAI, Vol. 34, pp. 11815–11822. Cited by: §2.3.
  • [59] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang (2010) Domain adaptation via transfer component analysis. IEEE transactions on neural networks 22 (2), pp. 199–210. Cited by: §2.2.
  • [60] R. Panda, C. (. Chen, Q. Fan, X. Sun, K. Saenko, A. Oliva, and R. Feris (2021) AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition. pp. 7576–7585 (en). External Links: Link Cited by: §2.1.
  • [61] M. Patrick, Y. Asano, R. Fong, J. F. Henriques, G. Zweig, and A. Vedaldi (2020) Multi-modal self-supervision from generalized data transformations. ArXiv abs/2003.04298. Cited by: §2.1.
  • [62] C. Pham, L. Nguyen, A. Nguyen, N. Nguyen, and V. Nguyen (2021) Combining skeleton and accelerometer data for human fine-grained activity recognition and abnormal behaviour detection with deep temporal convolutional networks. Multimedia Tools and Applications 80 (19), pp. 28919–28940. Cited by: §2.1.
  • [63] A. Piergiovanni, A. Angelova, and M. S. Ryoo (2020) Evolving losses for unsupervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 133–142. Cited by: §2.1.
  • [64] N. Rai, E. Adeli, K. Lee, A. Gaidon, and J. C. Niebles (2021) CoCon: cooperative-contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3384–3393. Cited by: §2.1.
  • [65] M. Ramanathan, J. Kochanowicz, and N. M. Thalmann (2019) Combining pose-invariant kinematic features and object context features for rgb-d action recognition. International Journal of Machine Learning and Computing 9 (1), pp. 44–50. Cited by: §2.1.
  • [66] S. S. Rani, G. A. Naidu, and V. U. Shree (2021) Kinematic joint descriptor and depth motion descriptor with convolutional neural networks for human action recognition. Materials Today: Proceedings. Cited by: §2.1.
  • [67] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §3.3.
  • [68] S. Reiß, A. Roitberg, M. Haurilet, and R. Stiefelhagen (2020) Deep classification-driven domain adaptation for cross-modal driver behavior recognition. In 2020 IEEE Intelligent Vehicles Symposium (IV), pp. 1042–1047. Cited by: §2.3.
  • [69] A. Roitberg, D. Schneider, A. Djamal, C. Seibold, S. Reiß, and R. Stiefelhagen (2021) Let’s play for action: recognizing activities of daily living by learning from life simulation video games. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8563–8569. Cited by: §1, §1, §3.1, §3.2, §3.3, §3.3, §4.1, §4.2, §4.2, §4.3, §4.3, §4.3, Table 1, Table 2, Table 3.
  • [70] A. Roitberg, N. Somani, A. Perzylo, M. Rickert, and A. Knoll (2015) Multimodal human activity recognition for industrial manufacturing processes in robotic workcells. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 259–266. Cited by: §1.
  • [71] S. Sankaranarayanan, Y. Balaji, A. Jain, S. N. Lim, and R. Chellappa (2018) Learning from synthetic data: addressing domain shift for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3752–3761. Cited by: §1.
  • [72] G. Sharma, F. Jurie, and C. Schmid (2012) Discriminative spatial saliency for image classification. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3506–3513. Cited by: §1, §1.
  • [73] X. Song, S. Zhao, J. Yang, H. Yue, P. Xu, R. Hu, and H. Chai (2021) Spatio-temporal contrastive domain adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9787–9795. Cited by: §2.3.
  • [74] Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, and J. Liu (2020) Human action recognition from various data modalities: a review. arXiv preprint arXiv:2012.11866. Cited by: §1, §1.
  • [75] M. van Erp, L. Vuurpijl, and L. Schomaker (2002) An overview and comparison of voting methods for pattern recognition. In Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition, Vol. , pp. 195–200. External Links: Document Cited by: §2.1.
  • [76] C. Wang, H. Yang, and C. Meinel (2016) Exploring multimodal video representation for action recognition. In International Joint Conference on Neural Networks, pp. 1924–1931. Cited by: §2.1, §3.
  • [77] L. Wang, Z. Ding, Z. Tao, Y. Liu, and Y. Fu (2019) Generative multi-view human action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6212–6221. Cited by: §2.1, §3.
  • [78] P. Wang, W. Li, Z. Gao, Y. Zhang, C. Tang, and P. Ogunbona (2017) Scene flow to action map: a new representation for rgb-d based action recognition with convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 595–604. Cited by: §2.1.
  • [79] W. Wang, H. Li, Z. Ding, and Z. Wang (2020) Rethink maximum mean discrepancy for domain adaptation. arXiv preprint arXiv:2007.00689. Cited by: §2.2, §4.3.
  • [80] X. Wang, J. He, Z. Jin, M. Yang, Y. Wang, and H. Qu (2021)

    M2Lens: visualizing and explaining multimodal models for sentiment analysis

    IEEE Transactions on Visualization and Computer Graphics 28 (1), pp. 802–812. Cited by: §2.2.
  • [81] H. Wei, R. Jafari, and N. Kehtarnavaz (2019) Fusion of video and inertial sensing for deep learning–based human action recognition. Sensors 19 (17), pp. 3680. Cited by: §2.1.
  • [82] R. R. Wilcox and H. Keselman (2003) Modern robust data analysis methods: measures of central tendency.. Psychological methods 8 (3), pp. 254. Cited by: §3.5, §4.3, §4.3, §4.3.
  • [83] P. Wu, H. Liu, X. Li, T. Fan, and X. Zhang (2016) A novel lip descriptor for audio-visual keyword spotting based on adaptive decision fusion. IEEE Transactions on Multimedia 18 (3), pp. 326–338. Cited by: §2.2.
  • [84] F. Xiao, Y. J. Lee, K. Grauman, J. Malik, and C. Feichtenhofer (2020) Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740. Cited by: §2.1.
  • [85] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In ECCV, pp. 305–321. Cited by: §3.3.
  • [86] Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu (2018) Pose Flow: efficient online pose tracking. In BMVC, Cited by: §3.3.
  • [87] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo (2017) Mind the class weight bias: weighted maximum mean discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2272–2281. Cited by: §2.2, §4.3.
  • [88] Z. Yao, Y. Wang, J. Wang, P. Yu, and M. Long (2021) Videodg: generalizing temporal relations in videos to novel domains. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.3.
  • [89] J. Ye, K. Li, G. Qi, and K. A. Hua (2015) Temporal order-preserving dynamic quantization for human action recognition from multimodal sensor streams. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pp. 99–106. Cited by: §2.1.
  • [90] C. Yi, S. Yang, H. Li, Y. Tan, and A. Kot (2021) Benchmarking the robustness of spatial-temporal models against corruptions. arXiv preprint arXiv:2110.06513. Cited by: §2.3.
  • [91] C. Zhao, M. Chen, J. Zhao, Q. Wang, and Y. Shen (2019) 3D behavior recognition based on multi-modal deep space-time learning. Applied Sciences 9 (4), pp. 716. Cited by: §2.1.
  • [92] Y. Zheng (2015) Methodologies for cross-domain data fusion: an overview. IEEE transactions on big data 1 (1), pp. 16–34. Cited by: §3.5, §4.3.
  • [93] H. Zou, J. Yang, H. Prasanna Das, H. Liu, Y. Zhou, and C. J. Spanos (2019) Wifi and vision multimodal learning for accurate and robust device-free human activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §2.1, §3.
  • [94] Q. Zou, Y. Wang, Q. Wang, Y. Zhao, and Q. Li (2020) Deep learning-based gait recognition using smartphones in the wild. IEEE Transactions on Information Forensics and Security 15, pp. 3197–3212. Cited by: §2.1.