CALDA: Improving Multi-Source Time Series Domain Adaptation with Contrastive Adversarial Learning

09/30/2021 ∙ by Garrett Wilson, et al. ∙ Washington State University 0

Unsupervised domain adaptation (UDA) provides a strategy for improving machine learning performance in data-rich (target) domains where ground truth labels are inaccessible but can be found in related (source) domains. In cases where meta-domain information such as label distributions is available, weak supervision can further boost performance. We propose a novel framework, CALDA, to tackle these two problems. CALDA synergistically combines the principles of contrastive learning and adversarial learning to robustly support multi-source UDA (MS-UDA) for time series data. Similar to prior methods, CALDA utilizes adversarial learning to align source and target feature representations. Unlike prior approaches, CALDA additionally leverages cross-source label information across domains. CALDA pulls examples with the same label close to each other, while pushing apart examples with different labels, reshaping the space through contrastive learning. Unlike prior contrastive adaptation methods, CALDA requires neither data augmentation nor pseudo labeling, which may be more challenging for time series. We empirically validate our proposed approach. Based on results from human activity recognition, electromyography, and synthetic datasets, we find utilizing cross-source information improves performance over prior time series and contrastive methods. Weak supervision further improves performance, even in the presence of noise, allowing CALDA to offer generalizable strategies for MS-UDA. Code is available at:



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised domain adaptation can leverage labeled data from past (source) machine learning tasks when only unlabeled data are available for a new related (target) task [1]. As an example, when learning a model to recognize a person’s activities from time-series sensor data, standard learning algorithms will face an obstacle when person does not provide ground-truth activity labels for their data. If persons through are willing to provide these labels, then Multi-Source Unsupervised Domain Adaptation (MS-UDA) can create a model for the target person based on labeled data from the source persons. When performing adaptation, MS-UDA must bridge a domain gap. In our example, such a gap exists because of human variability in how activities are performed. Meta-domain information may exist for person that is easier to collect and can improve the situation through weak supervision [2], such as self-reported frequencies for each activity (e.g., “I sleep 8 hours each night”).

In this article, we develop a framework that can construct a model for time-series MS-UDA. Our proposed approach leverages labeled data from one or more source domains, unlabeled data from a target domain, and optional target class distribution. Very few domain adaptation methods handle time series [3, 2, 1] and even fewer facilitate multiple source domains or weak supervision [2]. We postulate adapting multiple time-series domains is particularly critical because many time-series problems involve multiple domains (in our example, multiple people) [4, 5]. Furthermore, we posit that existing approaches to adaptation do not make effective use of meta-domain information about the target, yet additional gains may stem from leveraging this information via weak supervision. We propose a novel framework for time series MS-UDA. This framework, called CALDA (Contrastive Adversarial Learning for Multi-Source Time Series Domain Adaptation), improves unsupervised domain adaptation through adversarial training, contrastive learning, and weak supervision without relying heavily on data augmentation or pseudo labeling like prior image-based methods. CALDA utilizes adversarial training to align the feature-level distributions between domains and utilizes contrastive learning to leverage cross-source label information for improving accuracy on the target domain.

First, CALDA guides adaptation through multi-source domain-adversarial training [6, 2]

. CALDA trains a multi-class domain classifier to correctly predict the original domain for an example’s feature representation while simultaneously training a feature extractor to incorrectly predict the example’s domain. Through this two-player game, the feature extractor produces domain-invariant features. A task classifier trained on this domain-invariant representation can potentially transfer its model to a new domain because the target features match those seen during training, thus bridging the domain gap.

Second, CALDA enhances MS-UDA through contrastive learning across source domains. Contrastive learning moves the representations of similar examples closer together and the representations of dissimilar examples further apart. While this technique yielded performance gains for self-supervised [7, 8]

and traditional supervised learning

[9], the method is unexplored for multi-source time series domain adaptation. We propose to introduce the contrastive learning principle within CALDA. In this context, we will investigate three design decisions. First, we will analyze methods to select pairs of examples from source domains. Second, we will determine whether it is beneficial to pseudo-label and include target data as a contrastive learning domain despite the likelihood of incorrect pseudo-labels due to the large domain shifts in time series data. Third, we will assess whether to randomly select contrastive learning examples or utilize the varying complexities of different domains to select the most challenging (i.e., hard) examples.

Fig. 1: In the projected representation space , an illustration of the difference between cross-source (blue) and within-source (red) example pairs for contrastive learning. Any-source uses both cross-source and within-source pairs.

In the case of the first decision, we hypothesize that utilizing cross-source information to select example pairs can improve transfer. Prior MS-UDA methods [2] ignore cross-source label information that can provide vital insights into how classes vary among domains (i.e., which aspects of the data truly indicate a different class label versus the same label from different domains). Utilizing this information can potentially improve transfer to the target domain. If our activity recognition class labels include “walk” and “sleep”, we want the feature representations of two different walking people to be close, but the representations to be far apart for one walking person and one sleeping person. We use CALDA to investigate whether such cross-source information aids in transfer by explicitly making use of labeled data from each source domain as well as the differences in data distributions among the source domains. Furthermore, we compare this approach with a different instantiation of our framework that utilizes only labels within each source domain, thus explicitly ignoring cross-domain information. The differences between these approaches are illustrated in Figure 1.

For the second decision, we propose to utilize contrastive learning only across source domains rather than include pseudo-labeled target data that run the risk of being incorrectly labeled. Prior single-source domain adaptation methods have integrated contrastive losses [10, 11]. Because they utilize a single source domain with the target, they rely on pseudo-labeling the target domain data, creating difficult challenges when faced with large domain gaps [12]. Because CALDA employs more than one source domain, we can leverage a contrastive loss between source domains, thereby avoiding incorrectly pseudo-labeled target data.

For the third decision, we note that prior contrastive learning work has found selecting hard examples to yield improved performance [13, 14]. However, recent theory postulates that hard examples do not need to be singled out - such examples already intrinsically yield a greater contribution to the contrastive loss than less-challenging examples [9]. We hypothesize that both random and hard sampling may offer improvements for multi-source domain adaptation, thus we evaluate both within CALDA.

CALDA integrates all of these components. As in prior work [6, 2], we utilize a domain adversary that aligns feature representations across domains, yielding a domain-invariant feature extractor. Unlike prior approaches, we further utilize contrastive learning to pull examples from different source domains together in the feature space that have the same label while pushing apart examples from different domains that have different labels, though the choice of examples to pull and push depends on the three design decisions.

Contributions. The key contribution of this paper is the development and evaluation of the CALDA framework for time-series MS-UDA. Specific contributions include:

  • We improve time-series MS-UDA by leveraging cross-source domain labels via contrastive learning without requiring data augmentation or pseudo labeling.

  • We incorporate alternative contrastive learning strategies into CALDA, resulting in several instantiations of the framework that facilitate careful analysis of the design choices.

  • We demonstrate performance improvements of CALDA over prior work with and without weak supervision. Improvements are shown for synthetic time series data and a variety of real-world time-series human activity recognition and electromyography datasets. These experiments aid in identifying the most promising CALDA instantiations, validating the importance of the adversary and unlabeled target data, and measuring the sensitivity of CALDA to noise within the weak supervision information.111Code and data is available at:

2 Related Work

In this section, we discuss the related work on domain adaptation and contrastive learning in the context of our time-series MS-UDA problem setting.

2.1 Domain Adaptation

Single-source domain adaptation methods abound [1], but little work studies multi-source domain adaptation. Zhao et al. [15] develop an adversarial method supporting multiple sources by including multiple binary domain classifiers, one for each source domain. Xie et al. [16] propose a more scalable method, only requiring one multi-class domain classifier. This adversarial approach is the most similar method to the adversarial learning component of our framework. Yet, without a model compatible with time-series, this approach cannot be used for time series MS-UDA.

Limited research has investigated time-series domain adaptation. Purushotham et al. [3]

develop a variational domain-adversarial method for single-source domain adaptation using a variational recurrent neural network (RNN) as the feature extractor. However, in our prior work


we found that for both single-source and multi-source domain adaptation using a 1D convolutional neural network outperforms RNNs on a variety of time-series datasets. Thus, we select this network architecture for our experiments.

Domain adaptation has also been studied specifically for electromyography (EMG)-based gesture recognition, though most study a different type of domain adaptation. Rather than unsupervised domain adaptation, several methods are proposed to improve supervised domain adaptation performance [17, 18, 19], where some labeled target data are required. One method is developed for unsupervised domain adaptation [20]

, but they convert the EMG data to images followed by using adaptive batch normalization for domain adaptation


. This approach makes the assumption that the domain differences are contained primarily within the network’s normalization statistics and not the neural network layer weights. We do not require this assumption in our CALDA framework.

Another component of this paper involves incorporating weak supervision information into domain adaptation. Weak supervision is inspired by the posterior regularization problem that Ganchev et al. [22] propose, but we consider this in a domain adaptation context. A number of other methods have been developed for related problems. Jiang et al. [23] use a related regularizer for the problem where label proportions are available for the source domains but not the target domain. Hu et al. [24] propose using a different form of weak supervision for human activity recognition from video data, using multiple incomplete or uncertain labels. Pathak et al. [25] develop a method for semantic segmentation using a weaker form of labeling than pixel-level labels. While we study weak supervision in a different context, the benefit of weak supervision in these other contexts in addition to the performance gains observed in our experiments suggests the general applicability of this idea. Additionally, prior weak supervision work fails to address the sensitivity of the weak supervision to noise [2], which we may expect with self-reported data. We analyze weak supervision in the presence of noise.

2.2 Contrastive Learning

Our framework leverages the contrastive learning principle in addition to the adversarial learning from prior works. Early uses include clustering [26] and dimensionality reduction [27]. More recently, numerous research efforts have incorporated contrastive learning. These methods typically rely on data augmentation to generate positives and negatives but sometimes use labels instead [9]. While such methods yield large gains in other contexts [7, 8], little work has explored contrastive learning for time series domain adaptation.

In the related problem of single-source domain adaptation of image data, prior methods [10, 11] consider a contrastive loss in a different manner than our method. In addition to not utilizing an adversary, which we demonstrate is a vital component to our framework, these methods rely heavily on data augmentation and pseudo labeling which may be problematic for time series. First, data augmentation is standard in image contexts, but for time series, it is not as straightforward and is not yet a standard procedure [28]. Yet, data augmentation is sometimes key to the success of image-based methods for difficult domain adaptation problems, such as in the handwritten digits to street view house numbers adaptation problem [29]. Unlike prior methods, CALDA exhibits strong performance on time series data without requiring data augmentation. Second, prior methods perform contrastive learning on the combined (single) source domain and target domain. This critically depends on accurate target domain pseudo-labeling. However, pseudo-labeling remains a challenging problem that is particularly difficult when faced with large domain gaps [12], which we observe in time series data. Because we study MS-UDA and therefore consider multiple labeled source domains, in CALDA we found it best to avoid pseudo-labeling and instead leverage the contrastive loss across source domains.

Other contrastive domain adaptation work focus on different problems: label-less transferable representation learning for image data [30] and image adaptation to a sequence of target domains [31]. As with the other prior work, these too rely on data augmentation [30, 31] and typically pseudo labeling [31].

CALDA’s final component is hard sampling, which has been found beneficial in some contrastive learning contexts. Schroff et al. [13] found it necessary for the triplet loss, a special case of contrastive learning [9]. Similarly, Cai et al. [14] found including the top five percent of hard negatives to be both necessary and sufficient. In contrast, Khosla et al. [9] state that hard sampling is not necessary since hard examples intrinsically have larger contributions to the loss. However, we note that this inherent impact depends on having a large number of positives and negatives, which may not be optimal on all datasets, as demonstrated in our experimental results.

3 Problem Setup

Here, we formalize Multi-Source Unsupervised Domain Adaptation (MS-UDA) without and with weak supervision.

3.1 Multi-Source Unsupervised Domain Adaptation

MS-UDA assumes that labeled data are available from multiple sources and unlabeled data are available from the target [32] to achieve the goal of creating a model that performs well on the target domain. Formally, given source domain distributions for and a target domain distribution , we draw labeled training examples i.i.d. from each source distribution and unlabeled training instances i.i.d. from the marginal distribution :


Here, each domain is distributed over the space , where is the input data space and is the label space for classification labels. After training a MS-UDA model using and , we test the model using a holdout set of labeled testing examples (input and ground-truth label pairs) drawn i.i.d. from the target distribution :


In the case of time series domain adaptation, for time series variables or channels. Each variable for consists of a time series containing a sequence of real values observed at equally-spaced time steps [33].

3.2 MS-UDA with Weak Supervision

When MS-UDA is guided by weak supervision [2], the target-domain label proportions are additionally available during training, which we can utilize to guide the neural network’s representation. Formally, these proportions represent

for the target domain, i.e., the probability

that each example will have label :

Fig. 2: The CALDA framework incorporates adversarial learning via a domain classifier and contrastive learning via a contrastive loss on an additional contrastive head in the network. Through the various instantiations of CALDA, we determine how to best utilize contrastive learning for MS-UDA.

4 CALDA Framework

We introduce CALDA, a MS-UDA framework that blends adversarial learning with contrastive learning. First, we motivate CALDA from domain adaptation theory. Second, we describe the key components: source domain error minimization, adversarial learning, and contrastive learning. Finally, we describe framework alternatives to investigate how to best construct the example sets used in contrastive loss.

4.1 Theoretical Motivation

Zhao et al. [15] offer an error bound for multi-source domain adaptation. Given a hypothesis space with VC-dimension , source domains, empirical risk of the hypothesis on source domain for , empirical source distributions for generated by labeled samples from each source domain, empirical target distribution generated by unlabeled samples from the target domain, optimal joint hypothesis error on a mixture of source domains , and target domain (average case if ), the target classification error bound with probability at least for all can be expressed as:


In Equation 5, term (1) is the sum of source domain errors, term (2) is the sum of the divergences between each source domain and the target domain, term (3) is the optimal joint hypothesis on the mixture of source domains and the target domain, and term (4) addresses the issue of finite sample sizes. Note that the first two terms are the most relevant for informing multi-source domain adaptation methods since they can be optimized. In contrast, given a hypothesis space (e.g., a neural network of a particular size and architecture), term (3) is fixed. Similarly, term (4) regards finite samples from the source and target domains, which depends on the number of samples and for a given dataset cannot increase.

We introduce CALDA to minimize this error bound as illustrated in Figure 2. First, we train a Task Classifier to correctly predict the labeled data from the source domains, thus minimizing (1). To minimize (2), we better align domains based on adversarial learning and contrastive learning. As in prior works, we use feature-level domain invariance via domain adversarial training to align the sources and unlabeled data from the target domain. We train a Domain Classifier to predict which domain originated a representation while simultaneously training the Feature Extractor to generate domain-invariant representations that fool the Domain Classifier. Additionally, we propose a supervised contrastive loss to align the representations of same-label examples among the multiple source domains. The new loss aids in determining which aspects of the data correspond to differences in the class label (the primary concern) versus differences in the domain (e.g., person) where the data originated (which can be ignored). The new loss definition leverages both the labeled source domain data and cross-source information. This contrastive loss is applied to an additional Contrastive Head in the model. To address term (3), we consider an adequately-large hypothesis space by using a neural network of sufficient size and incorporating an architecture previously shown to handle time-series data [2, 34].

4.2 Adaptation Components

We describe source domain errors and feature-level domain invariance before moving onto our novel contrastive loss for multi-source domain adaptation, optionally with weak supervision, and the corresponding design choices.

4.2.1 Minimize Source Domain Errors

We minimize classification error on the source domains by feeding the outputs of feature extractor to a task classifier having a softmax output. Then, we update the parameters and to minimize a categorical cross-entropy loss

using one-hot encoded true label

and softmax probabilities . To handle multiple sources, we compute this loss over a mini-batch of examples drawn from each of the source domain distributions for :


We employ the categorical cross-entropy loss, where and represent the th components of

’s one-hot encoding and the softmax probability output vector, respectively:


4.2.2 Adversarial Learning

If we rely on only minimizing source domain error, we will obtain a classifier that likely does not transfer well to the target domain. One reason for this is that the extractor’s selected feature representations may differ widely between source domains and the target domain. To remedy this problem, we invite a domain adversary to produce feature-level domain invariance. In other words, we align the feature extractor’s outputs across domains. We achieve alignment by training a domain classifier (the “adversary”) to correctly predict each example’s domain labels (i.e., predict which domain each example originated from) while simultaneously training the feature extractor to make the domain classifier predict that the example is from a different domain.

We define a multi-class domain classifier with a softmax output as the adversary [2]. The domain classifier follows the feature extractor in the network. However, we place a gradient reversal layer between and

, which multiplies the gradient during backpropagation by a negative constant

, yielding adversarial training [6]. Given domain labels , that map target examples to label and source examples to label for , we update the model parameters and :


This objective incorporates a categorical cross-entropy loss similar to that uses domain labels instead of class labels. Given the one-hot encoded representation of the true domain label and the domain classifier’s softmax probability output vector , we compute the loss:


4.2.3 Contrastive Learning

The above domain invariance adversarial loss does not leverage labeled data from the source domains. Prior work [2] only indirectly leverages labeled source domain data through jointly training the adversarial loss and task classifier loss on the labeled source domains. To better utilize source labels, we propose employing a supervised contrastive loss [9] to pull same-labeled examples together in the embedding space and push apart different-labeled examples, thereby making use of both positive and negative cross-source label information.

While the exact details vary based on the design decisions we will discuss in the next section, in general, contrastive learning has two roles: (1) pull same-label examples together and (2) push different-label examples apart. This process operates on pairs of representations of examples . We call the first the “query” or “anchor”, i.e., , where is drawn from the set of all example representations. To pull examples together, we create a pair , where the “positive” is drawn from the set of all example representations having the same label as . To push examples apart, we create another pair , where “negative” is drawn from the set of example representations that have a different label than . CALDA allows additional constraints to be placed on how positives and negatives are selected, such as selecting examples from the same domain, a different domain, or any domain. Figure 1 illustrates one query positive pair and negative pair for the cross-source and within-source cases. We may create additional positive and negative pairs for each query similarly.

We propose using a supervised contrastive loss based on a multiple-positive InfoNCE loss [35, 9]. Given the projected representation of a query, the corresponding positive and negative sets and

, a temperature hyperparameter

, and cosine similarity

, we obtain:


Conceptually, for a given query, Equation 10 sums over each positive and normalize by the number of positives. Inside of this sum, we compute what is mathematically equivalent to a log loss for a softmax-based classifier that classifies the query as the positive [8]. The denominator sums over both the positive and also all the negatives corresponding to the query. Note, alternatively, the sum over the positive set could be moved inside the log, but keeping this sum outside has been found to perform better in prior works using InfoNCE [9].

Finally, we update weights by summing over the queries for each source domain and normalizing by the number of queries. In some framework instantiations, we similarly compute this loss over queries from the pseudo-labeled target domain data. Formally, given the source domain queries , the pseudo-labeled target domain queries , positive and negative sets and (construction depends on the instantiation of our method, discussed next), and the indicator function 1, we can update the model parameters and :


4.2.4 Total Loss and Weak Supervision Regularizer

We jointly train each of these three adaptation components. Thus, the total loss that we minimize during training is a sum of each loss: source domain errors from Equation 6, adversarial learning from Equation 8, and contrastive learning from Equation 11. We further add a weighting parameter for the contrastive loss and note that the multiplier included in the gradient reversal layer can be used as a weighting parameter for the adversarial loss.

Additionally, for the problem of MS-UDA with weak supervision, we include the weak supervision regularization term described in our prior work [2]. While individual labels for the unlabeled target domain data are unknown, this KL-divergence regularization term guides training toward learning model parameters that produce a class label distribution approximately matching the given label proportions on the unlabeled target data. This allows us to leverage target-domain label distribution information, if available.


4.3 Design Decisions for Contrastive Learning

The contrastive losses used in Equation 11 involve selecting positive and negative pairs for each query. The CALDA framework facilitates multiple options for selecting these pairs as we incorporate contrastive learning into multi-source domain adaptation. Here, we formalize the CALDA instantiations along the dimensions of 1) how to select example pairs across multiple domains, 2) whether to include a pseudo-labeled target domain in contrastive learning, and 3) whether to select examples randomly or based on difficulty.

4.3.1 Multiple Source Domains

When selecting pairs of examples for MS-UDA, we may choose to select two examples within a single domain, from two different domains, or a combination. We term these variations Within-Source, Cross-Source, and Any-Source Label-Contrastive learning. Note that similar terms apply if including the pseudo-labeled target domain. Recall that the motivation behind contrastive-learning MS-UDA is to leverage cross-source information. Because cross-source information is excluded in the Within-Source case, we hypothesize that Within-Source will perform poorly, whereas Any-Source and Cross-Source, which leverage the cross-source information, will yield improved results.

Formally, we define the sets of queries, positives, and negatives for each of these cases using set-builder notation. To simplify constructing these sets, we first create the auxiliary set , which contains input , class label (or in the case of the target domain, the pseudo-label ), and domain label for all examples. Given a set of labeled examples from each source domain , a set of unlabeled instances from the target domain , and a projected representation defined as the feature-level representation passed through an additional contrastive head in the model (e.g., an additional fully-connected layer), we define a set including both the pair and the domain label (as defined in the previous section) of all source domains, a set including both the pseudo-labeled pair and the domain label of the target domain, and set , which is the union of and :


Using , we define the query set for each source domain and the query set for the target domain:


Next, we define the positive and negative sets for each framework instantiation.

(a) Within-Source Label-Contrastive learning: Positives for each query are selected from the same domain as the query with the same label. Negatives are selected that from the same domain as the query with a different label. Formally, we define the positive and negative sets and for each query of domain and label as follows:


(b) Any-Source Label-Contrastive learning: Positives for each query are selected having the same label and coming from any domain. Negatives are selected with a different label and from any domain:


(c) Cross-Source Label-Contrastive learning: Positives for each query are selected from a different domain with the same label. Negatives are selected from a different domain with a different label:


Note that these cases are distinguished based on whether (Within-Source), (Cross-Source), or there is no constraint (Any-Source).

4.3.2 Pseudo-Labeled Target Domain

Prior contrastive learning work for single-source domain adaptation utilizes a supervised contrastive loss on the combined single source domain and the target domain. However, since this loss depends on labels, such methods require pseudo-labeling the target domain data. The methods rely on the classifier producing correct class labels, which can then be used in the supervised contrastive loss. Unfortunately, pseudo-labeling the target domain is a challenging problem, and classification errors are likely [12]. If the pseudo-labels are incorrect, then this may hurt contrastive learning performance. Because US-MDA utilizes multiple domains, instead of performing contrastive learning between the source and target domains, we may perform contrastive learning among the source domains, which we may improve performance. We include whether to utilize pseudo-labeled target domain data during training as an additional CALDA dimension. If pseudo-labeled target domain data is included during contrastive learning, in the contrastive learning objective (Equation 11), otherwise .

4.3.3 Pair Selection by Difficulty

Outside of domain adaptation, selecting hard examples has been found beneficial for contrastive learning [13, 14]. However, recent theoretical work suggests that hard examples implicitly contribute more to the loss, thus mitigating the need for explicitly selecting hard examples in contrastive learning [9]. To determine if explicitly selecting hard examples is beneficial in multi-source domain adaptation, we propose a method for hard sampling in CALDA and compare it with random sampling – the final dimension of our CALDA framework. For brevity, we give the equations for Cross-Source Label-Contrastive learning, but the other variations can be constructed by changing the domain constraint of each set.

For hard sampling, we select a subset of hard positive and negative examples. This necessitates that we define “hard examples.” In each case, the domain constraint is the same for both positives and negatives, thus the key difference is whether they have the same label as the query or not. The examples that would most help the model learn this decision boundary are those that are predicted to be on the wrong side. Thus, we select hard examples as examples that are currently predicted to be on the wrong side of this decision boundary. We define hard positives as examples with the query’s label but with a different predicted label (with respect to the current model predictions) and hard negatives as examples of a class other than the query’s label but that are predicted to have the query’s label. Both are from a different domain than the query (in the Cross-Source case). Focusing on the Cross-Source Label-Contrastive case, for a query with domain and true label and the current model prediction , we define hard positive and negative sets and :


However, there is no guarantee there will always be positives and negatives that are misclassified. For example, the task classifier likely makes accurate predictions later on during training. Instead, we propose using a relaxed version of hardness: take the top- hardest positives and top- hardest negatives in terms of a softmax cross-entropy loss. To obtain a relaxation of for the positives, we can find positive examples with a high task classification loss for that example via . This is because a positive is defined as having the same label as the query, so having a high task loss for the positive corresponds to being on the wrong side of the positive-negative decision boundary. To obtain a relaxation of for the negatives, we can find negative examples with a low task classification loss where we replace the true class label with the query’s class label . This finds the negatives that are most easily misclassified as having the query’s class label, i.e., those on the wrong side of the decision boundary.

Thus, we define the relaxed hard positive and negative sets and in terms of the softmax-based cross-entropy loss , with loss thresholds and chosen such that we have positives and negatives (i.e, and ):


The contrastive weight update in Equation 11 can now be adjusted to use the relaxed hard positive and negative sets and . As an alternative to hard sampling, we may instead use random sampling, to select a random subset of positives and negatives that pair with each query.

5 Experimental Validation

We validate our hypothesis that CALDA will improve time-series MS-UDA through contrastive learning based on experimental analysis. We also apply CALDA to synthetic and real-world datasets to address the three design decisions, with and without weak supervision. Finally, we validate the impact of the adversary and unlabeled target domain data.

5.1 Datasets

(a) InterT 10
(b) IntraT 10
(c) InterR 1.0
(d) IntraR 1.0
Fig. 3:

Synthetic time series data are generated by summing two sine waves of different frequencies for each domain and class label, drawn from 2D multivariate normal distributions. Each source domain is generated with inter-domain or intra-domain translation or rotation domain shifts (InterT, IntraT, InterR, and IntraR).

5.1.1 Synthetic Data

We construct synthetic time series data to aid in comparing the different instantiations of our framework. For each synthetic domain, we generate a two-dimensional multivariate normal distribution. After drawing a sample from the normal distribution for domain and label , we construct a time series signal by summing two sine waves of frequencies and (Hz), thereby obtaining a time series example , where is a vector. This synthetic scenario is loosely motivated by activity recognition, where different activities may contain different frequency components.

Because MS-UDA consists of multiple domains, we must devise a method for constructing domain shifts, or controlled changes between domains. First, we construct a target domain using the above approach. Then, we construct multiple source domains with different domain shifts: inter-domain or intra-domain translation or rotation. Inter-domain changes shift all class distributions for a source domain in a similar manner, either translating or rotating by the same amount. Intra-domain changes shift each class distribution for a source domain differently, e.g., class 1 may be translated a different amount and direction than class 2. Example domain shifts are illustrated in Figure 3. Because we perform experiments with different numbers of source domains , these types of shifts also emulate domains that vary in size, e.g., inter-domain translation with larger values of results in the source domains covering a larger region of the space compared to the target domain.

5.1.2 Real-World Datasets

We also evaluate the real-world efficacy of the CALDA framework using real-world multi-variate time series datasets. First, we include the UCI HAR and UCI HHAR human activity recognition (HAR) datasets [4, 36], both of which consist of motion data collected from participants performing a scripted activities while carrying smart phones (in HHAR, participants hold the phones in multiple orientations). We also include WISDM AR and WISDM AT activity recognition datasets [37, 38], consisting of motion data from multiple lab-based (WISDM AR) unscripted real-world (WISDM AT) situations. Next, we include two electromyography (EMG) datasets: Myo EMG [19] and NinaPro Myo [39], both consisting of multivariate EMG signals from a Myo armband while participants performed various hand gestures. See Supplemental Material for details about dataset pre-processing, model architecture, and hyperparameter tuning.

UCI HAR 92.3 2.5 92.0 3.0 92.6 2.6 91.9 3.3 92.1 3.0 92.4 3.3
UCI HHAR 89.2 4.2 89.0 4.5 88.8 4.7 86.2 5.8 89.4 4.3 87.4 5.6
WISDM AR 72.5 9.1 71.1 8.0 74.7 8.3 71.3 9.1 74.5 8.4 73.5 8.3
WISDM AT 68.6 8.7 68.5 7.2 61.6 12.9 63.3 9.8 62.0 10.9 60.2 12.3
Myo EMG 83.4 5.5 82.4 5.4 83.7 5.5 82.9 6.2 84.2 5.3 82.0 6.4
NinaPro Myo 52.0 5.3 52.6 4.5 51.0 4.6 52.4 4.1 51.5 5.3 52.3 4.6
Average 77.2 5.9 76.7 5.5 76.2 6.5 75.4 6.5 76.5 6.2 75.4 6.8
TABLE I: Ablation study of CALDA instantiations that include the target-domain data via pseudo labeling.
UCI HAR 93.5 2.0 93.6 2.2 93.1 2.3 93.7 2.1 93.4 2.1 93.4 2.5
UCI HHAR 89.3 3.9 89.4 3.9 89.8 3.7 88.7 4.6 89.8 3.9 89.4 4.0
WISDM AR 79.0 6.9 79.4 7.7 80.2 7.1 80.3 6.9 80.0 6.3 81.4 7.9
WISDM AT 71.4 8.2 71.0 8.3 72.1 7.5 71.2 8.4 70.7 6.5 71.0 8.6
Myo EMG 83.0 5.3 82.7 6.1 83.8 5.6 83.6 5.4 83.4 5.6 83.3 5.3
NinaPro Myo 57.3 3.6 56.9 3.6 57.7 3.8 55.9 4.2 58.0 3.8 56.0 3.9
Synth InterT 0 93.7 0.2 93.7 0.1 93.8 0.2 93.7 0.2 93.7 0.3 93.9 0.1
Synth InterR 0 94.2 0.2 94.2 0.1 94.1 0.1 94.1 0.2 94.1 0.1 94.0 0.2
Synth IntraT 0 93.7 0.2 93.8 0.2 93.8 0.3 93.9 0.2 93.7 0.3 93.6 0.3
Synth IntraR 0 94.1 0.2 94.1 0.2 94.1 0.2 94.0 0.1 94.1 0.2 93.9 0.2
Synth InterT 10 69.8 17.0 70.8 15.8 69.4 17.0 70.7 15.0 68.4 14.2 70.5 14.5
Synth InterR 1.0 67.5 12.5 69.9 12.6 63.4 10.2 77.4 8.1 63.8 10.3 76.3 8.9
Synth IntraT 10 74.9 6.4 73.6 6.6 75.6 8.4 75.9 8.9 77.9 8.9 76.2 8.3
Synth IntraR 1.0 74.9 8.3 74.3 7.6 77.5 7.0 76.0 6.7 75.0 7.9 76.9 6.1
Real-World Avg. 79.7 5.0 79.6 5.4 80.2 5.0 79.7 5.3 79.9 4.7 79.9 5.4
Synth (No Shift) Avg. 93.9 0.2 93.9 0.1 94.0 0.2 93.9 0.2 93.9 0.2 93.9 0.2
Synth Avg. 71.8 11.0 72.2 10.7 71.5 10.7 75.0 9.7 71.3 10.3 75.0 9.4
TABLE II: Ablation study comparing hard and random sampling for each CALDA instantiation: within-source, any-source, and cross-source. Bold denotes highest accuracy in each row.

5.2 Ablation Studies

Using a set of ablation studies, we identify the appropriate instantiations of our CALDA framework to compare with the baseline methods. We compare instantiations across each component of our framework: (1) including the pseudo-labeled target domain (P), (2) using within-source (CALDA-In), any-source (CALDA-Any), or cross-source (CALDA-XS) examples for each query, and (3) random sampling (R) versus hard sampling (H). A one-sided paired student’s t-test indicates whether accuracy improvements are statistically significant.

5.2.1 Pseudo-Labeled Target Domain

To determine whether to include pseudo-labeled target domain data or not, we compare experimental results with and without pseudo labeling, respectively. Comparing Tables I and II, we find that regardless of the choices for the other components in our framework, pseudo labeling generally performs worse than without pseudo-labeling on the real-world datasets, i.e., the corresponding values in Table I are significantly lower than those in Table II (). Interestingly, random sampling typically performs better than hard sampling when using pseudo labels. This is likely because incorrectly pseudo-labeled target data may often be selected during hard sampling and thereby degrade contrastive learning performance. Random sampling helps to partially reduce this performance degradation by reducing the likelihood that the pseudo-labeled target domain is used in contrastive learning. However, we obtain even better performance by explicitly excluding the target domain in contrastive learning, as shown in Table II. Thus, in subsequent experiments, we exclude pseudo-labeling. This does not mean that pseudo labeling cannot help domain adaptation, but future work is required to determine which pseudo labeling techniques perform well in a time series context. On the synthetic datasets, we do not observe significant differences between pseudo labeling and not pseudo labeling (see Supplemental Material), but due to the real-world dataset results, we exclude pseudo labeling here as well.

5.2.2 Multiple Source Domains

We next analyze the results in Table II to determine whether to use within-source, any-source, or cross-source examples. Starting with the results on real-world datasets, the best method is consistently among the CALDA-Any and CALDA-XS variants. In particular, CALDA-Any,R is the best-performing instantiation on average with the two CALDA-XS variants ranking second, though no variation offers statistically significant improvement. Thus, we construct and include results on the additional synthetic datasets. As expected, all methods perform almost identically for no domain shift. However, when we include various types of synthetic domain shifts, we see significant differences. CALDA-Any,H and CALDA-XS,H are the best methods on average and are both significantly better than CALDA-In,R and CALDA-In,H (). Thus, a similar trend emerges from both the real-world data and synthetic data: CALDA-Any and CALDA-XS instantiations outperform CALDA-In. Since CALDA-In ignores cross-source information by only utilizing within-source examples, the other instantiations performing better validates our hypothesis that leveraging cross-source information can yield improved transfer. Thus, we conclude that CALDA-Any and CALDA-XS, which both leverage cross-source information, are the two most promising methods of selecting source domain examples for our framework.

5.2.3 Pair Selection by Difficulty

The choice between selecting examples by hard or random sampling is more variable than the above design decisions. Comparing hard sampling with random sampling on the real datasets in Table II, we observe random sampling for CALDA-Any to perform significantly better than hard sampling (. However, random vs. hard sampling differences for the other methods are not statistically significant. On the synthetic datasets with domain shifts, we observe the opposite: hard sampling versions of CALDA-Any and CALDA-XS are significantly better than random sampling (). This may indicate this design choice depends on the type of data and domain shift.

We further investigate hard vs. random sampling by running an additional set of experiments for CALDA-XS on WISDM AR, the dataset where CALDA-XS,H performed better than all other instantiations, and specifically, where it outperformed CALDA-XS,R. The results as we increase the number of positives and negatives (two of the hyperparameters) via a positive/negative multiplier are shown in Figure 4. On the left, we observe CALDA-XS,H outperforms CALDA-XS,R. Moving to the right, the performance gain from hard sampling reduces, which is expected since hard and random sampling are no different in the limit of using all positive and negative examples in each mini-batch. We conclude that hard sampling may yield an improvement over random sampling in some situations, particularly those for which the optimal hyperparameters for a dataset include relatively few positives and negatives (such as is the case on the WISDM AR dataset).

Finally, because CALDA-Any,R performed best on the real-world data and CALDA-XS,H was tied for second best on the real-world data and tied for best on the synthetic datasets, we select these two methods as the most-promising instantiations of our framework for subsequent experiments.

Fig. 4: Comparing hard vs. random sampling as and increase.
Dataset No Adaptation CAN CoDATS CALDA-Any,R CALDA-XS,H Train on Target
UCI HAR 88.8 4.2 89.0 3.7 92.8 3.3 93.1 2.3 93.4 2.5 99.6 0.1
UCI HHAR 76.9 6.3 77.5 5.3 88.6 3.9 89.8 3.7 89.4 4.0 98.9 0.2
WISDM AR 72.1 8.0 61.3 7.5 78.0 8.4 80.2 7.1 81.4 7.9 96.5 0.1
WISDM AT 69.9 7.1 66.2 9.6 69.7 6.6 72.1 7.5 71.0 8.6 98.8 0.1
Myo EMG 77.4 5.2 74.1 6.3 82.2 5.4 83.8 5.6 83.3 5.3 97.7 0.1
NinaPro Myo 54.8 3.6 56.8 3.2 55.9 5.0 57.7 3.8 56.0 3.9 77.8 1.3
Synth InterT 10 62.6 18.8 74.4 12.3 67.0 16.2 69.4 17.0 70.5 14.5 93.4 0.2
Synth InterR 1.0 52.4 7.8 79.2 6.8 56.2 8.8 63.4 10.2 76.3 8.9 94.0 0.0
Synth IntraT 10 70.6 8.5 85.7 5.3 70.1 8.9 75.6 8.4 76.2 8.3 93.7 0.2
Synth IntraR 1.0 63.3 9.0 73.4 6.4 61.9 8.4 77.5 7.0 76.9 6.1 93.6 0.2
Real-World Avg. 73.9 5.8 71.3 6.0 78.6 5.4 80.2 5.0 79.9 5.4 94.9 0.3
Synth Avg. 62.2 11.0 78.2 7.7 63.8 10.6 71.5 10.7 75.0 9.4 93.7 0.2
TABLE III: Comparing target domain accuracy of the most-promising CALDA instantiations with baselines. Bold denotes CALDA outperforming baselines. Underline denotes highest accuracy in each row.

5.3 MS-UDA: CALDA vs. Baselines

To measure the performance improvement of CALDA, we compare the two most promising instantiations, CALDA-Any,R and CALDA-XS,H, with several baselines and prior work. We include an approximate domain adaptation lower bound that performs no adaptation during training (No Adaptation). This allows us to see how much performance improvement results from utilizing domain adaptation. We include an approximate domain adaptation upper bound showing the performance achievable if we did have labeled target domain data available (Train on Target). For a contrastive domain adaptation baseline, we include Contrastive Adaptation Network (CAN) [11] modified to employ a time-series compatible feature extractor. Finally, we include CoDATS [2] to see if our CALDA framework improves over prior multi-source time series domain adaptation work. The results are presented in Table III.

We first examine the No Adaptation and Train on Target baselines, which train only on the source domain data or train directly on the target domain data respectively. The performance of the No Adaptation baseline can be viewed as an approximate measure of domain adaptation difficulty, where lower No Adaptation performance indicates a more challenging problem, i.e., with a larger domain gap between the source domains and the target domain. Accordingly, we can identify WISDM AR and WISDM AT as the most challenging activity recognition datasets and NinaPro Myo as the most challenging EMG dataset. In contrast, Train on Target performs well on all but one dataset. It does this well by “cheating”, i.e., looking at the target domain labels and thereby eliminating the domain gap. However, the NinaPro Myo dataset is challenging enough that even with no domain gap, we cannot obtain near-perfect accuracy.

Now we compare CALDA-Any,R and CALDA-XS,H with the No Adaptation and CoDATS baselines. Denoted by underlining, one of the two CALDA instantiations is always the best method. Similarly, CALDA-Any,R and CALDA-XS,H significantly outperform both No Adaptation and CoDATS across all datasets (). On the real-world datasets, the largest improvement over CoDATS is 2.4% on WISDM AT for CALDA-Any,R and 3.4% on WISDM AR for CALDA-XS,H. The largest improvement over No Adaptation is 12.9% and 12.5%, respectively, on UCI HHAR. On average, we observe a 1.6% and 1.3% improvement of these two CALDA instantiations over CoDATS and a 6.3% and 6.0% improvement over No Adaptation, respectively. On the synthetic datasets, these improvements are even larger: 7.7% and 11.2% improvement over CoDATS and 9.3% and 12.8% improvement over no Adaptation. These experimental results across a variety of real world and synthetic time-series datasets confirm the benefit of utilizing cross-source information through our proposed CALDA framework for time-series multi-source domain adaptation.

Finally, we compare CALDA-Any,R and CALDA-XS,H with the contrastive domain adaptation baseline CAN. Both CALDA instantiations significantly outperform CAN on all of the real-world time series datasets (). The largest improvements over CAN on the real-world datasets are 18.9% for CALDA-Any,R and 20.1% for CALDA-XS,H on WISDM AR. On average, we observe an 8.9% and 8.6% improvement in the two CALDA instantiations over CAN. In contrast, on the synthetic datasets, CALDA only significantly outperforms CAN on one out of the four domain shifts (). This is because CAN relies on clustering for pseudo-labeling target domain data, which works well on our synthetically-generated Gaussian domain shifts. These results indicate that while CAN may be successful with some types of domain shifts such as those found in image datasets or clustered synthetic time series, we find that CALDA better handles the domain shifts found in real-world time series datasets.

5.4 MS-UDA with Weak Supervision

We additionally study whether our framework yields improved results for domain adaptation with weak supervision. First, we simulate obtaining target domain label proportions by estimating these proportions on the target domain training set and incorporate our weak supervision regularizer into each method to leverage this additional information. Following this, we determine the sensitivity of each method to noise in the estimated label proportions since, for example, if these label proportions are acquired from participants’ self-reports, there will be some error in the reported proportions.

5.4.1 CALDA with Weak Supervision

We compare the two most promising instantiations of CALDA-WS (CALDA-XS,H,WS and CALDA-Any,R,WS) with CoDATS-WS. The results are shown in Table IV. Denoted by bold, similar to without weak supervision, CALDA-Any,R,WS improves over both No Adaptation and CoDATS-WS across all datasets and CALDA-XS,H,WS in all except one case (). On average we observe a 4.1% and 3.6% improvement of the two CALDA instantiations over CoDATS-WS and a 9.8% and 9.3% improvement over No Adaptation on the real-world datasets. On the synthetic datasets, we observe a 12.3% and 16.5% improvement of CALDA over both CoDATS-WS and No Adaptation. These results demonstrate the efficacy of CALDA for the domain adaptation when incorporating weak supervision.

By comparing Table III with IV, we can measure the benefit of weak supervision. In all cases, the CALDA instantiations with weak supervision significantly improve over CALDA without weak supervision (). On the real-world datasets, this is also the case with CoDATS: CoDATS-WS improves over CoDATS (). On the real-world datasets, we observe a 3.5% and 3.3% improvement for the CALDA instantiations by including weak supervision. On the synthetic datasets, these differences are 3.0% and 3.7%. We observe the largest performance gains from utilizing weak supervision on the two unbalanced datasets: 5.0% and 3.3% improvements of the two instantiations on WISDM AR and 10.8% and 12.3% improvements on WISDM AT. Because these datasets are unbalanced, larger differences on these datasets are expected since our weak supervision regularization term capitalizes on label distribution differences among the domains. These gains demonstrate: (1) the benefit of leveraging weak supervision for domain adaptation when available, and (2) the observation that CALDA yields improvements over prior work, even for this related problem setting.

Dataset No Adaptation CoDATS-WS CALDA-Any,R,WS CALDA-XS,H,WS Train on Target
UCI HAR 88.8 4.2 92.9 3.2 94.8 1.8 95.5 2.1 99.6 0.1
UCI HHAR 76.9 6.3 88.2 4.6 90.2 3.8 89.8 4.1 98.9 0.2
WISDM AR 72.1 8.0 84.9 7.2 85.2 6.9 84.7 7.0 96.5 0.1
WISDM AT 69.9 7.1 72.1 10.3 82.9 7.3 83.3 7.1 98.8 0.1
Myo EMG 77.4 5.2 79.5 5.7 85.1 4.6 84.7 4.6 97.7 0.1
NinaPro Myo 54.8 3.6 54.9 4.2 58.8 3.8 56.0 4.5 77.8 1.3
Synth InterT 10 62.6 18.8 67.5 19.5 75.4 18.0 78.6 13.8 93.4 0.2
Synth InterR 1.0 52.4 7.8 57.2 8.7 65.1 10.9 78.2 9.5 94.0 0.0
Synth IntraT 10 70.6 8.5 63.5 8.1 78.4 8.0 78.3 7.1 93.7 0.2
Synth IntraR 1.0 63.3 9.0 60.7 10.2 79.1 8.1 79.9 6.3 93.6 0.2
Real-World Avg. 73.9 5.8 79.6 5.9 83.7 4.7 83.2 4.9 94.9 0.3
Synth Avg. 62.2 11.0 62.2 11.6 74.5 11.3 78.7 9.2 93.7 0.2
TABLE IV: Comparing target domain accuracy for domain adaptation methods utilizing weak supervision. Bold denotes CALDA outperforming baselines. Underline denotes highest accuracy in each row.

5.4.2 Sensitivity of Weak Supervision to Noise

By leveraging weak supervision, we were able to improve performance. However, in the above experiments, we simulated obtaining target domain label proportions by estimating those proportions exactly on the target domain training dataset. Now we perform additional experiments to determine how robust these methods are to noise in the estimated label proportions. Since weak supervision has the greatest effect when label distributions differ among domains, we compare these methods with various noise budgets on the unbalanced WISDM datasets. A noise budget of 0.1 indicates that approximately 10% of the class labels can be redistributed. In the case of human activity recognition, if all hours of the day correspond with an activity, then this represents 10% of the day being attributed to an incorrect activity when self-reporting label proportions for weak supervision. The results are shown in Table V. Note that label proportions are redistributed according to the noise budget and then re-normalized so the proportions remain a valid distribution. In the table we provide the True Post-Norm. Noise column to validate that the true post-normalized noise on average is close to the desired noise budget.

Denoted by bold in each row, CALDA typically outperforms CoDATS both with and without weak supervision on the WISDM AR dataset (the final row corresponds to methods without weak supervision). Similarly, the best method in each row is always one of the two CALDA instantiations, denoted by underlining. We observe that even with a noise budget of 0.1, CALDA-WS and CoDATS-WS perform better than CALDA and CoDATS without weak supervision. However, beyond this threshold, we find additional noise degrades performance on WISDM AR. From these results, we conclude the maximum acceptable noise level for weak supervision on WISDM AR is between 0.1 and 0.2. In the case of WISDM AT, the two CALDA instantiations outperform CoDATS both without weak supervision and with weak supervision regardless of the amount of noise. We find that it takes a noise budget of 0.1 before CoDATS-WS degrades to the performance of CoDATS without weak supervision. However, for CALDA-WS, the noise budget can be as large as 0.4 before it degrades to the performance of CALDA without weak supervision.

Overall, on these two datasets, we find that leveraging both weak supervision and cross-source label information can yield improved domain adaptation performance, even with some noise in the weak supervision label information. Though, the acceptable amount of noise depends on the dataset. On both datasets, CoDATS-WS requires a noise level of no more than approximately 0.1, and both CALDA-WS instantiations have similar limits on WISDM AR. However, on WISDM AT, the noise budget for either CALDA-WS instantiation can be as high as 0.4 – four times that of CoDATS-WS. Thus, we conclude that our CALDA framework improves over CoDATS with and without weak supervision and also that our CALDA framework can yield higher robustness to noise in the weak supervision label information on some datasets.

Dataset Weak Supervision Noise Budget True Post-Norm. Noise CoDATS CALDA-XS,H CALDA-Any,R
WISDM AR Yes 0.0 0.0 84.9 7.2 84.7 7.0 85.2 6.9
0.05 0.06 82.9 7.4 83.4 6.2 84.0 6.0
0.1 0.12 79.1 8.4 83.1 7.7 81.8 6.8
0.2 0.22 74.6 8.8 78.9 6.7 78.4 8.1
0.4 0.38 64.9 9.7 68.9 8.8 69.9 9.2
WISDM AR No N/A N/A 78.0 8.4 81.4 7.9 80.2 7.1
WISDM AT Yes 0.0 0.0 72.1 10.3 83.3 7.1 82.9 7.3
0.05 0.07 71.0 11.9 81.8 6.6 82.4 7.3
0.1 0.13 69.9 14.3 81.1 7.9 82.7 7.4
0.2 0.23 65.6 11.4 78.3 8.3 79.2 7.8
0.4 0.40 56.3 12.8 72.0 8.6 71.3 8.3
WISDM AT No N/A N/A 69.7 6.6 71.0 8.6 72.1 7.5
TABLE V: Weak supervision sensitivity to noise. Bold denotes higher accuracy than CoDATS. Underlining denotes best method in each row.

5.5 Validating Assumptions

In this section, we evaluate the validity of two key assumptions behind our work.

5.5.1 Importance of the Adversary

Using the CALDA framework, we investigated various design choices for how to use contrastive learning for domain adaptation. However, we made the assumption that adversarial learning is an important component for each of these instantiations. Here we illustrate why. For the two most-promising instantiations, we run experiments when excluding the adversary. The results on the real-world datasets are shown in Table VI. The methods with an adversary are far superior to those when we exclude the adversary (). This justifies our inclusion of the adversary.

UCI HAR 89.9 3.4 89.8 3.5 93.1 2.3 93.4 2.5
UCI HHAR 74.0 7.2 74.7 6.8 89.8 3.7 89.4 4.0
WISDM AR 70.9 6.8 72.1 6.2 80.2 7.1 81.4 7.9
WISDM AT 70.9 7.7 70.2 7.9 72.1 7.5 71.0 8.6
Myo EMG 76.2 5.2 76.2 5.0 83.8 5.6 83.3 5.3
NinaPro Myo 53.4 4.0 51.4 4.4 57.7 3.8 56.0 3.9
Average 73.2 5.8 73.1 5.7 80.2 5.0 79.9 5.4
TABLE VI: Ablation study comparing CALDA with or without an adversary. Bold denotes highest accuracy in each row.
Dataset No Adaptation CoDATS-DG Sleep-DG AFLAC-DG CALDG-Any,R CALDG-XS,H Train on Target
UCI HAR 88.8 4.2 88.4 3.7 87.0 4.8 89.3 4.6 89.5 3.8 90.0 3.7 99.6 0.1
UCI HHAR 76.9 6.3 76.0 6.1 75.4 6.7 76.6 6.4 76.6 6.4 76.2 6.9 98.9 0.2
WISDM AR 72.1 8.0 66.9 8.7 66.9 8.9 70.9 7.8 68.9 8.8 70.2 7.8 96.5 0.1
WISDM AT 69.9 7.1 69.7 7.8 68.3 8.9 69.7 6.6 70.7 7.4 70.7 7.7 98.8 0.1
Myo EMG 77.4 5.2 73.0 5.3 73.9 5.9 74.3 5.5 78.4 4.8 76.8 5.8 97.7 0.1
NinaPro Myo 54.8 3.6 50.6 4.3 50.4 4.7 51.1 3.5 55.1 3.7 49.8 4.6 77.8 1.3
Synth InterT 10 62.6 18.8 68.4 13.9 68.7 13.9 67.6 13.4 69.8 16.6 71.2 15.9 93.4 0.2
Synth InterR 1.0 52.4 7.8 53.3 8.0 53.7 7.0 66.6 7.7 65.4 10.5 75.5 8.1 94.0 0.0
Synth IntraT 10 70.6 8.5 66.5 10.8 68.2 10.2 68.5 7.5 73.2 8.7 74.5 8.7 93.7 0.2
Synth IntraR 1.0 63.3 9.0 59.9 7.1 62.4 6.3 64.0 7.8 77.3 8.0 76.6 6.1 93.6 0.2
Real-World Avg. 73.9 5.8 71.4 6.1 71.0 6.7 72.7 5.8 73.8 5.9 73.1 6.1 94.9 0.3
Synth Avg. 62.2 11.0 62.0 9.9 63.3 9.3 66.7 9.1 71.4 11.0 74.5 9.7 93.7 0.2
TABLE VII: Comparing domain adaptation performance excluding unlabeled target domain data during training. Bold denotes domain generalization methods outperforming CoDATS-DG and No Adaptation baselines. Underline denotes highest accuracy in each row.

5.5.2 Importance of Unlabeled Target Domain Data

Finally, in the problem of unsupervised domain adaptation, we have unlabeled target domain data available for use during training. Unsupervised domain adaptation methods typically make the assumption that these data are useful for improving target-domain performance. Here, both CoDATS and CALDA leverage this data via adversarial learning, which as observed above is vital to domain adaptation performance. However, another alternative is to only perform adversarial learning among the multiple source domains and exclude the target-domain unlabeled data, i.e., promote domain-invariant features among only the multiple source domains through the adversarial loss. This is related to the problem of domain generalization [40]. The results for the corresponding CoDATS-DG, CALDG-XS,H, and CALDG-Any,R methods are shown in Table VII. For comparison, we also include two domain generalization methods Sleep-DG [5] and AFLAC-DG [41]. Comparing Tables III and VII, on the real-world datasets we observe including the unlabeled data yields significantly higher accuracy of CoDATS, CALDA-Any,R, and CALDA-XS,H (). This is similarly true on the synthetic data, but the differences are not large enough to be significant. However, on both real and synthetic datasets, CALDG-Any,R and CALDG-XS,H are significantly better than CoDATS-DG, Sleep-DG, and AFLAC-DG (). On the synthetic datasets, they are similarly better than No Adaptation (). In contrast, on the real-world datasets, No Adaptation performs the best on average, though not significantly different than CALDG-Any,R. From these experiments, we conclude that the unlabeled target domain data makes a significant contribution to our proposed CALDA method. In addition, contrastive learning appears to benefit the problem of domain generalization as well as domain adaptation, though we leave a more detailed investigation to future work.

6 Conclusions and Future Work

We propose a novel time series MS-UDA framework CALDA, drawing on the principles of both adversarial and contrastive learning. This approach seeks to improve transfer by leveraging cross-source information, which is ignored by prior work time series work. We investigate design decisions for incorporating contrastive learning into multi-source domain adaptation, including how to select examples from multiple domains, whether to include the target domain, and whether to utilize example difficulty. We observe that CALDA improves performance over prior work on a variety of real-world and synthetic time-series datasets both with and without weak supervision. In the weak supervision case, we additionally find the method is robust to label proportion noise. We also validated that both the adversary and unlabeled target domain data yield significant contribution to domain adaptation performance. Future work includes developing data augmentation compatible with time series domain adaptation and pseudo labeling techniques viable for the large domain gaps observed in time series to see if these yield further improvements in transfer.


This material is based upon work supported by the National Science Foundation under Grant No. 1543656 and by the National Institutes of Health under Grant No. R01EB009675. This research used resources from CIRC at WSU.


  • [1] G. Wilson and D. J. Cook, “A survey of unsupervised deep domain adaptation,” ACM Trans. Intell. Syst. Technol., vol. 11, no. 5, Jul. 2020.
  • [2] G. Wilson, J. R. Doppa, and D. J. Cook, “Multi-source deep domain adaptation with weak supervision for time-series sensor data,” in KDD, 2020, p. 1768–1778.
  • [3] S. Purushotham, W. Carvalho, T. Nilanon, and Y. Liu, “Variational adversarial deep domain adaptation for health care time series analysis,” in ICLR, 2017.
  • [4] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, “A public domain dataset for human activity recognition using smartphones.” in ESANN, 2013.
  • [5] M. Zhao et al., “Learning sleep stages from radio signals: A conditional adversarial architecture,” in ICML, vol. 70, 2017, pp. 4100–4109.
  • [6] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle et al., “Domain-adversarial training of neural networks,” JMLR, vol. 17, no. 59, pp. 1–35, 2016.
  • [7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020, pp. 1597–1607.
  • [8] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020, pp. 9729–9738.
  • [9] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot et al., “Supervised contrastive learning,” arXiv:2004.11362, 2020.
  • [10] C. Park, J. Lee, J. Yoo, M. Hur, and S. Yoon, “Joint contrastive learning for unsupervised domain adaptation,” arXiv:2006.10297, 2020.
  • [11] G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann, “Contrastive adaptation network for unsupervised domain adaptation,” in CVPR, 2019, pp. 4893–4902.
  • [12] J. Choi, M. Jeong, T. Kim, and C. Kim, “Pseudo-labeling curriculum for unsupervised domain adaptation,” arXiv:1908.00262, 2019.
  • [13]

    F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in

    CVPR, 2015, pp. 815–823.
  • [14] T. Cai, J. Frankle, D. J. Schwab et al., “Are all negatives created equal in contrastive instance discrimination?” arXiv:2010.06682, 2020.
  • [15] H. Zhao, S. Zhang, G. Wu, J. M. F. Moura, J. P. Costeira, and G. J. Gordon, “Adversarial multiple source domain adaptation,” in NeurIPS, 2018, pp. 8559–8570.
  • [16] Q. Xie, Z. Dai, Y. Du, E. Hovy, and G. Neubig, “Controllable invariance through adversarial feature learning,” in NeurIPS, 2017, pp. 585–596.
  • [17] I. Ketykó, F. Kovács, and K. Varga, “Domain adaptation for sEMG-based gesture recognition with recurrent neural networks,” in IJCNN, 2019, pp. 1–7.
  • [18]

    A. Ameri, M. A. Akhaee, E. Scheme, and K. Englehart, “A deep transfer learning approach to reducing the effect of electrode shift in emg pattern recognition-based control,”

    IEEE NSRE, vol. 28, no. 2, pp. 370–379, 2019.
  • [19]

    U. Côté-Allard, C. L. Fall, A. Drouin, A. Campeau-Lecours, C. Gosselin, K. Glette, F. Laviolette, and B. Gosselin, “Deep learning for electromyographic hand gesture signal classification using transfer learning,”

    IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, no. 4, pp. 760–771, 2019.
  • [20] Y. Du, W. Jin, W. Wei, Y. Hu, and W. Geng, “Surface emg-based inter-session gesture recognition enhanced by deep domain adaptation,” Sensors, vol. 17, no. 3, p. 458, 2017.
  • [21] Y. Li, N. Wang, J. Shi, X. Hou, and J. Liu, “Adaptive batch normalization for practical domain adaptation,” Pattern Recognition, vol. 80, pp. 109 – 117, 2018.
  • [22] K. Ganchev, J. Gillenwater, B. Taskar et al., “Posterior regularization for structured latent variable models,” JMLR, vol. 11, no. Jul, pp. 2001–2049, 2010.
  • [23] W. Jiang, C. Miao, F. Ma, S. Yao, Y. Wang, Y. Yuan et al., “Towards environment independent device free human activity recognition,” in MobiCom, 2018, pp. 289–304.
  • [24] N. Hu, G. Englebienne, Z. Lou, and B. Kröse, “Learning to recognize human activities using soft labels,” IEEE PAMI, vol. 39, no. 10, pp. 1973–1984, Oct 2017.
  • [25] D. Pathak, P. Krahenbuhl, and T. Darrell, “Constrained convolutional neural networks for weakly supervised segmentation,” in ICCV, Dec 2015.
  • [26] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning with application to clustering with side-information,” in NeurIPS, vol. 15, no. 505–512, 2002, p. 12.
  • [27] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in CVPR, vol. 2, 2006, pp. 1735–1742.
  • [28] B. K. Iwana and S. Uchida, “An empirical survey of data augmentation for time series classification with neural networks,” Plos one, vol. 16, no. 7, p. e0254841, 2021.
  • [29] G. French et al., “Self-ensembling for visual domain adaptation,” in ICLR, 2018.
  • [30] M. Thota and G. Leontidis, “Contrastive domain adaptation,” in CVPR, 2021, pp. 2209–2218.
  • [31] P. Su, S. Tang, P. Gao, D. Qiu, N. Zhao, and X. Wang, “Gradient regularized contrastive learning for continual domain adaptation,” arXiv preprint arXiv:2007.12942, 2020.
  • [32] J. Guo, D. Shah, and R. Barzilay, “Multi-source domain adaptation with mixture of experts,” in EMNLP, 2018, pp. 4694–4703.
  • [33] H. I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P. Muller, “Deep learning for time series classification: a review,” DMKD, vol. 33, no. 4, pp. 917–963, 2019.
  • [34] Z. Wang, W. Yan, and T. Oates, “Time series classification from scratch with deep neural networks: A strong baseline,” in IJCNN, 2017, pp. 1578–1585.
  • [35] A. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv:1807.03748, 2018.
  • [36] A. Stisen et al., “Smart devices are different: Assessing and mitigating mobile sensing heterogeneities for activity recognition,” in SenSys, 2015, pp. 127–140.
  • [37] J. R. Kwapisz, G. M. Weiss, and S. A. Moore, “Activity recognition using cell phone accelerometers,” SIGKDD Explor. Newsl., vol. 12, no. 2, pp. 74–82, 2011.
  • [38] J. W. Lockhart, G. M. Weiss, J. C. Xue et al., “Design considerations for the wisdm smart phone-based sensor mining architecture,” in SensorKDD, 2011, pp. 25–33.
  • [39] S. Pizzolato, L. Tagliapietra, M. Cognolato, M. Reggiani, H. Müller, and M. Atzori, “Comparison of six electromyography acquisition setups on hand movement classification tasks,” PloS one, vol. 12, no. 10, 2017.
  • [40] G. Blanchard, G. Lee, and C. Scott, “Generalizing from several related classification tasks to a new unlabeled sample,” in NeurIPS, 2011, pp. 2178–2186.
  • [41] K. Akuzawa, Y. Iwasawa, and Y. Matsuo, “Adversarial invariant feature learning with accuracy constraint for domain generalization,” arXiv:1904.12543, 2019.

Appendix A Experimental Setup

Here we further document the dataset pre-processing, model architecture, hyperparameter tuning, and training algorithm for the CALDA framework and experiments from the main paper. We also provide a few additional tables which could not be included in the main paper due to space constraints, further corroborating the conclusions given in the main paper.

a.1 Hyperparameter Tuning

For CALDA, CoDATS, and No Adaptation, we performed hyperparameter tuning with a random search over the following space: learning rate , max number of positives selected , max number of negatives selected with the negative to positive ratio , contrastive loss weight , and contrastive loss temperature . Tuning was performed on a limited number of adaptation problems from each dataset: 5 target users with 3 random sets of sources for each of two values of (the lowest and highest values of for each dataset). The best hyperparameters were selected based on highest accuracy on the validation set – no method ever saw the true test set during tuning or model selection. For each dataset, the CALDA variants each use the same hyperparameters to verify the differences in results are due to the design choices rather than tuning.

For the CAN baseline, we follow the same procedure as for CALDA, but the set of hyperparameters differs. We perform a search over: base learning rate , source/target batch size , alpha , beta , and loss weight . These values are inclusive of the best parameters found in the original CAN experiments, but we extend the search space for these neural network hyperparameters since the time series feature extractor is sufficiently different from the image neural network. For CAN clustering, we retain the same hyperparameters as in the CAN paper: source/target clustering batch size of 600, clustering budget of 1000, and max loops of 50.

a.2 Neural Network Model and Training

We employ a neural network previously demonstrated to work well on time-series data [2, 34]. To add support for contrastive learning, we include an additional contrastive head [10, 14], consisting of a single 128-unit fully-connected layer added to the model following the feature extractor. We apply the contrastive loss to the representations output by this additional layer. The entire model architecture is illustrated in Figure 5.

Following the setup for CoDATS [2], each model was trained for 30,000 iterations with the Adam optimizer, a batch size of 128, and the domain-adversarial learning rate schedule from Ganin et al. [6]. Because weak supervision depends on a sufficient number of unlabeled target domain data to estimate the predicted label proportions in each batch, for weak supervision we divide the batch size equally between source domains (further split among source domains equally) and the target domain [2]. Without weak supervision, an evenly-split batch division between domains was used. For the additional hard vs. random experiment, we instead used a batch of size 64 because, while the same trend holds for a batch of size 128, it is less visible.

For a contrastive domain adaptation baseline, we include the CAN method [11]. However, CAN was designed for use with image datasets. To make it compatible with time series datasets, we replace the image neural network with the same time series feature extractor used in CoDATS and CALDA. This allows comparing the CAN method with CALDA on time series data.

Fig. 5: CoDATS multi-source time series domain adaptation model architecture for source domains and class labels, with an additional contrastive head to support our contrastive loss.

a.3 Evaluation Metrics

To evaluate each method, we follow the MS-UDA evaluation protocol from prior work [2]. We select 5 different values of to determine how well each method works across various numbers of source domains. These numbers of source domains are as follows: for UCI HAR, for UCI HHAR, for WISDM AR, for WISDM AT, for Myo EMG, for NinaPro Myo, and for Synth. However, to better observe overall trends across various numbers of source domains and due to space constraints, we average over these multiple values of for each dataset.

For each value of , we select 10 random target domains and 3 random sets of source domains for each of those target domains. Finally, we compute the average classification accuracy of each method on the hold-out target domain test sets. Thus, each point in the results of the main paper is an average of approximately 150 experiments (30 for each value of ), though results for each individual value of are provided below in Tables X and XI

. The error given for each experiment is the average of the standard deviation over each set of 3 random sets of source domains, i.e., this error indicates the variation of each method to the 3 different source domain selections and also the 3 different random initializations of the networks. This is in contrast to typical variances given outside a multi-source domain adaptation context, where the variances only indicate variation over several random initializations since they do not have multiple source domain choices available. Overall, this evaluation procedure allows us to compare each method across a wide array of MS-UDA adaptation problems.

a.4 Dataset Preprocessing

The synthetic datasets were generated with 12 domains and 3 classes for each domain. The 12 domains allows for with exactly 3 random sets of source domains for the largest number of source domains . The time series signals were generated at 250 Hz with a window length of 0.2 seconds, yielding a window of 50 samples. Each sine wave had uniform amplitude and no phase shift.

The UCI HAR dataset contains accelerometer , , and , gyroscope , , and , and estimated body acceleration , , and for 30 participants [4]. This data was collected at 50 Hz and is segmented into 2.56 second windows (i.e., each window contains 128 time steps). This dataset contains 6 activity labels: walking, walking upstairs, walking downstairs, sitting, standing, and laying.

The UCI HHAR dataset contains both accelerometer and gyroscope data [36]. However, the authors only utilize one of these sensor modalities at a time and find accelerometer data to be the superior choice for human activity recognition. Thus, on this dataset, we similarly use the three-axis accelerometer data in our experiments. We include the data from the 31 participants carrying smartphones, each of which was sampled at the highest-supported sampling rate, and segment this into windows of 128 time steps. This dataset includes data from the following activities: biking, sitting, standing, walking, walking upstairs, and walking downstairs.

The WISDM AR dataset contains accelerometer , , and data collected from a large number of participants [37]. However, many of these participants have very little data. Thus, we only include data from the 33 participants with sufficient data. The accelerometer data is collected at 20 Hz, and we segment this into non-overlapping windows of 128 time steps. The activity labels for WISDM AR are: walking, jogging, sitting, standing, walking upstairs, and walking downstairs.

The WISDM AT dataset similarly contains accelerometer data [38]. Like WISDM AR, the amount of data collected from each participant varies widely, so we only include data from the 51 participants with a sufficient amount of labeled data. The accelerometer is sampled at 20 Hz. We segment this data into non-overlapping windows of 128 time steps. WISDM AT contains the following activity labels: walking, jogging, ascending/descending stairs, sitting, standing, and lying down.

The Myo EMG dataset [19] contains 8-channel EMG data from a Myo armband collected at 200 Hz while different people performed various hand gestures. This dataset consists of 7 hand gestures: neutral, radial deviation, wrist flexion, ulnar deviation, wrist extension, hand close, and hand open. Data was collected from 40 people, but the authors of the dataset proposed pre-training on the data from the first 22 participants, so they only provide only training sets for these participants. Thus, we include all participants as potential source domains but only include the later 18 participants as target domains. As in their paper, we use 260ms windows (i.e., 52 samples) with an overlap of 235 ms.

The NinaPro Myo dataset consists of data from the NinaPro DB5 Exercise 2 dataset [39] processed similar to the Myo EMG dataset to better align with the problem setup in their paper [19]. The same subset of classes are used as in Myo EMG and only the 8-channel EMG signals from the lower Myo armband. Additionally, while the authors proposed an electrode shifting/rotation mechanism to better align data across participants, we found this additional procedure to be unnecessary for our domain adaptation method. Thus, we perform no such electrode shifting. This dataset consists of data from 10 participants at 200 Hz, and we use the same window size and overlap as the Myo EMG dataset. To not have overlap between data from gestures in the training and testing sets, we select the first 5 repetitions of each gesture as the training data and the final repetition as the test data, which gives approximately an 80%-20% train-test split.

For each HAR dataset, the data from each participant was split into 80% training set and 20% testing set, with the training set similarly split into training and validation sets. The train-test splits for the EMG datasets were described above, and the training set of each was further split into training and validation sets of 80% and 20% respectively. After the selection of source and target domains for each experiment, we select the corresponding training, validation, and test sets for each participant. Data are normalized to have zero mean and unit variance with statistics computed on only the training set. The model selected for evaluation is the checkpoint that performs best on the validation set.

Appendix B Additional Experimental Results

While the key results were presented in the main paper, here we provide a few additional tables that could not be included in the main paper due to space constraints. These further results corroborate our conclusions from the main paper. Table VIII provides the ablation results when including pseudo-labeled target domain data on the synthetic datasets. Table IX provides the results of the ablation study without the adversary on the synthetic datasets. Tables X and XI show the results comparing CALDA with baseline methods when varying the number of source domains for MS-UDA. Similarly, Tables XII and XIII show the results for MS-UDA with weak supervision.

Synth InterT 10 72.1 12.8 72.7 14.1 71.5 13.6 71.1 15.1 70.7 13.8 72.4 13.7
Synth InterR 1.0 69.7 11.6 70.1 10.1 65.4 10.4 77.9 9.2 65.2 10.7 76.6 9.4
Synth IntraT 10 66.7 6.4 68.4 7.0 71.8 7.5 73.7 11.2 69.4 7.3 75.0 8.9
Synth IntraR 1.0 77.7 6.1 77.6 7.5 77.9 6.9 75.9 3.3 78.0 6.4 75.6 4.3
Average 71.6 9.2 72.2 9.7 71.7 9.6 74.7 9.7 70.8 9.6 74.9 9.1
TABLE VIII: Ablation study of CALDA instantiations that include the target-domain data via pseudo labeling on synthetic datasets.
Synth InterT 10 70.0 16.4 69.3 14.0 69.4 17.0 70.5 14.5
Synth InterR 1.0 68.8 12.0 76.2 9.3 63.4 10.2 76.3 8.9
Synth IntraT 10 74.7 7.6 75.5 8.6 75.6 8.4 76.2 8.3
Synth IntraR 1.0 73.8 7.8 77.2 5.5 77.5 7.0 76.9 6.1
Average 71.8 10.9 74.6 9.4 71.5 10.7 75.0 9.4
TABLE IX: Ablation study comparing CALDA with or without an adversary on synthetic datasets. Bold denotes highest accuracy in each row.
Dataset No Adaptation CAN CoDATS CALDA-Any,R CALDA-XS,H Train on Target
UCI HAR 2 74.5 12.2 84.1 7.4 91.3 3.6 92.1 4.1 91.2 4.8 99.6 0.1
UCI HAR 8 90.4 3.0 88.3 4.1 93.7 2.8 92.6 3.0 93.4 2.4 99.6 0.1
UCI HAR 14 92.4 2.8 90.5 2.9 92.9 3.9 93.5 2.3 94.4 2.6 99.6 0.1
UCI HAR 20 93.2 1.6 90.5 2.7 92.7 3.4 93.6 1.1 94.1 1.5 99.6 0.1
UCI HAR 26 93.4 1.5 91.7 1.5 93.5 2.8 94.0 1.0 93.8 1.4 99.6 0.1
[0.75pt/3pt] UCI HAR Avg 88.8 4.2 89.0 3.7 92.8 3.3 93.1 2.3 93.4 2.5 99.6 0.1
UCI HHAR 2 68.4 7.5 73.3 7.1 87.6 5.1 88.8 4.4 88.7 4.7 98.9 0.2
UCI HHAR 3 74.8 6.8 77.5 6.3 88.6 4.3 90.2 3.8 89.7 3.5 98.9 0.2
UCI HHAR 4 77.9 7.2 79.0 5.0 89.4 3.4 90.2 3.6 89.9 4.1 98.9 0.2
UCI HHAR 5 80.9 5.2 79.8 4.3 89.0 3.2 90.0 3.4 89.6 3.9 98.9 0.2
UCI HHAR 6 82.4 4.9 78.0 3.6 88.6 3.4 89.7 3.4 89.3 3.6 98.9 0.2
[0.75pt/3pt] UCI HHAR Avg 76.9 6.3 77.5 5.3 88.6 3.9 89.8 3.7 89.4 4.0 98.9 0.2
WISDM AR 2 55.2 13.0 48.8 10.1 65.6 13.9 68.6 13.3 69.9 12.3 96.5 0.1
WISDM AR 8 69.6 8.2 57.8 7.8 76.1 8.8 77.9 8.2 79.9