Causal Mechanism Transfer Network for Time Series Domain Adaptation in Mechanical Systems

by   Zijian Li, et al.

Data-driven models are becoming essential parts in modern mechanical systems, commonly used to capture the behavior of various equipment and varying environmental characteristics. Despite the advantages of these data-driven models on excellent adaptivity to high dynamics and aging equipment, they are usually hungry to massive labels over historical data, mostly contributed by human engineers at an extremely high cost. The label demand is now the major limiting factor to modeling accuracy, hindering the fulfillment of visions for applications. Fortunately, domain adaptation enhances the model generalization by utilizing the labelled source data as well as the unlabelled target data and then we can reuse the model on different domains. However, the mainstream domain adaptation methods cannot achieve ideal performance on time series data, because most of them focus on static samples and even the existing time series domain adaptation methods ignore the properties of time series data, such as temporal causal mechanism. In this paper, we assume that causal mechanism is invariant and present our Causal Mechanism Transfer Network(CMTN) for time series domain adaptation. By capturing and transferring the dynamic and temporal causal mechanism of multivariate time series data and alleviating the time lags and different value ranges among different machines, CMTN allows the data-driven models to exploit existing data and labels from similar systems, such that the resulting model on a new system is highly reliable even with very limited data. We report our empirical results and lessons learned from two real-world case studies, on chiller plant energy optimization and boiler fault detection, which outperforms the existing state-of-the-art method.


Time-Series Domain Adaptation via Sparse Associative Structure Alignment: Learning Invariance and Variance

Domain adaptation on time-series data is often encountered in the indust...

Time Series Domain Adaptation via Sparse Associative Structure Alignment

Domain adaptation on time series data is an important but challenging ta...

Multi-Source Deep Domain Adaptation with Weak Supervision for Time-Series Sensor Data

Domain adaptation (DA) offers a valuable means to reuse data and models ...

Remaining Useful Lifetime Prediction via Deep Domain Adaptation

In Prognostics and Health Management (PHM) sufficient prior observed deg...

A textual transform of multivariate time-series for prognostics

Prognostics or early detection of incipient faults is an important indus...

Structured Time Series Prediction without Structural Prior

Time series prediction is a widespread and well studied problem with app...

Transferable Time-Series Forecasting under Causal Conditional Shift

This paper focuses on the problem of semi-supervised domain adaptation f...

I Introduction

Explosively growing data from Internet of Things (IoT) are now flooding into data management systems for processing and analysis. The availability of massive historical data, powerful deep learning frameworks and excessive computation power are together boosting the development of new data-driven models over complex mechanical systems, which are used to characterize the behaviors of the systems running with highly dynamic external conditions and aging equipment.

While data-driven models have achieved significant performance on different systems with various equipment and objectives, such as chiller plant [34] and vacuum pumps [21]

, these successes may not be easily repeated in another setting, mostly due to the unaffordable cost to meet the expected data quality and quantity. Particularly, powerful machine learning models are usually hungry for big quality data, such as labels on equipment failure events, which involves huge human efforts in reading and annotating the historical data. To better address these demands on training data, we discuss two types of common limitations we meet in real-world applications.

Fig. 1: Illustration of three challenges of time series domain adaptation in boiler data. The purple, orange and green lines represent temperature () sensor values, operation status (O) sensor values and pressure (p) sensor values respectively. (a) Inter-domain value range shift might appear in different domain, .e.g, the value range in source domain is smaller than target domain, which might leads to larger generalization error if we train with source domain data and test on target domain data directly. (b) Inter-domain time lag shift caused by temporal causal mechanism appears in different domain, domain adaptation can capture the this causal mechanism and ignore the time lag simultaneously. (c) Intra-domain causal mechanism shift also exists in time series data. For example, the pressure and temperature jointly effect the operation status. This relationship between multivariate time series data should be captured by domain adaptation model. (best view in color)

The first limitation is the lack of configuration coverage. In real-world mechanical systems, there are usually a variety of controlling parameters. In chiller plant, for example, the parameters contain the variable speed drives(VSD) controlling, the frequencies of the pumps and cooling towers in the plant. Due to the limited variety of the conventional chiller plant control strategies, a chiller plant is operated under a small number of candidate configurations over the control parameters. This leads to a potential risk of overfitting of the data-driven model. The second limitation is the lack of label coverage. In real-world IoT systems, events of interests, e.g., failures of pumps [21]

, are usually very rare. Every individual failure event, on the other hand, may cause huge financial loss. Supervised learning, however, builds reliable and meaningful models only when there are sufficient labelled data linked to the detection/prediction target event.

Fortunately, domain adaptation, e.g., [2, 11, 12, 15, 22] which enables the system to reuse existing data from similar systems when building models over a new system with limited data by aligning the features and transforming the old model based on observations over these domains, can lift this restriction in our IoT applications. While most of the existing approaches of domain adaptation are designed for non-sequential domain with fixed number of dimension, the neglect of temporal information is an important source of performance degradation, when these methods are applied on time series data directly. Recently, domain adaptation for time series data has received wide attention. [26]

adversarially captures complex and domain-invariant temporal relationships by using variational recurrent neural network

[7]. However, this method ignores the causal mechanism in time series data, because it mainly takes the hidden state of final time step into account instead of the hidden states of all the time steps.

In order to figure out the aforementioned challenges, we consider what can be transfered and what hinder the transferability in time series domain adaptation. Firstly, we assume that the causal mechanism is invariant. Because the physical mechanism is invariant among domains, and causal mechanism is a kind of this physical mechanism. What’s more, causal mechanism denotes the directed path between two random variables. In a word, a set of cause variables have impacts on the set of effect variables

[25]. According to our observation, there are two significant causal mechanisms of time series data in the mechanical systems. One is dynamic causal mechanism, which means that one sensor value have influence on another sensor value in any time step. The other is temporal causal mechanism which means that the past values of one sensor value should contain information that helps predict another sensor value above.

However, in order to transfer causal mechanism, three obstacles need to be tackled, which as shown in Figure 1. (1) Inter-domain value range shift means value ranges of sensors vary with domains. For example, the values range of the temperature sensors varies with the location of the machine. And the model that is trained by the machine with lower temperature range might not be suitable for the machine with higher temperature range. (2) Inter-domain time lag shift means the time lags of causal effect vary with domains. According to the ideal gas law , where and are the pressure, volume and absolute temperature respectively. is the ideal gas constant and is the number of moles of gas. Because different boilers use different kinds of fuels, the ratio between temperature and pressure is different, which leads to different response time. (3) Intra-domain causal mechanism means that the causal mechanism between sensors in each time step. For example, in Figure 1, the drop of temperature () causes the drop of of pressure () i.e. , and the operation status is decided by temperature and pressure jointly, i.e. .

In this paper, we utilize the invariant causal mechanism and solve the aforementioned obstacles by proposing a novel causal mechanism transfer network (CMTN). Firstly, because of different value ranges in different domains, we devise different feature extractors for the source and target domains separately. Then, we introduce two kinds of attention mechanisms to transfer two kinds of causal mechanisms according to our observations over the real data as shown in Figure 1. In order to tackle the inter-domain time lag shift, we propose the transferable temporal attention mechanism. In order to tackle the intra-domain causal mechanism shift, we propose the transferable intra-sensors attention mechanism. Furthermore, We apply our CMTN to two case studies, including chiller plant optimization under lack of configuration coverage and boiler failure detection under lack of label coverage, and achieves significant improvements in modeling accuracy and consequently promising performance in their respective settings.

The rest of the paper is organized as follows. Section II reviews existing studies on time series modeling, domain adaptation, domain adaptation on time series as well as attention mechanism. Section III provides the problem definition on time series domain adaptation and adversarial domain adaptation model. Section IV proposes motivation based on the observation over the time series data in the mechanical systems and our causal mechanism transfer network for time series domain adaptation. Section V presents case studies on two completely different areas and conduct the ablation study on our CMTN. Section VI concludes the paper with future work discussion.

Ii Related Work

In this section, we first review the existing techniques on time series modeling and domain adaptation, and then we give a brief introduction about time series domain adaptation and attention mechanism.

Time Series:

Modeling and prediction on time series is a traditional research problem in computer science, with a number of successful cases, e.g., Autoregressive model

[20] and ARMA [4] . With the introduction of domain expertise and graphical model, new approaches are proposed, e.g., [27], to enhance prediction accuracy. The quick growth of computation power, on the other hand, has propelled the success of deep neural network models, e.g., RNN, LSTM [18] and GRU [6], specifically designed for time series domain. In this paper, we adopt LSTM as our backbone network to model time series data.

Domain Adaptation: Unsupervised domain adaptation is a very important problem. The mainstream methods aim to extract the domain invariant feature between domains. Maximum Mean Discrepancy is one of the most popular methods by using kernel-reproducing Hilbert space [3, 19, 13]. Second-order statistics is proposed for unsupervised domain adaptation [30]. Second or higher order scatters statistics can be used to measure alignment in CNN [22].

Another essential approach in unsupervised domain adaptation is to extract the domain-invariant representation by introducing a domain adversarial layer for domain alignment. [10]

introduces gradient reversal layer to fool the domain classifier and extracts the domain-invariant representation,

[32] borrows the idea of generative adversarial network (GAN) [16] and proposes a novel unified framework for adversarial domain adaptation.

Based on the adoption of causality view over the variables, the adaptation scenario can be determined by causal mechanism. [37] discusses three different application scenarios in domain adaptation. These scenarios respectively are target shift, condition shift and generalized target shift. Based on [37], [36, 15] investigate more on the generalized target shift in the context of domain adaptation.

Domain Adaptation on Time Series: Though unsupervised domain adaptation performs well in many tasks in computer version, there is limited work of domain adaptation in time series data. In NLP, [9]

uses distributed representations for sequence labeling tasks.

[24] simultaneously uses domain specific and invariant representations for domain adaptation in sentiment classification task while [29] solves the same problem by combining the generic embeddings with domain-specific ones. And [26] use variational method that produces a latent representation that captures underlying temporal latent dependencies of time series samples from different domains. However, this method extracts the domain-invariant representation with the final hidden state of RNN, which ignore the whole time series and its properties. In this paper, we proposed an unsupervised domain adaptation method for time series data, which extracts domain-invariant representation in the time-series level and consider the causal mechanism in time series data. What’s more, we figure out time series domain adaptation in a causal view.

Attention Mechanism:

Attention mechanism is also very significant in time series modeling. Motivated by how human beings pay visual attention to different regions of an image or correlate words in one sentence, attention mechanisms have become an integral part of network architectures in natural language processing and computer vision tasks.

[1] introduced a general attention mechanism into machine translation model which allow the model to automatically search for parts of the correlative words. [23] achieve promising performance in image caption by using a global-local attention method by integrating local representation at object level with global representation at image-level. Based on Transformer [33], a general attention mechanism architecture, BERT [8] achieves the state-of-the-art performance in question answering and language inference. Observing that not all region of an image is transferable, [35] introduce attention mechanism into domain adaptation which focuses on transferable regions of an image.

In this paper, we introduce attention mechanism into time series domain adaptation, focusing on two kinds of transferable causal mechanism: dynamic causal mechanism and temporal causal mechanism. In this paper, we first present how the causal mechanisms happen in the time series by data observation, and then explain how to transfer this causal mechanism by introducing a dual attention mechanism.

Iii preliminary

Iii-a Problem definition

We first denote as a multivariate time series sample with time steps, where and as the certain label. When is a real number, the prediction on is a regression problem over time series. When is a categorical value, it becomes a multi-class classification problem. We assume that and , which represent source domain and target domain respectively, have different distributions but share the same causal structure. and which are sampled from and separately, denote the source and target domain dataset. We further assume that each source domain time series sample comes with , while target domain has no labelled sample, and our goal is to devise a model that can predict label given time series sample from target domain.

Iii-B Base Model

We pick up recurrent neural network model as the base approach for our time series modelling, because of its huge performance improvement over conventional approach [6]

. Specifically, we develop domain adaptation techniques based on Long Short-Term Memory (or LSTM in short)

[28]. In this subsection, we present the basic of LSTM and its usage in our target mechanical system. Formally, we define:


in which denote the LSTM that accepts a time series sample as input and then outputs a time series hidden states and represent the parameters of LSTM.

Dozens of domain adaptation algorithms, which are proposed in last decade, has shown significant performance improvement in their respective setting. We opt to use the strategy proposed by Ganin [10]

. Generally speaking, their strategy models invariant features across domains by optimizing a domain predictor that is expected to fails to tell whether the extracted feature is from the source or the target domain. And we consider the feature extracted by aforementioned method is more robust for multiple domains. One of the biggest benefits of the strategy is that the domain prediction loss, which denotes the loss of domain predictor, could be easily merged into the

regression/classification prediction loss, therefore enabling a holistic model training for both domain adaptation and label prediction optimization.

A straightforward solution to time series domain adaptation is to directly reuse existing algorithms originally designed for non-sequential data. Because the final hidden state is assumed to contain all the message of time series, so we take

as the input of label predictor and domain predictor as shown in equation (2). When training LSTM by using data from multiple domains, the objective loss function consists of two parts, the label loss for the source domain data and the domain prediction loss over both source and target domains. The label loss is used to minimize the error of LSTM when predicting the labels, while the domain prediction loss is used to control the alignment of features such that extracted features are consistent across domains.


in which represents label predictor with parameters and represents domain predictor with parameters . The parameters and are trained by minimizing the following objective function.


In which denotes the domain number, We let and as source and target domains labels, respectively.In next section, we will introduce our causal mechanism transfer network (CMTN) motivated by our data observation.

Iv model

The above base model only considers the alignment of the hidden representation of the data, while ignores the inherent properties of the time series data. Fortunately, we find that the causal mechanisms are invariant across the domains, due to the fact that all the machines from different domains still follow the same physical mechanism. Here the causal mechanism refers to a process that a cause contributes to the production of an effect. For example, as shown in Figure

1, in the boiler system, the variation of temperature () causes the variation of pressure (P)(i.e. ). Furthermore the temperature () and the pressure (P) effect the operation status jointly (i.e. ).

Such invariant causal mechanism motivates our Causal Mechanism Transfer Network(CMTN) for time series domain adaptation– extending the existing time series representation model with the casual mechanism of the data. Generally, we attempt to extend the sequence presentation model into two parts, the domain-invariant causal mechanism part and the domain-specific part. Formally, we extend to , by splitting the parameters into three parts: , and . Among them and denotes the domain-specific parameters for the source and target domain respectively, and denote the domain-invariant parameters.

However, it is still a challenging task to model the invariant causal mechanisms over the dynamic time series data, which is usually hindered by the following three phenomena of the limitations: inter-domain value range shift, inter-domain time lag shift and intra-domain causal mechanism shift. These limitations come from our observation over the data. For example, as shown in Figure 1, the value range of temperature of the chiller or boiler varies with the location of the machine; The time lag of causal effect (i.e. ) varies with domains. The factors which effect the operation status can be more complex, for example temperature and pressure are jointly making effects on the operation status. In the following, we will provide the details to solve the above three obstacles under the above general causal mechanism transfer framework.

Iv-a Domain Specific Feature Extractor

Observation1 Inter-domain value range shift:

First of all, it is obvious that value range over the input vectors varies with different domain, which is shown in Figure

2. In a boiler system, for example, the minimal and maximal values of certain sensor readings are very different from boiler to boiler. Traditional domain adaptation techniques, e.g. [10], leave it to the feature alignment. It may affect the LSTM model which is shared by all domains when generating the features for final classification and regression task.

Fig. 2: The illustration of value range shit. The value ranges of the source domains may be different from that of the target domains.

Motivated by our observations over varying value range over the input vectors in different domains, we insert a domain-specific feature extractor between the input and LSTM. If we use Ganin’s method [11] directly, the shared LSTM that simply aligns the sensor readings of different value range will not achieve ideal performance. In our solution, we intentionally add a new layer for domain-specific feature extraction, i.e., the feature extractor in Figure 5. It is expected to handle a wide spectrum of domain alignment problems by pre-processing the input values in an automatic manner. Formally, we have:


in which and are composed of a simple neural network respectively, are learnable projection matrices. and is the feature generated by the source domain specific feature extractors. Similarly, we let denote the feature generated by the target domain specific feature extractors, and further let and denote feature generated by any domain specific feature extractors and any domain specific feature extractors. Subsequently, we will take as the input of the base model in section III-B.

As a summary, and in this section are domain-specific parameters, which are used to capture the different value range for the source and target domain respectively. are the domain-sharing parameters, which are used to model the domain-invariant causal mechanism.

Iv-B Transferable Temporal Causal Mechanism

Observation2 inter-domain time lag shift: Temporal causal mechanism [31, 17, 5] is important to the modeling of multivariate time series data, for example, the relationship between temperature and pressure follows the Charles’s law. However, because of the properties of different domain, such as the different degree of aging of different machines, there are time lags between different domains, which is shown in Figure 3.

Fig. 3: The illustration of temporal causal mechanism. The time lags vary across the domains, but the causal mechanism among the sensors (i.e., P is the cause of in the two domains) is transferable across the domains.

In mechanical system, the readings of sensors follow temporal causality, such as the relationship between the temperature and pressure. Formally, time series is said to be temporal-cause if it can be shown that those values of provide statistically significant information about future values of .

We can find that the ubiquity of temporal causality exists in the mechanism systems, but it comes with time lags due to properties of different domains. For example, in the chiller plant systems, the aging of pumps might lead to lags in response when the temperature is changing. In order to figure out this situation, we introduce the supervised attention mechanism that can select the relevant hidden states adaptively, i.e., by employing attention mechanism, the contributing hidden state might be assigned a larger weight, so the effectiveness of time lags will be negligible. Specifically, to calculate the context vector at time step over each hidden state before the final time step , we define the weights of each hidden state as follow:


in which and are trainable parameters, and is the candidate context vectors over all the hidden states except the last one. We generate the final context vectors by concatenating and final hidden state . The aforementioned process is as follows:


As a summary, are the domain-sharing parameters, which are used to model the transferable temporal causal mechanism proposed in this subsection.

Iv-C Transferable Dynamic Causal Mechanism

Observation3 intra-domain causal mechanism shift: As shown in Figure 4, we can find that the causal effect between sensors are changing over time, which depends on the sensor readings in the last time step. In chiller plant system, higher temperature leads to the increment of relative humidity, which further rev the chilled water pump, while lower temperature leads to the falloff of relative humidity, which further revs the condenser water pump. This causal effects are actually some physical mechanism, so it’s reasonable to be transferred from the source domain to the target domain.

Fig. 4: The illustration of dynamic causal mechanism. The causal mechanism among sensors change over time in a domain, but such mechanism is transferable across the domains.

Next we introduce the transferable dynamic causal mechanism motivated by the aforementioned observation. In another word, and share the same dynamic casual mechanism. To address this issue, given the -th dimension of the -th time step of domain specific extracted feature(i.e., ), we employ a self-attention mechanism that generates a transferable weight over sensors to adaptively capture the dynamic correlation of the multivariate time series data. Formally, we can calculate the weight of -th feature at -th time step (i.e., ) by:


in which and are trainable parameters. The attention weights are jointly generated by the historical hidden state of LSTM as well as current domain specific feature , and it also representation which sensor plays an important role in final prediction. Here, as the vector of weights of each sensor. After generating the intra-sensors attention weight, the weighted sensor readings are calculated with:


The aforementioned process is as follows:


As a summary, are the domain-sharing parameters, which are used to model the transferable dynamic causal mechanism proposed in this subsection.

Iv-D Model Summary

Fig. 5: The architectures of Causal Mechanism Transfer Network (CMTN). From the input to the output, the domain specific feature extractor (in orange) employs a MLP layer to ease the mischief of inter-domain value range shift; the dynamic causal mechanism (in pink) employs a self-attention mechanism to capture the intra-domain causal mechanism shift; the temporal causal transfer layer employs a supervised attention layer to extract the important hidden state for the final prediction. (best view in color)

The architecture of CMTN is shown in Figure 5. First, We take the time series sensor value as the input of domain-specific feature extractors, which mitigate the influence of different value ranges and the output of the extractors is feature . Second, the features are aligned by dynamic causal transfer layer which utilizes the feature and the hidden state from the last time step and we further get the weighted feature . Third, by taking the hidden state from last time step and weighted feature as input, LSTM generates the hidden state . Fourth, by utilizing all the hidden states, the temporal causal transfer layer calculates the final context representation which not only contain all the message of the time series but also extract and highlight the most important state. Finally, we employ the gradient reversal layer to fool the domain predictor and the label predictor to generate the final decision.

The overall objective function of our approach is summarized as follows:


where and is the size of source domain and target domain dataset, is the parameter that trade-off the label prediction loss and the domain prediction loss in this unified optimization.

In the training procedure, we employ the stochastic gradient descent algorithm to find the optimal parameter set

as follows. In this procedure, all the samples are used, including the labelled source domain samples and the unlabelled target domain samples.


In the predicting procedure, we input the target domain samples into the model through the target feature extractor, and the labels of target domain samples are predicted as follows,


V Case Studies and Experiment

In this section, the proposed CMTN method is experimental studied on two real-world applications: Chiller Plant Optimization and Boiler Fault Detection.

V-a Datasets

Chiller Plant Optimization: The chiller plant data which is provided by Kaer Pte. Ltd, consists of chiller plant sensor data collected from Building Management Systems (BMS) from two sites, each considered as one domain. The learning task is to predict total system power of a chiller plant, which is a regression problem, for energy optimization. We extract training data samples from the target domain, where the VSD speeds of condenser water pumps, chilled water pumps and fans of cooling towers are restricted to of allowed range. This is to simulate the situation at new chiller sites with insufficient data. Such data insufficiency is also common at chiller sites that have been running for years. We have encountered several chiller sites with VSD speeds set at a fixed speed for all the time. The test data of the target domain contains data samples with full range of VSD speeds. Details of the dataset in terms of the start and end date, and the sizes of the source domain and the target domain are provided in Table I. Table II lists all the features. Training and test data are split according to time. The first data are used as training data while the rest are used as test data.

Different from approaches in [34] that decomposes a chiller plant into multiple components and models each component separately, we use a black-box approach based on LSTM to model the total system power. This is because it is less straightforward and even difficult to apply domain adaptation technique on a complex system with multiple inter-connected models.

Start Date End Date Size
Source domain 28/06/2017 14/07/2017 211K
Target domain 11/06/2017 14/07/2017 254K
TABLE I: Duration and size of chiller data
Feature Name
VSD speed of chilled water pump ()
VSD speed of cooling tower fan ()
VSD speed of condenser water pump ()
Relative humidity ()
Dry bulb temperature (outdoor) (C)
System cooling load (RT)
Number of chillers on
Number of chilled water pumps on
Number of cooling towers on
Number of condenser water pumps on
TABLE II: Features of chiller data

Boiler Fault Detection: The boiler data which is provided by SK Telecom, consists of sensor data from five boilers from 24/3/2014 to 30/11/2016. Each boiler is considered as one domain. The learning task is to predict faulty blow down valve of each boiler. All the features used for this task is listed in Table III. In data pre-processing, we replace value with for columns with continuous increasing values along time, as indicated by “delta” in Table III. Notice that the boiler data is extremely unbalanced, as can be seen from the statistics of the five boilers listed in Table IV. Less than of the total samples have faulty labels, with boiler 1 having faulty samples. Due to lack of faulty labels, we use all the faulty data of source domains as training data for domain adaptation. To handle the extreme unbalance of the data, we apply down sampling on the normal samples of the source domain to obtain a balanced training dataset.

Feature Name
Steam pressure main header
Outdoor temperature
Temperature concentrated water
Operating time feed water (delta)
Temperature exhaust gas
Volume feed water (delta)
Temperature feed water
Temperature tube wall
Damper angle
Temperature scale
Temperature external
Operating status
Operating code
Input status
Power usage meter (delta)
Steam pressure
Operating time chemical injection (delta)
Combustion time (delta)
Number of ignition (delta)
Gas consumption
TABLE III: Features of boiler data
Boiler ID # of samples # of faulty samples Ratio
1 89969 1334 0.98
2 90120 7170 0.92
3 83145 1168 0.98
4 89718 4936 0.94
5 89639 6712 0.92
TABLE IV: Statistics of the boiler data. ‘Ratio’ is the ratio of # normal samples over # of samples.

V-B Evaluaion Metrics

We use application specific criteria to evaluate the performance of our model and the baselines. For Chiller Plant Optimization case, we use the mean absolute percentage error (MAPE) to evaluate the performance of proposed model. MAPE is formally defined as follows:


where is the actual value and . For Boiler Fault Detection, we use another two criteria to evaluate the performance of boiler fault detection:

  • Accuracy of fault detection as the percentage of correctly predicted samples.

  • Area under the curve (AUC) of the correctly predicted faulty samples.

It is worth noting that we report the AUC over the fault samples in our experiment. As the boiler data is extremely unbalanced, a prediction model that always predicts ’normal’ could achieve accuracy and AUC over the fault samples could enable us to have a better understanding of the performance of the model.

V-C Baselines

We compare our approach against the following baselines:

  • LSTM_S2T uses source domain data to train a LSTM model and apply it on the target domain without any adaptation(S2T stands for source to target).It is expected to provide the lower bound performance.

  • Ganin implements the domain adaptation architecture proposed in [10] with GRL(Gradient Reversal Layer) on LSTM, which is a straightforward solution for time series domain adaptation.

  • VRADA implements the domain adaptation architecture proposed in [26] which combines the GRL with VRNN [7]. However, it only aligns the the final latent representation from recurrent latent variables model.

Besides the above baselines, we also consider three variations of our approach to evaluate the effect of individual component as:

  • CMTN-NDE: We only remove the domain specific extractors.

  • CMTN-NGA: We only remove the temporal causal transfer layer.

  • CMTN-NLA: We only remove the dynamic causal transfer layer.

Our model and the baselines are implemented with Tensorflow

[14] on the server with one GTX-1080 and Intel 7700K. We set the length of time series sample as 6, i.e. . The setting of each model are provided in Table V.

Batch size 512 512 512 512
LSTM hidden layer size 500 500 500 500
LSTM layer 1 1 1 1
MLP hidden layer size 100 100 100 100
MLP layer 1 1 1 1
Domain specific feature size 100 100 100 100
Optimizer Adam Adam Adam Adam
Learning rate 0.0001 0.003 0.003 0.003
Coefficient - 0.0001 0.0001 0.005
Dropout rate 0.5 0.1 0.2 0.1
TABLE V: Settings of Models on Chiller Data

V-D Results on Chiller Plant Optimization

Method MAPE (%)
LSTM_S2T 371.86
Ganin 4.71
VRADA 4.21
CMTN 3.28
TABLE VI: MAPE on total system power prediction on chiller data

Accuracy of the system power prediction: The MAPE of all model for total system power prediction are reported in Table VI. Our approach achieve the lowest MAPE among all models. It’s lower that of Ganin and lower than that of VRADA. The MAPE of CMTN-NDE is and lower than that of Ganin and VRADA respectively. This indicates the effectiveness of transferable temporal and dynamic causal mechanism, which is different from that in Ganin and VRADA. The MAPE of LSTM_S2T is the worst, which simply implies that applying source domain knowledge directly to target domain without adaptation is not going to work on the chiller plant.

Power saving after using the power prediction: In order to evaluate the usefulness of domain adaptation models on energy saving, we conduct simulation of real-time VSD speed optimization on the test data of target domain as proposed in [34]. The main idea is to search for optimal VSD speeds of pump and fans every time steps with the minimum total system power based on the domain adaptation models, assuming other features (e.g., weather, cooling load, etc) remain the same.

Model Energy (kWh) Energy Difference (%)
Original 15858
Ganin 17385 +9.62
VRADA 16971 +7.02
CMTN-NDE 16930 +6.76
CMTN-NGA 16003 +0.91
CMTN-NLA 16535 +4.28
CMTN 15532 -2.05
TABLE VII: Percentage of total system power saving on chiller data

Upon finding the optimal speed, we first train a LSTM_T2T model, which is trained and tested with target domain training and test dataset respectively. And then we apply the most accurate LSTM_T2T model to predict the corresponding total system power and compare it against the original power. The result of Ganin, VRADA and our apporach are plotted in Figure 6, 7 and 8 respectively, with 5-day simulations covering4 weekdays and 1 weekend day. Note that the energy consumption of original setting is already optimization outcomes of our previous data-driven method in [34].

Our approach with optimization is able to further reduce energy consumption, by consistently reaching lower power in most of the cases as show in Figure 8, while Ganin’s approach generates similar or even higher power after optimization due to it’s MAPE on power prediction.

The corresponding energy consumption (kWh) and percentage of energy saving in total system power, if possible, of all models are reported in Table VII. Due to the high requirement on accurate modeling, only our approach is able to achieve energy saving by in the simulation. With electricity tariff being around SGD$0.20, the optimization based on our domain adaptation model can save roughly SGD$65.2 in five days. Since all domain adaptation techniques tested here do not use any labels from target domain, the saving achieved by our approach is significant.

V-E Results on Boiler Fault Detection

Accuracy of the boiler fault detection: We use boiler 4 as the source domain, which has the median number of fault labels among the five boilers. The rest of the boilers are used as target domains. We report AUC of each source-target pair in Tables VIII, IX, X and XI respectively.

Fig. 6: Comparison of original total system power with that of Ganin’s approach with optimization.
Fig. 7: Comparison of original total system power with that of VRADA’s approach with optimization.
Fig. 8: Comparison of original total system power with that of CMTN’s approach with optimization.
Fig. 9: Comparison of AUC with different length of time series input under the setting .
Method Accuracy AUC
LSTM_S2T 0.970 0.475
Ganin 0.975 0.533
VRADA 0.985 0.634
CMTN-NDE 0.982 0.640
CMTN-NGA 0.980 0.642
CMTN-NLA 0.985 0.6763
CMTN 0.985 0.707
TABLE VIII: Results on Boiler 4 (Source) and 1 (Target)
Method Accuracy AUC
LSTM_S2T 0.940 0.864
Ganin 0.971 0.909
VRADA 0.972 0.934
CMTN-NDE 0.976 0.926
CMTN-NGA 0.975 0.925
CMTN-NLA 0.971 0.947
CMTN 0.977 0.948
TABLE IX: Results on Boiler 4 (Source) and 2 (Target)
Method Accuracy AUC
LSTM_S2T 0.978 0.300
Ganin 0.979 0.475
VRADA 0.986 0.720
CMTN-NDE 0.982 0.534
CMTN-NGA 0.978 0.709
CMTN-NLA 0.986 0.800
CMTN 0.986 0.877
TABLE X: Results on Boiler 4 (Source) and 3 (Target)
Method Accuracy AUC
LSTM_S2T 0.975 0.930
Ganin 0.967 0.932
VRADA 0.980 0.945
CMTN-NDE 0.969 0.936
CMTN-NGA 0.976 0.929
CMTN-NLA 0.981 0.949
CMTN 0.986 0.954
TABLE XI: Results on Boiler 4 (Source) and 5 (Target)

Overall, our approach achieves the highest accuracy and AUC on all setting. It outperforms Ganin and VRADA by improving the AUC over faulty samples, for example, by for Ganin(from 0.475 to 0.877 in Table X) and by for VRADA(from 0.720 to 0.877 in Table X) on pair Boiler 4 and Boiler 3 (denoted by ). All models perform well on pair and

. Even LSTM_S2T can achieve AUC over 0.930 and 0.864 respectively. This is probably because these boilers (i,e., boiler 2, 4 and 5) encounter similar problems, i.e., faulty blow down valve, after installation. Therefore they tend to share more common properties without adaptation that may result in fault due to issues in installation. Even in such case, domain adaptation is able to further improve the accuracy and AUC, for example, by

for pair than LSTM_S2T.

However, the performance on pair and are much worse than the other cases. The highest AUC over faulty samples on pair is only 0.707(Table VIII). The reasons are two fold: first, these two target domains, i.e., Boiler 1 and Boiler 3, contain much fewer faulty labels than the others. This makes it more difficult to learn domain specific feature extractor. Second, these two boilers do not encounter ’faulty blow down valve’ problems after installation. Thus they tend to share less similar properties with the source domain.

However, the improvement over AUC of LSTM_S2T by our domain adaptation approach is significant in such case, e.g., by on and on , though they have not yet reached the level for reliable industrial adoption. Inspired by these observations, a possible solution for quick examination of whether domain adaptation technique would apply on a new domain is to use S2T as the baseline. If S2T can achieve reasonable performance, it shows higher chances to obtain a promising result with domain adaptation. We leave this as our feature work.

V-F Ablation Study

Study on the domain specific feature extractor: The value ranges of some sensors of each boiler with wide difference are shown in Table XII, and we can find that boiler 3 contains the largest otherness of the value range among all the boilers. At the same time, the experimental result reveal that CMTN-NDE, which removes the domain specific extractors, gains significant drop over baselines compared with CMTN and even gets a lower AUC score than VRADA. From the result of boiler fault detection, we observe that: 1) Different value range of sensors can lead to negative transfer. 2) The domain specific feature extractors can mitigation the domain-variant influence.

Boiler Operating time feed water Temperature Exhaust Gas Power usage meter Temperature Tube Wall
TABLE XII: some sensor value ranges with large otherness

Study on the transferable temporal causal mechanism: Motivated by the fact that temporal causal mechanism keeps invariable among domains while time lag varies, we adopt attention mechanism for transferable temporal causal mechanism module, which not only consider the final hidden state, but also the others. Longer the input time series is, less information about preceding information is included in the final hidden state. Therefore, we evaluate the effect transferable temporal causal mechanism module by taking time series with different length as input, the experiment is shown in Figure 9.

According to the result, we can observe that: 1)The performance of TCMTN-NGA is still better than VRADA and the longer the sequence length, the larger the gap between TCMTN-NGA and VRADA, which reflect the useless of domain specific extractors and transferable dynamic causal mechanism. 2)The AUC of Ganin, VRADA and TCMTN-NGA drop sharply with the increasement of the length of the time series while slope of CMTN is much small than other compared approach. This is because CMTN applies temporal causal mechanism to all the hidden state, which utilizes all the hidden states and decreases the effect of domain-variant time lag and capture the temporal causality between time series at the same time. Though VRADA can capture complex and domain-invariant temporal relationships, it fails in time-series level feature alignment, so the increasement of sequence length will make a great impact on transferability.

Study on the transferable dynamic causal mechanism: As shown in Table VIII, IX, X and XI, we observe that: 1) the combination of domain specific extractors and transferable temporal causal mechanism shows superiority against VRADA, especially in . 2) After appending the dynamic temporal causal mechanism, the experiment result improves ulteriorly, which demonstrates the importance of transferable dynamic causal mechanism. VRADA and Ganin simple consider that the weight of each sensor in each time step are the same, and the main drawback is that some sensor value might be useless and even have interference effect to detection.

Vi Conclusion

In this paper, we present novel Casual Mechanism Transfer Network for time series domain adaptation. We demonstrate the usefulness of the approach on two real-world case studies on mechanical systems. The case studies show positive results on model performance improvement even when the mechanical system lacks labels over historical data. By deploying these data-driven models, we are capable of reducing energy consumption of chiller plant and accurate detection of boiler failures. Furthermore, we not only mitigate the different value ranges and time lags among different machines in mechanism system, but also exploit the causal mechanisms among time series data to transfer the knowledge from source domain to target domain.


  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §II.
  • [2] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann (2013) Unsupervised domain adaptation by domain invariant projection. In ICCV, pp. 769–776. Cited by: §I.
  • [3] K. M. Borgwardt, A. Gretton, M. J. Rasch, H. P. Kriegel, B. Schölkopf, and A. J. Smola (2006) Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22 (14), pp. e49. Cited by: §II.
  • [4] G. E. Box and D. A. Pierce (1970) Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. Journal of the American statistical Association 65 (332), pp. 1509–1526. Cited by: §II.
  • [5] Y. Chikahara and A. Fujino (2018) Causal inference in time series via supervised learning.. In IJCAI, pp. 2042–2048. Cited by: §IV-B.
  • [6] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555. External Links: Link, 1412.3555 Cited by: §II, §III-B.
  • [7] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio (2015) A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980–2988. Cited by: §I, 3rd item.
  • [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §II.
  • [9] S. G (2009) An empirical analysis of domain adaptation algorithms for genomic sequence analysis. Cited by: §II.
  • [10] Y. Ganin and V. S. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    In ICML, pp. 1180–1189. Cited by: §II, §III-B, §IV-A, 2nd item.
  • [11] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. Journal of Machine Learning Research 17 (59), pp. 1–35. Cited by: §I, §IV-A.
  • [12] P. Germain, A. Habrard, F. Laviolette, and E. Morvant (2016) A new pac-bayesian perspective on domain adaptation. In ICML, pp. 859–868. Cited by: §I.
  • [13] M. Ghifary, D. Balduzzi, W. B. Kleijn, and M. Zhang (2017) Scatter component analysis: a unified framework for domain adaptation and domain generalization. IEEE transactions on pattern analysis and machine intelligence 39 (7), pp. 1414–1430. Cited by: §II.
  • [14] S. S. Girija (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. Software available from tensorflow. org. Cited by: §V-C.
  • [15] M. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour, and B. Schölkopf (2016) Domain adaptation with conditional transferable components. In ICML, pp. 2839–2848. Cited by: §I, §II.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §II.
  • [17] C. W. Granger (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society, pp. 424–438. Cited by: §IV-B.
  • [18] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II.
  • [19] J. Huang, A. Gretton, K. M. Borgwardt, B. Schölkopf, and A. J. Smola (2007) Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pp. 601–608. Cited by: §II.
  • [20] R. J. Hyndman and G. Athanasopoulos (2012) Forecasting: principles and practice. Cited by: §II.
  • [21] D. Jung, Z. Zhang, and M. Winslett (2017) Vibration analysis for iot enabled predictive maintenance. In ICDE, pp. 1271–1282. Cited by: §I, §I.
  • [22] P. Koniusz, Y. Tas, and F. Porikli (2016)

    Domain adaptation by mixture of alignments of second-or higher-order scatter tensors

    arXiv preprint arXiv:1611.08195. Cited by: §I, §II.
  • [23] L. Li, S. Tang, L. Deng, Y. Zhang, and Q. Tian (2017) Image caption with global-local attention. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §II.
  • [24] P. M (2018) Cross-domain sentiment classification with target domain specific information. Cited by: §II.
  • [25] J. Pearl (2002) Causality: models, reasoning, and inference. IIE Transactions 34 (6), pp. 583–589. Cited by: §I.
  • [26] S. Purushotham, W. Carvalho, T. Nilanon, and Y. Liu (2016) Variational recurrent adversarial deep domain adaptation. Cited by: §I, §II, 3rd item.
  • [27] G. Qi, J. Tang, J. Wang, and J. Luo (2017) Mixture factorized ornstein-uhlenbeck processes for time-series forecasting. In SIGKDD, pp. 987–995. Cited by: §II.
  • [28] H. Sak, A. W. Senior, and F. Beaufays (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling.. In Interspeech, pp. 338–342. Cited by: §III-B.
  • [29] Sarma (2018) Domain adapted word embeddings for improved sentiment classification. Cited by: §II.
  • [30] B. Sun, J. Feng, and K. Saenko (2016) Return of frustratingly easy domain adaptation. In AAAI, pp. 2058–2065. Cited by: §II.
  • [31] X. Sun (2008) Assessing nonlinear granger causality from multivariate time series. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 440–455. Cited by: §IV-B.
  • [32] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7167–7176. Cited by: §II.
  • [33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §II.
  • [34] H. D. Vu, K. Chai, B. Keating, N. Tursynbek, B. Xu, K. Yang, X. Yang, and Z. Zhang (2017) Data driven chiller plant energy optimization with domain knowledge. In CIKM, pp. 1309–1317. Cited by: §I, §V-A, §V-D, §V-D.
  • [35] X. Wang, L. Li, W. Ye, M. Long, and J. Wang (2019) Transferable attention for domain adaptation. Cited by: §II.
  • [36] K. Zhang, M. Gong, and B. Schölkopf (2015) Multi-source domain adaptation: A causal view. In AAAI, pp. 3150–3157. Cited by: §II.
  • [37] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang (2013) Domain adaptation under target and conditional shift. In ICML, pp. 819–827. Cited by: §II.