HierMUD
MUD: a Multi-task Unsupervised Domain adaptation framework
view repo
Monitoring bridge health using vibrations of drive-by vehicles has various benefits, such as no need for directly installing and maintaining sensors on the bridge. However, many of the existing drive-by monitoring approaches are based on supervised learning models that require labeled data from every bridge of interest, which is expensive and time-consuming, if not impossible, to obtain. To this end, we introduce a new framework that transfers the model learned from one bridge to diagnose damage in another bridge without any labels from the target bridge. Our framework trains a hierarchical neural network model in an adversarial way to extract task-shared and task-specific features that are informative to multiple diagnostic tasks and invariant across multiple bridges. We evaluate our framework on experimental data collected from 2 bridges and 3 vehicles. We achieve accuracies of 95 for localization, and up to 72 improvements from baseline methods.
READ FULL TEXT VIEW PDFMUD: a Multi-task Unsupervised Domain adaptation framework
Bridges are key components of transportation infrastructure. Aging bridges all over the world pose challenges to the economy and public safety. According to the 2016 National Bridge Inventory of the Federal Highway Administration, 139,466 of 614,387 bridges in the U.S. are structurally deficient or functionally obsolete (asce). The state of aging bridges demands researchers to develop efficient and scalable approaches for monitoring a large stock of bridges.
Currently, bridge maintenance is based on manual inspection (hartle2002bridge), which is inefficient, incurs high labor costs, and fails to detect damages in a timely manner. To address these challenges, structural health monitoring techniques (sun2020review), where structures are instrumented using sensors (e.g., strain gauge, accelerometers, cameras, etc.) to collect structural performance data, have been developed to achieve continuous and autonomous bridge health monitoring (BHM). Yet, such sensing methods are hard to scale up as they require on-site installation and maintenance of sensors on every bridge and cause interruptions to regular traffic for running tests and maintaining instruments (yang2020state).
To address the drawbacks of current BHM, drive-by BHM approaches were proposed to use vibration data of a vehicle passing over the bridge for diagnosing bridge damage. Drive-by BHM is also referred to as the vehicle scanning method (yang2020vehicle) or indirect structural health monitoring (cerda2014indirect; lederman2014damage; liu2020diagnosis). Vehicle vibrations contain information about the vehicle-bridge interaction (VBI) and thus can indirectly inform us of the dynamic characteristics of the bridge for damage diagnosis (yang2004extracting; liu2020diagnosis). This is a scalable sensing approach with low-cost and low-maintenance requirements because each instrumented vehicle can efficiently monitor multiple bridges. Also, there is no need for direct installations and on-site maintenance of sensors on every bridge.
Previous drive-by BHM focuses on estimating bridge modal parameters (e.g., fundamental frequencies
(yang2004extracting; lin2005use; liu2019expectation), mode shapes (yang2014constructing; malekjafarian2014identification), and damping coefficients (gonzalez2012identification; liu2019expectation)) that can be used for detecting and localizing bridge damage. However, since the VBI system is a complex physical system involving many types of noise and uncertainties (e.g., environmental noise, vehicle operational uncertainties, etc.), such modal analysis methods for drive-by BHM are susceptible to them (liu2020diagnosis). This makes the modal parameters’ estimation inaccurate and limits the ability of drive-by BHM to diagnose bridge damage (e.g., localize and quantify damage severity).More recently, data-driven approaches use signal processing and machine learning techniques to extract informative features from the vehicle acceleration signals
(nguyen2010multi; mcgetrick2013parametric; liu2019expectation; liu2020scalable; eshkevari2020signal; cerda2014indirect; lederman2014damage; liu2020diagnosis; liu2020damage; mei2019indirect; sadeghi2020modal). The extracted features are more robust to noise, enabling more sophisticated diagnoses such as damage localization and quantification. However, such data-driven approaches generally use supervised learning models developed on available labeled data (i.e., a set of bridges with known damage labels) that is expensive, if not impossible, to obtain. This labeled data requirement is further exacerbated by having multiple diagnostic tasks (e.g., damage detection, localization, and quantification). Furthermore, the standard supervised learning-based approaches learned using vehicle vibration data collected from one bridge are inaccurate for monitoring other bridges because data distributions of the vehicle passing over different bridges are shifted. Having to re-train for each new bridge in multiple bridge monitoring is time-consuming and costly. Therefore, for multiple bridge monitoring with multiple diagnostic tasks, one needs to transfer or generalize the multi-task damage diagnostic model learned from one bridge to other bridges, in order to eliminate the need for requiring training labeled data from every bridge in multiple tasks.To this end, we introduce HierMUD, a new Hierarchical Multi-task Unsupervised Domain adaptation framework that transfers a model learned from one bridge data to predict multiple diagnostic tasks (including damage detection, localization, and quantification) in another bridge in an unsupervised way. Specifically, HierMUD makes a prediction for the target bridge, without any labels from the target bridge in any of the tasks. We achieve this goal by extracting features that are 1) informative to multiple tasks (i.e., task-informative) and 2) invariant across the source and target domains (i.e., domain-invariant).
Our framework is inspired by an unsupervised domain adaptation (UDA) approach that has been developed in the machine learning community to address the data distribution shift between two different domains, namely source and target domains (zhang2019transfer; jiang2007instance; pan2010domain; cao2018unsupervised; luo2020unsupervised; xu2021phymdan)
. In our work, we denote the bridge with labeled data as “Source Domain,” and the new bridge of interest without any labels as “Target Domain.” UDA focuses on the unsupervised learning tasks in the target domain, which transfers the model learned using source domain data and labels (e.g., vehicle vibration data with the corresponding bridge damage labels) to predict tasks in the target domain without labels (e.g., vehicle vibration data from other bridges without knowing damage labels). In particular, domain adversarial learning algorithm extracts a feature that simultaneously maximizes the damage diagnosis performance for the source domain based on source domain labeled data while minimizing the performance of classifying which domain the feature came from using both source and target domain data
(saito2018maximum; ganin2016domain; zhang2019bridging; xu2021phymdan). This algorithm can better match complex data distributions between the source and target domains by learning feature representations through neural networks, when compared to other conventional UDA approaches.However, directly matching different domains’ data distributions over each diagnostic task separately through domain adversarial learning limits the overall performance of drive-by BHM, which has multiple diagnostic tasks. This is because these tasks are coupled/related with each other and share damage-sensitive information. The prediction performance of each task depends on that of other coupled tasks.
Therefore, our algorithm integrates multi-task learning (MTL) with domain-adversarial learning to fuse information from different bridges and multiple diagnostic tasks for effectively improving diagnostic accuracy and scalability of drive-by BHM. MTL is helpful because it simultaneously solves multiple learning tasks to improve the generalization performance of all the tasks, when compared to training the models for each task independently or sequentially (i.e., independent or sequential task learning) (caruana1997multitask; luong2015multi; augenstein2018multi; dong2015multi; hashimoto2016joint; yang2016deep; misra2016cross; jou2016deep; wan2019bayesian; liu2019damage).
Yet, when combining MTL and domain-adversarial learning, two particular research challenges exist:
Distinct distribution shift: When we have multiple damage diagnostic tasks, the distribution shift problem becomes more challenging than having only one task because different tasks have distinct shifted distributions between the source and target domains. Therefore, it is important to develop efficient optimization strategies to find an optimal trade-off for matching different domains’ distributions over multiple tasks.
Varying task difficulty: The learning difficulties of different tasks vary with the complexity of mappings from the vehicle vibration data distribution to the damage label distributions. Some tasks (e.g., damage quantification task) would have highly non-linear mappings between data and label distributions and, therefore, be more difficult to learn than other tasks (i.e., hard-to-learn). Therefore, we need to distribute learning resources (e.g., representation capacities) according to the learning difficulty of tasks.
To address the distinct distribution shift challenge, we introduce a new loss function that prioritizes domain adaptations for tasks having more severe distribution shifts between different domains through a soft-max objective function. To formulate this new loss function, we first derive a new generalization risk bound for multi-task UDA, by optimizing which the new loss function is designed. Specifically, we minimize our loss function to jointly optimize three components: 1) feature extractors, 2) task predictors, and 3) domain classifiers. The parameters of task predictors are optimized to predict task labels in the source domain training set, which ensures that the extracted features are task informative. The parameters of feature extractors are optimized with the domain classifiers in an adversarial way, such that the best trained domain classifier cannot distinguish which domains the extracted features come from. During the optimization, our new loss function adaptively weighs more on minimizing the distribution divergence of tasks having more shifted distributions between different domains. In this way, the model is optimized to automatically find a trade-off for matching different domains’ distributions over multiple tasks.
Further, to address the varying task difficulty challenge, we develop hierarchical feature extractors to allocate more learning resources to hard-to-learn tasks. We first split the multiple tasks into easy-to-learn tasks (e.g., damage detection and localization) and hard-to-learn tasks (e.g., damage quantification) based on their degrees of learning difficulty. Specifically, we model the learning difficulty of each task to be inversely proportional to performance in the source domain (e.g., damage localization and quantification accuracy in supervised learning settings). Then, the hierarchical feature extractors learn two-level features: task-shared and task-specific features. To achieve high prediction accuracy for multiple tasks without creating a complex model, we extract task-shared features from input data for easy-to-learn tasks and then extract task-specific features from the task-shared features for only the hard-to-learn tasks. In this way, we allocate more learning resources (learning deeper feature representations) to learn hard-to-learn tasks, which improves the overall performance for all the tasks.
We evaluate our framework on the drive-by BHM application using lab-scale experiments with two structurally different bridges and three vehicles of different weights. In the evaluation, we train our framework using labeled data collected from a vehicle passing over one bridge to diagnose damage in another bridge with unlabeled vehicle vibration data. Our framework outperform five baselines without UDA, MTL, the new loss function, or hierarchical structure.
In summary, this paper has three main contributions:
We introduce HierMUD, a new multi-task UDA framework that transfers the model learned from one bridge to achieve multiple damage diagnostic tasks in another bridge without any labels from the target bridge in any of the tasks. To the best of our knowledge, this new framework is the first domain adaptation framework for multi-task bridge monitoring. We have since released a PyTorch
(paszke2017automatic) implementation of HierMUD at https://github.com/jingxiaoliu/HierMUD.We derive a generalization risk bound that provides a theoretical guarantee to achieve domain adaptation on multiple learning tasks. This work bridges the gaps between the theories and algorithms for multi-task UDA. Based on this bound, we design a new loss function to find a trade-off for matching different domains’ distributions over multiple tasks, which addresses the distinct distribution shift challenge.
We develop a hierarchical architecture for our multi-task and domain-adversarial learning algorithm. This hierarchical architecture ensures that the framework accurately and efficiently transfers the model for predicting multiple tasks in the target domain, which addresses the varying task difficulty challenge.
The remainder of this paper is divided into six sections. In section 2, we study the MTL and data distribution shift challenges in the drive-by BHM application. Section 3 derives the generalization risk bound for multi-task UDA. Section 4 presents our HierMUD framework, which includes the description of our framework, loss function, and algorithm design. Section 5 describes the evaluation of our framework on the drive-by BHM application, following by Section 6 that presents the evaluation results. Section 7 concludes our work and provides discussions about future work.
In this section, we describe the physical insights that enable our drive-by BHM and explain the associated challenges in achieving this scalable BHM approach. We first characterize the structural dynamics of the VBI system. Next, we study the multi-task learning and data distribution shift challenges, respectively, by proving the error propagation between multiple damage diagnostic tasks and characterizing the shifting of the joint distribution of vehicle vibration and damage labels.
To provide physical insights of drive-by BHM, we model the VBI system, as shown in Figure 1, as a sprung mass (representing the vehicle) traveling with a constant speed on a simply supported beam (representing the bridge). We assume the beam is of the Euler-Bernoulli type with a constant cross section. We also assume that there is no friction force between the ‘wheel’ and the beam. The damage is simulated by attaching a mass (magnitude/severity level: ) at location () on the beam. The added mass changes the mass of the bridge and its dynamic characteristics (malekjafarian2015review). Modifying the weight of the attached mass is a non-destructive way of creating physical changes to the VBI system to mimic structural damage (DERAEMAEKER20181; Taddei2018; doi:10.1177/1475921717699375).
In our prior works (liu2019damage; liu2020diagnosis)
, we have derived the theoretical formulation of the VBI system in the frequency domain, which is summarized in the following paragraphs.
The equations of motion for the vehicle and bridge in the time domain are first derived as
(1) | ||||
(2) | ||||
where , and are the mass, stiffness, damping coefficient, and total displacement of the vehicle, respectively; , and
are the density, sectional area, Young’s modulus, moment of inertia, and the static displacement of the bridge, respectively;
is the Dirac delta function; and and are the dynamic displacements of vehicle and bridge, respectively.Then the -th mode frequency response of the vehicle’s acceleration is
(3) | ||||
where is the -th mode frequency response function (FRF) of the bridge element acceleration at the location ; and and are the -th mode FRF of the vehicle acceleration and the FRF of the Dirac delta function, respectively.
From Equation (3), we obtain the following important physical understandings of the drive-by vehicle vibration:
Non-linear property: Vehicle acceleration () is a high-dimensional signal that has a complex non-linear relationship with bridge properties () and damage parameters (). Thus, it is difficult to infer damage states for different bridges by directly analyzing the raw vehicle signals. It is important to model features that can represent the non-linearity of the VBI system.
Coupled diagnostic tasks: Different damage locations () and severity levels () only vary the term . Let’s define the damage information as , representing structural dynamic characteristic changes due to damage. The damage localization and quantification tasks are coupled with each other through the same damage information , and thus the estimation of them depends on each other’s estimation.
We incorporate these physical insights of the VBI system to develop our multi-task UDA framework. The non-linear property instructs us to use non-linear models or extract non-linear features from the vehicle vibrations for estimating bridge damage. In this work, we use a neural network-based model to non-linearly extract task-informative features from vehicle vibration data.
Moreover, the coupled tasks property of the VBI system informs us to use or extract the shared damage information (e.g., task-shared features) instead of independently learning multiple diagnostic tasks. This is further discussed in the following subsection.
The shared information among multiple tasks can be learned simultaneously from multiple tasks or learned sequentially from one task to the next. In this section, we illustrate that simultaneous learning (i.e., MTL) is more accurate than sequential learning through a theoretical study of the VBI system. The study shows that the sequential learning method results in a significant error propagation from the previous task to the next.
For instance, if we localize and quantify the bridge damage sequentially (localize the damage first and then quantify the severity of the damage at the obtained damage location), the estimation of damage severity is , where is the estimated damage location. Then, the propagation of error from the damage location estimation to the severity estimation is
(4) |
where , , are errors of severity, location, and damage information estimations, respectively. and in the denominators are smaller than 1, which makes the estimation error of every large as their values decrease. Especially, when the damage is close to the bridge supports (i.e., ), it leads to and . Thus, damage location estimation error propagates, which results in a very inaccurate estimation of damage severity level.
To this end, we solve multiple tasks simultaneously, which can improve the overall accuracy by minimizing error propagation and learning the shared information (e.g., task-shared feature representations) from the coupled tasks.
Besides simultaneously learning multiple tasks, a scalable drive-by BHM approach needs to work for multiple domains (i.e., bridges). The following subsection discusses the data distribution shift challenge for drive-by monitoring of multiple bridges.
The joint distributions of vehicle vibrations and damage labels are shifted as the vehicle passes by different bridges. If we consider the process of the VBI system as a stochastic process, according to Equation 3, the joint distributions of the vehicle accelerations () and damage labels ( or ) changes non-linearly with bridge properties (e.g., ). An example of the shifting of these joint data distributions for different bridges is visualized in a low-dimensional space in Figure 2. It shows a two dimensional t-Distributed Stochastic Neighbor Embedding (tSNE) (van2008visualizing) visualization of vehicle vibration data distributions for Bridges#1 and #2. Each vibration signal is collected from a vehicle passing over an 8-feet bridge model with a damage at the location () of 2 feet, 4 feet, or 6 feet. The data for the two structurally different bridges (Bridge#1 and #2) are represented by filled and unfilled markers, respectively. More details of this experiment and dataset are in Evaluation Section. We can observe from Figure 2 that directly applying the model learned from one bridge’s dataset (e.g., Bridge#1) to localize damage on the other bridge (e.g., Bridge#2) results in a low prediction accuracy because of the joint distribution shift.
To address the distribution shift challenge and achieve a scalable drive-by BHM that is invariant across multiple bridges (i.e., it can predict damage without requiring training data from every bridge), we introduce a new multi-task UDA approach. In the next section, we first investigate the multi-task UDA problem theoretically through deriving a generalization risk bound.
In this section, we derive the upper bound of the generalization risk for multi-task UDA problems to investigate the theoretical guarantee of its performance on target domain unseen data. The generalization risk (or error) of a model is the difference between the empirical loss on the training set and the expected loss on a test set, as defined in statistical learning theory
(jakubovitz2019generalization). In other words, it represents the ability of the trained model to generalize from the training dataset to a new unseen dataset. In our problem, the generalization risk is defined to represent how accurately a classifier trained using source domain labeled data and target domain unlabeled data predicts class labels in the target domain. Therefore, deriving the upper bound of the generalization risk provides insights on how to develop learning algorithms to efficiently optimize it.We first derive a generalization risk bound for UDA and then integrate it with the risk bound for MTL. Next, we characterize the newly derived generalization risk bound to provide insights to our multi-task UDA problem.
We first derive a new generalization risk bound for UDA by representing the original data distribution in a feature space, which has been ignored by the existing risk bounds for UDA (ganin2016domain; zhang2019bridging; zhao2018adversarial). Having a feature space enables the modeling of task-shared feature representation when we have multiple tasks. This results in a tighter generalization risk bound for multi-task UDA than independently estimating each task’s generalization risk bound (maurer2016benefit). Yet, this feature space requires us to estimate the discrepancy between the marginal feature distributions of the source and target domains for obtaining the generalization risk bound, which is introduced in this section.
We consider a classification task that labels input as belonging to different classes . We also consider mappings
where , , and
are random variables of input, feature representation, and class label, which are taken from the input, feature, and output space
, , and , respectively. The function is a -dimension feature transformation, and the function is a hypothesis on the feature space (i.e., a labeling function). Then, we have a predictor , that is, , for every .Further, we define a domain as a distribution on the input space and the output space . UDA problems involve two domains, a source domain and a target domain . We denote , , , and as the marginal data () and label () distributions in the source () and target () domains, respectively. Note that we also have feature representation distributions, and , in the source and target domains. Mathematically, an unsupervised domain adaptation algorithm has independent and identically distributed (i.i.d.) labeled source samples drawn from and i.i.d. unlabeled target samples drawn from , as shown below:
where and are the number of samples in the source and target domains, respectively. The goal of UDA is to learn and with a low target domain risk under distribution , which is defined as: where is the ground truth labeling function and .
Since we do not have labeled data in the target domain, we cannot directly compute the target domain risk. Therefore, the upper bound of the target domain risk is estimated by the source domain risk and the discrepancy between the marginal data distributions of the source and target domains, and . The discrepancy between and is quantified through the -divergence (ben2010theory) that measures distribution divergence with finite samples of unlabeled data from and . It is defined as
where is a hypothesis space, and is the symmetric difference hypothesis.
Then, we can derive the generalization bound for UDA in Theorem 1.
Let be a hypothesis space on with VC dimension and be a hypothesis space on with VC dimension . If and are samples of size from and , respectively, and and follow distributions and , respectively,, then for any
with probability at least
, for every and :where
See Appendix. ∎
We prove in Theorem 1 that the upper bound of the target domain risk consists of five components:
The source domain risk:
which quantifies the error for estimating class labels in the source domain.
The minimal risk:
which quantifies the error for estimating class labels using the ideal joint hypothesis over the source and target domains. It is the smallest error we can achieve using the best predictor in the hypothesis set.
The empirical symmetric divergence between marginal feature distributions:
which quantifies the distribution difference between the source and target domain marginal feature distributions.
The supremum of empirical symmetric divergence set between marginal data distributions:
which quantifies the distribution difference between the source and target domain marginal data distributions.
The Big-O terms that measure the complexity of the estimation of divergence.
Next, in the following subsection, we derive a generalization risk bound for multi-task UDA problems by considering the feature space being the task-shared feature space for multiple tasks.
We first consider multiple classification tasks that label input as belonging to different classes , for , where is the total number of tasks. The mappings for this multi-task learning problem becomes
where , , and are the -th task’s random variables of input, feature representation, and class label, respectively, which are taken from the input, feature, and output space , , and , respectively. The function is a task-shared k-dimensional feature transformation, and the function is a task-specific hypothesis for the -th task on the feature space. For each task, we have a predictor . We define the task-averaged true risk under the joint distribution as
where is the ground truth labeling function for the -th task and for . We also define the task-averaged empirical risk as
where is the total number of samples in each task, which is assumed to be the same for each task, and is the indicator function. We define that are i.i.d. samples drawn from the joint distribution . For the -th task,
Further, we consider a multi-task UDA problem, which has labeled samples drawn from a joint source domain and unlabeled samples drawn from a joint target domain . The goal of multi-tasks UDA is to learn and with a low target domain task-averaged risk under the joint distribution , which is defined as:
Our generalization risk bound for multi-task UDA is built on our Theorem 1 by combining it with the risk bound for multi-task learning. The multi-task learning risk bound was introduced in the work of maurer2016benefit, which showed that the upper bound of the task-averaged risk consists of the task-averaged empirical risk, the complexity measure relevant to the estimation of the representation, and the complexity measure of estimating task-specific predictors. Specifically, with probability at least , where , in the draw of , it holds for every and every that
(5) | ||||
where is the Lipschitz constant for ; and are universal constants; is the Gaussian average that measures the empirical complexity relevant to the estimation of the feature representation; and
Now, by integrating the multi-task learning risk bound in Equation (5) with the unsupervised domain adaptation risk bound in Theorem 1, we obtain the following theorem:
Let and be probability measure on . Let be a hypothesis space on with VC dimension and be a hypothesis space on with VC dimension . Let . With probability at least in the draw of , (i.e., and for ), and that follow distributions and , it holds for every and every that
(6) | ||||
where
We show in Theorem 2 that the upper bound of task-averaged target domain risk contains seven components:
The source domain empirical risks:
the task-averaged minimal risks:
the complexity measure relevant to the estimation of the representation:
the complexity measure of estimating task-specific predictors:
the empirical symmetric divergence between marginal feature distributions:
the supremum of empirical symmetric divergence set (for multiple tasks) between marginal data distributions:
the Big-O complexity measures of the estimation of divergence.
Once we determine the hypothesis sets and , the task-averaged minimal risk and complexity terms are fixed (zhao2018adversarial). Therefore, we can minimize the target domain risk bound (Equation 6) by minimizing the sum of the source domain empirical risks, the empirical divergence between marginal data distributions, and the empirical divergence between marginal feature distributions:
(7) | ||||
To efficiently solve Equation (7) that minimizes the generalization risk bound for multi-task UDA, we interpret and characterize the bound in the following subsection.
We can learn two main insights of the new generalization risk bound for multi-task UDA from Equation (7):
Feature divergence minimization: Some UDA methods (e.g., zhang2019bridging; saito2018maximum) minimize the empirical symmetric divergences between marginal feature distributions (i.e., ) to make the task-specific classifiers (i.e., ) invariant across domains. However, these methods are not scalable as the number of tasks grows because they require every task-specific classifier to be domain-invariant. Therefore, our multi-task UDA approach avoids directly minimizing the feature divergence, which requires adapting classifiers of each task separately.
Data divergence minimization: Some UDA methods (e.g., ganin2016domain; long2015learning) minimize the empirical symmetric divergence between marginal data distributions (i.e., ) to extract domain-invariant features. Such methods would be successful and scalable if the feature distributions and are matched because in this case the empirical symmetric divergence between marginal feature distributions (i.e., ) would be also very small or even be zero. However, directly minimizing this supremum divergence is difficult and data inefficient under the MTL setting because different tasks have distinct distribution shifts between the source and target domains. Therefore, we introduce a new efficient optimization strategy to find an optimal trade-off for minimizing the data divergence over multiple tasks.
In summary, to develop an algorithm that is scalable to the number of tasks, we need to minimize the empirical divergence between marginal data distributions to learn a feature mapping that matches feature distributions between different domains. Therefore, we can rewrite Equation (7) as
(8) | ||||
For a classification problem, minimizing the first two terms in Equation (8) can be achieved by minimizing the cross-entropy loss between predicted labels and ground truth labels in source domain:
where is the number of classes for the -th tasks.
Further, if we consider are hypotheses independently drawn from the hypothesis class , we can write the last term in Equation (8) as
(9) |
without loss of generality. This means that we can minimize the maximum divergence between marginal feature distributions over all the tasks to achieve the minimization of the divergence term in Equation (8). An approximation of the empirical symmetric divergence between distributions is computed by learning a domain discriminator () that distinguishes samples from different domains (ganin2016domain):
(10) | ||||
To this end, there are three ways to minimize Equation (8):
Hard-max objective: directly minimizing the maximum divergence over all tasks:
(11) | ||||
Average objective: minimizing the average divergence:
(12) | ||||
Soft-max objective: minimizing a soft maximum of Equation (10):
(13) | ||||
The hard-max objective is data inefficient because the gradient of the max function is only non-zero for with the maximum divergence, and the algorithm only updates its parameters based on the gradient from one of the tasks. The average objective updates algorithm parameters based on the average gradient from all tasks. However, this objective considers each task as equally contributing to updating the algorithm parameters, which may not allocate enough computational and learning resources to optimize tasks with larger divergence. The soft-max objective combines the gradients from all the tasks and adaptively assigns the loss of task that have a larger divergence with a heavier weight (zhao2018adversarial). In this way, the model automatically applies larger gradient magnitudes to tasks having more shifted distributions between different domains. As a result, we propose to use the soft-max objective (Equation 13) to optimize the multi-task UDA problem.
We now proceed to introduce our HierMUD framework that transfers the model learned from a source domain to predict multiple tasks on a target domain without any labels from the target domain in any of the tasks. In our drive-by BHM application, the two domains are vibration data and damage information collected from a vehicle passing over two structurally different bridges, and the multiple learning tasks are damage diagnostic tasks, such as damage detection, localization, and quantification. The overview flowchart of our framework is shown in Figure 3. The framework contains three modules: 1) a data pre-processing module, 2) a multi-task UDA module, and 3) a target domain prediction module. In the following subsections, we present each module in detail.
The data pre-processing module contains two steps: data augmentation and initial processing. In the first step, to avoid over-fitting and data biases while providing sufficient information of each class, we conduct data augmentation on raw data by multiple procedures including adding white noise, randomly cropping or erasing samples. The data augmentation expands the size of the dataset and introduces data variability, which improves the robustness of the learned multi-task UDA model.
In the second step, we create the input to our multi-task UDA module, including the source domain data with the corresponding labels and the target domain data. Feature transforms are applied to the raw input data to provide information in other feature space. For example, a Fast Fourier Transform can be used to convert the signal from its original domain (time or space) to the frequency domain. Short-Time Fourier Transform (STFT) or wavelet transform can be used to convert the time or space domain signal to the time-frequency domain. Specifically, in our drive-by BHM application, we conduct data augmentation by adding white noise to vehicle vibration signals. Then, we compute the STFT of each vertical acceleration record of the vehicle traveling over the bridge to preserve the time-frequency domain information.
In this module, we introduce our hierarchical multi-task and domain adversarial learning algorithm (as shown in Figure 4) that exploits the derived generalization risk bound for multi-task UDA based on the theoretical study in the previous section. This algorithm integrates domain adversarial learning and hierarchical multi-task learning to achieve an optimal trade-off between domain invariance and task informativeness.
Our algorithm consists of three components: hierarchical feature extractors (orange blocks), task predictors (blue blocks), and domain classifiers (red blocks). Domain adversarial learning utilizes the domain classifier to minimize the domain discrepancy through an adversarial objective (e.g., Equation (10)) for training against the feature extractors, which encourages the extracted feature to be domain-invariant (zhang2019transfer). The architectures of each component are presented in the following paragraphs.
Hierarchical feature extractors. The hierarchical feature extractors extract domain-invariant and task-informative features. To ensure domain invariance, the parameters of the extractors are optimized with domain classifiers in an adversarial way to extract features that cannot be differentiated by the domain classifiers (i.e., domain-invariant) while the domain classifiers are optimized to best distinguish which domain the extracted features come from.
To ensure task informativeness, we implement hierarchical feature extractors that learn task-shared and task-specific feature representations for tasks with different learning difficulties. Inspired by human learning (e.g., kenny2012placing and kenny2014effectiveness) and the work of guo2018dynamic, we separate the total of tasks into easy-to-learn and hard-to-learn tasks based on the task difficulty, which is inversely proportional to the learning performance (e.g., prediction accuracy) in the source domain. In particular, we train, respectively, classifiers with the same model complexity using source domain data of the tasks, and obtain testing accuracy values, , for the tasks. A threshold is determined based on our domain knowledge and empirical observations of the problem. We then consider tasks with as easy-to-learn tasks, and tasks with as hard-to-learn tasks. For example, in our drive-by BHM application, damage detection and localization are considered as easy-to-learn tasks, and damage quantification is considered as a hard-to-learn task.
Furthermore, we implement two types of feature extractors: one task-shared feature extractor and task-specific feature extractors. We denote as the task-shared feature extractor with parameters , and as the task-specific feature extractor for the -th hard-to-learn task with parameters , where . The source and target domain data after being processed in the data pre-processing module, (), are input to the task-shared feature extractor to extract task-shared features, and , for the source and target domain, respectively. For each hard-to-learn task, task-specific features, and , are extracted from the task-shared features using the corresponding task-specific feature extractor.
The task-shared feature extractor is implemented as a deep convolutional neural network (CNN) that combines convolutional layers and pooling layers to extract feature representations. We utilize CNN to extract features because it has an excellent performance in understanding spatial hierarchies and structures of features in various resolutions
(goodfellow2016deep). Further, task-specific feature extractors are implemented as deep fully-connected neural networks, consisting of multiple fully-connected layers that map task-shared features to task-specific features.Task predictors. Task predictors are trained to ensure that the extracted features from the hierarchical feature extractors are task-informative. They are implemented as deep fully-connected neural networks that map features to task labels. There are task predictors for all the learning tasks. We denote as the task predictor for the -th task with parameters , where . In the training phase, the input to the task predictor of each easy-to-learn task is the task-shared feature in the source domain. The input to the task predictor of each hard-to-learn task is the corresponding task-specific feature in the source domain. Each predictor outputs the predicted source domain labels, , in each task.
Domain classifiers. Domain classifiers are trained to distinguish which domain the extracted features are from. We also have two types of domain classifiers: one task-shared domain classifier and task-specific domain classifiers. We denote as the task-shared domain classifier with parameters , and as the task-specific domain classifier for the -th hard-to-learn task with parameters , where . The task-shared domain classifier takes the task-shared features in all tasks from the source domain or the target domain as input and predicts if the feature sample comes from the source domain or not (i.e., a binary classification). Each task-specific domain classifier takes the task-specific features for each task from the source domain or the target domain as input and also classifies the feature sample into two classes (as the source or the target domain).
We implement the domain adversarial learning by back-propagation, inspired by ganin2016domain, using the gradient reversal layer (GRL). We use GRL because it can be easily incorporated into any existing neural network architecture that can handle high-dimensional signals and multiple learning tasks. In particular, each domain classifier is connected to the corresponding feature extractor via a GRL that multiplies the gradient by a negative constant during back-propagation updating. With GRL, feature extractors and domain classifiers are trained in an adversarial way, such that the extracted features are as indistinguishable as possible for even well-trained domain classifiers.
Domain classifiers are implemented as deep fully-connected neural networks that map feature to domain labels. One should note that the architecture of domain classifiers is simpler than that of task predictors to avoid overusing learning resources to train domain classifiers over task predictors, which could reduce task informativeness of the extracted features (ganin2016domain).
In this subsection, we present the loss function for our hierarchical multi-task and domain-adversarial learning algorithm, which minimizes the objective function in Equation (13).
After considering the hierarchical structure, we can rewrite the objective function of the optimization in Equation (13) as
(14) | ||||
where is the cross-entropy loss for the -th easy-to-learn task; is the number of classes for the -th task; is the cross-entropy loss for the -th hard-to-learn tasks;
Comments
There are no comments yet.