DeepMerge II: Building Robust Deep Learning Algorithms for Merging Galaxy Identification Across Domains

03/02/2021 ∙ by A. Ćiprijanović, et al. ∙ Fermilab 4

In astronomy, neural networks are often trained on simulation data with the prospect of being used on telescope observations. Unfortunately, training a model on simulation data and then applying it to instrument data leads to a substantial and potentially even detrimental decrease in model accuracy on the new target dataset. Simulated and instrument data represent different data domains, and for an algorithm to work in both, domain-invariant learning is necessary. Here we employ domain adaptation techniques- Maximum Mean Discrepancy (MMD) as an additional transfer loss and Domain Adversarial Neural Networks (DANNs)- and demonstrate their viability to extract domain-invariant features within the astronomical context of classifying merging and non-merging galaxies. Additionally, we explore the use of Fisher loss and entropy minimization to enforce better in-domain class discriminability. We show that the addition of each domain adaptation technique improves the performance of a classifier when compared to conventional deep learning algorithms. We demonstrate this on two examples: between two Illustris-1 simulated datasets of distant merging galaxies, and between Illustris-1 simulated data of nearby merging galaxies and observed data from the Sloan Digital Sky Survey. The use of domain adaptation techniques in our experiments leads to an increase of target domain classification accuracy of up to ∼20%. With further development, these techniques will allow astronomers to successfully implement neural network models trained on simulation data to efficiently detect and study astrophysical objects in current and future large-scale astronomical surveys.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

page 9

page 12

page 13

page 20

Code Repositories

DeepMergeDomainAdaptation

Fisher deep domain adaptation for astronomy - galaxy mergers


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Studies of galaxy mergers are crucial for understanding the evolution of galaxies as astronomical objects, their star formation rates, chemistry, particle acceleration and other properties. Moreover, they are equally important for cosmology, understanding structure formation and the study of evolution of matter in the universe. Being able to leverage large samples of merging galaxies and to connect the knowledge obtained from large-scale simulations and astronomical surveys will play an important role in these studies.
Standard methods for classifying merging galaxies using visual inspection (Lin et al., 2004) or extraction of parametric measurements of structure such as the Sérsic index (Sérsic, 1963)

, Gini coefficient, M20 – the second-order moment of the brightest

percent of the galaxy’s flux (Lotz et al., 2004), CAS – Concentration, Asymmetry, Clumpiness (Conselice et al., 2003)

etc., can be time consuming, prone to biases, or require the use of high-quality images. Due to these limitations, it has been shown that machine learning can greatly advance the study of merging galaxies  

(Ackermann et al., 2018; Snyder et al., 2019; Pearson et al., 2019; Ćiprijanović et al., 2020), improving both the quality of the results and the speed of working with big datasets. Studies of merging galaxies using machine learning can greatly benefit from models trained on simulated data, which can then be successfully applied to newly observed images from present and future large-scale surveys.
Simulated and observed images have different origins and represent different data domains. In this case, labeled simulation images represent the source domain we are starting from, while observed data (often unlabeled) is the target domain. While images produced by simulations are made to mimic real observations from a particular telescope, unavoidable small differences can cause the model trained on simulated images to perform substantially worse when applied to real data. This substandard performance has been demonstrated directly in the case of merging galaxies by Ćiprijanović et al. (2020)

. The authors show that even when the only difference between the two merging galaxy datasets is the inclusion of noise, convolutional neural networks (CNNs) trained on one dataset cannot classify the other dataset at all. In the paper the classification accuracy of in-domain images was

, while the accuracy for out-of-domain images was around , equivalent to random guessing. Additionally, Pearson et al. (2019) use a dataset from the EAGLE simulation (Schaye et al., 2015), which is made to mimic Sloan Digital Sky Survey (SDSS) observations, and real SDSS images (Lintott et al., 2008; Darg et al., 2010). Their work provides further evidence that the performance of the classifier trained on one dataset has much lower accuracy when classifying the other dataset. They achieve out-of-domain accuracies between , with the classifier trained on real SDSS images classifying EAGLE simulation images performing particularly poorly. These two examples are indicative of a great need for more sophisticated deep learning methods to be applied to cross-domain studies in astrophysical contexts.
An important area of deep learning research includes the development of Domain Adaptation (DA) techniques (Csurka, 2017; Wang and Deng, 2018; Wilson and Cook, 2020). They allow the model to learn the invariant features shared between the domains and align the extracted latent feature distributions. This allows the model to successfully find the decision boundary that distinguishes between different classes in multiple domains at the same time. One group of divergence-based DA methods includes finding and minimizing some divergence criterion between the source and target data distributions. Some of the most well known methods include Maximum Mean Discrepancy (MMD; Gretton et al. (2012a)), Correlation Alignment (CORAL; Sun and Saenko (2016); Sun et al. (2016)), Contrastive Domain Discrepancy (CDD; Kang et al. (2019)), and the Wasserstein metric (Shen et al., 2018). On the other hand, adversarial-based DA methods use either generative models (Liu and Tuzel, 2016) to create synthetic target data related to the source domain or more simple models that utilize domain-confusion loss (Ganin et al., 2016), which measures how well the model distinguishes between different data domains.
In this paper we employ two different domain adaptation techniques— Maximum Mean Discrepancy (MMD) and domain adversarial training with a domain-confusion loss— to improve cross-domain applications of deep learning models to the problem of distinguishing between merging and non-merging galaxies. Maximum Mean Discrepancy works by minimizing a distance measure of the mean embeddings of the two domain distributions in latent feature space (Gretton et al., 2012a). It is applied to standard classification networks as a transfer loss. Domain adversarial neural networks (DANNs;  Ganin et al. (2016)), use adversarial training between a label classifier— which distinguishes mergers from non-mergers— and a domain classifier— which classifies the source and target domain of images. This kind of training employs a gradient reversal layer within the domain classifier, thereby maximizing the loss in this branch and leading to the extraction of domain-invariant features from both sets of images. Following methods from Zhang et al. (2020), we also add Fisher loss and entropy minimization (Grandvalet and Bengio, 2005)

, which can be used as additional losses for both MMD or domain adversarial training, to improve the overall performance of the classifier. Both of these loss functions enforce additional discriminability between the classes in source (Fisher loss) and target (entropy minimization) domains, by producing more compact classes in the latent feature space.


We test two networks for a comparison of technique results across architectures: DeepMerge, a simple convolutional network for classification of galaxies presented in Ćiprijanović et al. (2020), as well as the more complex and well-known ResNet18 (He et al., 2015). We demonstrate both methods on a dataset similar to the one from Ćiprijanović et al. (2020), using simulated distant merging galaxies from Illustris-1 (Vogelsberger et al., 2014) at redshift z = 2, both without (source) and with (target) the addition of random sky shot noise to mimic observations from the Hubble Space Telescope.
Additionally, we test these methods on a harder and more realistic application example, where the source domain includes simulated galaxies at from the Illustris-1 simulation (Vogelsberger et al., 2014) made to mimic SDSS observations and real SDSS images of merging galaxies (Lintott et al., 2008; Darg et al., 2010)

. These two domains exhibit a much larger discrepancy, and simply applying MMD and adversarial training does not perform well. We demonstrate that combining MMD with transfer learning from the model trained on the first dataset of distant merging galaxies can be used to solve this harder domain adaptation problem.


With the use of domain adaptation techniques mentioned above, we manage to increase the target domain classification accuracy up to in our experiments, which allowed the model to be successfully used in both domains. It is our hope that the use and continued development of these techniques will allow astronomers and cosmologists to develop deep learning algorithms that can combine information from either simulations and real data, or to combine observations from different telescopes.
The remainder of the paper is structured as follows: In Section 2, we introduce and explain the domain adaptation methods used in this paper. We explain the neural network architectures we use in Section 3. In Section 4, we give details about the images we use in our experiments and talk more about the experimental setup in Section 5. Finally, our results are given in Section 6, followed by a discussion in Section 7.

2 Methods

Deep learning is already bringing advances in astronomy and survey science, as in other academic fields and industry. Many astronomical applications often require these models to perform well on new datasets, requiring the applicability of features learned from simulations to data that the model was not initially trained on, including newly available observed data and cross-telescope applications. Since labelling new data is slow and prone to errors, retraining these neural networks on new datasets in order to maintain high performance is often impractical. In these situations, a discriminative model that is able to transfer knowledge between training (source domain) and new data (target domain) is necessary. This can be achieved by using domain adaptation techniques, which extract invariant features between two domains, so that a neural network classifier trained on the source domain can also be applied successfully on a target domain. As previously underscored, this functionality is very useful in situations often found in astronomy, where the target domain is comprised of new observational data that has very few identified objects or is completely unlabeled. Here we will test several DA techniques that can be effective in the situation where the target domain is unlabeled. These techniques include adding transfer loss to the widely-adopted cross-entropy loss used for standard classification of images.

Cross-entropy loss is given as:

(1)

where is a particular class and the total number of classes is . The true label for class is given as , and is the neural network assigned score, i.e. the output prediction from the last layer, for a given class . Minimizing cross-entropy loss leads to output predictions approaching real label values, which results in an increase in classification accuracy.

The inclusion of transfer loss allows DA techniques to impact the way the network learns via backpropagation. We explore two different transfer losses in this paper: Maximum Mean Discrepancy (MMD;  

Gretton et al. (2012b)) and using the loss of the discriminator from a Domain Adversarial Neural Network (DANN;  Ganin et al. (2016)). Additionally, we explore adding Fisher loss (Zhang et al., 2020), which enforces feature discriminability between classes in the source domain, and entropy minimization, which forces a target sample to move toward one of the compacted and separated source classes (Grandvalet and Bengio, 2005), leading to better class separability in the target domain.

The resultant total classifier loss (Zhang et al., 2020) has multiple components:

(2)

where we define , , , , as total loss, classifier loss, Fisher loss, entropy minimization, and transfer loss, respectively. The contribution of these additional losses can be weighted using weights , and . Further details about the different losses used are given below.

2.1 Transfer Loss

The transfer loss

is calculated from the DA technique, whose goal is to decrease discrimination between the source and the target domains. This involves the representation of data from both domains in a higher-dimensional latent feature space. In this paper, we explore the use of both MMD and domain adversarial training as transfer criteria. MMD frames the domain problem in terms of high-dimensional statistics and involves calculating the distance between the mean embeddings of the source and target domain distributions. In the case of adversarial training, the domain discrepancy problem is addressed by adopting DANN, a neural network that seeks to find the common feature space between the source and target distributions by jointly minimizing the training loss in the source domain while maximizing the loss of the domain classifier.

We will denote the source and target domains as and respectively. Source domain images are labeled, so we have pairs of images and labels , while in the case of the target domain we have unlabeled images . Images from both domains are associated with domain labels for source domain and for target domain.

2.1.1 Maximum Mean Discrepancy (MMD)

Maximum Mean Discrepancy (MMD) is a statistical technique that calculates a nonparametric distance between mean embeddings of the source and target probability distributions from the

norm. Following  Smola et al. (2007) and  Gretton et al. (2012a)

, we designate the source probability distribution as

and the target probability distribution as .

It is possible to estimate densities of

and from the observed source and target data using kernel methods, but this estimation, which is often computationally expensive and introduces bias, is unnecessary in practice (Pan et al., 2011). Instead, we use kernel methods to determine their means for subtraction instead of estimating the full distributions:

(3)

where denotes the kernel distance as a proxy for discrepancy, and

are random variables drawn from

and respectively, function class

closely resembles the set of CDF functions in vector space with total variance less than one operating on the domain

, and the supremum is the least element in greater than or equal to the chosen , i.e. the max of the subset. By this definition, if = , then  (Gretton et al., 2012b; Pan et al., 2011). If , then there must exist some such that the distance between the two means is maximized. This becomes an optimization problem in which the criterion aims to maximize the discrepancy by separating the two distributions as far as possible in some high-dimensional feature space. Kernel methods are well suited to this task since they are able to map means into higher-dimensional Reproducing Kernel Hilbert Spaces (RKHSs). Furthermore, this embedding linearizes the metric by mapping the input space into a feature vector space.

RKHSs possess several properties that facilitate the calculation of , including a property of norms that rescale the output of a function to fit within a unit ball— which greatly restricts the many possibilities of function class — and a reproducing property that reduces calculations to the inner product of the output of functions. For example, , where is a kernel that has one argument fixed at , and the second free. Performing this calculation in an RKHS means there is no need to explicitly calculate the mapping function that maps and into an RKHS feature space due to equivalence between and . Therefore, Eq. 3 can be re-expressed as two inner products with kernels mean embeddings in an RKHS:

(4)

,

where and are the source and target distribution’s mean embeddings and is still bounded by the unit ball of the RKHS. Clearly, the inner product is maximized for the identity . Therefore, to maximize the mean discrepancy we need - , leaving us with the final formula:

(5)

where all kernel functions come from the simplification of the inner product , following the logic of the equivalence between the mapping function and the kernel established previously. Here it is clear that the distance is expressed as the difference between the self-similarities of source and target domains and their cross-similarity.

In practice, this is discretized to give the the unbiased estimator

:

(6)

,

where is the sample number of and . While in practice, can be considered a general kernel, we follow  Zhang et al. (2020) and substitute with , where

is a linear combination of multiple Gaussian Radial Basis Function (RBF) kernels to extend across a range of mean embeddings. Gaussian RBF kernel can be written as:

(7)

where is the Euclidean distance norm (where can be or depending on the domain), and is the free parameter which determines the width of the kernel.

Finally, we use MMD as our transfer loss: , effectively drawing the source and target distributions together in latent space as the network aims to minimize the loss via backpropagation.

2.1.2 Domain Adversarial Training

Domain adversarial training employs a Domain Adversarial Neural Network (DANN) to distinguish between the source and target domains (Ganin et al., 2016). DANNs are comprised of three parts: a feature extractor (), label predictor (), and domain classifier (). The first two parts can be found in any Convolutional Neural Network (CNN)— the feature extractor is built from convolutional layers which extract features from images, while the label predictor usually has fully-connected (dense) layers which output the class label. The last part, the domain classifier, is unique to DANNs. It is built from dense layers and optimized to predict the domain labels. This domain classifier is added after the feature extractor as a parallel branch to the label predictor. It includes a gradient reversal layer which maximizes the loss for this branch of the neural network, thus achieving the adversarial objective of confusing the discriminator. When the domain classifier fails to distinguish latent features from the two domains, the domain invariant features, i.e. the shared feature space, is found.

Compared to regular CNNs which can learn the best features for classification, training DANNs can lead to a slight drop in classification accuracy for the source domain because only the domain-invariant features are used. However, this will also lead to an increase in the classification accuracy in the new target domain which is our objective. The total loss for a DANN is , where is the loss for the image class label predictor , while is the loss from the domain classifier . Fine-tuning the trade-off between these two quantities during the learning process is done with the regularization parameter . Domain classifier loss is calculated as:

(8)

where and are the output scores for the source domain and target domain labels, respectively. Similarly to the class label predictor, the output scores for domain labels are also calculated using cross-entropy loss on domain labels. Finally, we can use domain classifier loss as our transfer loss:

(9)

2.2 Fisher Loss

The addition of Fisher loss to the classification and transfer losses was demonstrated to further improve classification performance for source domain images in  Zhang et al. (2020). This improvement in source classification can aid the performance of both MMD and domain adversarial training transfer criteria. It is more generally applicable than Scatter Component Analysis (Ghifary et al., 2017), which also results in class compactness, but can only be used in conjunction with MMD and would not be practical for use with adversarial training methods  (Zhang et al., 2020).

Minimizing Fisher loss leads to within-class compactness and between-class separability in the latent feature space, which makes the distinction between classes easier in the source domain. Fisher loss produces a centroid for each class and effectively pushes labeled classes toward their respective centroids, thereby creating more tightly clustered classes further apart from each other. It can be defined as a function :

(10)

Here, captures the intra-class dispersion of samples within each class, where is the latent feature of the -th sample of the -th class (with M being the total number of classes). On the other hand, describes the distances of all class centroids to the global center . This global center, , is meant to be optimized such that the centroids of the classes are pushed as far away as possible. Traces are used in the computation of the Fisher loss since they are computationally efficient.

To achieve the intended result of intra-class compactness and inter-class separability, Eq. 10 must be monotonically increasing with respect to the trace of the intra-class matrix and monotonically decreasing with respect to the trace of the inter-class matrix . Thus, as the loss is minimized via backpropagation, the distances within classes will grow smaller and the distances between classes will grow larger. There are two simple ways one can construct a Fisher loss function obeying these constraints: Fisher trace ratio ) or Fisher trace difference ). In this paper, we have chosen to use the Fisher trace ratio as our Fisher loss.

As we mentioned in Eq. 2 for total loss, we can control the contribution of all additional losses using the weight parameter . In the case of Fisher loss, we can separately weight the importance of the two matrices using and . Since we have chosen to use the trace ratio Fisher loss , this gives .

2.3 Entropy Minimization

Fisher loss can only be used in the source domain since it requires ground-truth labels to calculate intra-class centroids and the between class global center. However, Fisher loss can aid the discrimination between classes within the unlabeled target domain as well through entropy minimization loss. Entropy minimization loss pushes examples from the target domain toward source domain class centroids. Therefore, entropy minimization ensures better generalization of the decision boundary between optimally discriminative and compact source domain classes to the target domain as well (Grandvalet and Bengio, 2005).

Entropy minimization loss is defined as:

(11)

where is the classifier output and the true label is not needed. The above formula is based on Shannon’s entropy (Shannon, 1948), which for a discrete probability distribution can be written as .

3 Neural network architectures

We present the performance of domain adaptation using the aforementioned techniques in two neural networks— the simpler DeepMerge architecture with 174,626 trainable parameters (Ćiprijanović et al., 2020) and the more complex and well known network ResNet18 with 22,484,866 trainable parameters (He et al., 2015)— to compare results across architectures. We decided to use the smallest standard ResNet architecture, in order to more easily tackle possible overfitting of the model, due to small sizes of merging galaxies datasets.

The DeepMerge network, first introduced by Ćiprijanović et al. (2020)

, is a simple CNN comprised of three convolutional layers followed by batch normalization, max pooling, and dropout, and three dense layers. In this paper, the dropout layers have been removed such that the only regularization happens via L2 regularization of the weight decay parameter in the optimizer. Additionally, the last layer of the original DeepMerge network was updated to include two neurons rather than one. For more details about the architecture check Table 

3 in the Appendix.

ResNets were first proposed in the seminal paper  He et al. (2015)

and have become one of the most widely-used network architectures for image recognition. They are comprised of residual blocks; in the case of ResNet18, blocks of two 3x3 convolutional layers are followed by a ReLU nonlinearity. The chaining of these residual blocks enables the network to retain high training accuracy performance even with increasing network depth.

The domain classifier used in adversarial domain training to calculate transfer loss comprises of three dense layers, the first of which is the same dimension of the extracted features in the base network (either DeepMerge or ResNet18), such that these features form the input into the domain classifier. The second layer has 1024 neurons, followed by ReLU activation and dropout of , and the third has one output neuron followed by Sigmoid activation, conveying the domain chosen by the network.

Details about training the networks can be found in Appendix A

. We also list all hyperparameters used for training in different experiments in Table 

4 and Table 5. Our parameter choice for each experiment was informed by running hyperparameter searches using DeepHyper (Balaprakash et al., 2018, 2019).

4 Data

Here we present two dataset pairs for classifying distant merging galaxies () and nearby merging galaxies ().

4.1 Simulation-to-Simulation: Distant Merging Galaxies from Illustris

It is often very difficult to obtain real-sky observational data of labeled mergers for deep learning models, especially at higher redshifts. Therefore, both the source and target domain of our distant merging galaxy dataset are simulated.

We use the same dataset as in  Ćiprijanović et al. (2020), where the authors extract galaxies at redshift from Illustris-1 cosmological simulation (Vogelsberger et al., 2014). The objects in this dataset are labeled as mergers if they underwent a major merger (stellar mass ratios of ) in a time window of around when the Illustris snapshot was taken. This means our merger sample includes both past (happened before the snapshot) and future mergers (happened after the snapshot). In Ćiprijanović et al. (2020), images contain two filters, which mimic Hubble Space Telescope (HST) observations. In this paper we add a third (middle) filter to produce three filter HST images (ACS F814W, NC F356W, WFC3 F160W). This allows us to use the images even with a more complex ResNet18 architecture.

We produce two groups of images: source "pristine" images are convolved with a point-spread function (PSF); and target "noisy" images are convolved with the PSF, with added random sky shot noise. This sky shot noise produces a limiting surface brightness of 25 magnitudes per square arc-second. More details about the dataset can be found in Ćiprijanović et al. (2020) and Snyder et al. (2019). The source and target domains contain mergers and non-mergers, respectively. Each image is pixels. We divide these datasets into training, validation, and testing samples: .

See Figure 1 for example images from this Illustris dataset: mergers are shown in the left column and non-mergers in the right column. The top row shows images from the source domain, while the middle row shows the target galaxies with the added noise. The bottom row shows the same group of top-row source images with logarithmic color mapping in order to make the galaxies more visible to the human eye.


Figure 1: Galaxy images from Illustris-1 simulation at . The left column shows merging galaxies and the right column shows non-mergers. The same objects are repeated across rows, with the top showing the source domain, the middle showing the target domain, and the bottom displaying the source objects with logarithmic color map normalization for enhanced visibility.

4.2 Simulation-to-Real: Nearby Merging Galaxies from Illustris and Real SDSS Images

To test the ability of MMD and adversarial training in the astronomical situations where they show the most promise— training on simulated data with the prospect of applying the models to real data— we need to use merging galaxies at lower redshifts where more real data is available.

4.2.1 Source dataset: simulated images

In this scenario, attempting to make simulated images resemble real observations from a particular telescope is an important step for domain adaptation techniques, since decreasing differences between domains will make domain adaptation easier. Here our source data is comprised of simulated merging galaxies (major mergers) from the final snapshot at of Illustris-1, in a time window of before the time the snapshot was taken. The fact that we use the final snapshot of the simulation is an extremely important difference between this source domain dataset and the one described earlier: this means that only past mergers (also called post-mergers) are included instead of both past and future mergers. Since the simulation was stopped after this snapshot, images of future mergers that would have merged during after the snapshot are not available.
This dataset was originally produced in Snyder et al. (2015). Here, images also include effects of dust implemented as a slab model based on the gas and metal density along the line of sight to each pixel, similar to models by  Nelson et al. (2018)De Lucia and Blaizot (2007), and Kitzbichler and White (2007). Images have three SDSS filters () and are also convolved with a Gaussian PSF () and re-binned to a constant pixel scale of . This scale corresponds to seeing for an object observed by SDSS at

. Finally, random sky shot noise was added to these images by independently drawing from a Gaussian distribution for each pixel, which produces average signal-to-noise ratio of

.
The simulated galaxies in this dataset contain a lower number of mergers compared to the snapshot used in our simulation-to-simulation experiments. Observational evidence shows that merger rates today are much lower compared to the merger rate peak during the "cosmic high noon" at  (Madau and Dickinson, 2014), which is where our galaxies from the previous example are located. Our source domain dataset contains only post-mergers and non-mergers. We employ data augmentation in order to make a larger source dataset, particularly focusing on mergers to make the classes balanced. We first augment mergers by using mirroring (vertical and horizontal), and rotation by and (which produces images). Finally, these images are additionally augmented by random angle rotation or zooming in/out. The final source dataset we use contains post-mergers and non-mergers (we truncate non-mergers to make the classes balanced).

4.2.2 Target dataset: observed images

Our target dataset is composed of observational SDSS images. We follow dataset selection from Ackermann et al. (2018) and use the SDSS online image cutout server to get RGB (red, green, blue) JPEG images of both merging and non-merging galaxies. These RGB images correspond to () SDSS filters, as opposed to () in our source domain. We later align the order of filters in our source domain to correspond to this filter order. All of the galaxies selected are from the Galaxy Zoo project (Lintott et al., 2008, 2010), which used crowd-sourcing to generate labels for 900,000 galaxies. We use the 3003 mergers identified in the Darg et al. (2010) catalogue; three of these mergers were unable to be retrieved due to faulty weblinks. This catalogue identified mergers through the weighted-merger-vote fraction, , which describes the confidence in the crowd-sourced label. Mergers were defined as galaxies with an , where describes objects that are not merger-like and describes merger-like objects. Mergers included in Galaxy Zoo were also required to be between redshifts .
All available SDSS mergers also include a merger stage: separated mergers ( images), interacting ( images), post-mergers ( images). Since our source domain includes only post-mergers, we restrict our target dataset to only include the post-merger subclass from SDSS. To obtain mergers, we augment the SDSS post-merger images using the same techniques used in the source domain. To complete our target dataset, another non-merger galaxies in the redshift range were randomly selected from the Galaxy Zoo project’s entire dataset by requiring .

We resize images from both domains to the same size as in simulation-to-simulation example ( pixels), and use the same split into training, validation and testing samples: . In Figure 2 we plot images from both domains. In the top row, we plot post-mergers (left) and non-mergers (right) from Illustris simulation at , while in the bottom row we plot post-mergers (left) and non-mergers (right) from SDSS.

Images from the target domain were visually classified, and most of the target post-mergers clearly exhibit two bright galaxy cores, while images from the source domain display a greater variety of characteristics. Consequently, the two domains are extremely dissimilar relative to the two domains in the simulation-to-simulation example. The choices we detail above— using only post-mergers in both domains; including observational and dust effects and choosing a small time window to avoid the inclusion of very relaxed merger systems in the source domain— were made in an attempt to make the two domains as similar as possible. Still, the fact that the number of individual mergers in both domains is very small, paired with the fact that their appearance can be quite different makes any domain adaptation efforts quite challenging.


Figure 2: Galaxy images from Illustris simulation at mimicking SDSS observations (top row) and real SDSS images (bottom row). The left column shows post-merger galaxies, while the right column shows non-mergers. Source domain images in the top row were plotted with a logarithmic color map to make features more visible. Even when we select only post-mergers from SDSS we can still see that the merger class is different across the two domains. While the source domain contains more relaxed systems, the target contains galaxies near each other, with two bright clearly visible cores.

5 Experiments

CNNs outperform other machine learning methods for classification of merging galaxies (Snyder et al., 2019; Ćiprijanović et al., 2020). However, in both Ćiprijanović et al. (2020) and Pearson et al. (2019), it was shown that, even though training and evaluating a CNN on images from the same domain gives very good results, a simple model trained in one domain cannot perform classification in a different domain with high accuracy. To increase the performance of deep learning classifiers on a target dataset, we use the DA techniques described in Section 2.

We first train neural networks without the implementation of any DA techniques to determine the base performance for source and target images for each pair of datasets: simulated-to-simulated (Illustris with and without noise) and simulated-to-real (Illustris and SDSS). While we possess labels for both the source and target domain in both scenarios, this training is performed using labeled source images exclusively. We only use the target image labels in the testing phase to asses accuracy. We seek these performance accuracies as the metric to improve upon through domain adaptation.

We then run several domain adaptation experiments— using MMD as transfer loss, adversarial training with DANN domain discriminator loss as transfer loss, MMD as transfer loss with Fisher loss and entropy minimization, and finally DANN adversarial training with Fisher loss and entropy minimization— with both DeepMerge and ResNet18 architectures on the simulation-to-simulation dataset. Training with domain adaptation is performed using labeled source data and unlabeled target data. In the case of adversarial training, an additional domain classifier branch is added to receive an input of features from base network. The parameters used for training both DeepMerge and ResNet18 for all simulation-to-simulation experiments are given in the Appendix in Table 4.

Since larger networks are prone to overfitting, given the limited size of both source and target datasets in our simulation-to-real experiments, we decided to only test it with the the smaller DeepMerge network. Similar to the experiments described above for the simulation-to-simulation dataset, we first train DeepMerge without any domain adaptation in order to determine the base performance. We then tried to improve target domain accuracy by training using MMD and adversarial training, both with and without Fisher loss and entropy minimization. However, despite performing hyperparameter searches, domain adaptation was not successful, i.e. the target domain accuracy was no better than random guessing. We then turned our trials to combining MMD and adversarial training with transfer learning from models successfully trained in simulation-to-simulation experiments. Hyperparameters used in training the model in the simulation-to-real experiments are given in the Appendix in Table 5.

To ensure reproducibility of our results prior to training, we fix the random seeds used for image shuffling (before division into training, validation and testing samples), as well as for random weight initialization of our neural networks. The same images were used across experiments for training, as well as for testing afterwards to produce the reported results. In Section 6 we report results for a fixed seed=1.

6 Results

Throughout this paper we consider mergers the positive class (label ), and non-mergers the negative class (label ). Consequently, correctly/incorrectly classified merger are true positives (TP)/false negatives (FN), while correctly/incorrectly classified non-mergers are true negatives (TN)/false positives (FP).
We report classification accuracy, precision or purity: TP/(TP + FP), recall or completeness: TP/(TP + FN), and . We also report the Area Under the Curve (AUC) score— the area under the Receiver Operating Characteristic (ROC) curve, which conveys the trade-off between true-positive rate and false-positive rate. Finally, we provide Brier score values, which measure the mean squared error between the predicted scores and the true labels; a perfect classifier would have a Brier score of zero.

6.1 Simulation-to-Simulation Experiments

Simulated-to-Simulated
Loss Metric Source Target
DeepMerge ResNet18 DeepMerge ResNet18
No Domain Adaptation AUC
Accuracy
Precision
Recall
F1 score
Brier score
MMD AUC
Accuracy
Precision
Recall
F1 score
Brier score
MMD + Fisher + Entropy AUC
Accuracy
Precision
Recall
F1 score
Brier score
Adversarial AUC
Accuracy
Precision
Recall
F1 score
Brier score
Adv. + Fisher + Entropy AUC
Accuracy
Precision
Recall
F1 score
Brier score
Table 1: Performance metrics of the DeepMerge and ResNet18 CNNs, on source and target domain test sets, without domain adaptation (first row) and when domain adaptation techniques are used during training (all other rows). The table shows AUC, Accuracy, Precision, Recall, F1 score, and Brier score.
Figure 3: The top panel shows classification results for DeepMerge network and the bottom panel for ResNet18. Left: Performance metrics for no domain adaptation experiment (labeled "noDA") in navy blue, MMD in purple, MMD wih Fisher loss and entropy minimization (labeled "MMD+F") in dark purple, adversarial training (labeled "ADA") in yellow and adversarial training with Fisher loss and entropy minimization (labeled "ADA+F") in pink. We plot values for accuracy, precision, recall, F1 score, Brier score, and AUC. Dashed bars show results for the source domain and solid colored bars for the target domain. Right: ROC curves with the same color and line style scheme. In the legend we also give AUC values for all five experiments.

Results of training the two classifiers without domain adaptation are given in the first row of Table 1. Training was performed on the source domain images, and test accuracy on source images is high for both networks in the base case without DA: for DeepMerge and for ResNet18. As was expected, without any domain adaptation both classifiers are almost unable to classify target domain images; test accuracy for this domain are only (DeepMerge) and (ResNet18). Additionally, as expected, we noticed that the more complex ResNet18 was much more prone to overfitting earlier in the training than DeepMerge. We therefore implemented early stopping in our training in all experiments, as well as saving of the best model before the training stops due to substantial overfitting.

We then trained DeepMerge and ResNet18 with both domain adaptation techniques, each with and without Fisher loss and entropy minimization. Results from these DA experiments are also given in Table 1. We conclude that it is difficult to determine a single best technique across architectures. Additionally, inclusion of the Fisher loss and entropy minimization, implemented for within class compactness in both the source and target domains, does not always help. This might be due to the fact that multiple losses interact differently depending on the network architecture and complexity of the feature space. In short, simple hyperparameter grid searches, which informed our parameter choices, are not a perfect solution to find the optimal hyperparameters for different experiments (a non-trivial task that we leave for future studies). Despite our imperfect hyperparamter choices, we assert that the results presented here convey an overall demonstration of the performance and improvements of domain adaptation techniques for cross-domain studies.
Next we take a closer look at experiments that were most successful for DeepMerge and ResNet18. The best-performing DeepMerge network experiments— MMD, adversarial training, and adversarial training with Fisher loss and entropy minimization— reached source domain accuracies of . The accuracy in the target domain was largest with adversarial training at , while with MMD it reached . Consequently, the highest increase in target domain accuracy was compared to the classifier without domain adaptation. Again, we assert that each experiment’s results could potentially be further improved with a different set of hyperparameters.
For ResNet18, we see a slightly smaller increase in target domain accuracies compared to improvements by DeepMerge. Because it is a more complex network, it is harder to stop ResNet18 from learning more intricate details that are only found in the source domain. Target domain accuracies increase from without domain adaptation to in the best performing experiment, which was MMD with additional Fisher loss and entropy minimization.
In experiments without DA, we allow the network to learn from all available features that can be extracted rather than restricting to the set of domain-invariant features. It follows that, in training without DA, a perfectly optimized network should be able to reach its highest source accuracy when not being forced to learn domain-invariant features. In contrast to this expectation, we observe that the addition of transfer and other losses slightly increases source domain accuracies in almost all of the experiments we ran. We posit that this is first and foremost the consequence of not finding the best set of hyperparameters and that the additional transfer loss serves as a good regularizer, enabling longer training without overfitting. This is more prominent in case of ResNet18, where source domain accuracies increase from for no domain adaptation case to the highest value of in the case of adversarial training.
For easier comparison, in Figure 3 we plot the performance values from Table 1 for our test set of images. The top row of plots shows results for the DeepMerge network, while the bottom row is for ResNet18. Bar plots on the left show all performance metrics (accuracy, precision, recall, F1 score, Brier score, and AUC) for our simulated-to-simulated experiments: no domain adaptation (navy blue), MMD (purple), MMD with Fisher and entropy minimization (dark purple), adversarial training (yellow), and adversarial training with Fisher and entropy (pink). The two right panels show ROC curves for all DeepMerge and ResNet18 experiments, with the same color coding as in the bar plots. In all four panels, the solid bars and lines show values for the target domain while the dashed bars and lines show source domain performance.

See Appendix B for additional performance comparison between training with and without domain adaptation for both networks on the simulation-to-simulation dataset.

6.2 Simulation-to-Real Experiments

We also evaluated the performance of these DA methods in the situation where the classifier is trained on simulated source domain images and tested on a target domain of observational telescope images. Examples like this are much more complex than our simulation-to-simulation experiments due to the larger discrepancy between domains, as discussed at length in Subsection 4.2.

Due to the perils of training a large network on such a small dataset, we decided to test domain adaptation techniques with only the smaller DeepMerge network. Without DA, DeepMerge reached an accuracy of in the source domain and in the target domain in the testing phase. This was our baseline we try to improve upon in the simulation-to-real DA experiments. Due to the extreme discrepancy between domains, we also tested trained the DeepMerge network on the target domain directly, which resulted in an accuracy of on the target domain images. This is higher than for the source domain of simulation images, confirming that the target domain is easier to train on since it is comprised of visually more apparent merging features.
We then tried running hyperparameter grid searches with MMD and adversarial training, but were unable to successfully use domain adaptation to improve target domain accuracies. This led us to conclude that problems with the size of our dataset, as well as the difference between domains, was preventing the successful learning of domain-invariant features. In the top row of Table 2, we report all performance metrics for no domain adaptation case and in the middle row we report results of using MMD only as transfer loss. Since no other method was successful, we omit reporting numbers for all other DA setups tested in the simulation-to-real experiments. Larger simulated training samples, more sophisticated domain adaptation methods that allow for better domain overlap of discrepant feature distributions, or a combination of the two will advance the study of merging galaxies across the simulated-to-real domain in the future.

Simulated-to-Real
Loss Metric DeepMerge
Source Target
No Domain Adaptation AUC
Accuracy
Precision
Recall
F1 score
Brier score
MMD AUC
Accuracy
Precision
Recall
F1 score
Brier score
Transfer Learning + MMD AUC
Accuracy
Precision
Recall
F1 score
Brier score
Table 2: Performance metrics of DeepMerge, on source simulated data and target observational data in the testing phase: without domain adaptation (first row), MMD only (middle row), and MMD with tranfer learning (bottom row). The table shows AUC, accuracy, precision, recall, F1 score, and Brier score.
Figure 4: Left: Performance metrics for DeepMerge network for no domain adaptation experiment (labeled "noDA") in navy blue, MMD in purple, MMD with transfer learning (labeled "MMD+TL") in yellow. We plot values for accuracy, precision, recall, F1 score, Brier score, and AUC. Dashed bars show results for the source domain and solid colored bars for the target domain. Right: ROC curves with same color and line style scheme. In legend we also give AUC values for all three experiments.

6.2.1 Transfer Learning

The approach we took to overcome our small dataset limitation was to use transfer learning, where the weights from a neural network pre-trained on different data are loaded before training the classifier on the data of interest.

Transfer learning has been used in previous studies of merging galaxies. For example, in Ackermann et al. (2018), authors use Xception (Chollet, 2016)

, a large deep learning model, pre-trained on images of everyday objects from ImageNet 

(Deng et al., 2009). They successfully trained the model on observed images of merging galaxies from SDSS and report classification precision, recall, and F1 score of . Similarly, in Wang et al. (2020), authors use a VGG network (Simonyan and Zisserman, 2015)

pre-trained on ImageNet to train on simulated images from the IllustrisTNG simulation at

 (Springel et al., 2018; Pillepich et al., 2018). They report classification accuracy of on simulated images, and then use the simulation-trained model to detect major mergers in KiDS (de Jong et al., 2013) and GAMA (Driver et al., 2009) observations.

We decided to test if our simulated-to-real DA setup would benefit from transfer learning from a more similar dataset than ImageNet. Rather than proceeding with random weight initialization, we load the weights from our successfully trained DeepMerge networks in our simulated-to-simulated experiments. This way we can utilize extracted features that relate to distant merging galaxies, which are much more similar to nearby merging galaxies, than features extracted from everyday objects.


We tried training without freezing layers (allowing all weights in the network to fine-tune to the new datasets), and freezing of convolutional and batch normalization layers. The training performed much better when all weights of the model were allowed to train from their loaded checkpoint. This may be due to both the smaller size of the DeepMerge network as well as the possible differences in the appearance of galaxies in our experiments, with a particular emphasis on the very different appearance of real SDSS mergers compared to simulated ones. This probably led to the necessity of the network finding better-suited domain invariant features when real data is included, which can be more easily found when convolutional layers are allowed to train.
We also performed a hyperparameter search for both MMD and adversarial domain adaptation with transfer learning, and were able to find a configuration for successful DA with MMD. Since the domain discrepancy in the case of simulated-to-real images was large, successfully learning common features led to the reduction of source domain accuracy from without domain adaptation to with MMD and transfer learning. However, and most significantly, we were able to achieve in the target domain during the testing phase of MMD with transfer learning— an increase of compared to noDA. In the bottom row of Table 2, we report performance metrics for this transfer learning case.
For ease of comparison, we also plot performance metric values in the testing phase in Figure 4. Bar plots on the left show all performance metrics (accuracy, precision, recall, F1 score, Brier score, and AUC) for our simulated-to-real experiments— no domain adaptation (navy blue), MMD (purple), MMD with transfer learning (yellow). ROC curves for these experiments are presented on the right, with the same color coding as in the bar plots. In both panels, the solid bars and lines show values for the target domain while the dashed bars and lines show source domain performance.
This experiment demonstrates that domain adaptation techniques are very powerful. However, to be useful in a scientific context, we conclude that very careful data preprocessing to reduce domain discrepancies and/or transfer learning to mitigate the problem of small datasets is necessary. It is our hope that the introduction of these techniques to the astronomy community will spur innovation and encourage the use of more sophisticated DA methods that optimize domain alignment for this sort of difficult astronomical tasks.

7 Discussion

We have demonstrated how MMD and domain adversarial training substantially increase the performance of simulated-to-simulated learning in the context of galaxy merger classification. For DeepMerge, the average accuracy benefit of these techniques was in the target domain; for ResNet18, the average benefit was . While unable to show positive results with adversarial domain adaptation training on our simulated-to-real dataset, pairing MMD with transfer learning achieved a substantial increase in target domain accuracy of with DeepMerge.
We believe that both techniques show great promise for use in astronomy. Here we discuss their interpretability with the aid of t-Distributed Stochastic Neighbor Embeddings (t-SNEs) and Gradient-Class Activation Mappings (Grad-CAMs); and provide an outlook on their potential for use within the scientific community.

7.1 Model Interpretability: Understanding the Extracted Features with t-SNEs

To better understand the effect of domain adaptation, we visualize the distribution of the extracted features with t-Distributed Stochastic Neighbor Embeddings (t-SNE) plots by projecting the high-dimensional feature space to a more familiar two-dimensional plane (van der Maaten and Hinton, 2008). This method calculates the probability distribution over data point pairs, assigning a higher probability to similar objects and a lower probability to dissimilar pairs, in both the latent feature space and in the two-dimensional mapping. By minimizing the Kullback–Leibler (KL) divergence (Kullback and Leibler, 1951) between the two distributions, the t-SNE method ensures the similarity between the actual distribution and the projection. Despite its usefulness, we emphasize that t-SNE is a non-linear algorithm and adapts to data by performing different transformations in each region. This can lead to clumps with highly-concentrated points appearing as very large groups, i.e. it is difficult to compare relative sizes of clusters in t-SNE plot renderings. Additionally, the outputted two-dimensional embeddings are entirely dependent on several user-defined parameters. For more details on t-SNE best practices, see  Wattenberg et al. (2016).

We implemented an option to plot t-SNEs in our training method to demonstrate the changes in extracted features from the source and target domain across a series of epochs. In Figure  

5 we plot t-SNE plots for DeepMerge before the start of the training in the first panel, and t-SNE plots after some training— when no domain adaptation is implemented in the second panel, when MMD as transfer loss is used in the third panel, when MMD with Fisher loss and entropy minimization is used in the fourth panel. We confirm that domain adversarial training t-SNEs are virtually indistinguishable from MMD plots, so we omit the repeat here. On all t-SNE plots, red and blue colored dots represent the two classes— mergers and non-mergers, respectively— and transparent dots represent the source domain, while opaque dots represent the target domain.

Before the start of training, classes are completely mixed together and domains are separated (first panel). With no domain adaptation, we see that even after some training, domains remain separated (second panel). With domain adaptation but no Fisher loss or entropy minimization (third panel), we see that features from both domains completely overlap. We can see that both classes exhibit some clumping but the structure of both classes is quite complex. Finally, in the fourth panel, we also include Fisher loss and entropy minimization which helps the two classes separate more in both domains.

Figure 5: t-SNE plots for the DeepMerge network. Red and blue colored dots represent mergers and non-mergers, respectively; transparent dots represent the source domain, while opaque dots represent the target domain. The first panel shows classes before any training, while all other panels show embeddings after epochs. The second panel represents training without any domain adaptation, the third panel show training using MMD as transfer loss and the fourth panel shows MMD with Fisher loss and entropy minimization.

7.2 Model Interpretability: Visualizing Salient Regions in Input Images with Grad-CAMs

Another way of probing deep neural network models is by identifying regions in the input images that proved most important for classification as a particular class. Domain adaptation should lead to differences in these important regions. In particular, without domain adaptation, the neural network can often identify incorrect or spurious regions in images it was not trained on in the target domain, while the classifier that works correctly should focus on regions that contain useful information for the given classification task.

Here we will use the Gradient-weighted Class Activation Mapping (Grad-CAM) method (Selvaraju et al., 2020) to visualize the regions which differently trained models identify as the most salient information in the image. This method calculates class -specific gradients of the output score with respect to the activation maps (i.e. feature maps) of the last convolutional layer in the network. Here activation map dimensions are pixels and lists the number of feature maps. These gradients are global-average-pooled to calculate the importance weights for a particular class :

(12)

Grad-CAMs are then produced by applying a ReLU function (to extract positive activation regions for the particular class ) to the weighted combination of feature maps in the last convolutional layer:

(13)

In Figure 6, we plot the last convolutional layer Grad-CAMs for simulation-to-simulation experiments with the DeepMerge network. We display plots for DeepMerge instead of ResNet18 due to the fact that the dimension of the last convolutional layer in ResNet18 is smaller, resulting in low-resolution Grad-CAMs that are much harder to interpret. The first column shows an example of a merging galaxy from the source domain at the top and from the target domain at the bottom; recall that these two domains differ only by the inclusion of noise. The second, third, and fourth columns show Grad-CAMs for the images in question for classification into a merger class, when training without DA, with MMD, and with MMD, Fisher loss, and entropy minimization, respectively.

In the case of training without domain adaptation in the second column, the network is focusing on the periphery of the galaxy in the source domain, exactly where a lot of interesting asymmetric and clumpy features are expected to appear in the case of mergers. These features are faint, and a lot of this information is lost in the target domain due to the inclusion of mimicked observational noise. As expected, the classifier does not work in the target domain: we can see that the network focuses on the noise instead. When domain adaptation is introduced— MMD in the third row and MMD with Fisher loss and entropy minimization in the fourth column— the network learns to focus on the central brightest regions of the galaxy, which are visible in both domains, and successfully performs classification in both cases.

Figure 6: Grad-CAMs for simulation-to-simulation experiments in DeepMerge network made from feature maps in the final convolutional layer. The top left image shows an example galaxy merger image from the source domain (plotted with logarithmic colormap to enhance visibility), while the bottom left image is the same galaxy from the target domain. The second, third, and fourth column show Grad-CAMs for those images for classification into a merger class when training without domain adaptation, with MMD, and with MMD, Fisher loss, and entropy minimization, respectively. In the case of training without domain adaptation, the network is focusing on the periphery of the galaxy in the source domain network, and focuses on the noise in the target domain. When domain adaptation is introduced, the network learns to focus on the central brightest regions of the galaxy, which is visible in both domains.

Likewise, we plot Grad-CAMs for the simulated-to-real experiments with the DeepMerge network in Figure 7. Here we show multiple true merger images from the source and target domains in the top left and right columns, respectively. The second and third row show Grad-CAMs for these images when training without domain adaptation to highlight what the network focuses on for both merger and non-merger classes. Finally, the fourth and fifth rows show merger class and non-merger class Grad-CAMs for training with MMD with transfer learning.
In the source domain Grad-CAMs— where the classifier works with and without domain adaptation— the neural network searches the periphery when classifying an example as a merger, while it focuses at the bright center when classifying non-mergers. This is to be expected, since mergers often have a lot of useful information on the periphery, while non-mergers are often very compact and only have a bright center in the middle of the image. On the other hand, the Grad-CAMs for the target domain without domain adaptation, i.e. for the unsuccessful classifier, demonstrate the network’s focus on the noise. This even leads to the inverted characteristics from those described in the source domain: here classification as a merger depends on the bright center and classification as a non-merger depends on the peripheral information. This leads to the classifier failing to successfully distinguish mergers and non-mergers in the target domain. Finally, when training with MMD and transfer learning, the Grad-CAMs start to resemble those in the source domain and classification is successful in the target domain as well.

Figure 7: Grad-CAMs for simulation-to-real experiments in DeepMerge network. The top row of images shows examples of true mergers from simulated source dataset on the left and real target dataset on the right. The second row shows Grad-CAMs for these designated example images for classification into the merger class, while row three shows Grad-CAM for the same image for the non-merger class. We can see that, in case of the source domain, the peripheries are important for positive classification in the merger class, while central regions are important for positive classification as a non-merger. In the case of the target domain, both mergers and non-mergers look very different, so Grad-CAMs become noisy and display inverted behavior compared with the source domain. Finally, the third and fourth rows show merger and non-merger Grad-CAMs for the model trained with MMD and transfer learning. The successful domain adaptation is apparent, as the network performs both source and target domain classification in a similar manner as in the case of source domain classification without DA.

7.3 Issues for Deployment in the Sciences: Avoiding Negative Transfer and Dealing With Small Datasets

The ability to achieve success in combining cross-domain knowledge will largely depend on data availability and quality. We stress the importance of making efforts to achieve likeness between domains, particularly in the sciences. Most domain adaptation techniques were developed for use across very similar domains, such as pictures of office supplies in different settings in the Office-31 dataset (Saenko et al., 2010; Venkateswara et al., 2017). Scientists hoping to deploy these DA techniques face a much more nontrivial task, demonstrated by the challenges we encountered when trying to apply MMD and adversarial domain adaptation to the simulated-to-real datasets, which were both small and considerably different across domains.

Substantial domain divergence can result in suboptimal performance or even render domain adaptation techniques virtually useless. The presence of additional classes in the target domain or outliers within the same class make domain adaptation very difficult. Training a model on a demanding problem across dissimilar domains, which occurs frequently in the sciences, may lead to what is known as negative transfer. Negative transfer is when, rather than aiding domain adaptation, the knowledge learned from the source domain actually negatively impacts the performance of the classifier in the target domain. For a comprehensive survey of negative transfer and common methods to mitigate it see 

Zhang et al. (2020).
Therefore, despite its promise to help mitigate the issue of dealing with large unlabeled datasets, applying domain adaptation in the sciences may not be very straightforward. Scientists that wish to use DA techniques should strive to make their domains as similar as possible through careful dataset selection and image pre-processing (adding realistic noise, PSF, and other observational effects to simulated astronomical images). In crafting the simulated-to-real dataset in this paper, we made several choices to increase the likeness between our two domains, including choosing to use a small merger window to remove relaxed merger systems from our source domain, which were not present in our target domain. However, even in the case where our target distribution includes only post-mergers, DA was not very successful without the inclusion of transfer learning.
Other clever approaches may be taken to decrease domain discrepancy. For example, it was shown in Bottrell et al. (2019) that standard deep learning algorithms trained to distinguish between galaxy merger stages in realistic data without domain adaption do not perform well when trained on pristine simulated images or even simulated images with included realistic PSF and noise. The authors achieve their best performance when simulated images were inserted into the realistic sky in order to introduce examples of crowding from nearby sources. This enabled the model to learn the distinction between crowding of nearby sources and close merging pairs. In this paper, the authors also show that achieving this observational realism is more important for good classification across domains than attempting to add more intricate details to computationally expensive simulations.
Beyond dealing with negative transfer, data augmentation and transfer learning can aid substantially in training with small datasets. In this paper, we demonstrated that a combination of MMD and transfer learning from our simulated-to-simulated dataset enhanced the learning of correct features in our extremely small simulated-to-real dataset. Even in the case of small and quite discrepant domains, domain adaptation techniques can be successfully used to improve performance of deep learning algorithms on unlabeled observational data.
In the future, more refined domain adaptation techniques will likely be needed in the sciences. In the all too prevalent case when classes look very different across source and target domains, future work will include applying methods such as class-aware Contrastive Domain Discrepancy networks (Kang et al., 2019), that promote class compactness and overlap between class distributions from different domains— rather than the overlap of entire source and target distributions performed with MMD— to achieve even higher accuracies in the target domain.
Despite the issues mentioned above, we maintain that the prospect for domain adaptation’s use in the sciences is extremely promising. With ongoing domain adaptation development both within computer science and by those who leverage it creatively in other sciences, we are confident that DA will soon become a staple in the natural scientist’s toolkit to leverage all available data.

8 Conclusion

In this paper, we focus on applying domain adaptation techniques to the astronomical context of studying galaxy mergers. Galaxy mergers are crucial in the study of galaxy morphology, evolution, star formation, as well as particle acceleration and the evolution of matter in the universe. Finding comprehensive samples of merging galaxies in different merger stages is very important for the study of these long processes, and we are excited about what the next era of large-scale survey data will bring.

MMD and adversarial training using DANNs show great promise for use in classifying galaxy mergers across domains. We showed here how they can help improve classification accuracies in an unlabeled target domain, thereby allowing models trained on simulated labeled data to be successfully applied on both mimicked and real observational data. While we demonstrate successful implementation of both methods in the simulated-to-simulated dataset for both DeepMerge and ResNet18, we also present promising results for MMD combined with transfer learning in the simulated-to-real dataset in DeepMerge. In both types of experiments we were able to increase the target dataset accuracy in the testing phase by up to .

While we found that MMD and adversarial training can be challenging to fine-tune for nontrivial scientific tasks, we conclude that domain adaptation techniques will soon flourish as a necessary tool in astronomy and other natural sciences. These techniques will play an important role in successful use of deep learning algorithms on huge datasets from future large astronomical surveys, and might even help in real-time detection of transient objects and other interesting phenomena. We affirm that domain adaptation techniques will prove essential to building deep learning models that can combine and harness all available observational and simulated data, a tantalizing prospect in the sciences.

Acknowledgements

This manuscript has been supported by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy, Office of Science, Office of High Energy Physics. This research has been partially supported by the High Velocity Artificial Intelligence grant as part of the Department of Energy High Energy Physics Computational HEP sessions program. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is a user facility supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357.

The authors of this paper have committed themselves to performing this work in an equitable, inclusive, and just environment, and we hold ourselves accountable, believing that the best science is contingent on a good research environment. We acknowledge the Deep Skies Lab as a community of multi-domain experts and collaborators who have facilitated an environment of open discussion, idea-generation, and collaboration. This community was important for the development of this project.

We also thank K. Pedro, N. Tran, W. J. Pearson, Y. Zhang and M. Vasist for valuable discussion and comments.

Author Contributions

A. Ćiprijanović— Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Visualization, Project administration, Supervision, Writing of original draft; D. Kafkes— Formal analysis, Investigation, Methodology, Resources, Software, Visualization, Writing of original draft; K. Downey— Data curation, Formal analysis, Investigation, Software, Visualization; S. Jenkins— Formal analysis, Investigation, Software, Visualization, Writing (review & editing); G. N. Perdue— Investigation, Methodology, Project administration, Resources, Software, Supervision, Writing (review & editing); S. Madireddy— Resources, Software, Methodology, Supervision, Writing (review & editing); T. Johnson— Methodology, Supervision, Writing (review & editing); G. F. Snyder—Methodology, Conceptualization, Data curation, Writing (review & editing); B. Nord— Methodology, Conceptualization, Supervision, Writing (review & editing).

Data and Code Availability

All simulated Illustris datasets and observed SDSS dataset are available on Zenodo. For code see our GitHub page.

References

  • S. Ackermann, K. Schawinski, C. Zhang, A. K. Weigel, and M. D. Turp (2018) Using transfer learning to detect galaxy mergers. MNRAS 479 (1), pp. 415–425. External Links: Document, 1805.10289 Cited by: §1, §4.2.2, §6.2.1.
  • P. Balaprakash, M. Salim, T. D. Uram, V. Vishwanath, and S. M. Wild (2018) DeepHyper: asynchronous hyperparameter search for deep neural networks. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC), Vol. , pp. 42–51. Cited by: Appendix A, §3.
  • P. Balaprakash, R. Egele, M. Salim, S. Wild, V. Vishwanath, F. Xia, T. Brettin, and R. Stevens (2019)

    Scalable reinforcement-learning-based neural architecture search for cancer deep learning research

    .
    In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, NY, USA. External Links: ISBN 9781450362290, Link, Document Cited by: Appendix A, §3.
  • C. Bottrell, M. H. Hani, H. Teimoorinia, S. L. Ellison, J. Moreno, P. Torrey, C. C. Hayward, M. Thorp, L. Simard, and L. Hernquist (2019) Deep learning predictions of galaxy merger stage and the importance of observational realism. Monthly Notices of the Royal Astronomical Society 490 (4), pp. 5390–5413. External Links: ISSN 0035-8711, Document, Link, https://academic.oup.com/mnras/article-pdf/490/4/5390/30736869/stz2934.pdf Cited by: §7.3.
  • F. Chollet (2016) Xception: deep learning with depthwise separable convolutions. CoRR abs/1610.02357. External Links: Link, 1610.02357 Cited by: §6.2.1.
  • A. Ćiprijanović, G. F. Snyder, B. Nord, and J. E. G. Peek (2020) DeepMerge: Classifying high-redshift merging galaxies with deep neural networks. Astronomy and Computing 32, pp. 100390. External Links: Document, 2004.11981 Cited by: §1, §3, §3, §4.1, §4.1, §5.
  • C. J. Conselice, M. A. Bershady, M. Dickinson, and C. Papovich (2003) A Direct Measurement of Major Galaxy Mergers at z less than ~3. AJ 126, pp. 1183–1207. External Links: astro-ph/0306106, Document Cited by: §1.
  • G. Csurka (2017) A comprehensive survey on domain adaptation for visual applications. In

    Domain Adaptation in Computer Vision Applications

    ,
    pp. 1–35. External Links: ISBN 978-3-319-58347-1, Document, Link Cited by: §1.
  • D. W. Darg, S. Kaviraj, C. J. Lintott, K. Schawinski, M. Sarzi, S. Bamford, J. Silk, R. Proctor, D. Andreescu, P. Murray, R. C. Nichol, M. J. Raddick, A. Slosar, A. S. Szalay, D. Thomas, and J. Vandenberg (2010) Galaxy Zoo: the fraction of merging galaxies in the SDSS and their morphologies. Monthly Notices of the Royal Astronomical Society 401 (2), pp. 1043–1056. External Links: ISSN 0035-8711, Document, Link, https://academic.oup.com/mnras/article-pdf/401/2/1043/3958956/mnras0401-1043.pdf Cited by: §1, §4.2.2.
  • J. T. A. de Jong, K. Kuijken, D. Applegate, K. Begeman, A. Belikov, C. Blake, J. Bout, D. Boxhoorn, H. Buddelmeijer, A. Buddendiek, M. Cacciato, M. Capaccioli, A. Choi, O. Cordes, G. Covone, M. Dall’Ora, A. Edge, T. Erben, J. Franse, F. Getman, A. Grado, J. Harnois-Deraps, E. Helmich, R. Herbonnet, C. Heymans, H. Hildebrandt, H. Hoekstra, Z. Huang, N. Irisarri, B. Joachimi, F. Köhlinger, T. Kitching, F. La Barbera, P. Lacerda, J. McFarland, L. Miller, R. Nakajima, N. R. Napolitano, M. Paolillo, J. Peacock, B. Pila-Diez, E. Puddu, M. Radovich, A. Rifatto, P. Schneider, T. Schrabback, C. Sifon, G. Sikkema, P. Simon, W. Sutherland, A. Tudorica, E. Valentijn, R. van der Burg, E. van Uitert, L. van Waerbeke, M. Velander, G. Verdoes Kleijn, M. Viola, and W. -J. Vriend (2013) The Kilo-Degree Survey. The Messenger 154, pp. 44–46. Cited by: §6.2.1.
  • G. De Lucia and J. Blaizot (2007) The hierarchical formation of the brightest cluster galaxies. MNRAS 375 (1), pp. 2–14. External Links: Document, astro-ph/0606519 Cited by: §4.2.1.
  • J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Vol. , pp. 248–255. External Links: Document Cited by: §6.2.1.
  • S. P. Driver, P. Norberg, I. K. Baldry, S. P. Bamford, A. M. Hopkins, J. Liske, J. Loveday, J. A. Peacock, D. T. Hill, L. S. Kelvin, A. S. G. Robotham, N. J. G. Cross, H. R. Parkinson, M. Prescott, C. J. Conselice, L. Dunne, S. Brough, H. Jones, R. G. Sharp, E. van Kampen, S. Oliver, I. G. Roseboom, J. Bland-Hawthorn, S. M. Croom, S. Ellis, E. Cameron, S. Cole, C. S. Frenk, W. J. Couch, A. W. Graham, R. Proctor, R. De Propris, I. F. Doyle, E. M. Edmondson, R. C. Nichol, D. Thomas, S. A. Eales, M. J. Jarvis, K. Kuijken, O. Lahav, B. F. Madore, M. Seibert, M. J. Meyer, L. Staveley-Smith, S. Phillipps, C. C. Popescu, A. E. Sansom, W. J. Sutherland, R. J. Tuffs, and S. J. Warren (2009) GAMA: towards a physical understanding of galaxy formation. Astronomy and Geophysics 50 (5), pp. 5.12–5.19. External Links: Document, 0910.5123 Cited by: §6.2.1.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky (2016) Domain-adversarial training of neural networks. Journal of Machine Learning Research 17 (59), pp. 1–35. External Links: Link Cited by: §1, §2.1.2, §2.
  • M. Ghifary, D. Balduzzi, W. B. Kleijn, and M. Zhang (2017) Scatter component analysis: A unified framework for domain adaptation and domain generalization. IEEE TPAMI 39 (7), pp. 1414–1430. External Links: Link, 1510.04373, Document Cited by: §2.2.
  • Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou (Eds.), pp. 529–536. External Links: Link Cited by: §1, §2.3, §2.
  • A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012a) A kernel two-sample test. Journal of Machine Learning Research 13 (25), pp. 723–773. External Links: Link Cited by: §1, §2.1.1.
  • A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur (2012b) Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1205–1213. External Links: Link Cited by: §2.1.1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep Residual Learning for Image Recognition. arXiv e-prints, pp. arXiv:1512.03385. External Links: 1512.03385 Cited by: Appendix A, §1, §3, §3.
  • G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann (2019) Contrastive Adaptation Network for Unsupervised Domain Adaptation. arXiv e-prints, pp. arXiv:1901.00976. External Links: 1901.00976 Cited by: §1, §7.3.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and (. LeCun (Eds.), External Links: Link Cited by: Appendix A.
  • M. G. Kitzbichler and S. D. M. White (2007) The high-redshift galaxy population in hierarchical galaxy formation models. MNRAS 376 (1), pp. 2–12. External Links: Document, astro-ph/0609636 Cited by: §4.2.1.
  • S. Kullback and R. A. Leibler (1951) On information and sufficiency. Ann. Math. Statist. 22 (1), pp. 79–86. External Links: Document, Link Cited by: §7.1.
  • L. Lin, D. C. Koo, C. N. A. Willmer, D. R. Patton, C. J. Conselice, R. Yan, A. L. Coil, M. C. Cooper, M. Davis, S. M. Faber, B. F. Gerke, P. Guhathakurta, and J. A. Newman (2004) The DEEP2 Galaxy Redshift Survey: Evolution of Close Galaxy Pairs and Major-Merger Rates up to z ~1.2. ApJ 617 (1), pp. L9–L12. External Links: Document, astro-ph/0411104 Cited by: §1.
  • C. J. Lintott, K. Schawinski, A. Slosar, K. Land, S. Bamford, D. Thomas, M. J. Raddick, R. C. Nichol, A. Szalay, D. Andreescu, P. Murray, and J. Vandenberg (2008) Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey*. Monthly Notices of the Royal Astronomical Society 389 (3), pp. 1179–1189. External Links: ISSN 0035-8711, Document, Link, https://academic.oup.com/mnras/article-pdf/389/3/1179/3325962/mnras0389-1179.pdf Cited by: §1, §4.2.2.
  • C. Lintott, K. Schawinski, S. Bamford, A. Slosar, K. Land, D. Thomas, E. Edmondson, K. Masters, R. C. Nichol, M. J. Raddick, A. Szalay, D. Andreescu, P. Murray, and J. Vandenberg (2010) Galaxy Zoo 1: data release of morphological classifications for nearly 900 000 galaxies*. Monthly Notices of the Royal Astronomical Society 410 (1), pp. 166–178. External Links: ISSN 0035-8711, Document, Link, https://academic.oup.com/mnras/article-pdf/410/1/166/18442057/mnras0410-0166.pdf Cited by: §4.2.2.
  • M. Liu and O. Tuzel (2016)

    Coupled generative adversarial networks

    .
    In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 469–477. External Links: Link Cited by: §1.
  • J. M. Lotz, J. Primack, and P. Madau (2004) A New Nonparametric Approach to Galaxy Morphological Classification. AJ 128, pp. 163–182. External Links: astro-ph/0311352, Document Cited by: §1.
  • P. Madau and M. Dickinson (2014) Cosmic star-formation history. Annual Review of Astronomy and Astrophysics 52 (1), pp. 415–486. External Links: Document, Link, https://doi.org/10.1146/annurev-astro-081811-125615 Cited by: §4.2.1.
  • D. Nelson, A. Pillepich, V. Springel, R. Weinberger, L. Hernquist, R. Pakmor, S. Genel, P. Torrey, M. Vogelsberger, G. Kauffmann, F. Marinacci, and J. Naiman (2018) First results from the IllustrisTNG simulations: the galaxy colour bimodality. MNRAS 475 (1), pp. 624–647. External Links: Document, 1707.03395 Cited by: §4.2.1.
  • S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang (2011) Domain adaptation via transfer component analysis. Trans. Neur. Netw. 22 (2), pp. 199–210. External Links: ISSN 1045-9227, Link, Document Cited by: §2.1.1, §2.1.1.
  • W. J. Pearson, L. Wang, J. W. Trayford, C. E. Petrillo, and F. F. S. van der Tak (2019) Identifying galaxy mergers in observations and simulations with deep learning. A&A 626, pp. A49. External Links: Document, 1902.10626 Cited by: §1, §5.
  • A. Pillepich, V. Springel, D. Nelson, S. Genel, J. Naiman, R. Pakmor, L. Hernquist, P. Torrey, M. Vogelsberger, R. Weinberger, and F. Marinacci (2018) Simulating galaxy formation with the IllustrisTNG model. MNRAS 473 (3), pp. 4077–4106. External Links: Document, 1703.02970 Cited by: §6.2.1.
  • K. Saenko, B. Kulis, M. Fritz, and T. Darrell (2010) Adapting visual category models to new domains. In Proceedings of the 11th European Conference on Computer Vision: Part IV, ECCV’10, Berlin, Heidelberg, pp. 213–226. External Links: ISBN 364215560X Cited by: §7.3.
  • J. Schaye, R. A. Crain, R. G. Bower, M. Furlong, M. Schaller, T. Theuns, C. Dalla Vecchia, C. S. Frenk, I. G. McCarthy, J. C. Helly, A. Jenkins, Y. M. Rosas-Guevara, S. D. M. White, M. Baes, C. M. Booth, P. Camps, J. F. Navarro, Y. Qu, A. Rahmati, T. Sawala, P. A. Thomas, and J. Trayford (2015) The EAGLE project: simulating the evolution and assembly of galaxies and their environments. MNRAS 446 (1), pp. 521–554. External Links: Document, 1407.7040 Cited by: §1.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2020) Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. Int. J. Comput. Vis. 128, pp. 336–359. External Links: Document, 1610.02391 Cited by: §7.2.
  • J. L. Sérsic (1963) Photometry of southern galaxies. IX:NGC 1313. Boletin de la Asociacion Argentina de Astronomia La Plata Argentina 6, pp. 99. Cited by: §1.
  • C. E. Shannon (1948) A mathematical theory of communication. The Bell System Technical Journal 27 (3), pp. 379–423. Cited by: §2.3.
  • J. Shen, Y. Qu, W. Zhang, and Y. Yu (2018) Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, S. A. McIlraith and K. Q. Weinberger (Eds.), pp. 4058–4065. External Links: Link Cited by: §1.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. External Links: 1409.1556 Cited by: §6.2.1.
  • L. N. Smith and N. Topin (2019) Super-convergence: very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, T. Pham (Ed.), , Vol. 11006, pp. 369 – 386. External Links: Document, Link Cited by: Appendix A.
  • A. Smola, A. Gretton, L. Song, and B. Schölkopf (2007) A hilbert space embedding for distributions. In Algorithmic Learning Theory, Lecture Notes in Computer Science 4754, Biologische Kybernetik, Berlin, Germany, pp. 13–31. Cited by: §2.1.1.
  • G. F. Snyder, V. Rodriguez-Gomez, J. M. Lotz, P. Torrey, A. C. N. Quirk, L. Hernquist, M. Vogelsberger, and P. E. Freeman (2019) Automated distant galaxy merger classifications from Space Telescope images using the Illustris simulation. MNRAS 486 (3), pp. 3702–3720. External Links: Document, 1809.02136 Cited by: §1, §4.1, §5.
  • G. F. Snyder, P. Torrey, J. M. Lotz, S. Genel, C. K. McBride, M. Vogelsberger, A. Pillepich, D. Nelson, L. V. Sales, D. Sijacki, L. Hernquist, and V. Springel (2015) Galaxy morphology and star formation in the Illustris Simulation at z = 0. MNRAS 454 (2), pp. 1886–1908. External Links: Document, 1502.07747 Cited by: §4.2.1.
  • V. Springel, R. Pakmor, A. Pillepich, R. Weinberger, D. Nelson, L. Hernquist, M. Vogelsberger, S. Genel, P. Torrey, F. Marinacci, and J. Naiman (2018) First results from the IllustrisTNG simulations: matter and galaxy clustering. MNRAS 475 (1), pp. 676–698. External Links: Document, 1707.03397 Cited by: §6.2.1.
  • B. Sun, J. Feng, and K. Saenko (2016) Return of frustratingly easy domain adaptation. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp. 2058–2065. Cited by: §1.
  • B. Sun and K. Saenko (2016) Deep CORAL: correlation alignment for deep domain adaptation. In Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III, G. Hua and H. Jégou (Eds.), Lecture Notes in Computer Science, Vol. 9915, pp. 443–450. External Links: Link, Document Cited by: §1.
  • L. van der Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9 (86), pp. 2579–2605. External Links: Link Cited by: §7.1.
  • H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan (2017) Deep hashing network for unsupervised domain adaptation. CoRR abs/1706.07522. External Links: Link, 1706.07522 Cited by: §7.3.
  • M. Vogelsberger, S. Genel, V. Springel, P. Torrey, D. Sijacki, D. Xu, G. Snyder, D. Nelson, and L. Hernquist (2014) Introducing the Illustris Project: simulating the coevolution of dark and visible matter in the Universe. MNRAS 444 (2), pp. 1518–1547. External Links: Document, 1405.2921 Cited by: §1, §4.1.
  • L. Wang, W. J. Pearson, and V. Rodriguez-Gomez (2020) Towards a consistent framework of comparing galaxy mergers in observations and simulations. A&A 644, pp. A87. External Links: Document, 2009.02974 Cited by: §6.2.1.
  • M. Wang and W. Deng (2018) Deep visual domain adaptation: a survey. Neurocomputing 312, pp. 135 – 153. External Links: ISSN 0925-2312, Document, Link Cited by: §1.
  • M. Wattenberg, F. Viégas, and I. Johnson (2016) How to use t-sne effectively. Distill. External Links: Link, Document Cited by: §7.1.
  • G. Wilson and D. J. Cook (2020) A survey of unsupervised deep domain adaptation. ACM Trans. Intell. Syst. Technol. 11 (5). External Links: ISSN 2157-6904, Link, Document Cited by: §1.
  • W. Zhang, L. Deng, L. Zhang, and D. Wu (2020) Overcoming Negative Transfer: A Survey. arXiv e-prints, pp. arXiv:2009.00909. External Links: 2009.00909 Cited by: §7.3.
  • Y. Zhang, Y. Zhang, Y. Wei, K. Bai, Y. Song, and Q. Yang (2020) Fisher Deep Domain Adaptation. arXiv e-prints, pp. arXiv:2003.05636. External Links: 2003.05636 Cited by: §1, §2.1.1, §2.2, §2, §2.

Appendix A Neural Network Hyperparameters and Training Details

Here we list all relevant details related to training the neural networks used in this paper. In Table 3, we give details about the architecture of DeepMerge. For ResNet18 architecture, see  He et al. (2015).
For both DeepMerge and ResNet18, we performed hyperparameter optimization experiments with DeepHyper (Balaprakash et al., 2018, 2019)

, an open source mixed-integer nonlinear optimization framework which employs a parallel asynchronous-model-based search approach to find the high-performing parameter configuration in this mixed (categorical, continuous, integer) search space. Our objective is to minimize the total loss

for MMD hyperparameter searches and to minimize a more complex objective for adversarial domain adaptation: , where and . This was done in an attempt to prevent domain classifier mode collapse (all images classified as one domain) and limit how much the domain classifier’s source domain accuracy and target domain accuracy differed from , the ideal result for a perfectly confused domain classifier.
Different hyperparameters were needed for different experiments, i.e. their values are highly dependent not only on the network, but also on the domain adaptation technique used and the dataset in question. We list all hyperparameters used in the simulated-to-simulated experiments in Table 4 and the simulated-to-real experiments in Table 5.
For training both DeepMerge and ResNet18, we use the Adam optimizer (Kingma and Ba, 2015). Additionally, we implement "one-cycle" learning rate scheduling (Smith and Topin, 2019), which splits a given cycle length in half and includes equal length linear scaling up and down of values from a minimum to a maximum value (like a sawtooth pattern). This technique offers the best of both worlds: higher learning rates assist in regularization by enabling egress from saddle points and lower learning rates prevent the training from diverging  Smith and Topin (2019). It was shown by Smith and Topin (2019) that this one-cycle learning leads to much faster convergence of training accuracy. If present within the chosen optimizer, momentum also follows this scaling with optional annihilation. If specified, annihilation requires rapid scaling to small learning rate values following the cycling down period in order to enable a steeper fall into a local minimum in the loss-landscape.
Following  Smith and Topin (2019), the choice of maximum and minimum learning rates for this technique were performed using a Learning Rate Range Scan in which we linearly increased values over several orders of magnitude during the first few epochs of training and plotted the associated total loss against the epoch. The minimum of these curves was taken as the maximum value of the learning rate, with the minimum learning rate taken as one order of magnitude lower than the maximum. As for cycle length, results of Smith and Topin (2019) suggest that learning should be relatively robust to choices of between 2-10 epochs per cycle length. Here, we select cycle length through the aforementioned DeepHyper hyperparameter search for both DeepMerge and ResNet18.
In the case of domain adversarial training, we found the performance of the domain classifier to be quite finicky: mode collapse in the domain classifier was common despite training achieving high classification accuracy in the base network. We added the ability to multiply the learning rate of this part of the network by a user-defined constant in order to implement differential learning rates across the networks. For tasks similar to ours, we believe that this factor should often be less than one— indeed small fractions proved quite helpful to the domain adversarial training, especially with Fisher loss included.

Layers Properties Stride Padding Output Shape Parameters
Input 111We use "channel first" image data format. - - (3, 75, 75) 0
Convolution (2D) Filters: 8 (8, 75, 75) 608
Kernel: - - - -
Activation: ReLU - - - -
Batch Normalization - - - (8, 75, 75) 16
MaxPooling Kernel: 0 (8, 37, 37) 0
Convolution (2D) Filters: 16 (16, 37, 37) 1168
Kernel: - - - -
Activation: ReLU - - - -
Batch Normalization - - - (16, 37, 37) 32
MaxPooling Kernel: 0 (16, 18, 18) 0
Convolution (2D) Filters: 32 1 (32, 18, 18) 4640
Kernel: - - - -
Activation: ReLU - - - -
Batch Normalization - - - (32, 18, 18) 64
MaxPooling Kernel: 0 (32, 9, 9) 0
Flatten - - - (2592) -
Fully connected Activation: ReLU - - (64) 165952
Fully connected Activation: ReLU - - (32) 2080
Fully connected Activation: Softmax - - (2) 66
Table 3: Architecture of the DeepMerge CNN used in this paper.
Simulated-to-Simulated
Experiment Hyperparameters DeepMerge ResNet18
No Domain Adaptations Learning rate
Beta
Weight Decay
Epsilon
Cycle Length
Early stopping patience
Dropout N/A
MMD Learning rate
Beta
Weight Decay
Epsilon
Cycle Length
Early stopping patience
MMD + Fisher + Entropy Learning rate
Beta
Weight Decay
Epsilon
Cycle Length
Early stopping patience
Adversarial Learning rate
Beta
Weight Decay
Epsilon
Cycle Length
Early stopping patience
Domain class. LR mult.
Adversarial + Fisher + Entropy Learning rate
Beta
Weight Decay
Epsilon
Cycle Length
Early stopping patience
Domain class. LR mult.
Table 4: The hyperparameters used to train in the simulated-to-simulated dataset experiments. The third and fourth columns list parameters for DeepMerge and ResNet18, respectively. The first row shows parameters for our baseline case without domain adaptation, while all other rows give parameters used in different domain adaptation experiments.
DeepMerge: Simulated-to-Real
Hyperparameters noDA MMD TL+MMD
Learning rate
Beta
Weight Decay
Epsilon
Cycle Length
Early stopping patience
Table 5: The hyperparameters used to train DeepMerge on the simulated-to-real dataset. The second column of the table shows values for the baseline without domain adaptation, while the third column gives parameters for MMD and the fourth for MMD with transfer learning from the simulated-to-simulated dataset.

All experiments were run on Google Colab GPU instances and Google Console virtual machine with a NVIDIA Tesla T4 GPU. Our code uses the PyTorch framework and a fixed random seed=1; it is important to note that completely reproducible results are not guaranteed across PyTorch releases or between CPU and GPU executions, even when using fixed random seeds. In the case of training with GPUs, some PyTorch functions that use CUDA can introduce non-deterministic results. We make sure that this is fixed in the backend so that rerunning the code always leads to the same results. Yet we still must note that, even though our code will produce deterministic results when running on a single machine, slight differences will be present in the case of running on different machines.

Appendix B More on Network Performance

In this section we provide more details to further compare both DeepMerge and Resnet18 performance in experiments with and without domain adaptation. To complement the results presented in Section 6.1 for simulated-to-simulated experiments, in Table 6 we compactly display confusion matrices— for the test phase using both source and target data— for all simulated-to-simulated experiments performed with DeepMerge at the top and ResNet18 at the bottom.

We also plot histograms of the output scores for DeepMerge in Figure 8 (left two columns) and ResNet18 (right two columns) for the simulated-to-simulated dataset. In these figures, the source domain is on the left with mergers as dark purple and non-mergers as yellow, and the target domain is on the right with mergers as navy blue and non-mergers as pink. Histograms from top to bottom are ordered as: no domain adaptation, MMD, MMD with Fisher loss and entropy minimization, adversarial domain adaptation, and adversarial domain adaptation with Fisher loss and entropy minimization.

From Figure 8, it is clear that adversarial training, for both DeepMerge and ResNet18, leads to outputs more tightly concentrated around and , i.e. the network seems more confident about the classification, while using MMD leads to slightly more spread out results. Still, confidence is not a guarantee of best classification accuracy and indeed ResNet18 performs the best in the target domain with MMD with Fisher loss and entropy minimization (although, we note that overall the differences in performance between different methods are not large). Also, in the case of the labeled source domain, it is evident that the inclusion of additional losses works as a regularization mechanism, allowing the network to train for longer without overfitting. This effect is particularly noticeable in the case of ResNet18, which is a larger network and thus more prone to overfitting. As a result, in the source domain, we can see that the ResNet18 histograms have more outputs very close to or in experiments with domain adaptation, than without it.

Using the confusion matrices and histograms for the simulated-to-simulated experiments we notice that, depending on the network, inclusion of different losses leads to different behaviours. In the case of DeepMerge, the best performing model uses adversarial training. The inclusion of Fisher loss and entropy minimization with this technique leads to slightly worse performance. On the other hand, in the case of ResNet18, the best performance is the result of using MMD with Fisher loss and entropy minimization. When these additional losses are added to the adversarial trained ResNet18, performance drops significantly: the merger class gets classified incorrectly of the time.

More details related to the results presented in Section 6.2 for simulated-to-real experiments can be found in Table 8, where we present confusion matrices for classification of test source and target datasets with DeepMerge in the case of training without domain adaptation, with MMD, and with MMD with transfer learning.

We also plot histograms of the output scores for simulated-to-real experiments in Figure 9— from top to bottom we present no domain adaptation, MMD, and MMD with transfer learning, respectively. It is clear that the addition of MMD results in a broadening of the source histogram distributions, especially for the non-merger class. When transfer learning is employed, distributions are further spread. This provides a visual of how a network trained with domain adaption is restricted to learn the set of domain-invariant features, rather than exploiting all available features in the source domain, which would result in greater confidence. Meanwhile, in the target domain, we see that classifier does not work without DA and with simple MMD, with many examples from both classes classified incorrectly. Inclusion of transfer learning allows correct classification even in the target domain, but with large spread of the output values, especially for the non-merger class.

Finally, we checked the stability of neural network performance to the choice of the random seed. To test this, we trained the DeepMerge architecture, trained on each dataset pair, ten times using ten different seeds for each type of experiment. Table 7

shows the means and standard deviations for all relevant reported performance metrics for the simulation-to-simulation dataset of distant merging galaxies. Since our dataset is very small in the simulated-to-real experiments, we expect a larger spread in performance metric results in the testing stage when different random seeds are used. In Table 

9, we again give the mean and standard deviations of different performance metrics, in order to show this slightly larger spread.

DeepMerge: Simulated-to-Simulated
Experiment noDA MMD MMD + F ADA ADA + F
True Label M NM M NM M NM M NM M NM
Source Predicted Label M
NM
Target M
NM
ResNet18: Simulated-to-Simulated
Experiment noDA MMD MMD + F ADA ADA + F
True Label M NM M NM M NM M NM M NM
Source Predicted Label M
NM
Target M
NM
Table 6: Normalized confusion matrices for simulated-to-simulated experiments. The top table gives results for DeepMerge, while ResNet18 is presented in the bottom table. Here the true labels are presented horizontally and the predicted labels are given vertically. Finally, in each table, the top row shows confusion matrices for the source domain test set of images, while the bottom row shows results from the target test dataset.
DeepMerge: Simulated-to-Simulated
Experiment Metric Source Target
No Domain Adaptation AUC
Accuracy
Precision
Recall
F1 score
Brier score
MMD AUC
Accuracy
Precision
Recall
F1 score
Brier score
MMD + Fisher + Entropy AUC
Accuracy
Precision
Recall
F1 score
Brier score
Adversarial AUC
Accuracy
Precision
Recall
F1 score
Brier score
Adv. + Fisher + Entropy AUC
Accuracy
Precision
Recall
F1 score
Brier score
Table 7: Results from running DeepMerge experiments with ten different random seeds. Seeds are used for image shuffling, weight initialization, and CUDA backend. We present means and standard deviations for all aforementioned performance metrics. We do not see substantial variation within each experiment.
Figure 8: Simulated-to-Simulated: Histograms of the output scores of DeepMerge (left two columns) and ResNet18 (right two columns) for the test set of images with the source domain (simulated pristine images) on the left— mergers as dark purple and non-mergers as yellow— and target domain (simulated noisy images) on the right— mergers as navy blue and non-mergers as pink. From top to bottom, results are given for training without domain adaptation, MMD, MMD with Fisher loss and entropy minimization, adversarial training, and adversarial training with Fisher loss and entropy minimization. We plot the histogram of the output scores for all images that represent true mergers, and 1-score for all non-merger images in order to separate the classes for better visibility. Note that the vertical axis range is the same for all experiments except for no domain adaptation for DeepMerge and adversarial domain adaptation for ResNet18 (in order to accommodate larger bars).
DeepMerge: Simulated-to-Real
Experiment noDA MMD MMD + TL
True Label M NM M NM M NM
Source Predicted Label M
NM
Target M
NM
Table 8: Normalized confusion matrices for simulated-to-real experiments with DeepMerge. True labels are presented horizontally, while predicted labels are vertical. Finally, the top row shows confusion matrices for the source domain test set of images, while the bottom row give results of the classification of the target test dataset.
Figure 9: Simulated-to-Real: Histograms of the output scores of DeepMerge for the test set of images with the source domain on the left (simulated Illustris images)— mergers as dark purple and non-mergers as yellow— and target domain (real SDSS images) on the right— mergers as navy blue and non-mergers as pink. From top to bottom, results are given for training without domain adaptation, MMD, and MMD with transfer learning. We plot histograms of the output scores for all images that represent true mergers, and 1-score of all non-merger images in order to separate the classes for better visibility. Note that the vertical axis range is the same for all experiments except for no domain adaptation (in order to accommodate larger bars).
DeepMerge: Simulated-to-Real
Experiment Metric Source Target
No Domain Adaptation AUC
Accuracy
Precision
Recall
F1 score
Brier score
MMD AUC
Accuracy
Precision
Recall
F1 score
Brier score
MMD + Transfer Learning AUC
Accuracy
Precision
Recall
F1 score
Brier score
Table 9: Results from running DeepMerge simulated-to-real experiments with ten different random seeds. Seeds are used for image shuffling, weight initialization, and CUDA backend. We present means and standard deviations for all aforementioned performance metrics.