Over the years, U-net  and its variants  have become one of the most commonly used network architecture for semantic segmentation problem because of their continuous success in various benchmarks. However, the difference between the data distributions of training and test images caused by atmospheric effects, position of the sun, intra-class variability, etc. makes the U-net likely to fail to generate good maps. To overcome this issue, the traditional data augmentation methods such as gamma correction, random contrast change [12, 11], histogram matching , color constancy algorithms  are widely used. However, adopting such augmentation techniques is usually insufficient to generalize the model to unseen data, especially when the domain shift between training and test data is highly large.
Better data augmentation strategy would be to generate semantically consistent and test stylized fake training data. To accomplish this task, especially in the field of computer vision, various image-to-image translation (I2I) approaches have already been proposed[14, 7, 4, 6]. The biggest challenge for the style transfer problem is to generate test stylized fake training image that semantically matches the real training image, so that one can use the fake training image and the ground-truth for the real training image to train or to fine-tune a model. The recent I2I approaches [14, 7, 4, 6] fail to keep the real and the fake training images semantically the same. For instance, some of them replace trees by buildings, buildings by roads, etc., whereas some of them create totally artificial objects in the fake training image, which are not available in the real training data. Thus, the fake training images generated by these approaches and the ground-truth for the real training data do not correspond. Hence, training a model on such image and ground-truth pair causes the model to yield a poor performance. Another approach is to adapt the model to test data , which is a more difficult task, rather than modifying the data.
To overcome the limitations described above, Tasar et al. have recently proposed ColorMapGAN , which learns to change the color distribution of training data without doing any structural change. Although the method works very well, it has some limitations. Firstly, it does not support style transfer when the images in both domains have different number of channels. Secondly, it learns to map the colors of the source domain linearly, which may not be strong enough in some cases. Finally, since it transforms each color separately, its output is usually slightly noisy. In this work, we propose semantically consistent image-to-image translation (SemI2I) method, which overcomes all the limitations of ColorMapGAN. By following the same segmentation strategy explained in , we use the fake training images and the original ground-truth to fine-tune the U-net trained on real data.
Let us assume that both domains are denoted by A and B, respectively. The essential goal of SemI2I is to generate fake A with the style of B, and fake B with the style of A. The design of SemI2I is illustrated in Fig. 1. The main components and the stages of SemI2I are as follows:
Style transfer. The crucial components for the style transfer are adaptive instance normalization (adaIN ) and adversarial losses. To generate B stylized fake A and A stylized fake B, we first encode both A and B. In the encoded embeddings, we compute mean and standard deviation for each feature channel. The embedding itself has the content information, whereas and carry the style information. We combine the content of A and the style of B, and the content of B and the style of A by using adaIN. After this operation, since we switch the styles of A and B, we obtain fake A by the decoder B and fake B by the decoder A. Moreover, to make B and fake A, and A and fake B visually as similar as possible, we minimize the adversarial losses between them. The combination of an encoder and a decoder forms a generator. For the discriminator, we use the same architecture as in CycleGAN . As can be verified from Fig. 1, there are 6 generators and 2 discriminators in SemI2I.
Semantic consistency. By following the same strategy described above, we switch the styles of fake A and fake B to obtain A and B (cross reconstruction). In an ideal transformation A and A, and B and B have to be exactly the same. In addition, when we encode A with encoder A and decode the embedding with decoder A, we must reconstruct itself. We call this operation self reconstruction. The self reconstruction also applies to B. Besides, we compute the image gradients from A and fake A by sobel filter. We force the difference between the gradients calculated from A and fake A to be as small as possible to help them to be semantically consistent. Again, the same rule applies to B. Finally, we have visually verified that the filters in the first convolution layer of each encoder learns the low level features such as edges. We re-size and concatenate these low level features from each encoder with the input to each deconvolution layer in the corresponding decoder (see gray arrows in Fig. 1). Such concatenation allows the decoder to have a footprint of the objects in the real data; therefore, it guides the decoder to place the right objects in the correct locations.
Training. We now formulate the losses to train SemI2I. We compute the cross reconstruction loss , which is the sum of L1 norm between A and A and L1 norm between B and B. Similarly, the self reconstruction loss is computed by summing L1 norms between A and A, and B and B. By adding L1 norms between the image gradients calculated from A and fake A, and the gradients from B and fake B, we compute the gradient loss
. Finally, we use the loss functions in LSGAN to compute the adversarial losses. The adversarial loss for the generators is the sum of the LSGAN generator losses between A and fake B, and B and fake A. Similarly, the adversarial loss for the discriminators is computed by adding the LSGAN discriminator losses between A and fake B, and B and fake A. The overall generator loss that is minimized in the training stage is computed as:
where , and are used to adjust the relative importance of each loss. The discriminator loss is defined as :
To train SemI2I, we simultaneously minimize and .
Test. In the test stage, to generate B stylized fake A, we need the encoder A and the decoder B (see the top left generator in Fig. 1 that generates fake A). However, as can be seen in Fig. 1, before feeding the embedding encoded by the encoder A from real A to the decoder B, it needs to be normalized by adaIN using and calculated from the embedding of B. When generating fake A, one may think of randomly sampling a patch from B, encoding it by the encoder B, computing its and
values, and using them to normalize the embedding of A. However, this is not a good idea, because depending on which patch is sampled from B, we may end up generating fake A having a different style in each run. We have exactly the same problem when generating fake B. We want SemI2I to generate exactly the same fake data every time when we test it. To do this, we estimate the globaland values for the embeddings of both domains via Alg. 1 during the training. In Alg. 1, d_rate is a parameter ranging between 0 and 1. Note that this parameter needs to be set to a value that is close to 1 (e.g., 0.95), so that the current patches would not change the global mean and standard deviation values too much. When generating fake A, we use and . Similarly, we use and while generating fake B.
3 Experiments and Conclusion
We use Pléiades images collected from Bad Ischl and Villach in Austria, covering km2 and km2, respectively. The images contain RGB color channels, and their resolution is 1 m. The annotations for building, road, and tree classes have been manually prepared. We perform city-to-city domain adaptation. In the first experiment, we use Bad Ischl as training and Villach as test data. In the second experiment, we switch the training and the test data.
|Method||Training: Bad Ischl, Test: Villach||Training: Villach, Test: Bad Ischl|
We split the cities into 256
256 patches with an overlap of 32 pixels. In each training iteration, we randomly sample only 1 patch from each domain. We train SemI2I for 25 epochs, where the number of iterations in each epoch is the minimum of the number of patches from each domain. We optimize SemI2I with Adam optimizer. We set the learning rate to 0.001 in the first 15 epochs, and compute it in the rest of the epochs as:
where LR, num_epochs, epoch_no are the current learning rate, the total number of epochs, and current epoch no. decay_epoch stands for the epoch, which is set to 15 in our experiments, where the learning rate is started to be reduced. We set and in Eqs. 1 and 2 to 10, 10, 10, and 1, respectively. We have found these values empirically. To generate maps, we first train a U-net for 8,000 iterations on the real training data. We then fine-tune it using the fake image patches generated by SemI2I and the original ground-truth for 2,500 iterations. In each training iteration of both the initial training and the fine-tuning steps, we sample a mini-batch of 32 patches. We optimize U-net in both steps via Adam optimizer with the learning rate of 0.0001.
Among the I2I approaches, we compare SemI2I with CycleGAN , UNIT , MUNIT , DRIT , histogram matching , and gray world algorithm . To make a fair comparison, we replace the fake images used in the fine-tuning step by the fake images generated by these methods. We also provide the results for the traditional U-net  without doing domain adaptation and AdaptSegNet , which aims at adapting the model instead of modifying the data.
Fig. 2 depicts the close-ups from the city pair used in the experiments, test stylized fake training images generated by SemI2I, the ground-truth, and the predictions by U-net and our framework. As can be seen, our method performs the style transfer perfectly, and real and the fake images are semantically consistent. The figure shows that U-net performs extremely poorly, when there is a large shift between the data distributions of training and test images. The improvement of our framework against U-net can clearly be observed. The quantitative results for our segmentation framework and for the other methods are reported in Table 1, which demonstrates the superior performance of our methodology.
In this work, we presented our novel SemI2I approach, which is a new data augmentation method. We compared our segmentation framework with no less than 10 different methods and showed that it exhibits a much better performance than the existing domain adaptation approaches. In the future, we plan to tackle multiple cities-to-multiple cities adaptation problem instead of performing city-to-city adaptation.
-  (1980) A spatial processor model for object colour perception. Journal of the Franklin institute. Cited by: §1, Table 1, §3.
-  (2006) Digital image processing (3rd edition). Cited by: §1, Table 1, §3.
-  (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, Cited by: §2.
-  (2018) Multimodal unsupervised image-to-image translation. In ECCV, Cited by: §1, Table 1, §3.
Multi-task deep learning for satellite image pansharpening and segmentation. In IGARSS, Cited by: §1.
-  (2018) Diverse image-to-image translation via disentangled representations. In ECCV, Cited by: §1, Table 1, §3.
-  (2017) Unsupervised image-to-image translation networks. In NIPS, Cited by: §1, Table 1, §3.
-  (2017) Least squares generative adversarial networks. In ICCV, Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §1, §3.
-  (2019) ColorMapGAN: unsupervised domain adaptation for semantic segmentation using color mapping generative adversarial networks. arXiV. Cited by: §1, Table 1.
-  (2019) Continual learning for dense labeling of satellite images. In IGARSS, Cited by: §1.
-  (2019) Incremental learning for semantic segmentation of large-scale remote sensing data. IEEE JSTARS. Cited by: §1.
-  (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: §1, Table 1, §3.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In CVPR, Cited by: §1, §2, Table 1, §3.